Instructional video editing applies edits to an input video using only text prompts, enabling intuitive natural-language control. Despite rapid progress, most methods still require fixed-length inputs and substantial compute. Meanwhile, autoregressive video generation enables efficient variable-length synthesis, yet remains under-explored for video editing. We introduce a causal, efficient video editing model that edits variable-length videos frame by frame. For efficiency, we start from a 2D image-to-image (I2I) diffusion model and adapt it to video-to-video (V2V) editing by conditioning the edit at time step t on the model's prediction at t-1. To leverage videos' temporal redundancy, we propose a new I2I diffusion forward process formulation that encourages the model to predict the residual between the target output and the previous prediction. We call this Residual Flow Diffusion Model (RFDM), which focuses the denoising process on changes between consecutive frames. Moreover, we propose a new benchmark that better ranks state-of-the-art methods for editing tasks. Trained on paired video data for global/local style transfer and object removal, RFDM surpasses I2I-based methods and competes with fully spatiotemporal (3D) V2V models, while matching the compute of image models and scaling independently of input video length. More content can be found in: this https URL
该段落介绍了一种新的视频编辑方法,名为Residual Flow Diffusion Model(RFDM)。这种方法旨在通过文本指令来编辑输入视频,并且是基于因果关系的高效模型。与传统方法不同的是,它能够处理任意长度的视频数据,同时保持计算资源的要求较低。 具体来说,该方法从二维图像到图像(I2I)扩散模型开始,然后将其转换为适用于视频到视频(V2V)编辑的模型。在这一过程中,模型会在时间步骤t使用前一时刻t-1的预测来调整编辑操作。为了更好地利用视频中的时间冗余性,研究人员提出了一种新的I2I扩散过程公式,该公式鼓励模型预测目标输出与之前预测之间的残差(即差异)。这种设计使得RFDM更加专注于连续帧间的变化。 除此之外,研究团队还提出了一个新的基准测试方法,用于更准确地评估各种视频编辑技术的效果。通过训练具有全局/局部风格转换和对象移除功能的配对视频数据集,RFDM在性能上超越了传统的I2I模型,并且可以与全时空(3D)V2V模型相媲美,同时保持计算成本与图像处理模型相当,并独立于输入视频长度进行扩展。 总体而言,这种方法通过创新地结合时间序列预测和差分编辑策略,为高效的视频编辑任务提供了一个新的方向。更多信息可以在提供的链接中找到。
https://arxiv.org/abs/2602.06871
Recent advances in image editing models have demonstrated remarkable capabilities in executing explicit instructions, such as attribute manipulation, style transfer, and pose synthesis. However, these models often face challenges when dealing with implicit editing instructions, which describe the cause of a visual change without explicitly detailing the resulting outcome. These limitations arise because existing models rely on uniform editing strategies that are not equipped to handle the complex world knowledge and reasoning required for implicit instructions. To address this gap, we introduce \textbf{WorldEdit}, a dataset specifically designed to enable world-driven image editing. WorldEdit consists of high-quality editing samples, guided by paraphrased instructions that align with real-world causal logic. Furthermore, we provide \textbf{WorldEdit-Test} for evaluating the existing model's performance on causal editing scenarios. With WorldEdit, we use a two-stage training framework for fine-tuning models like Bagel, integrating with a causal verification reward. Our results show that the proposed dataset and methods significantly narrow the gap with GPT-4o and Nano-Banana, demonstrating competitive performance not only in instruction following but also in knowledge plausibility, where many open-source systems typically struggle.
最近在图像编辑模型方面取得的进展展示了执行显式指令(如属性操控、风格转换和姿态合成)的非凡能力。然而,当面对隐含编辑指令时,这些模型往往会遇到挑战。隐含编辑指令描述的是视觉变化的原因而非详细的结果。现有模型面临的限制在于它们依赖于统一的编辑策略,无法处理实现隐含指令所需的复杂世界知识和推理过程。 为了解决这一差距,我们引入了\textbf{WorldEdit},这是一个专门设计用于支持以世界驱动方式进行图像编辑的数据集。WorldEdit包括了大量的高质量编辑样本,并通过与现实世界的因果逻辑一致的语句指导这些样本的生成。此外,为了评估现有模型在处理因果编辑场景中的表现,我们还提供了名为\textbf{WorldEdit-Test}的测试数据集。 使用WorldEdit,我们提出了一种两阶段训练框架来对Bagel等模型进行微调,并结合了因果验证奖励机制。我们的实验结果表明,通过所提出的这一新数据集和方法,在遵循指令和知识合理性方面(这通常困扰着许多开源系统),与GPT-4o和Nano-Banana相比,差距显著缩小,展示了相当的性能水平。 这段翻译概述了一种新的图像编辑模型训练方式,并强调了它在处理因果性较强的隐含编辑任务时所取得的进步。
https://arxiv.org/abs/2602.07095
Positional encodings are essential to transformer-based generative models, yet their behavior in multimodal and attention-sharing settings is not fully understood. In this work, we present a principled analysis of Rotary Positional Embeddings (RoPE), showing that RoPE naturally decomposes into frequency components with distinct positional sensitivities. We demonstrate that this frequency structure explains why shared-attention mechanisms, where a target image is generated while attending to tokens from a reference image, can lead to reference copying, in which the model reproduces content from the reference instead of extracting only its stylistic cues. Our analysis reveals that the high-frequency components of RoPE dominate the attention computation, forcing queries to attend mainly to spatially aligned reference tokens and thereby inducing this unintended copying behavior. Building on these insights, we introduce a method for selectively modulating RoPE frequency bands so that attention reflects semantic similarity rather than strict positional alignment. Applied to modern transformer-based diffusion architectures, where all tokens share attention, this modulation restores stable and meaningful shared attention. As a result, it enables effective control over the degree of style transfer versus content copying, yielding a proper style-aligned generation process in which stylistic attributes are transferred without duplicating reference content.
位置编码是基于变压器的生成模型中的关键组成部分,然而,在多模态和注意力共享设置下的行为尚未完全理解。在本文中,我们对旋转位置嵌入(RoPE)进行了系统的分析,揭示了RoPE自然地分解成具有不同位置敏感性的频率分量。我们展示了这种频率结构能够解释为什么共用注意力机制——即在关注参考图像中的标记时生成目标图像——会导致复制参考内容的行为,而不是仅仅提取其风格提示。 我们的分析发现,RoPE的高频部分在注意力计算中占主导地位,迫使查询主要关注空间对齐的参考标记,从而导致这种意外的复制行为。基于这些见解,我们提出了一种方法来选择性地调节RoPE频率带,使得注意力反映的是语义相似性而不是严格的位置对齐。当应用于现代基于变压器的扩散架构中所有令牌共用注意力的情况下时,这种调制恢复了稳定且有意义的共享注意力。因此,它能够有效地控制风格转移的程度与内容复制之间的平衡,在此过程中传递风格属性而不重复参考内容,从而实现正确的风格一致生成过程。 这段翻译详细阐述了RoPE在多模态和共享注意力机制中的行为,并提出了一种调节方法以改善模型性能,使之更适于有效的样式迁移而非内容复制。
https://arxiv.org/abs/2602.05013
Semantic segmentation networks require large amounts of pixel-level annotated data, which are costly to obtain for real-world images. Computer graphics engines can generate synthetic images alongside their ground-truth annotations. However, models trained on such images can perform poorly on real images due to the domain gap between real and synthetic images. Style transfer methods can reduce this difference by applying a realistic style to synthetic images. Choosing effective data transformations and their sequence is difficult due to the large combinatorial search space of style transfer operators. Using multi-objective genetic algorithms, we optimize pipelines to balance structural coherence and style similarity to target domains. We study the use of paired-image metrics on individual image samples during evolution to enable rapid pipeline evaluation, as opposed to standard distributional metrics that require the generation of many images. After optimization, we evaluate the resulting Pareto front using distributional metrics and segmentation performance. We apply this approach to standard datasets in synthetic-to-real domain adaptation: from the video game GTA5 to real image datasets Cityscapes and ACDC, focusing on adverse conditions. Results demonstrate that evolutionary algorithms can propose diverse augmentation pipelines adapted to different objectives. The contribution of this work is the formulation of style transfer as a sequencing problem suitable for evolutionary optimization and the study of efficient metrics that enable feasible search in this space. The source code is available at: this https URL.
语义分割网络需要大量的像素级标注数据,这些数据在获取真实世界的图像时成本高昂。计算机图形引擎可以生成合成图像及其地面实况注释。然而,在这类图像上进行训练的模型可能在真实图像上的表现不佳,因为真实图像和合成图像之间存在领域差距。通过将现实风格应用于合成图像,风格迁移方法可以减少这种差异。但由于风格转换操作的组合搜索空间很大,选择有效的数据变换及其顺序是很困难的。 我们使用多目标遗传算法来优化管道,以便在结构一致性和与目标领域的样式相似性之间取得平衡。我们在进化过程中研究了对单个图像样本使用配对图像度量的方法,以实现快速的流水线评估,而不是需要生成许多图像的标准分布度量。优化后,我们使用分布度量和分割性能来评估所得的帕累托前沿。 我们将这种方法应用于从视频游戏GTA5到真实图像数据集Cityscapes和ACDC(假设是某个特定数据集)的合成到现实领域适应的标准数据集中,重点在于不利条件。结果表明,进化算法可以提出适合不同目标的各种增强管道。 这项工作的贡献在于将风格迁移表述为一个适用于进化优化的序列问题,并研究了在这一空间中进行可行搜索的有效度量。源代码可在以下网址获得:this https URL.
https://arxiv.org/abs/2602.03625
Human motion data is inherently rich and complex, containing both semantic content and subtle stylistic features that are challenging to model. We propose a novel method for effective disentanglement of the style and content in human motion data to facilitate style transfer. Our approach is guided by the insight that content corresponds to coarse motion attributes while style captures the finer, expressive details. To model this hierarchy, we employ Residual Vector Quantized Variational Autoencoders (RVQ-VAEs) to learn a coarse-to-fine representation of motion. We further enhance the disentanglement by integrating contrastive learning and a novel information leakage loss with codebook learning to organize the content and the style across different codebooks. We harness this disentangled representation using our simple and effective inference-time technique Quantized Code Swapping, which enables motion style transfer without requiring any fine-tuning for unseen styles. Our framework demonstrates strong versatility across multiple inference applications, including style transfer, style removal, and motion blending.
https://arxiv.org/abs/2602.02334
3D style transfer refers to the artistic stylization of 3D assets based on reference style images. Recently, 3DGS-based stylization methods have drawn considerable attention, primarily due to their markedly enhanced training and rendering speeds. However, a vital challenge for 3D style transfer is to strike a balance between the content and the patterns and colors of the style. Although the existing methods strive to achieve relatively balanced outcomes, the fixed-output paradigm struggles to adapt to the diverse content-style balance requirements from different users. In this work, we introduce a creative intensity-tunable 3D style transfer paradigm, dubbed \textbf{Tune-Your-Style}, which allows users to flexibly adjust the style intensity injected into the scene to match their desired content-style balance, thus enhancing the customizability of 3D style transfer. To achieve this goal, we first introduce Gaussian neurons to explicitly model the style intensity and parameterize a learnable style tuner to achieve intensity-tunable style injection. To facilitate the learning of tunable stylization, we further propose the tunable stylization guidance, which obtains multi-view consistent stylized views from diffusion models through cross-view style alignment, and then employs a two-stage optimization strategy to provide stable and efficient guidance by modulating the balance between full-style guidance from the stylized views and zero-style guidance from the initial rendering. Extensive experiments demonstrate that our method not only delivers visually appealing results, but also exhibits flexible customizability for 3D style transfer. Project page is available at this https URL.
https://arxiv.org/abs/2602.00618
Whereas reinforcement learning has been applied with success to a range of robotic control problems in complex, uncertain environments, reliance on extensive data - typically sourced from simulation environments - limits real-world deployment due to the domain gap between simulated and physical systems, coupled with limited real-world sample availability. We propose a novel method for sim-to-real transfer of reinforcement learning policies, based on a reinterpretation of neural style transfer from image processing to synthesise novel training data from unpaired unlabelled real world datasets. We employ a variational autoencoder to jointly learn self-supervised feature representations for style transfer and generate weakly paired source-target trajectories to improve physical realism of synthesised trajectories. We demonstrate the application of our approach based on the case study of robot cutting of unknown materials. Compared to baseline methods, including our previous work, CycleGAN, and conditional variational autoencoder-based time series translation, our approach achieves improved task completion time and behavioural stability with minimal real-world data. Our framework demonstrates robustness to geometric and material variation, and highlights the feasibility of policy adaptation in challenging contact-rich tasks where real-world reward information is unavailable.
https://arxiv.org/abs/2601.20846
Content-preserving style transfer, generating stylized outputs based on content and style references, remains a significant challenge for Diffusion Transformers (DiTs) due to the inherent entanglement of content and style features in their internal representations. In this technical report, we present TeleStyle, a lightweight yet effective model for both image and video stylization. Built upon Qwen-Image-Edit, TeleStyle leverages the base model's robust capabilities in content preservation and style customization. To facilitate effective training, we curated a high-quality dataset of distinct specific styles and further synthesized triplets using thousands of diverse, in-the-wild style categories. We introduce a Curriculum Continual Learning framework to train TeleStyle on this hybrid dataset of clean (curated) and noisy (synthetic) triplets. This approach enables the model to generalize to unseen styles without compromising precise content fidelity. Additionally, we introduce a video-to-video stylization module to enhance temporal consistency and visual quality. TeleStyle achieves state-of-the-art performance across three core evaluation metrics: style similarity, content consistency, and aesthetic quality. Code and pre-trained models are available at this https URL
https://arxiv.org/abs/2601.20175
3D style transfer enables the creation of visually expressive 3D content, enriching the visual appearance of 3D scenes and objects. However, existing VGG- and CLIP-based methods struggle to model multi-view consistency within the model itself, while diffusion-based approaches can capture such consistency but rely on denoising directions, leading to unstable training. To address these limitations, we propose DiffStyle3D, a novel diffusion-based paradigm for 3DGS style transfer that directly optimizes in the latent space. Specifically, we introduce an Attention-Aware Loss that performs style transfer by aligning style features in the self-attention space, while preserving original content through content feature alignment. Inspired by the geometric invariance of 3D stylization, we propose a Geometry-Guided Multi-View Consistency method that integrates geometric information into self-attention to enable cross-view correspondence modeling. Based on geometric information, we additionally construct a geometry-aware mask to prevent redundant optimization in overlapping regions across views, which further improves multi-view consistency. Extensive experiments show that DiffStyle3D outperforms state-of-the-art methods, achieving higher stylization quality and visual realism.
https://arxiv.org/abs/2601.19717
We present Quran MD, a comprehensive multimodal dataset of the Quran that integrates textual, linguistic, and audio dimensions at the verse and word levels. For each verse (ayah), the dataset provides its original Arabic text, English translation, and phonetic transliteration. To capture the rich oral tradition of Quranic recitation, we include verse-level audio from 32 distinct reciters, reflecting diverse recitation styles and dialectical nuances. At the word level, each token is paired with its corresponding Arabic script, English translation, transliteration, and an aligned audio recording, allowing fine-grained analysis of pronunciation, phonology, and semantic context. This dataset supports various applications, including natural language processing, speech recognition, text-to-speech synthesis, linguistic analysis, and digital Islamic studies. Bridging text and audio modalities across multiple reciters, this dataset provides a unique resource to advance computational approaches to Quranic recitation and study. Beyond enabling tasks such as ASR, tajweed detection, and Quranic TTS, it lays the foundation for multimodal embeddings, semantic retrieval, style transfer, and personalized tutoring systems that can support both research and community applications. The dataset is available at this https URL
https://arxiv.org/abs/2601.17880
Deep learning models in medical image analysis often struggle with generalizability across domains and demographic groups due to data heterogeneity and scarcity. Traditional augmentation improves robustness, but fails under substantial domain shifts. Recent advances in stylistic augmentation enhance domain generalization by varying image styles but fall short in terms of style diversity or by introducing artifacts into the generated images. To address these limitations, we propose Stylizing ViT, a novel Vision Transformer encoder that utilizes weight-shared attention blocks for both self- and cross-attention. This design allows the same attention block to maintain anatomical consistency through self-attention while performing style transfer via cross-attention. We assess the effectiveness of our method for domain generalization by employing it for data augmentation on three distinct image classification tasks in the context of histopathology and dermatology. Results demonstrate an improved robustness (up to +13% accuracy) over the state of the art while generating perceptually convincing images without artifacts. Additionally, we show that Stylizing ViT is effective beyond training, achieving a 17% performance improvement during inference when used for test-time augmentation. The source code is available at this https URL .
https://arxiv.org/abs/2601.17586
Recent advances in image editing leverage latent diffusion models (LDMs) for versatile, text-prompt-driven edits across diverse tasks. Yet, maintaining pixel-level edge structures-crucial for tasks such as photorealistic style transfer or image tone adjustment-remains as a challenge for latent-diffusion-based editing. To overcome this limitation, we propose a novel Structure Preservation Loss (SPL) that leverages local linear models to quantify structural differences between input and edited images. Our training-free approach integrates SPL directly into the diffusion model's generative process to ensure structural fidelity. This core mechanism is complemented by a post-processing step to mitigate LDM decoding distortions, a masking strategy for precise edit localization, and a color preservation loss to preserve hues in unedited areas. Experiments confirm SPL enhances structural fidelity, delivering state-of-the-art performance in latent-diffusion-based image editing. Our code will be publicly released at this https URL.
https://arxiv.org/abs/2601.16645
Synthetic data offers a promising solution for mitigating data scarcity and demographic bias in mental health analysis, yet existing approaches largely rely on pretrained large language models (LLMs), which may suffer from limited output diversity and propagate biases inherited from their training data. In this work, we propose a pretraining-free diffusion-based approach for synthetic text generation that frames bias mitigation as a style transfer problem. Using the CARMA Arabic mental health corpus, which exhibits a substantial gender imbalance, we focus on male-to-female style transfer to augment underrepresented female-authored content. We construct five datasets capturing varying linguistic and semantic aspects of gender expression in Arabic and train separate diffusion models for each setting. Quantitative evaluations demonstrate consistently high semantic fidelity between source and generated text, alongside meaningful surface-level stylistic divergence, while qualitative analysis confirms linguistically plausible gender transformations. Our results show that diffusion-based style transfer can generate high-entropy, semantically faithful synthetic data without reliance on pretrained LLMs, providing an effective and flexible framework for mitigating gender bias in sensitive, low-resource mental health domains.
合成数据为缓解心理健康分析中的数据稀缺和人口偏见提供了有希望的解决方案,然而现有的方法大多依赖于预训练的大规模语言模型(LLM),这可能会导致输出多样性有限,并传播其训练数据中固有的偏见。在本研究中,我们提出了一种无需预训练的基于扩散的方法来生成合成文本,将偏见缓解视为一种风格转换问题。使用CARMA阿拉伯语心理健康语料库,在该语料库中性别失衡显著,我们将重点放在从男性向女性的风格转移上,以增强代表性不足的女性作者内容。我们构建了五个数据集,捕捉阿拉伯语表达中的不同语言和语义方面的性别差异,并为每种情况训练单独的扩散模型。定量评估表明,在源文本与生成文本之间保持一致的语义忠实度的同时,还实现了有意义的表面风格差异,而定性分析则确认了合乎逻辑的语言上的性别转换。我们的结果表明,基于扩散的风格转移可以生成高熵、语义忠实的合成数据,无需依赖预训练的大规模语言模型,在敏感且资源匮乏的心理健康领域提供了一种有效且灵活的框架以缓解性别偏见。
https://arxiv.org/abs/2601.14124
Image style transfer aims to integrate the visual patterns of a specific artistic style into a content image while preserving its content structure. Existing methods mainly rely on the generative adversarial network (GAN) or stable diffusion (SD). GAN-based approaches using CNNs or Transformers struggle to jointly capture local and global dependencies, leading to artifacts and disharmonious patterns. SD-based methods reduce such issues but often fail to preserve content structures and suffer from slow inference. To address these issues, we revisit GAN and propose a mamba-based generator, termed as StyMam, to produce high-quality stylized images without introducing artifacts and disharmonious patterns. Specifically, we introduce a mamba-based generator with a residual dual-path strip scanning mechanism and a channel-reweighted spatial attention module. The former efficiently captures local texture features, while the latter models global dependencies. Finally, extensive qualitative and quantitative experiments demonstrate that the proposed method outperforms state-of-the-art algorithms in both quality and speed.
图像风格转换的目标是将特定艺术风格的视觉模式整合到内容图像中,同时保留其结构。现有的方法主要依赖于生成对抗网络(GAN)或稳定扩散(SD)。基于GAN的方法使用CNN或Transformer时,在捕捉局部和全局依赖性方面存在困难,导致出现伪影和不和谐图案。基于SD的方法减少了这些问题,但往往无法保存内容结构,并且在推理速度上较慢。 为了应对这些问题,我们重新审视了GAN技术,并提出了一种名为StyMam的基于蟒蛇(mamba)生成器的方法,以产生高质量、无伪影和不和谐图案的风格化图像。具体而言,我们引入了一个带有残差双路径条带扫描机制以及通道重权空间注意模块的基于蟒蛇的生成器。前者能够高效地捕捉局部纹理特征,而后者则用于建模全局依赖性。 最终,大量的定性和定量实验表明,所提出的方法在质量和速度方面均优于现有的先进算法。
https://arxiv.org/abs/2601.12954
In 1888, Vincent van Gogh wrote, "I am seeking exaggeration in the essential." This principle, amplifying structural form while suppressing photographic detail, lies at the core of Post-Impressionist art. However, most existing 3D style transfer methods invert this philosophy, treating geometry as a rigid substrate for surface-level texture projection. To authentically reproduce Post-Impressionist stylization, geometric abstraction must be embraced as the primary vehicle of expression. We propose a flow-guided geometric advection framework for 3D Gaussian Splatting (3DGS) that operationalizes this principle in a mesh-free setting. Our method extracts directional flow fields from 2D paintings and back-propagates them into 3D space, rectifying Gaussian primitives to form flow-aligned brushstrokes that conform to scene topology without relying on explicit mesh priors. This enables expressive structural deformation driven directly by painterly motion rather than photometric constraints. Our contributions are threefold: (1) a projection-based, mesh-free flow guidance mechanism that transfers 2D artistic motion into 3D Gaussian geometry; (2) a luminance-structure decoupling strategy that isolates geometric deformation from color optimization, mitigating artifacts during aggressive structural abstraction; and (3) a VLM-as-a-Judge evaluation framework that assesses artistic authenticity through aesthetic judgment instead of conventional pixel-level metrics, explicitly addressing the subjective nature of artistic stylization.
1888年,文森特·梵高写道:“我追求的是在本质上的夸张。”这一原则强调放大结构形式的同时抑制摄影细节的呈现,构成了后印象派艺术的核心。然而,现有的大多数三维风格迁移方法与此理念背道而驰,将几何视为表面纹理投影的固定基底。为了真实地再现后印象派风格化效果,必须采用几何抽象作为主要表现手段。 我们提出了一种基于流引导的几何推进框架,应用于3D高斯点云(3DGS),该框架在没有网格约束的情况下实现了这一原则的应用。我们的方法从二维画作中提取方向性流动场,并将其回传至三维空间中,矫正高斯原始模型以形成与场景拓扑相吻合且符合画家动作的流线型笔触,而不依赖显式的网格先验条件。这使得结构变形能够直接由绘画动作驱动,而非基于摄影测量约束。 我们的贡献有三个方面: 1. 提供了一种投影基、无网格流动引导机制,将二维艺术运动转移至三维高斯几何中。 2. 一种亮度-结构解耦策略,使几何变形从颜色优化中分离出来,在进行激进的结构抽象时减少了伪影的产生。 3. VLM(视觉语言模型)作为评判者的评估框架,通过审美判断而非传统的像素级度量来评定艺术作品的真实性和表现力,直接应对了艺术风格化主观性的挑战。
https://arxiv.org/abs/2601.10075
Toxic language is one of the major barrier to safe online participation, yet robust mitigation tools are scarce for African languages. This study addresses this critical gap by investigating automatic text detoxification (toxic to neutral rewriting) for two low-resource African languages, isiXhosa and Yorùbá. The work contributes a novel, pragmatic hybrid methodology: a lightweight, interpretable TF-IDF and Logistic Regression model for transparent toxicity detection, and a controlled lexicon- and token-guided rewriting component. A parallel corpus of toxic to neutral rewrites, which captures idiomatic usage, diacritics, and code switching, was developed to train and evaluate the model. The detection component achieved stratified K-fold accuracies of 61-72% (isiXhosa) and 72-86% (Yorùbá), with per-language ROC-AUCs up to 0.88. The rewriting component successfully detoxified all detected toxic sentences while preserving 100% of non-toxic sentences. These results demonstrate that scalable, interpretable machine learning detectors combined with rule-based edits offer a competitive and resource-efficient solution for culturally adaptive safety tooling, setting a new benchmark for low-resource Text Style Transfer (TST) in African languages.
毒性语言是在线安全参与的主要障碍之一,但针对非洲语言的有效缓解工具却极为稀缺。本研究通过调查两种低资源非洲语言(isiXhosa 和 Yorùbá)的自动文本去毒化(将有毒内容转化为中立表述),来填补这一重要空白。 该研究提出了一种新颖且实用的混合方法:一个轻量级、可解释性高的 TF-IDF 和逻辑回归模型,用于透明度高的毒性检测;以及一个受控词典和标记引导重写组件。为了训练和评估模型,开发了一个平行语料库,其中包括有毒到中立表述转换,并捕捉了惯用表达、音标以及代码切换现象。 在isiXhosa语言上,该方法的检测部分实现了61-72%的分层K折准确率;而在Yorùbá语言上,则实现了72-86%的准确率。每种语言的最大ROC-AUC值达到了0.88。重写组件成功地去毒化了所有被检出的有毒句子,同时保持了非有毒句子100%的完整性。 这些结果表明,结合规则编辑和可解释性强、可扩展性好的机器学习检测器能够为文化适应性的安全工具提供竞争性和资源效率高的解决方案,并在非洲语言中的低资源文本风格转换(TST)领域树立新的标准。
https://arxiv.org/abs/2601.05624
Content-Preserving Style transfer, given content and style references, remains challenging for Diffusion Transformers (DiTs) due to its internal entangled content and style features. In this technical report, we propose the first content-preserving style transfer model trained on Qwen-Image-Edit, which activates Qwen-Image-Edit's strong content preservation and style customization capability. We collected and filtered high quality data of limited specific styles and synthesized triplets with thousands categories of style images in-the-wild. We introduce the Curriculum Continual Learning framework to train QwenStyle with such mixture of clean and noisy triplets, which enables QwenStyle to generalize to unseen styles without degradation of the precise content preservation capability. Our QwenStyle V1 achieves state-of-the-art performance in three core metrics: style similarity, content consistency, and aesthetic quality.
基于给定内容和风格参考,对于扩散变换器(DiT)而言,进行保留内容的风格迁移仍然是一项挑战,因为其内部的内容和风格特征相互纠缠。在这份技术报告中,我们提出了第一个在Qwen-Image-Edit上训练的、能够保持内容并定制风格的能力的保留内容的风格迁移模型。我们收集并过滤了高质量且具有特定风格的数据,并合成含有数千种不同风格图片的三元组数据集。为了使用这种混合清洁与噪点三元组数据训练QwenStyle,我们引入了一个课程连续学习框架,使QwenStyle能够泛化到未见过的风格上,同时不会降低精准的内容保持能力。我们的QwenStyle V1在三个核心指标上取得了最先进的性能:风格相似度、内容一致性以及美学质量。
https://arxiv.org/abs/2601.06202
Recent facial texture generation methods prefer to use deep networks to synthesize image content and then fill in the UV map, thus generating a compelling full texture from a single image. Nevertheless, the synthesized texture UV map usually comes from a space constructed by the training data or the 2D face generator, which limits the methods' generalization ability for in-the-wild input images. Consequently, their facial details, structures and identity may not be consistent with the input. In this paper, we address this issue by proposing a style transfer-based facial texture refinement method named FaceRefiner. FaceRefiner treats the 3D sampled texture as style and the output of a texture generation method as content. The photo-realistic style is then expected to be transferred from the style image to the content image. Different from current style transfer methods that only transfer high and middle level information to the result, our style transfer method integrates differentiable rendering to also transfer low level (or pixel level) information in the visible face regions. The main benefit of such multi-level information transfer is that, the details, structures and semantics in the input can thus be well preserved. The extensive experiments on Multi-PIE, CelebA and FFHQ datasets demonstrate that our refinement method can improve the texture quality and the face identity preserving ability, compared with state-of-the-arts.
最近的面部纹理生成方法倾向于使用深度网络合成图像内容,然后填充UV映射(一种在三维建模中用于描述物体表面几何和材质分布的方法),从而从单张图片生成逼真的完整纹理。然而,所合成的纹理UV映射通常源自于训练数据或二维人脸生成器构建的空间内,这限制了这些方法对野外输入图像的泛化能力。因此,产生的面部细节、结构及身份可能与输入不一致。 为了解决这个问题,本文提出了一种基于风格迁移的面部纹理精炼方法,名为FaceRefiner。FaceRefiner将从3D采样的纹理视作风格,而纹理生成方法的输出则被视为内容。期望通过这种方法将照片般真实的风格从风格图像转移到内容图像上。与当前仅转移高层和中层信息到结果的风格迁移方法不同的是,我们的风格迁移方法结合了可微分渲染技术,从而也能够将低层(或像素级别)的信息传递至可见面部区域。这种多层次信息传输的主要优点在于,可以更好地保留输入中的细节、结构及语义。 在Multi-PIE、CelebA和FFHQ数据集上进行的广泛实验表明,与现有先进方法相比,我们的精炼方法能够提高纹理质量和人脸身份保持能力。
https://arxiv.org/abs/2601.04520
Text-to-Image editing using diffusion models faces challenges in balancing content preservation with edit application and handling real-image editing. To address these, we propose LAMS-Edit, leveraging intermediate states from the inversion process--an essential step in real-image editing--during edited image generation. Specifically, latent representations and attention maps from both processes are combined at each step using weighted interpolation, controlled by a scheduler. This technique, Latent and Attention Mixing with Schedulers (LAMS), integrates with Prompt-to-Prompt (P2P) to form LAMS-Edit--an extensible framework that supports precise editing with region masks and enables style transfer via LoRA. Extensive experiments demonstrate that LAMS-Edit effectively balances content preservation and edit application.
基于扩散模型的文本到图像编辑面临在内容保留和编辑应用之间取得平衡以及处理真实图像编辑的挑战。为了解决这些问题,我们提出了LAMS-Edit方法,该方法利用了反转过程中产生的中间状态——这是真实图像编辑中的一个关键步骤。具体而言,在生成编辑后的图像时,该方法结合了来自这两个过程(即原始编辑和反转过程)在每一步的潜在表示和注意力图,并通过加权插值进行组合,其权重由调度器控制。这一技术被称为Latent and Attention Mixing with Schedulers (LAMS),并与Prompt-to-Prompt (P2P)集成以形成LAMS-Edit——一个支持区域掩码精确编辑并可通过LoRA实现风格转换的可扩展框架。广泛的实验表明,LAMS-Edit在内容保留和编辑应用之间实现了有效的平衡。
https://arxiv.org/abs/2601.02987
Unsupervised Text Style Transfer (UTST) aims to build a system to transfer the stylistic properties of a given text without parallel text pairs. Compared with text transfer between style polarities, UTST for controllable intensity is more challenging due to the subtle differences in stylistic features across different intensity levels. Faced with the challenges posed by the lack of parallel data and the indistinguishability between adjacent intensity levels, we propose a SFT-then-PPO paradigm to fine-tune an LLM. We first fine-tune the LLM with synthesized parallel data. Then, we further train the LLM with PPO, where the rewards are elaborately designed for distinguishing the stylistic intensity in hierarchical levels. Both the global and local stylistic features are considered to formulate the reward functions. The experiments on two UTST benchmarks showcase that both rewards have their advantages and applying them to LLM fine-tuning can effectively improve the performance of an LLM backbone based on various evaluation metrics. Even for close levels of intensity, we can still observe the noticeable stylistic difference between the generated text.
无监督文本风格迁移(UTST)的目标是构建一个系统,能够将给定文本的风格特征转移到其他文本中,而不需要平行文本对。与在不同风格极性之间进行文本转换相比,针对可控强度的UTST更具挑战性,因为不同强度级别的风格特征之间的差异非常细微。面对缺乏平行数据以及相邻强度级别难以区分的问题,我们提出了一种SFT-then-PPO(预训练后强化学习)范式来对大规模语言模型(LLM)进行微调。首先,使用合成的平行文本数据对LLM进行微调;然后,进一步利用PPO(Proximal Policy Optimization)方法训练该模型,其中奖励函数精心设计用于区分不同层次上的风格强度。在制定奖励函数时,既考虑了全局又考虑了局部的风格特征。 实验结果表明,在两个UTST基准测试上应用这两种类型的奖励均表现出各自的优点,并且将它们应用于LLM微调过程中能够有效提升基于多种评估指标下的模型性能。即使对于强度接近的情况下,仍然可以观察到生成文本之间的显著风格差异。
https://arxiv.org/abs/2601.01060