Despite the advancements in diffusion-based image style transfer, existing methods are commonly limited by 1) semantic gap: the style reference could miss proper content semantics, causing uncontrollable stylization; 2) reliance on extra constraints (e.g., semantic masks) restricting applicability; 3) rigid feature associations lacking adaptive global-local alignment, failing to balance fine-grained stylization and global content preservation. These limitations, particularly the inability to flexibly leverage style inputs, fundamentally restrict style transfer in terms of personalization, accuracy, and adaptability. To address these, we propose StyleGallery, a training-free and semantic-aware framework that supports arbitrary reference images as input and enables effective personalized customization. It comprises three core stages: semantic region segmentation (adaptive clustering on latent diffusion features to divide regions without extra inputs); clustered region matching (block filtering on extracted features for precise alignment); and style transfer optimization (energy function-guided diffusion sampling with regional style loss to optimize stylization). Experiments on our introduced benchmark demonstrate that StyleGallery outperforms state-of-the-art methods in content structure preservation, regional stylization, interpretability, and personalized customization, particularly when leveraging multiple style references.
尽管基于扩散的图像风格迁移技术取得了进步,现有的方法通常受限于以下几点:1)语义差距:风格参考可能缺少适当的内容语义,导致无法控制的风格化;2)对额外约束(如语义蒙版)的依赖,限制了适用性;3)刚性的特征关联缺乏自适应全局-局部对齐能力,在精细风格化和全局内容保留之间难以取得平衡。这些局限性,特别是不能灵活利用风格输入的问题,从根本上限制了风格迁移在个性化、准确性和适应性方面的表现。 为了解决这些问题,我们提出了StyleGallery,这是一个无需训练且具有语义感知的框架,支持任意参考图像作为输入,并能够实现有效的个性化定制。它包含三个核心阶段:语义区域分割(通过隐式扩散特征自适应聚类来划分区域,不依赖额外输入);集群区域匹配(在提取的特征上执行块过滤以实现精确对齐);以及风格迁移优化(使用能量函数指导下的扩散采样和区域样式损失来优化风格化)。我们在引入的基准测试上的实验表明,StyleGallery 在内容结构保留、区域风格化、可解释性和个性化定制方面优于现有最先进的方法,尤其是在利用多个风格参考时表现尤为突出。
https://arxiv.org/abs/2603.10354
Unified diffusion editors often rely on a fixed, shared backbone for diverse tasks, suffering from task interference and poor adaptation to heterogeneous demands (e.g., local vs global, semantic vs photometric). In particular, prevalent ControlNet and OmniControl variants combine multiple conditioning signals (e.g., text, mask, reference) via static concatenation or additive adapters which cannot dynamically prioritize or suppress conflicting modalities, thus resulting in artifacts like color bleeding across mask boundaries, identity or style drift, and unpredictable behavior under multi-condition inputs. To address this, we propose Condition-Aware Routing of Experts (CARE-Edit) that aligns model computation with specific editing competencies. At its core, a lightweight latent-attention router assigns encoded diffusion tokens to four specialized experts--Text, Mask, Reference, and Base--based on multi-modal conditions and diffusion timesteps: (i) a Mask Repaint module first refines coarse user-defined masks for precise spatial guidance; (ii) the router applies sparse top-K selection to dynamically allocate computation to the most relevant experts; (iii) a Latent Mixture module subsequently fuses expert outputs, coherently integrating semantic, spatial, and stylistic information to the base images. Experiments validate CARE-Edit's strong performance on contextual editing tasks, including erasure, replacement, text-driven edits, and style transfer. Empirical analysis further reveals task-specific behavior of specialized experts, showcasing the importance of dynamic, condition-aware processing to mitigate multi-condition conflicts.
统一扩散编辑器通常依赖于一个固定的、共享的骨干网络来执行各种任务,这会导致任务干扰和对异构需求(例如局部与全局、语义与光度)适应性差的问题。特别是流行的ControlNet和OmniControl变体通过静态拼接或加法适配器结合多种条件信号(如文本、掩码、参考图像),但这些方法无法动态地优先处理或抑制冲突模态,因此在多条件输入下会出现诸如跨掩码边界的颜色溢出、身份或风格漂移以及行为不可预测等问题。为解决这些问题,我们提出了Condition-Aware Routing of Experts (CARE-Edit),该方法使模型计算与特定编辑任务的能力相匹配。核心思想是一个轻量级的潜在注意力路由器根据多模态条件和扩散时间步骤将编码的扩散令牌分配给四个专门化的专家——文本、掩码、参考图像和基础图像: 1. 掩码重绘模块首先对粗略的用户定义掩码进行细化,以提供精确的空间指导。 2. 路由器应用稀疏的Top-K选择来动态地将计算资源分配给最相关的专家。 3. 潜在混合模块随后融合各个专家的输出,将语义、空间和风格信息连贯地整合到基础图像中。 实验验证了CARE-Edit在上下文编辑任务(包括擦除、替换、文本驱动的编辑和风格转移)中的强大性能。实证分析进一步揭示了专门化专家的任务特定行为,展示了动态、条件感知处理的重要性以减轻多条件冲突的影响。
https://arxiv.org/abs/2603.08589
Computer Vision-based Style Transfer techniques have been used for many years to represent artistic style. However, most contemporary methods have been restricted to the pixel domain; in other words, the style transfer approach has been modifying the image pixels to incorporate artistic style. However, real artistic work is made of brush strokes with different colors on a canvas. Pixel-based approaches are unnatural for representing these images. Hence, this paper discusses a style transfer method that represents the image in the brush stroke domain instead of the RGB domain, which has better visual improvement over pixel-based methods.
基于计算机视觉的风格迁移技术多年来一直被用于表现艺术风格。然而,大多数现代方法都局限于像素领域;换句话说,风格迁移的方法主要是通过修改图像像素来融入艺术风格。但是,真正的艺术品是由不同颜色的笔触在画布上构成的。基于像素的方法对于表示这些图片来说是不自然的。因此,本文提出了一种新的风格迁移方法,它不是在RGB域中表现图像,而是在笔刷痕迹域中进行,这种方法在视觉效果上的改进要优于传统的基于像素的方法。
https://arxiv.org/abs/2603.07776
Visual place recognition (VPR) aiming at predicting the location of an image based solely on its visual features is a fundamental task in robotics and autonomous systems. Domain variation remains one of the main challenges in VPR and is relatively unexplored. Existing VPR models attempt to achieve domain agnosticism either by training on large-scale datasets that inherently contain some domain variations, or by being specifically adapted to particular target domains. In practice, the former lacks explicit domain supervision, while the latter generalizes poorly to unseen domain shifts. This paper proposes a novel query-based domain-agnostic VPR model called QdaVPR. First, a dual-level adversarial learning framework is designed to encourage domain invariance for both the query features forming the global descriptor and the image features from which these query features are derived. Then, a triplet supervision based on query combinations is designed to enhance the discriminative power of the global descriptors. To support the learning process, we augment a large-scale VPR dataset using style transfer methods, generating various synthetic domains with corresponding domain labels as auxiliary supervision. Extensive experiments show that QdaVPR achieves state-of-the-art performance on multiple VPR benchmarks with significant domain variations. Specifically, it attains the best Recall@1 and Recall@10 on nearly all test scenarios: 93.5%/98.6% on Nordland (seasonal changes), 97.5%/99.0% on Tokyo24/7 (day-night transitions), and the highest Recall@1 across almost all weather conditions on the SVOX dataset. Our code will be released at this https URL.
视觉位置识别(VPR)旨在仅基于图像的视觉特征预测其位置,这是机器人技术和自主系统中的一个基本任务。领域变化仍然是VPR的主要挑战之一,并且研究相对较少。现有的VPR模型试图通过在大规模数据集上进行训练来实现对领域的无感知性,这些数据集中固有地包含了一些领域变化,或者通过专门适应特定的目标域来进行这种尝试。实践中,前者缺乏明确的领域监督,而后者则难以泛化到未见的领域偏移中。 本文提出了一种新颖的查询基、领域无关型VPR模型——QdaVPR。首先设计了一个双层对抗性学习框架,该框架鼓励全局描述符中的查询特征以及这些查询特征所衍生出的图像特征达到领域不变性。其次,基于查询组合设计了三元监督机制,以增强全局描述符的区分能力。为了支持学习过程,我们通过使用风格迁移方法扩展了一大规模VPR数据集,生成多种合成域,并提供相应的领域标签作为辅助监督。 广泛的实验表明,QdaVPR在多个具有显著领域变化的VPR基准测试上实现了最佳性能。具体而言,在几乎所有的测试场景中,它达到了最好的Recall@1和Recall@10:Nordland(季节变化)为93.5%/98.6%,Tokyo24/7(昼夜转换)为97.5%/99.0%;在SVOX数据集的几乎所有天气条件下,它也取得了最高的Recall@1。我们的代码将在[此处发布](此链接应替换为实际提供的URL)。
https://arxiv.org/abs/2603.07414
This study explores artificial visual creativity, focusing on ChatGPT's ability to generate new images intentionally pastiching original artworks such as paintings, drawings, sculptures and installations. The process involved twelve artists from Romania, Bulgaria, France, Austria, and the United Kingdom, each invited to contribute with three of their artworks and to grade and comment on the AI-generated versions. The analysis combines human evaluation with computational methods aimed at detecting visual and stylistic similarities or divergences between the original works and their AI-produced renditions. The results point to a significant gap between color and texture-based similarity and compositional, conceptual, and perceptual one. Consequently, we advocate for the use of a "style transfer dashboard" of complementary metrics to evaluate the similarity between pastiches and originals, rather than using a single style metric. The artists' comments revealed limitations of ChatGPT's pastiches after contemporary artworks, which were perceived by the authors of the originals as lacking dimensionality, context, and intentional sense, and seeming more of a paraphrase or an approximate quotation rather than as a valuable, emotion-evoking artwork.
这项研究探讨了人工智能的视觉创造力,重点关注ChatGPT生成新图像的能力,这些新图像是对原作如绘画、素描、雕塑和装置艺术进行有意模仿的结果。该过程邀请来自罗马尼亚、保加利亚、法国、奥地利和英国的十二位艺术家参与,每位艺术家贡献三件自己的作品,并对其AI生成版本进行评分和评论。分析结合了人类评价与计算方法,旨在检测原始作品与其人工智能生成版本之间的视觉和风格上的相似性或差异。 研究结果表明,在颜色和纹理基础上的相似性和构成、概念及感知层面的相似性之间存在显著差距。因此,我们倡导使用一个“风格转换仪表板”,采用互补指标来评估模仿作品与原作之间的相似度,而不是依赖单一的风格指标。 艺术家们的评论揭示了ChatGPT在对当代艺术品进行模仿时存在的局限性。原作者认为这些模仿缺乏深度、背景信息和有意图的感觉,并且看起来更像是引用或近似复制,而非具有情感价值的艺术品。
https://arxiv.org/abs/2603.06324
The proliferation of facial recognition (FR) systems has raised privacy concerns in the digital realm, as malicious uses of FR models pose a significant threat. Traditional countermeasures, such as makeup style transfer, have suffered from low transferability in black-box settings and limited applicability across various demographic groups, including males and individuals with darker skin tones. To address these challenges, we introduce a novel facial privacy protection method, dubbed \textbf{MAP}, a pioneering approach that employs human emotion modifications to disguise original identities as target identities in facial images. Our method uniquely fine-tunes a score network to learn dual objectives, target identity and human expression, which are jointly optimized through gradient projection to ensure convergence at a shared local optimum. Additionally, we enhance the perceptual quality of protected images by applying local smoothness regularization and optimizing the score matching loss within our network. Empirical experiments demonstrate that our innovative approach surpasses previous baselines, including noise-based, makeup-based, and freeform attribute methods, in both qualitative fidelity and quantitative metrics. Furthermore, MAP proves its effectiveness against an online FR API and shows advanced adaptability in uncommon photographic scenarios.
面部识别(FR)系统的普及在数字领域引发了隐私问题,因为恶意使用这些系统对个人安全构成了重大威胁。传统的对策,如化妆风格转移,在黑盒设置中传输效果不佳,并且在不同的人群,包括男性和肤色较深的个体之间适用性有限。为了解决这些问题,我们提出了一种新的面部隐私保护方法——称为**MAP**的方法,它采用人类情感修改技术来将原始身份伪装成目标身份。我们的方法独特地调整了一个评分网络以同时学习两个目标:目标身份和人类表情,并通过梯度投影联合优化这两个目标,确保在共享局部最优中收敛。 此外,我们通过应用局部平滑正则化并优化评分匹配损失来提高保护图像的感知质量。实验证明,我们的创新方法在定性和定量指标上都超越了以前的基础线方法,包括基于噪声、化妆和自由形式属性的方法。此外,MAP证明了其对抗在线面部识别API的有效性,并展示了在不常见摄影场景中的先进适应能力。
https://arxiv.org/abs/2603.03665
Deep learning-based automated diagnosis of lung cancer has emerged as a crucial advancement that enables healthcare professionals to detect and initiate treatment earlier. However, these models require extensive training datasets with diverse case-specific properties. High-quality annotated data is particularly challenging to obtain, especially for cases with subtle pulmonary nodules that are difficult to detect even for experienced radiologists. This scarcity of well-labeled datasets can limit model performance and generalization across different patient populations. Digitally reconstructed radiographs (DRR) using CT-Scan to generate synthetic frontal chest X-rays with artificially inserted lung nodules offers one potential solution. However, this approach suffers from significant image quality degradation, particularly in the form of blurred anatomical features and loss of fine lung field structures. To overcome this, we introduce DiffusionXRay, a novel image restoration pipeline for Chest X-ray images that synergistically leverages denoising diffusion probabilistic models (DDPMs) and generative adversarial networks (GANs). DiffusionXRay incorporates a unique two-stage training process: First, we investigate two independent approaches, DDPM-LQ and GAN-based MUNIT-LQ, to generate low-quality CXRs, addressing the challenge of training data scarcity, posing this as a style transfer problem. Subsequently, we train a DDPM-based model on paired low-quality and high-quality images, enabling it to learn the nuances of X-ray image restoration. Our method demonstrates promising results in enhancing image clarity, contrast, and overall diagnostic value of chest X-rays while preserving subtle yet clinically significant artifacts, validated by both quantitative metrics and expert radiological assessment.
基于深度学习的肺癌自动诊断技术已经成为一项重要的进展,使医疗专业人员能够更早地检测和启动治疗。然而,这些模型需要大量的训练数据集,并且这些数据集中应包含各种特定病例的特点。特别是高质量的标注数据对于细微肺结节(即使是经验丰富的放射科医生也难以发现)的情况来说尤其难以获取。这种良好标记的数据缺乏可能会限制模型在不同患者群体中的表现和泛化能力。利用CT扫描生成合成胸部正位X光片,这种方法通过数字重建射线照片(DRR)并在图像中人工插入肺结节来提供了一种可能的解决方案。然而,这种方法会导致显著的图像质量下降,特别是在解剖特征模糊和细微肺野结构丢失方面。 为了解决这些问题,我们引入了DiffusionXRay,这是一种新的用于胸部X光片图像恢复的技术流程,它结合了去噪扩散概率模型(DDPM)和生成对抗网络(GAN)。DiffusionXRay采用了一种独特的两阶段训练过程:首先,我们研究两种独立的方法——DDPM-LQ 和 GAN-based MUNIT-LQ——来生成低质量的胸部X光片(CXRs),以此应对训练数据不足的问题,并将其视为一种风格迁移问题。然后,我们使用一对低质量和高质量图像对基于DDPM的模型进行训练,使其能够学习到X射线图像恢复中的细微差别。 我们的方法在提高胸部X光片的清晰度、对比度以及整体诊断价值方面显示出有希望的结果,并且验证了该方法能够在保留临床意义的微妙结构的同时保持良好的定性和专家放射学评估结果。
https://arxiv.org/abs/2603.01686
Semantic segmentation takes pivotal roles in various applications such as autonomous driving and medical image analysis. When deploying segmentation models in practice, it is critical to test their behaviors in varied and complex scenes in advance. In this paper, we construct an automatic data generation pipeline Gen4Seg to stress-test semantic segmentation models by generating various challenging samples with different attribute changes. Beyond previous evaluation paradigms focusing solely on global weather and style transfer, we investigate variations in both appearance and geometry attributes at the object and image level. These include object color, material, size, position, as well as image-level variations such as weather and style. To achieve this, we propose to edit visual attributes of existing real images with precise control of structural information, empowered by diffusion models. In this way, the existing segmentation labels can be reused for the edited images, which greatly reduces the labor costs. Using our pipeline, we construct two new benchmarks, Pascal-EA and COCO-EA. We benchmark a wide variety of semantic segmentation models, spanning from closed-set models to open-vocabulary large models. We have several key findings: 1) advanced open-vocabulary models do not exhibit greater robustness compared to closed-set methods under geometric variations; 2) data augmentation techniques, such as CutOut and CutMix, are limited in enhancing robustness against appearance variations; 3) our pipeline can also be employed as a data augmentation tool and improve both in-distribution and out-of-distribution performances. Our work suggests the potential of generative models as effective tools for automatically analyzing segmentation models, and we hope our findings will assist practitioners and researchers in developing more robust and reliable segmentation models.
语义分割在自动驾驶和医学图像分析等多种应用中扮演着关键角色。当将分割模型部署到实际场景时,提前测试它们在各种复杂环境中的表现是至关重要的。在这篇论文中,我们构建了一个自动数据生成管道Gen4Seg,通过生成具有不同属性变化的各种挑战性样本来压力测试语义分割模型。与以往只关注全局天气和风格转换的评估范式不同,我们研究了对象级和图像级别上外观和几何属性的变化情况。这些变化包括物体的颜色、材质、尺寸、位置以及图像级别的天气和风格等。为了实现这一目标,我们提出利用扩散模型对现有真实图像的视觉属性进行编辑,并且能够精确控制结构信息。这样可以重用原有的分割标签来编辑后的图片,大大减少了人工劳动成本。 使用我们的管道,我们构建了两个新的基准测试集:Pascal-EA和COCO-EA。我们将各种语义分割模型进行了广泛的评估,这些模型涵盖了从封闭集合模型到开放词汇大模型的不同类型。我们有以下几个主要发现: 1. 在几何变化的情况下,先进的开放词汇模型并不比封闭集合方法表现出更强的鲁棒性。 2. 数据增强技术(如CutOut和CutMix)在对抗外观变化时有限于提升模型的鲁棒性。 3. 我们的管道也可以用作数据增强工具,并且能够改善分布内和分布外的表现。 我们的工作展示了生成模型作为自动分析分割模型的有效工具的巨大潜力,希望我们的研究发现能帮助从业者和研究人员开发出更加稳健可靠的分割模型。
https://arxiv.org/abs/2603.01535
We present LoR-LUT, a unified low-rank formulation for compact and interpretable 3D lookup table (LUT) generation. Unlike conventional 3D-LUT-based techniques that rely on fusion of basis LUTs, which are usually dense tensors, our unified approach extends the current framework by jointly using residual corrections, which are in fact low-rank tensors, together with a set of basis LUTs. The approach described here improves the existing perceptual quality of an image, which is primarily due to the technique's novel use of residual corrections. At the same time, we achieve the same level of trilinear interpolation complexity, using a significantly smaller number of network, residual corrections, and LUT parameters. The experimental results obtained from LoR-LUT, which is trained on the MIT-Adobe FiveK dataset, reproduce expert-level retouching characteristics with high perceptual fidelity and a sub-megabyte model size. Furthermore, we introduce an interactive visualization tool, termed LoR-LUT Viewer, which transforms an input image into the LUT-adjusted output image, via a number of slidebars that control different parameters. The tool provides an effective way to enhance interpretability and user confidence in the visual results. Overall, our proposed formulation offers a compact, interpretable, and efficient direction for future LUT-based image enhancement and style transfer.
我们提出了LoR-LUT,这是一种用于生成紧凑且可解释的3D查找表(LUT)的统一低秩表示方法。与传统的基于3D-LUT的方法不同,后者依赖于基LUT的融合,而这些基LUT通常是密集张量,我们的统一方法通过同时使用残差校正来扩展当前框架,这些残差实际上是低秩张量,并与一组基LUT结合使用。这里描述的方法主要由于其对残差校正的新颖应用,提高了图像现有的感知质量。与此同时,我们利用了显著较少的网络、残差校正和LUT参数,在保持相同的三线性插值复杂度的情况下实现了这一目标。 在MIT-Adobe FiveK数据集上训练得到的LoR-LUT模型能够以不到1MB的大小高度逼真地复现专家级图像修饰特性,具有良好的感知保真度。此外,我们还引入了一个交互式可视化工具,名为LoR-LUT Viewer,通过调整多个滑块来控制不同参数,该工具可以将输入图片转换为经过LUT校准后的输出图片。此工具提供了一种有效的方法以增强视觉结果的可解释性和用户信心。 总体而言,我们的方法提供了未来基于LUT的图像增强和风格迁移的一种紧凑、可解释且高效的途径。
https://arxiv.org/abs/2602.22607
Style transfer in diffusion models enables controllable visual generation by injecting the style of a reference image. However, recent encoder-based methods, while efficient and tuning-free, often suffer from content leakage, where semantic elements from the style image undesirably appear in the output, impairing prompt fidelity and stylistic consistency. In this work, we introduce CleanStyle, a plug-and-play framework that filters out content-related noise from the style embedding without retraining. Motivated by empirical analysis, we observe that such leakage predominantly stems from the tail components of the style embedding, which are isolated via Singular Value Decomposition (SVD). To address this, we propose CleanStyleSVD (CS-SVD), which dynamically suppresses tail components using a time-aware exponential schedule, providing clean, style-preserving conditional embeddings throughout the denoising process. Furthermore, we present Style-Specific Classifier-Free Guidance (SS-CFG), which reuses the suppressed tail components to construct style-aware unconditional inputs. Unlike conventional methods that use generic negative embeddings (e.g., zero vectors), SS-CFG introduces targeted negative signals that reflect style-specific but prompt-irrelevant visual elements. This enables the model to effectively suppress these distracting patterns during generation, thereby improving prompt fidelity and enhancing the overall visual quality of stylized outputs. Our approach is lightweight, interpretable, and can be seamlessly integrated into existing encoder-based diffusion models without retraining. Extensive experiments demonstrate that CleanStyle substantially reduces content leakage, improves stylization quality and improves prompt alignment across a wide range of style references and prompts.
在扩散模型中,风格转换技术通过注入参考图像的风格实现可控的视觉生成。然而,最近基于编码器的方法尽管高效且无需调整参数,但常常会遭受内容渗漏的问题:即从风格图提取出的一些语义元素不希望地出现在输出结果中,从而损害了提示准确性和风格一致性。为此,在这项工作中我们引入了一种名为CleanStyle的插件式框架,该框架能够在无需重新训练的情况下过滤掉与内容相关的噪声。 基于实证分析,我们观察到这种渗漏主要来源于样式嵌入中的尾部成分(通过奇异值分解SVD来隔离)。为解决这一问题,我们提出了CleanStyleSVD (CS-SVD),一种使用时间感知指数调度动态抑制尾部组件的方法。这在整个去噪过程中提供了既干净又保持风格的条件嵌入。 此外,我们还提出了一种风格特定的无分类器引导(Style-Specific Classifier-Free Guidance, SS-CFG)方法,该方法重用被压制的尾部成分来构建具有感知到的风格但与提示无关的无条件输入。不同于传统的方法使用通用负向嵌入(如零向量),SS-CFG引入了针对特定风格但与提示无关的视觉元素的目标化负面信号。这使模型能够在生成过程中有效抑制这些干扰模式,从而提高了提示准确性并提升了整体视觉质量。 我们的方法重量级轻、解释性强,并且可以无缝地集成到现有的基于编码器的扩散模型中而无需重新训练。大量的实验表明,CleanStyle大幅减少了内容渗漏,改进了风格化质量,增强了在广泛风格参考和提示下的指令对齐度。
https://arxiv.org/abs/2602.20721
Precise spatial control in diffusion-based style transfer remains challenging. This challenge arises because diffusion models treat style as a global feature and lack explicit spatial grounding of style representations, making it difficult to restrict style application to specific objects or regions. To our knowledge, existing diffusion models are unable to perform true localized style transfer, typically relying on handcrafted masks or multi-stage post-processing that introduce boundary artifacts and limit generalization. To address this, we propose an attention-supervised diffusion framework that explicitly teaches the model where to apply a given style by aligning the attention scores of style tokens with object masks during training. Two complementary objectives, a Focus loss based on KL divergence and a Cover loss using binary cross-entropy, jointly encourage accurate localization and dense coverage. A modular LoRA-MoE design further enables efficient and scalable multi-style adaptation. To evaluate localized stylization, we introduce the Regional Style Editing Score, which measures Regional Style Matching through CLIP-based similarity within the target region and Identity Preservation via masked LPIPS and pixel-level consistency on unedited areas. Experiments show that our method achieves mask-free, single-object style transfer at inference, producing regionally accurate and visually coherent results that outperform existing diffusion-based editing approaches.
在基于扩散的风格迁移中,实现精确的空间控制仍然是一项挑战。这一难题产生的原因是扩散模型将风格视为全局特征,并且缺乏对风格表示的具体空间定位,这使得限制特定对象或区域内的风格应用变得困难。据我们所知,现有的扩散模型无法执行真正的局部化风格传输,通常依赖于手工设计的掩码或多阶段后处理技术,这些方法会产生边界伪影并限制了泛化能力。 为了解决这个问题,我们提出了一种基于注意力监督的扩散框架。通过在训练过程中将风格标记的注意力分数与对象掩码对齐的方式,该框架明确地教导模型如何应用给定的风格。两个互补的目标——基于KL散度的Focus损失和使用二元交叉熵的Cover损失共同鼓励准确的空间定位和密集覆盖。此外,一个模块化的LoRA-MoE设计进一步实现了高效的多风格适应。 为了评估局部化样式化的效果,我们引入了“区域风格编辑评分”(Regional Style Editing Score),它通过目标区域内的CLIP基相似性测量区域风格匹配,并通过掩码LPIPS和未编辑区域的像素级一致性来衡量身份保持。实验结果表明,我们的方法能够在推理时实现无掩码、单对象的风格转换,生成准确且视觉上一致的结果,优于现有的基于扩散的方法。
https://arxiv.org/abs/2602.19254
Few-shot Chinese font generation aims to synthesize new characters in a target style using only a handful of reference images. Achieving accurate content rendering and faithful style transfer requires effective disentanglement between content and style. However, existing approaches achieve only feature-level disentanglement, allowing the generator to re-entangle these features, leading to content distortion and degraded style fidelity. We propose the Structure-Level Disentangled Diffusion Model (SLD-Font), which receives content and style information from two separate channels. SimSun-style images are used as content templates and concatenated with noisy latent features as the input. Style features extracted by a CLIP model from target-style images are integrated via cross-attention. Additionally, we train a Background Noise Removal module in the pixel space to remove background noise in complex stroke regions. Based on theoretical validation of disentanglement effectiveness, we introduce a parameter-efficient fine-tuning strategy that updates only the style-related modules. This allows the model to better adapt to new styles while avoiding overfitting to the reference images' content. We further introduce the Grey and OCR metrics to evaluate the content quality of generated characters. Experimental results show that SLD-Font achieves significantly higher style fidelity while maintaining comparable content accuracy to existing state-of-the-art methods.
少量样本中文字体生成的目标是使用少量参考图像来合成目标风格下的新字符。为了实现准确的内容呈现和忠实的风格转换,需要在内容与风格之间进行有效的解耦合。然而,现有的方法仅实现了特征级别的解耦合,这允许生成器重新结合这些特征,导致内容扭曲和风格保真度降低。我们提出了结构级解耦扩散模型(SLD-Font),该模型从两个独立的通道接收内容和风格信息。使用SimSun样式的图像作为内容模板,并将其与噪声潜在特性拼接在一起作为输入。通过跨注意力机制将CLIP模型从目标风格图像中提取的风格特征集成进来。此外,我们训练了一个背景噪声移除模块,在像素空间内删除复杂笔画区域中的背景噪声。基于对解耦有效性理论验证的基础上,我们引入了一种参数高效的微调策略,该策略仅更新与风格相关的模块,这使模型能够更好地适应新风格,同时避免过度拟合参考图像的内容。进一步地,我们引入了Grey和OCR指标来评估生成字符的内容质量。实验结果显示,SLD-Font在保持现有最先进的方法相当的内容准确度的同时,实现了显著更高的风格保真度。
https://arxiv.org/abs/2602.18874
Stereo depth estimation is fundamental to underwater robotic perception, yet suffers from severe domain shifts caused by wavelength-dependent light attenuation, scattering, and refraction. Recent approaches leverage monocular foundation models with GRU-based iterative refinement for underwater adaptation; however, the sequential gating and local convolutional kernels in GRUs necessitate multiple iterations for long-range disparity propagation, limiting performance in large-disparity and textureless underwater regions. In this paper, we propose StereoAdapter-2, which replaces the conventional ConvGRU updater with a novel ConvSS2D operator based on selective state space models. The proposed operator employs a four-directional scanning strategy that naturally aligns with epipolar geometry while capturing vertical structural consistency, enabling efficient long-range spatial propagation within a single update step at linear computational complexity. Furthermore, we construct UW-StereoDepth-80K, a large-scale synthetic underwater stereo dataset featuring diverse baselines, attenuation coefficients, and scattering parameters through a two-stage generative pipeline combining semantic-aware style transfer and geometry-consistent novel view synthesis. Combined with dynamic LoRA adaptation inherited from StereoAdapter, our framework achieves state-of-the-art zero-shot performance on underwater benchmarks with 17% improvement on TartanAir-UW and 7.2% improvment on SQUID, with real-world validation on the BlueROV2 platform demonstrates the robustness of our approach. Code: this https URL. Website: this https URL.
https://arxiv.org/abs/2602.16915
Where should we intervene in a language model (LM) to control behaviors that are diffused across many tokens of a long-form response? We introduce Generative Causal Mediation (GCM), a procedure for selecting model components, e.g., attention heads, to steer a binary concept (e.g., talk in verse vs. talk in prose) from contrastive long-form responses. In GCM, we first construct a dataset of contrasting inputs and responses. Then, we quantify how individual model components mediate the contrastive concept and select the strongest mediators for steering. We evaluate GCM on three tasks--refusal, sycophancy, and style transfer--across three language models. GCM successfully localizes concepts expressed in long-form responses and consistently outperforms correlational probe-based baselines when steering with a sparse set of attention heads. Together, these results demonstrate that GCM provides an effective approach for localizing and controlling the long-form responses of LMs.
https://arxiv.org/abs/2602.16080
This paper proposes a novel method for Text Style Transfer (TST) based on parameter-efficient fine-tuning of Large Language Models (LLMs). Addressing the scarcity of parallel corpora that map between styles, the study employs roundtrip translation to synthesize such parallel datasets from monolingual corpora. This approach creates 'neutralized' text devoid of stylistic attributes, essentially creating a shared input style at training-time and inference-time. Experimental results demonstrate consistent superiority of this method over zero-shot prompting and fewshot ICL techniques measured by BLEU scores and style accuracy scores across four investigated domains. Furthermore, the integration of retrieval-augmented generation (RAG) for terminology and name knowledge enhances robustness and stylistic consistency.
https://arxiv.org/abs/2602.15013
Transferring visual style between images while preserving semantic correspondence between similar objects remains a central challenge in computer vision. While existing methods have made great strides, most of them operate at global level but overlook region-wise and even pixel-wise semantic correspondence. To address this, we propose CoCoDiff, a novel training-free and low-cost style transfer framework that leverages pretrained latent diffusion models to achieve fine-grained, semantically consistent stylization. We identify that correspondence cues within generative diffusion models are under-explored and that content consistency across semantically matched regions is often neglected. CoCoDiff introduces a pixel-wise semantic correspondence module that mines intermediate diffusion features to construct a dense alignment map between content and style images. Furthermore, a cycle-consistency module then enforces structural and perceptual alignment across iterations, yielding object and region level stylization that preserves geometry and detail. Despite requiring no additional training or supervision, CoCoDiff delivers state-of-the-art visual quality and strong quantitative results, outperforming methods that rely on extra training or annotations.
https://arxiv.org/abs/2602.14464
The ability of Flow Matching (FM) to model complex conditional distributions has established it as the state-of-the-art for prediction tasks (e.g., robotics, weather forecasting). However, deployment in safety-critical settings is hindered by a critical extrapolation hazard: driven by smoothness biases, flow models yield plausible outputs even for off-manifold conditions, resulting in silent failures indistinguishable from valid predictions. In this work, we introduce Diverging Flows, a novel approach that enables a single model to simultaneously perform conditional generation and native extrapolation detection by structurally enforcing inefficient transport for off-manifold inputs. We evaluate our method on synthetic manifolds, cross-domain style transfer, and weather temperature forecasting, demonstrating that it achieves effective detection of extrapolations without compromising predictive fidelity or inference latency. These results establish Diverging Flows as a robust solution for trustworthy flow models, paving the way for reliable deployment in domains such as medicine, robotics, and climate science.
https://arxiv.org/abs/2602.13061
Despite rapid progress, autonomous driving algorithms remain notoriously fragile under Out-of-Distribution (OOD) conditions. We identify a critical decoupling failure in current research: the lack of distinction between appearance-based shifts, such as weather and lighting, and structural scene changes. This leaves a fundamental question unanswered: Is the planner failing because of complex road geometry, or simply because it is raining? To resolve this, we establish navdream, a high-fidelity robustness benchmark leveraging generative pixel-aligned style transfer. By creating a visual stress test with negligible geometric deviation, we isolate the impact of appearance on driving performance. Our evaluation reveals that existing planning algorithms often show significant degradation under OOD appearance conditions, even when the underlying scene structure remains consistent. To bridge this gap, we propose a universal perception interface leveraging a frozen visual foundation model (DINOv3). By extracting appearance-invariant features as a stable interface for the planner, we achieve exceptional zero-shot generalization across diverse planning paradigms, including regression-based, diffusion-based, and scoring-based models. Our plug-and-play solution maintains consistent performance across extreme appearance shifts without requiring further fine-tuning. The benchmark and code will be made available.
https://arxiv.org/abs/2602.12563
Numerous models have shown great success in the fields of speech recognition as well as speech synthesis, but models for speech to speech processing have not been heavily explored. We propose Speech to Speech Synthesis Network (STSSN), a model based on current state of the art systems that fuses the two disciplines in order to perform effective speech to speech style transfer for the purpose of voice impersonation. We show that our proposed model is quite powerful, and succeeds in generating realistic audio samples despite a number of drawbacks in its capacity. We benchmark our proposed model by comparing it with a generative adversarial model which accomplishes a similar task, and show that ours produces more convincing results.
https://arxiv.org/abs/2602.16721
Instructional video editing applies edits to an input video using only text prompts, enabling intuitive natural-language control. Despite rapid progress, most methods still require fixed-length inputs and substantial compute. Meanwhile, autoregressive video generation enables efficient variable-length synthesis, yet remains under-explored for video editing. We introduce a causal, efficient video editing model that edits variable-length videos frame by frame. For efficiency, we start from a 2D image-to-image (I2I) diffusion model and adapt it to video-to-video (V2V) editing by conditioning the edit at time step t on the model's prediction at t-1. To leverage videos' temporal redundancy, we propose a new I2I diffusion forward process formulation that encourages the model to predict the residual between the target output and the previous prediction. We call this Residual Flow Diffusion Model (RFDM), which focuses the denoising process on changes between consecutive frames. Moreover, we propose a new benchmark that better ranks state-of-the-art methods for editing tasks. Trained on paired video data for global/local style transfer and object removal, RFDM surpasses I2I-based methods and competes with fully spatiotemporal (3D) V2V models, while matching the compute of image models and scaling independently of input video length. More content can be found in: this https URL
该段落介绍了一种新的视频编辑方法,名为Residual Flow Diffusion Model(RFDM)。这种方法旨在通过文本指令来编辑输入视频,并且是基于因果关系的高效模型。与传统方法不同的是,它能够处理任意长度的视频数据,同时保持计算资源的要求较低。 具体来说,该方法从二维图像到图像(I2I)扩散模型开始,然后将其转换为适用于视频到视频(V2V)编辑的模型。在这一过程中,模型会在时间步骤t使用前一时刻t-1的预测来调整编辑操作。为了更好地利用视频中的时间冗余性,研究人员提出了一种新的I2I扩散过程公式,该公式鼓励模型预测目标输出与之前预测之间的残差(即差异)。这种设计使得RFDM更加专注于连续帧间的变化。 除此之外,研究团队还提出了一个新的基准测试方法,用于更准确地评估各种视频编辑技术的效果。通过训练具有全局/局部风格转换和对象移除功能的配对视频数据集,RFDM在性能上超越了传统的I2I模型,并且可以与全时空(3D)V2V模型相媲美,同时保持计算成本与图像处理模型相当,并独立于输入视频长度进行扩展。 总体而言,这种方法通过创新地结合时间序列预测和差分编辑策略,为高效的视频编辑任务提供了一个新的方向。更多信息可以在提供的链接中找到。
https://arxiv.org/abs/2602.06871