In this pioneering study, we introduce StyleWallfacer, a groundbreaking unified training and inference framework, which not only addresses various issues encountered in the style transfer process of traditional methods but also unifies the framework for different tasks. This framework is designed to revolutionize the field by enabling artist level style transfer and text driven stylization. First, we propose a semantic-based style injection method that uses BLIP to generate text descriptions strictly aligned with the semantics of the style image in CLIP space. By leveraging a large language model to remove style-related descriptions from these descriptions, we create a semantic gap. This gap is then used to fine-tune the model, enabling efficient and drift-free injection of style knowledge. Second, we propose a data augmentation strategy based on human feedback, incorporating high-quality samples generated early in the fine-tuning process into the training set to facilitate progressive learning and significantly reduce its overfitting. Finally, we design a training-free triple diffusion process using the fine-tuned model, which manipulates the features of self-attention layers in a manner similar to the cross-attention mechanism. Specifically, in the generation process, the key and value of the content-related process are replaced with those of the style-related process to inject style while maintaining text control over the model. We also introduce query preservation to mitigate disruptions to the original content. Under such a design, we have achieved high-quality image-driven style transfer and text-driven stylization, delivering artist-level style transfer results while preserving the original image content. Moreover, we achieve image color editing during the style transfer process for the first time.
在这项开创性的研究中,我们介绍了StyleWallfacer,这是一种突破性的统一训练和推理框架。它不仅解决了传统方法在风格转换过程中遇到的各种问题,而且还为不同的任务提供了一个统一的框架。该框架旨在通过实现艺术家级别的风格转换和文本驱动的美化来革新这一领域。 首先,我们提出了一种基于语义的风格注入方法,利用BLIP生成与样式图像语义严格对齐的CLIP空间中的文本描述。通过使用大型语言模型从这些描述中删除与风格相关的信息,我们创建了一个语义差距。然后利用这个差距来微调模型,从而使风格知识的有效且无漂移的注入成为可能。 其次,我们提出了一种基于人类反馈的数据增强策略,将早期微调过程中生成的高质量样本纳入训练集,以促进渐进式学习并显著减少过拟合现象。 最后,我们设计了一个无需训练的三重扩散过程,使用经过微调的模型,在自注意力层的操作方式上类似于跨注意力机制。具体而言,在生成过程中,内容相关的键和值被替换为风格相关的键和值,以注入风格的同时保持对文本的控制。我们也引入了查询保留来减轻对原始内容的干扰。 在这样的设计下,我们实现了高质量的基于图像的样式转换以及文本驱动的美化,并提供了艺术家级别的样式转换结果,同时保存了原始图像的内容。此外,在风格转换过程中首次实现了对图像颜色进行编辑。
https://arxiv.org/abs/2506.15033
Photorealistic style transfer (PST) enables real-world color grading by adapting reference image colors while preserving content structure. Existing methods mainly follow either approaches: generation-based methods that prioritize stylistic fidelity at the cost of content integrity and efficiency, or global color transformation methods such as LUT, which preserve structure but lack local adaptability. To bridge this gap, we propose Spatial Adaptive 4D Look-Up Table (SA-LUT), combining LUT efficiency with neural network adaptability. SA-LUT features: (1) a Style-guided 4D LUT Generator that extracts multi-scale features from the style image to predict a 4D LUT, and (2) a Context Generator using content-style cross-attention to produce a context map. This context map enables spatially-adaptive adjustments, allowing our 4D LUT to apply precise color transformations while preserving structural integrity. To establish a rigorous evaluation framework for photorealistic style transfer, we introduce PST50, the first benchmark specifically designed for PST assessment. Experiments demonstrate that SA-LUT substantially outperforms state-of-the-art methods, achieving a 66.7% reduction in LPIPS score compared to 3D LUT approaches, while maintaining real-time performance at 16 FPS for video stylization. Our code and benchmark are available at this https URL
光真实感风格迁移(PST)通过适应参考图像的颜色来实现现实世界的色彩分级,同时保持内容结构的完整性。现有方法主要遵循两种路径:一种是优先考虑风格忠实性的生成方法,但牺牲了内容完整性和效率;另一种是全局颜色变换方法,如查找表(LUT),它保留了结构完整性但缺乏局部适应性。为弥合这一差距,我们提出了空间自适应4D查找表(SA-LUT),将LUT的效率与神经网络的适应能力相结合。 SA-LUT的特点包括: 1. 风格引导的4D LUT生成器:从风格图像中提取多尺度特征以预测一个4D LUT。 2. 上下文生成器:使用内容-样式交叉注意力机制来产生上下文映射。这个上下文映射使得空间自适应调整成为可能,使我们的4D LUT能够执行精确的颜色变换同时保持结构完整性。 为了建立光真实感风格迁移的严格评估框架,我们引入了PST50,这是第一个专门用于PST评估的基准测试。实验结果表明,SA-LUT显著优于现有的最佳方法,在LPIPS评分上比3D LUT方法减少了66.7%,同时在视频着色方面保持了每秒16帧的实时性能。 我们的代码和基准可以在以下网址获得:[此URL](请将方括号中的内容替换为实际链接)。
https://arxiv.org/abs/2506.13465
We present a method for fine-grained control over music generation through inference-time interventions on an autoregressive generative music transformer called MusicGen. Our approach enables timbre transfer, style transfer, and genre fusion by steering the residual stream using weights of linear probes trained on it, or by steering the attention layer activations in a similar manner. We observe that modelling this as a regression task provides improved performance, hypothesizing that the mean-squared-error better preserve meaningful directional information in the activation space. Combined with the global conditioning offered by text prompts in MusicGen, our method provides both global and local control over music generation. Audio samples illustrating our method are available at our demo page.
我们提出了一种通过在推理阶段干预自回归生成音乐变压器(称为MusicGen)来实现对音乐生成精细控制的方法。我们的方法可以通过使用在线性探针上训练的权重引导残差流,或者以类似方式引导注意力层激活,从而实现音色转换、风格迁移和流派融合。我们观察到将此过程建模为回归任务可以提供更好的性能,假设均方误差更能保留激活空间中的有意义的方向信息。结合MusicGen中由文本提示提供的全局条件控制,我们的方法提供了对音乐生成的局部和全局控制。在我们的演示页面上可获取展示我们方法的音频样本。
https://arxiv.org/abs/2506.10225
As a foundational technology for intelligent human-computer interaction, voice conversion (VC) seeks to transform speech from any source timbre into any target timbre. Traditional voice conversion methods based on Generative Adversarial Networks (GANs) encounter significant challenges in precisely encoding diverse speech elements and effectively synthesising these elements into natural-sounding converted speech. To overcome these limitations, we introduce Pureformer-VC, an encoder-decoder framework that utilizes Conformer blocks to build a disentangled encoder and employs Zipformer blocks to create a style transfer decoder. We adopt a variational decoupled training approach to isolate speech components using a Variational Autoencoder (VAE), complemented by triplet discriminative training to enhance the speaker's discriminative capabilities. Furthermore, we incorporate the Attention Style Transfer Mechanism (ASTM) with Zipformer's shared weights to improve the style transfer performance in the decoder. We conducted experiments on two multi-speaker datasets. The experimental results demonstrate that the proposed model achieves comparable subjective evaluation scores while significantly enhancing objective metrics compared to existing approaches in many-to-many and many-to-one VC scenarios.
作为智能人机交互的基础技术,语音转换(Voice Conversion, VC)旨在将任意来源音色的语音转化为目标音色。传统的基于生成对抗网络(Generative Adversarial Networks, GANs)的语音转换方法在精确编码各种语音元素及有效合成自然流畅的转换语音方面面临重大挑战。 为克服这些限制,我们提出了一种名为Pureformer-VC的编解码框架,该框架利用Conformer模块构建了一个分离式编码器,并采用Zipformer模块创建一个风格转移解码器。我们在训练过程中采用了变分解耦方法,使用变分自编码器(Variational Autoencoder, VAE)来隔离语音成分,并通过三元组判别性训练增强说话人的区分能力。 此外,我们还在Zipformer的共享权重中引入了注意风格传输机制(Attention Style Transfer Mechanism, ASTM),以改进解码器中的风格转换性能。 我们在两个多说话人数据集上进行了实验。实验结果表明,在许多到许多(many-to-many)和许多到一个(many-to-one)语音转换场景下,所提出的模型在主观评估得分方面与现有方法相当,同时显著提升了客观评价指标。
https://arxiv.org/abs/2506.08348
Video editing has garnered increasing attention alongside the rapid progress of diffusion-based video generation models. As part of these advancements, there is a growing demand for more accessible and controllable forms of video editing, such as prompt-based editing. Previous studies have primarily focused on tasks such as style transfer, background replacement, object substitution, and attribute modification, while maintaining the content structure of the source video. However, more complex tasks, including the addition of novel objects and nonrigid transformations, remain relatively unexplored. In this paper, we present TV-LiVE, a Training-free and text-guided Video editing framework via Layerinformed Vitality Exploitation. We empirically identify vital layers within the video generation model that significantly influence the quality of generated outputs. Notably, these layers are closely associated with Rotary Position Embeddings (RoPE). Based on this observation, our method enables both object addition and non-rigid video editing by selectively injecting key and value features from the source model into the corresponding layers of the target model guided by the layer vitality. For object addition, we further identify prominent layers to extract the mask regions corresponding to the newly added target prompt. We found that the extracted masks from the prominent layers faithfully indicate the region to be edited. Experimental results demonstrate that TV-LiVE outperforms existing approaches for both object addition and non-rigid video editing. Project Page: this https URL
视频编辑随着基于扩散模型的视频生成技术的进步而受到了越来越多的关注。随着这些进步,人们对于更易访问和控制的视频编辑形式的需求也在增长,例如基于提示词的编辑。此前的研究主要集中在样式转换、背景替换、对象置换以及属性修改等任务上,同时保持源视频的内容结构不变。然而,包括添加新对象和非刚性变换在内的更为复杂的任务仍然鲜有探索。 在本文中,我们提出了TV-LiVE(Training-free and text-guided Video editing via Layer-informed Vitality Exploitation),这是一种无需训练且受文本指导的视频编辑框架。通过实验分析,我们识别出了视频生成模型中的关键层,这些层对生成输出的质量具有显著影响。值得注意的是,这些关键层与旋转位置嵌入(RoPE)紧密相关。基于这一观察结果,我们的方法可以通过利用图层活力来引导目标模型相应层级的关键和值特征的注入,从而实现对象添加和非刚性视频编辑。对于对象添加任务,我们进一步识别了重要的图层以提取对应于新添加提示词的目标掩码区域。研究发现从重要图层中提取出的掩码能够准确指示需要进行编辑的区域。 实验结果表明,在对象添加和非刚性视频编辑方面,TV-LiVE的表现优于现有方法。 项目主页:[此链接](https://this-url.com)(请将URL替换为实际项目页面地址)
https://arxiv.org/abs/2506.07205
In recent years, there has been a growing demand to stylize a given 3D scene to align with the artistic style of reference images for creative purposes. While 3D Gaussian Splatting(GS) has emerged as a promising and efficient method for realistic 3D scene modeling, there remains a challenge in adapting it to stylize 3D GS to match with multiple styles through automatic local style transfer or manual designation, while maintaining memory efficiency for stylization training. In this paper, we introduce a novel 3D GS stylization solution termed Multi-StyleGS to tackle these challenges. In particular, we employ a bipartite matching mechanism to au tomatically identify correspondences between the style images and the local regions of the rendered images. To facilitate local style transfer, we introduce a novel semantic style loss function that employs a segmentation network to apply distinct styles to various objects of the scene and propose a local-global feature matching to enhance the multi-view consistency. Furthermore, this technique can achieve memory efficient training, more texture details and better color match. To better assign a robust semantic label to each Gaussian, we propose several techniques to regularize the segmentation network. As demonstrated by our comprehensive experiments, our approach outperforms existing ones in producing plausible stylization results and offering flexible editing.
近年来,人们对将给定的3D场景根据参考图像的艺术风格进行美化的需求日益增长。虽然三维高斯点绘(Gaussian Splatting, GS)作为一种高效的方法在现实主义3D场景建模中展现出巨大潜力,但在自动局部风格转移或手动指定的情况下,将其适应以匹配多样的艺术风格,并保持风格化训练中的内存效率仍然面临挑战。为此,在本文中我们提出了一种新的3D GS美化解决方案,命名为Multi-StyleGS,旨在解决这些难题。 具体而言,我们采用双图匹配机制来自动识别样式图像与渲染图像局部区域之间的对应关系。为了促进局部风格转移,我们引入了一种新颖的语义风格损失函数,该函数利用分割网络将不同的艺术风格应用于场景中的各个对象,并提出一种局部-全局特征匹配方法以增强多视角一致性。此外,这种方法能够实现高效内存训练、更多纹理细节和更好的颜色匹配。 为更好地给每个高斯分配稳健的语义标签,我们提出了几种技术来规范分割网络。通过全面的实验表明,我们的方法在生成合理风格化结果和提供灵活编辑方面优于现有方法。
https://arxiv.org/abs/2506.06846
While diffusion models have demonstrated remarkable generative capabilities, existing style transfer techniques often struggle to maintain identity while achieving high-quality stylization. This limitation is particularly acute for images where faces are small or exhibit significant camera-to-face distances, frequently leading to inadequate identity preservation. To address this, we introduce a novel, training-free framework for identity-preserved stylized image synthesis using diffusion models. Key contributions include: (1) the "Mosaic Restored Content Image" technique, significantly enhancing identity retention, especially in complex scenes; and (2) a training-free content consistency loss that enhances the preservation of fine-grained content details by directing more attention to the original image during stylization. Our experiments reveal that the proposed approach substantially surpasses the baseline model in concurrently maintaining high stylistic fidelity and robust identity integrity, particularly under conditions of small facial regions or significant camera-to-face distances, all without necessitating model retraining or fine-tuning.
虽然扩散模型展示了出色的生成能力,但现有的风格转换技术往往难以在保持身份的同时实现高质量的风格化。这一限制尤其明显于面部较小或相机与脸部距离较大的图像中,常常导致身份保存不充分。为了解决这个问题,我们提出了一种新颖且无需训练的身份保留式风格化图像合成框架,利用扩散模型。主要贡献包括: 1. **“马赛克恢复内容图”技术**:该技术显著增强了在复杂场景中的身份保持能力。 2. **无训练的内容一致性损失**:这一损失函数通过更注重原图来增强细粒度内容细节的保留,在风格化过程中引导更多的注意力。 我们的实验结果显示,所提出的方法能够在同时维持高风格保真度和稳健的身份完整性方面显著超越基线模型,尤其是在面部区域较小或相机与脸部距离较大的情况下。值得注意的是,这些改进无需对模型进行重新训练或微调即可实现。
https://arxiv.org/abs/2506.06802
Expressive voice conversion aims to transfer both speaker identity and expressive attributes from a target speech to a given source speech. In this work, we improve over a self-supervised, non-autoregressive framework with a conditional variational autoencoder, focusing on reducing source timbre leakage and improving linguistic-acoustic disentanglement for better style transfer. To minimize style leakage, we use multilingual discrete speech units for content representation and reinforce embeddings with augmentation-based similarity loss and mix-style layer normalization. To enhance expressivity transfer, we incorporate local F0 information via cross-attention and extract style embeddings enriched with global pitch and energy features. Experiments show our model outperforms baselines in emotion and speaker similarity, demonstrating superior style adaptation and reduced source style leakage.
表达式音色转换的目标是从目标语音中转移说话人的身份以及情感特征到给定的源语音上。在这项工作中,我们改进了一个自监督、非自回归框架,并采用条件变分自动编码器(Conditional Variational Autoencoder, CVAE),专注于减少源语音的音色泄露并提高语言声学特征的解耦合能力,从而更好地进行风格转换。 为了最小化风格泄漏,我们使用多语言离散语音单元来表示内容,并通过基于数据增强的相似性损失和混合样式层归一化(mix-style layer normalization)来强化嵌入。为了提升情感表达的转移效果,我们在交叉注意力机制中引入了局部基频信息,并提取出富含全局音高和能量特征的情感嵌入。 实验结果显示,我们的模型在情绪和说话人相似度方面优于基准模型,展示了其卓越的风格适应能力和减少源语音风格泄漏的能力。
https://arxiv.org/abs/2506.04013
Large Language Models (LLMs) and Multimodal LLMs have shown promising capabilities for SVG processing, yet existing benchmarks suffer from limited real-world coverage, lack of complexity stratification, and fragmented evaluation paradigms. We introduce SVGenius, a comprehensive benchmark comprising 2,377 queries across three progressive dimensions: understanding, editing, and generation. Built on real-world data from 24 application domains with systematic complexity stratification, SVGenius evaluates models through 8 task categories and 18 metrics. We assess 22 mainstream models spanning different scales, architectures, training paradigms, and accessibility levels. Our analysis reveals that while proprietary models significantly outperform open-source counterparts, all models exhibit systematic performance degradation with increasing complexity, indicating fundamental limitations in current approaches; however, reasoning-enhanced training proves more effective than pure scaling for overcoming these limitations, though style transfer remains the most challenging capability across all model types. SVGenius establishes the first systematic evaluation framework for SVG processing, providing crucial insights for developing more capable vector graphics models and advancing automated graphic design applications. Appendix and supplementary materials (including all data and code) are available at this https URL.
大型语言模型(LLMs)和多模态LLMs在SVG处理方面展示了有前景的能力,然而现有的基准测试存在现实世界覆盖率有限、缺乏复杂度分层以及评估方法分散等问题。我们引入了SVGenius,这是一个全面的基准测试集,包含2,377个查询,涵盖了理解、编辑和生成三个渐进维度。该基准测试基于来自24个应用领域的实际数据,并系统地进行了复杂度分级。SVGenius通过8类任务和18项指标对模型进行评估。我们评估了22种主流模型,这些模型涵盖不同的规模、架构、训练方法和可访问性级别。我们的分析表明,尽管专有模型明显优于开源模型,但所有模型在复杂度增加时均表现出系统性的性能下降,这揭示了当前方法的基本局限;然而,增强推理能力的训练比单纯的规模扩展更有效于克服这些局限,不过样式迁移仍然是所有类型模型面临的最艰巨的能力挑战。SVGenius建立了首个针对SVG处理的系统性评估框架,为开发更加高效向量图形模型和推进自动化图形设计应用提供了重要见解。 附录及补充材料(包括所有数据和代码)可在此链接获取:[https URL]
https://arxiv.org/abs/2506.03139
Zero-Shot Voice Conversion (VC) aims to transform the source speaker's timbre into an arbitrary unseen one while retaining speech content. Most prior work focuses on preserving the source's prosody, while fine-grained timbre information may leak through prosody, and transferring target prosody to synthesized speech is rarely studied. In light of this, we propose R-VC, a rhythm-controllable and efficient zero-shot voice conversion model. R-VC employs data perturbation techniques and discretize source speech into Hubert content tokens, eliminating much content-irrelevant information. By leveraging a Mask Generative Transformer for in-context duration modeling, our model adapts the linguistic content duration to the desired target speaking style, facilitating the transfer of the target speaker's rhythm. Furthermore, R-VC introduces a powerful Diffusion Transformer (DiT) with shortcut flow matching during training, conditioning the network not only on the current noise level but also on the desired step size, enabling high timbre similarity and quality speech generation in fewer sampling steps, even in just two, thus minimizing latency. Experimental results show that R-VC achieves comparable speaker similarity to state-of-the-art VC methods with a smaller dataset, and surpasses them in terms of speech naturalness, intelligibility and style transfer performance.
零样本语音转换(Zero-Shot Voice Conversion,简称VC)旨在将源说话人的音色转化为任意未见过的目标音色,同时保留语音的内容信息。大多数先前的工作主要集中在保持源说话人的语调上,而细粒度的音色信息可能会通过语调泄露出来,并且很少有研究关注如何将目标语调转移到合成的语音中。鉴于此,我们提出了R-VC(节奏可控和高效的零样本语音转换模型)。R-VC采用数据扰动技术并把源语音离散化为Hubert内容令牌,从而消除了大量与内容无关的信息。通过利用上下文时间长度建模的掩码生成Transformer,我们的模型能够将语言内容的时间长度适应到所需的说话风格上,这有助于目标说话人节奏的转移。 此外,R-VC引入了一种强大的扩散变压器(Diffusion Transformer,简称DiT),在训练过程中采用跳跃流匹配技术。该网络不仅根据当前噪声级别进行条件设定,还根据所需步长进行调节,从而能够在较少的采样步骤中实现高音色相似度和高质量语音生成,甚至只需两步就能做到这一点,大大降低了延迟。 实验结果显示,R-VC在较小的数据集上达到了与现有最佳VC方法相当的说话人相似性,并且在自然度、可懂性和风格转换性能方面超越了它们。
https://arxiv.org/abs/2506.01014
In medical image segmentation, limited external validity remains a critical obstacle when models are deployed across unseen datasets, an issue particularly pronounced in the ultrasound image domain. Existing solutions-such as domain adaptation and GAN-based style transfer-while promising, often fall short in the medical domain where datasets are typically small and diverse. This paper presents a novel application of principal component analysis (PCA) to address this limitation. PCA preprocessing reduces noise and emphasizes essential features by retaining approximately 90\% of the dataset variance. We evaluate our approach across six diverse breast tumor ultrasound datasets comprising 3,983 B-mode images and corresponding expert tumor segmentation masks. For each dataset, a corresponding dimensionality reduced PCA-dataset is created and U-Net-based segmentation models are trained on each of the twelve datasets. Each model trained on an original dataset was inferenced on the remaining five out-of-domain original datasets (baseline results), while each model trained on a PCA dataset was inferenced on five out-of-domain PCA datasets. Our experimental results indicate that using PCA reconstructed datasets, instead of original images, improves the model's recall and Dice scores, particularly for model-dataset pairs where baseline performance was lowest, achieving statistically significant gains in recall (0.57 $\pm$ 0.07 vs. 0.70 $\pm$ 0.05, $p = 0.0004$) and Dice scores (0.50 $\pm$ 0.06 vs. 0.58 $\pm$ 0.06, $p = 0.03$). Our method reduced the decline in recall values due to external validation by $33\%$. These findings underscore the potential of PCA reconstruction as a safeguard to mitigate declines in segmentation performance, especially in challenging cases, with implications for enhancing external validity in real-world medical applications.
在医学图像分割领域,模型部署到未见过的数据集时,有限的外部有效性仍然是一个关键障碍,尤其是在超声影像领域问题更为突出。虽然现有的解决方案(如域适应和基于GAN的风格迁移)很有前景,但在数据集通常较小且多样化的医疗领域中往往效果不佳。本文提出了一种新颖的应用主成分分析(PCA)的方法来解决这一限制。通过PCA预处理可以减少噪声,并通过保留大约90%的数据集方差来强调关键特征。 我们在六个不同的乳腺肿瘤超声图像数据集中评估了我们的方法,这些数据集包含3,983张B模式图像及其对应的专家绘制的肿瘤分割掩模。对于每个数据集,我们创建了一个相应的降维PCA数据集,并在十二个数据集中的每一个上训练基于U-Net的分割模型。每种原始数据集上的模型在其余五个不同的域外原始数据集上进行推理(基线结果),而使用PCA数据集训练的模型则在五组域外PCA数据集中进行推理。 我们的实验结果显示,与使用原始图像相比,采用PCA重构的数据集可以提高模型召回率和Dice分数,特别是在那些基线性能最低的模-数对中。这导致了显著提升的召回率(0.57 ± 0.07 对比 0.70 ± 0.05, p = 0.0004)和Dice分数(0.50 ± 0.06 对比 0.58 ± 0.06, p = 0.03)。我们的方法将由于外部验证导致的召回值下降减少了33%。 这些发现强调了PCA重建作为一种潜在措施,可以在挑战性案例中降低分割性能下降的程度,并对增强现实世界医疗应用中的外部有效性具有重要意义。
https://arxiv.org/abs/2505.23587
Deep learning models often struggle to maintain performance when deployed on data distributions different from their training data, particularly in real-world applications where environmental conditions frequently change. While Multi-source Domain Generalization (MDG) has shown promise in addressing this challenge by leveraging multiple source domains during training, its practical application is limited by the significant costs and difficulties associated with creating multi-domain datasets. To address this limitation, we propose Pseudo Multi-source Domain Generalization (PMDG), a novel framework that enables the application of sophisticated MDG algorithms in more practical Single-source Domain Generalization (SDG) settings. PMDG generates multiple pseudo-domains from a single source domain through style transfer and data augmentation techniques, creating a synthetic multi-domain dataset that can be used with existing MDG algorithms. Through extensive experiments with PseudoDomainBed, our modified version of the DomainBed benchmark, we analyze the effectiveness of PMDG across multiple datasets and architectures. Our analysis reveals several key findings, including a positive correlation between MDG and PMDG performance and the potential of pseudo-domains to match or exceed actual multi-domain performance with sufficient data. These comprehensive empirical results provide valuable insights for future research in domain generalization. Our code is available at this https URL.
深度学习模型在部署到与训练数据分布不同的数据集时,往往难以保持性能,尤其是在环境条件经常变化的实际应用中。虽然多源领域泛化(MDG)通过利用多个来源领域的数据进行训练,在应对这一挑战方面显示出潜力,但其实际应用受到创建多领域数据集所需的巨大成本和困难的限制。为了解决这一局限性,我们提出了伪多源领域泛化(PMDG),这是一种新的框架,它使复杂的MDG算法能够应用于更具实用性的单一来源领域泛化(SDG)场景中。 PMDG通过使用风格迁移和数据增强技术从单个来源域生成多个伪域,创建了一个可以与现有MDG算法一起使用的合成多领域数据集。我们通过在修改后的DomainBed基准测试的版本PseudoDomainBed上进行广泛的实验,分析了PMDG在多种数据集和架构上的有效性。我们的分析揭示了几项关键发现,包括MDG和PMDG性能之间的正相关关系,以及伪域具有与实际多领域表现相匹配甚至超越的能力,只要数据充足。 这些全面的经验结果为未来领域的泛化研究提供了宝贵的见解。我们的代码可在以下网址获得:[这里插入URL]。
https://arxiv.org/abs/2505.23173
CLIP is a discriminative model trained to align images and text in a shared embedding space. Due to its multimodal structure, it serves as the backbone of many generative pipelines, where a decoder is trained to map from the shared space back to images. In this work, we show that image synthesis is nevertheless possible using CLIP alone -- without any decoder, training, or fine-tuning. Our approach optimizes a frequency-aware implicit neural representation that encourages coarse-to-fine generation by stratifying frequencies across network layers. To stabilize this inverse mapping, we introduce adversarially robust initialization, a lightweight Orthogonal Procrustes projection to align local text and image embeddings, and a blending loss that anchors outputs to natural image statistics. Without altering CLIP's weights, this framework unlocks capabilities such as text-to-image generation, style transfer, and image reconstruction. These findings suggest that discriminative models may hold untapped generative potential, hidden in plain sight.
CLIP 是一种判别模型,经过训练后能够在共享的嵌入空间中对齐图像和文本。由于其多模态结构,它成为了许多生成管道的基础,在这些管道中,解码器被训练为从共享空间映射回图像。然而在这项工作中,我们展示了仅使用 CLIP 也可以进行图像合成——无需任何解码器、训练或微调。我们的方法优化了一种频率感知的隐式神经表示,通过在网络层之间分层频带来鼓励粗到细的生成过程。为了稳定这个逆向映射,我们引入了对抗鲁棒初始化、一种轻量级的正交普罗克斯特斯投影以对齐本地文本和图像嵌入,并且加入了一种混合损失来将输出锚定在自然图像统计上。不改变 CLIP 的权重的情况下,这一框架解锁了包括从文本生成图像、风格转换以及图像重建在内的多种能力。这些发现表明判别模型可能隐藏着未被发掘的生成潜力。
https://arxiv.org/abs/2505.23161
Gaussian Splatting (GS) has recently emerged as an efficient representation for rendering 3D scenes from 2D images and has been extended to images, videos, and dynamic 4D content. However, applying style transfer to GS-based representations, especially beyond simple color changes, remains challenging. In this work, we introduce CLIPGaussians, the first unified style transfer framework that supports text- and image-guided stylization across multiple modalities: 2D images, videos, 3D objects, and 4D scenes. Our method operates directly on Gaussian primitives and integrates into existing GS pipelines as a plug-in module, without requiring large generative models or retraining from scratch. CLIPGaussians approach enables joint optimization of color and geometry in 3D and 4D settings, and achieves temporal coherence in videos, while preserving a model size. We demonstrate superior style fidelity and consistency across all tasks, validating CLIPGaussians as a universal and efficient solution for multimodal style transfer.
最近,高斯点表示(Gaussian Splatting,GS)作为一种高效的三维场景渲染方法从二维图像中崭露头角,并已扩展到图像、视频和动态四维内容。然而,将风格转换应用到基于GS的表示上,特别是超越简单的颜色变化方面,仍然具有挑战性。在这项工作中,我们介绍了CLIPGaussians,这是第一个统一的风格转换框架,支持在多个模式下进行文本引导和图像引导的样式化:二维图像、视频、三维物体和四维场景。我们的方法直接操作高斯原语,并作为插件模块集成到现有的GS管道中,无需使用大规模生成模型或从头开始重新训练。CLIPGaussians的方法使三维和四维设置中的颜色和几何形状的联合优化成为可能,在视频中实现了时间一致性的同时保持了模型大小。我们在所有任务上展示了卓越的风格忠实度和一致性,验证了CLIPGaussians作为一个通用且高效的跨模态风格转换解决方案的有效性。
https://arxiv.org/abs/2505.22854
Recent advances in expressive text-to-speech (TTS) have introduced diverse methods based on style embedding extracted from reference speech. However, synthesizing high-quality expressive speech remains challenging. We propose Spotlight-TTS, which exclusively emphasizes style via voiced-aware style extraction and style direction adjustment. Voiced-aware style extraction focuses on voiced regions highly related to style while maintaining continuity across different speech regions to improve expressiveness. We adjust the direction of the extracted style for optimal integration into the TTS model, which improves speech quality. Experimental results demonstrate that Spotlight-TTS achieves superior performance compared to baseline models in terms of expressiveness, overall speech quality, and style transfer capability. Our audio samples are publicly available.
近期,基于从参考语音中提取的风格嵌入(style embedding)的表达性文本到语音(TTS)方法得到了多样化的发展。然而,合成高质量且富有表现力的语音仍然具有挑战性。为此,我们提出了Spotlight-TTS系统,该系统通过声带活跃度感知的风格提取和风格方向调整来专门强调风格。 声带活跃度感知的风格提取专注于与风格高度相关的发声区域,并保持在不同语音区域间的连贯性以提升表现力。此外,我们还对提取出的风格的方向进行调整,以便其能够最佳地整合到TTS模型中,从而改善语音质量。 实验结果表明,在表达能力、整体语音质量和风格迁移能力方面,Spotlight-TTS相比基线模型取得了显著更优的表现。我们的音频样本可公开获取。
https://arxiv.org/abs/2505.20868
This work introduces Ui2i, a novel model for unpaired image-to-image translation, trained on content-wise unpaired datasets to enable style transfer across domains while preserving content. Building on CycleGAN, Ui2i incorporates key modifications to better disentangle content and style features, and preserve content integrity. Specifically, Ui2i employs U-Net-based generators with skip connections to propagate localized shallow features deep into the generator. Ui2i removes feature-based normalization layers from all modules and replaces them with approximate bidirectional spectral normalization -- a parameter-based alternative that enhances training stability. To further support content preservation, channel and spatial attention mechanisms are integrated into the generators. Training is facilitated through image scale augmentation. Evaluation on two biomedical tasks -- domain adaptation for nuclear segmentation in immunohistochemistry (IHC) images and unmixing of biological structures superimposed in single-channel immunofluorescence (IF) images -- demonstrates Ui2i's ability to preserve content fidelity in settings that demand more accurate structural preservation than typical translation tasks. To the best of our knowledge, Ui2i is the first approach capable of separating superimposed signals in IF images using real, unpaired training data.
这项工作介绍了Ui2i,这是一种新颖的模型,用于无配对图像到图像的转换。它在基于内容的无配对数据集上进行训练,旨在跨领域进行风格迁移的同时保持内容不变。Ui2i建立在CycleGAN的基础上,对其进行关键修改以更好地分离内容和风格特征,并保护内容完整性。具体来说,Ui2i采用具有跳跃连接的U-Net生成器,将局部浅层特征深入传播到生成器中。Ui2i从所有模块中移除了基于特征的归一化层,并用近似的双向谱归一化进行替换——这是一种参数化的替代方案,增强了训练稳定性。为了进一步支持内容保持,通道和空间注意机制被整合到生成器中。通过图像尺度增强来促进训练过程。 在两个生物医学任务上的评估展示了Ui2i在要求比常规翻译任务更准确结构保留的情况下仍能保持内容保真度的能力:一个是免疫组化(IHC)图像中的细胞核分割领域的适应性;另一个是单通道免疫荧光(IF)图像中叠加的生物结构的解混。据我们所知,Ui2i是首个能够使用真实无配对训练数据分离IF图像中超叠信号的方法。
https://arxiv.org/abs/2505.20746
Given a pair of source and reference speech recordings, audio-to-audio (A2A) style transfer involves the generation of an output speech that mimics the style characteristics of the reference while preserving the content and speaker attributes of the source. In this paper, we propose a novel framework, termed as A2A Zero-shot Emotion Style Transfer (A2A-ZEST), that enables the transfer of reference emotional attributes to the source while retaining its speaker and speech contents. The A2A-ZEST framework consists of an analysis-synthesis pipeline, where the analysis module decomposes speech into semantic tokens, speaker representations, and emotion embeddings. Using these representations, a pitch contour estimator and a duration predictor are learned. Further, a synthesis module is designed to generate speech based on the input representations and the derived factors. This entire paradigm of analysis-synthesis is trained purely in a self-supervised manner with an auto-encoding loss. For A2A emotion style transfer, the emotion embedding extracted from the reference speech along with the rest of the representations from the source speech are used in the synthesis module to generate the style translated speech. In our experiments, we evaluate the converted speech on content/speaker preservation (w.r.t. source) as well as on the effectiveness of the emotion style transfer (w.r.t. reference). The proposal, A2A-ZEST, is shown to improve over other prior works on these evaluations, thereby enabling style transfer without any parallel training data. We also illustrate the application of the proposed work for data augmentation in emotion recognition tasks.
给定一对源录音和参考录音,音频到音频(A2A)风格迁移旨在生成一个输出语音,在保留源语音内容和说话人属性的同时模仿参考录音的风格特性。本文提出了一种新颖框架,命名为 A2A 零样本情感风格迁移 (A2A-ZEST),该框架允许将参考语音的情感属性转移到源语音中,同时保持其说话人身份和语音内容不变。A2A-ZEST 框架包括一个分析-综合流水线,在其中分析模块将语音分解为语义标记、说话人表示和情感嵌入。利用这些表示形式学习音高曲线估计器和持续时间预测器。此外,设计了一个合成模块,用于根据输入表示及推导出的参数生成语音。整个分析-综合流程完全以自监督方式训练,并使用自动编码损失函数。 对于 A2A 情感风格迁移,从参考录音中提取的情感嵌入与源录音中的其余表示共同被用来在合成模块内生成风格转换后的语音。在实验过程中,我们评估了转换语音在保留内容和说话人身份(相对于原始音频)以及情感风格转移效果(相对于参考音频)方面的表现。A2A-ZEST 方法在这类评价中超越了先前的工作,从而实现在没有平行训练数据的情况下进行样式迁移。此外,我们也展示了所提出的方案在情感识别任务中的数据增强应用。
https://arxiv.org/abs/2505.17655
We introduce Color Disentangled Style Transfer (CDST), a novel and efficient two-stream style transfer training paradigm which completely isolates color from style and forces the style stream to be color-blinded. With one same model, CDST unlocks universal style transfer capabilities in a tuning-free manner during inference. Especially, the characteristics-preserved style transfer with style and content references is solved in the tuning-free way for the first time. CDST significantly improves the style similarity by multi-feature image embeddings compression and preserves strong editing capability via our new CDST style definition inspired by Diffusion UNet disentanglement law. By conducting thorough qualitative and quantitative experiments and human evaluations, we demonstrate that CDST achieves state-of-the-art results on various style transfer tasks.
我们介绍了颜色解耦风格迁移(CDST),这是一种新颖且高效的双流风格迁移训练范式,它完全将颜色与风格分离,并迫使风格流对颜色失明。使用同一个模型,在推理过程中以无需微调的方式解锁了通用的风格迁移能力。特别地,通过无需微调的方法首次解决了在保持特征的前提下,利用风格和内容参考进行风格迁移的问题。CDST 通过多特征图像嵌入压缩显著提高了风格相似性,并且通过受扩散 U-Net 解耦规律启发的新 CDST 风格定义,保持了强大的编辑能力。通过对定性和定量实验以及人工评估的全面研究,我们证明了 CDST 在各种风格迁移任务中达到了最先进的成果。
https://arxiv.org/abs/2506.13770
During the finetuning stage of text generation tasks, standard cross-entropy loss treats all tokens equally. This can lead models to overemphasize high-frequency, low-information tokens, neglecting lower-frequency tokens crucial for specificity and informativeness in generated content. This paper introduces a novel loss function, Power-Law Decay Loss (PDL), specifically designed to optimize the finetuning process for text generation. The core motivation for PDL stems from observations in information theory and linguistics: the informativeness of a token is often inversely proportional to its frequency of occurrence. PDL re-weights the contribution of each token in the standard cross-entropy loss based on its frequency in the training corpus, following a power-law decay. Specifically, the weights for high-frequency tokens are reduced, while low-frequency, information-dense tokens are assigned higher weights. This mechanism guides the model during finetuning to focus more on learning and generating tokens that convey specific and unique information, thereby enhancing the quality, diversity, and informativeness of the generated text. We theoretically elaborate on the motivation and construction of PDL and discuss its potential applications and advantages across various text generation finetuning tasks, such as abstractive summarization, dialogue systems, and style transfer.
在文本生成任务的微调阶段,标准交叉熵损失函数会平等对待所有标记(token)。这可能导致模型过度强调高频率但信息量低的标记,而忽略了那些对生成内容的具体性和信息性至关重要的低频标记。本文介绍了一种新颖的损失函数——幂律衰减损失(PDL),它专门用于优化文本生成任务中的微调过程。 PDL的核心动机源于信息论和语言学观察:一个标记的信息量通常与其出现频率成反比。因此,PDL根据训练语料库中标记的频率重新加权标准交叉熵损失中每个标记的贡献值,并遵循幂律衰减原则。具体来说,高频率标记的权重被降低,而低频且信息密集型标记则赋予更高的权重。 这种机制在微调过程中引导模型更加关注学习和生成传达特定及独特信息的标记,从而提高生成文本的质量、多样性和信息量。本文从理论上详细阐述了PDL的动机及其构造,并讨论了其在摘要概括、对话系统以及风格转换等各类文本生成任务中的潜在应用和优势。
https://arxiv.org/abs/2505.16900
Semantic segmentation models trained on synthetic data often perform poorly on real-world images due to domain gaps, particularly in adverse conditions where labeled data is scarce. Yet, recent foundation models enable to generate realistic images without any training. This paper proposes to leverage such diffusion models to improve the performance of vision models when learned on synthetic data. We introduce two novel techniques for semantically consistent style transfer using diffusion models: Class-wise Adaptive Instance Normalization and Cross-Attention (CACTI) and its extension with selective attention Filtering (CACTIF). CACTI applies statistical normalization selectively based on semantic classes, while CACTIF further filters cross-attention maps based on feature similarity, preventing artifacts in regions with weak cross-attention correspondences. Our methods transfer style characteristics while preserving semantic boundaries and structural coherence, unlike approaches that apply global transformations or generate content without constraints. Experiments using GTA5 as source and Cityscapes/ACDC as target domains show that our approach produces higher quality images with lower FID scores and better content preservation. Our work demonstrates that class-aware diffusion-based style transfer effectively bridges the synthetic-to-real domain gap even with minimal target domain data, advancing robust perception systems for challenging real-world applications. The source code is available at: this https URL.
基于合成数据训练的语义分割模型在真实世界图像上的表现通常较差,特别是在标签数据稀缺的恶劣条件下。然而,最近的基础模型能够在不进行训练的情况下生成逼真的图像。本文提出利用这些扩散模型(diffusion models)来改进仅通过合成数据学习的视觉模型的表现。 我们介绍了两种新的用于语义一致风格迁移的技术:基于类别的自适应实例归一化与交叉注意力(CACTI,Class-wise Adaptive Instance Normalization and Cross-Attention),以及具有选择性注意过滤功能的其扩展版本(CACTIF)。CACTI技术根据语义类别进行统计标准化处理,而CACTIF进一步根据特征相似度对跨注意力图进行过滤,从而避免在对应关系较弱区域出现伪影。我们的方法可以转移风格特性并保持语义边界和结构一致性,与应用全局变换或无约束内容生成的方法不同。 使用GTA5作为源域,Cityscapes/ACDC作为目标域的实验表明,我们提出的方法能够产生质量更高、FID得分更低且内容保存更好的图像。我们的研究证明了类别感知扩散基风格转换技术可以有效缩小合成数据与真实世界之间的差距,并在目标领域数据量极少的情况下推进鲁棒性感知系统的发展,以应对具有挑战性的现实应用。 源代码可在以下链接获取:[此链接](请将实际的URL地址插入此处)。
https://arxiv.org/abs/2505.16360