Abstract
Semantic segmentation models trained on synthetic data often perform poorly on real-world images due to domain gaps, particularly in adverse conditions where labeled data is scarce. Yet, recent foundation models enable to generate realistic images without any training. This paper proposes to leverage such diffusion models to improve the performance of vision models when learned on synthetic data. We introduce two novel techniques for semantically consistent style transfer using diffusion models: Class-wise Adaptive Instance Normalization and Cross-Attention (CACTI) and its extension with selective attention Filtering (CACTIF). CACTI applies statistical normalization selectively based on semantic classes, while CACTIF further filters cross-attention maps based on feature similarity, preventing artifacts in regions with weak cross-attention correspondences. Our methods transfer style characteristics while preserving semantic boundaries and structural coherence, unlike approaches that apply global transformations or generate content without constraints. Experiments using GTA5 as source and Cityscapes/ACDC as target domains show that our approach produces higher quality images with lower FID scores and better content preservation. Our work demonstrates that class-aware diffusion-based style transfer effectively bridges the synthetic-to-real domain gap even with minimal target domain data, advancing robust perception systems for challenging real-world applications. The source code is available at: this https URL.
Abstract (translated)
基于合成数据训练的语义分割模型在真实世界图像上的表现通常较差,特别是在标签数据稀缺的恶劣条件下。然而,最近的基础模型能够在不进行训练的情况下生成逼真的图像。本文提出利用这些扩散模型(diffusion models)来改进仅通过合成数据学习的视觉模型的表现。 我们介绍了两种新的用于语义一致风格迁移的技术:基于类别的自适应实例归一化与交叉注意力(CACTI,Class-wise Adaptive Instance Normalization and Cross-Attention),以及具有选择性注意过滤功能的其扩展版本(CACTIF)。CACTI技术根据语义类别进行统计标准化处理,而CACTIF进一步根据特征相似度对跨注意力图进行过滤,从而避免在对应关系较弱区域出现伪影。我们的方法可以转移风格特性并保持语义边界和结构一致性,与应用全局变换或无约束内容生成的方法不同。 使用GTA5作为源域,Cityscapes/ACDC作为目标域的实验表明,我们提出的方法能够产生质量更高、FID得分更低且内容保存更好的图像。我们的研究证明了类别感知扩散基风格转换技术可以有效缩小合成数据与真实世界之间的差距,并在目标领域数据量极少的情况下推进鲁棒性感知系统的发展,以应对具有挑战性的现实应用。 源代码可在以下链接获取:[此链接](请将实际的URL地址插入此处)。
URL
https://arxiv.org/abs/2505.16360