Paper Reading AI Learner

Style Transfer with Diffusion Models for Synthetic-to-Real Domain Adaptation

2025-05-22 08:11:10
Estelle Chigot, Dennis G. Wilson, Meriem Ghrib, Thomas Oberlin

Abstract

Semantic segmentation models trained on synthetic data often perform poorly on real-world images due to domain gaps, particularly in adverse conditions where labeled data is scarce. Yet, recent foundation models enable to generate realistic images without any training. This paper proposes to leverage such diffusion models to improve the performance of vision models when learned on synthetic data. We introduce two novel techniques for semantically consistent style transfer using diffusion models: Class-wise Adaptive Instance Normalization and Cross-Attention (CACTI) and its extension with selective attention Filtering (CACTIF). CACTI applies statistical normalization selectively based on semantic classes, while CACTIF further filters cross-attention maps based on feature similarity, preventing artifacts in regions with weak cross-attention correspondences. Our methods transfer style characteristics while preserving semantic boundaries and structural coherence, unlike approaches that apply global transformations or generate content without constraints. Experiments using GTA5 as source and Cityscapes/ACDC as target domains show that our approach produces higher quality images with lower FID scores and better content preservation. Our work demonstrates that class-aware diffusion-based style transfer effectively bridges the synthetic-to-real domain gap even with minimal target domain data, advancing robust perception systems for challenging real-world applications. The source code is available at: this https URL.

Abstract (translated)

基于合成数据训练的语义分割模型在真实世界图像上的表现通常较差,特别是在标签数据稀缺的恶劣条件下。然而,最近的基础模型能够在不进行训练的情况下生成逼真的图像。本文提出利用这些扩散模型(diffusion models)来改进仅通过合成数据学习的视觉模型的表现。 我们介绍了两种新的用于语义一致风格迁移的技术:基于类别的自适应实例归一化与交叉注意力(CACTI,Class-wise Adaptive Instance Normalization and Cross-Attention),以及具有选择性注意过滤功能的其扩展版本(CACTIF)。CACTI技术根据语义类别进行统计标准化处理,而CACTIF进一步根据特征相似度对跨注意力图进行过滤,从而避免在对应关系较弱区域出现伪影。我们的方法可以转移风格特性并保持语义边界和结构一致性,与应用全局变换或无约束内容生成的方法不同。 使用GTA5作为源域,Cityscapes/ACDC作为目标域的实验表明,我们提出的方法能够产生质量更高、FID得分更低且内容保存更好的图像。我们的研究证明了类别感知扩散基风格转换技术可以有效缩小合成数据与真实世界之间的差距,并在目标领域数据量极少的情况下推进鲁棒性感知系统的发展,以应对具有挑战性的现实应用。 源代码可在以下链接获取:[此链接](请将实际的URL地址插入此处)。

URL

https://arxiv.org/abs/2505.16360

PDF

https://arxiv.org/pdf/2505.16360.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot