Paper Reading AI Learner

Source-Free Domain Adaptation for RGB-D Semantic Segmentation with Vision Transformers

2023-05-23 17:20:47
Giulia Rizzoli, Donald Shenaj, Pietro Zanuttigh

Abstract

With the increasing availability of depth sensors, multimodal frameworks that combine color information with depth data are attracting increasing interest. In the challenging task of semantic segmentation, depth maps allow to distinguish between similarly colored objects at different depths and provide useful geometric cues. On the other side, ground truth data for semantic segmentation is burdensome to be provided and thus domain adaptation is another significant research area. Specifically, we address the challenging source-free domain adaptation setting where the adaptation is performed without reusing source data. We propose MISFIT: MultImodal Source-Free Information fusion Transformer, a depth-aware framework which injects depth information into a segmentation module based on vision transformers at multiple stages, namely at the input, feature and output levels. Color and depth style transfer helps early-stage domain alignment while re-wiring self-attention between modalities creates mixed features allowing the extraction of better semantic content. Furthermore, a depth-based entropy minimization strategy is also proposed to adaptively weight regions at different distances. Our framework, which is also the first approach using vision transformers for source-free semantic segmentation, shows noticeable performance improvements with respect to standard strategies.

Abstract (translated)

随着深度传感器的日益普及,将颜色信息和深度数据结合的多模式框架也越来越受到关注。在语义分割这个挑战性的任务中,深度图能够让人们在不同的深度上识别相似的颜色对象,并提供有用的几何提示。另一方面,语义分割的基准数据需要大量的提供,因此域转换也是一个重要的研究领域。具体来说,我们提出了MISFIT:多modal源-free信息融合Transformer,一个深度意识的框架,该框架在多个阶段使用视觉Transformer将深度信息注入到分割模块中,具体来说是输入、特征和输出水平。颜色和深度风格迁移可以帮助早期域对齐,同时重新调整modal之间的自我注意创造混合特征,使更好地提取语义内容。此外,还提出了一种基于深度的熵最小化策略,以自适应地加权不同距离的区域。我们的框架也是使用视觉Transformer进行源-free语义分割的第一种方法,与标准策略相比,表现出明显的性能改进。

URL

https://arxiv.org/abs/2305.14269

PDF

https://arxiv.org/pdf/2305.14269.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot