Source-Free Domain Adaptation for RGB-D Semantic Segmentation with Vision Transformers

Abstract
Abstract (translated)
URL
PDF

Abstract

With the increasing availability of depth sensors, multimodal frameworks that combine color information with depth data are attracting increasing interest. In the challenging task of semantic segmentation, depth maps allow to distinguish between similarly colored objects at different depths and provide useful geometric cues. On the other side, ground truth data for semantic segmentation is burdensome to be provided and thus domain adaptation is another significant research area. Specifically, we address the challenging source-free domain adaptation setting where the adaptation is performed without reusing source data. We propose MISFIT: MultImodal Source-Free Information fusion Transformer, a depth-aware framework which injects depth information into a segmentation module based on vision transformers at multiple stages, namely at the input, feature and output levels. Color and depth style transfer helps early-stage domain alignment while re-wiring self-attention between modalities creates mixed features allowing the extraction of better semantic content. Furthermore, a depth-based entropy minimization strategy is also proposed to adaptively weight regions at different distances. Our framework, which is also the first approach using vision transformers for source-free semantic segmentation, shows noticeable performance improvements with respect to standard strategies.

Abstract (translated)

随着深度传感器的日益普及，将颜色信息和深度数据结合的多模式框架也越来越受到关注。在语义分割这个挑战性的任务中，深度图能够让人们在不同的深度上识别相似的颜色对象，并提供有用的几何提示。另一方面，语义分割的基准数据需要大量的提供，因此域转换也是一个重要的研究领域。具体来说，我们提出了MISFIT：多modal源-free信息融合Transformer，一个深度意识的框架，该框架在多个阶段使用视觉Transformer将深度信息注入到分割模块中，具体来说是输入、特征和输出水平。颜色和深度风格迁移可以帮助早期域对齐，同时重新调整modal之间的自我注意创造混合特征，使更好地提取语义内容。此外，还提出了一种基于深度的熵最小化策略，以自适应地加权不同距离的区域。我们的框架也是使用视觉Transformer进行源-free语义分割的第一种方法，与标准策略相比，表现出明显的性能改进。

URL

https://arxiv.org/abs/2305.14269

PDF

https://arxiv.org/pdf/2305.14269.pdf