Abstract
Deep neural networks, especially transformer-based architectures, have achieved remarkable success in semantic segmentation for environmental perception. However, existing models process video frames independently, thus failing to leverage temporal consistency, which could significantly improve both accuracy and stability in dynamic scenes. In this work, we propose a Spatio-Temporal Attention (STA) mechanism that extends transformer attention blocks to incorporate multi-frame context, enabling robust temporal feature representations for video semantic segmentation. Our approach modifies standard self-attention to process spatio-temporal feature sequences while maintaining computational efficiency and requiring minimal changes to existing architectures. STA demonstrates broad applicability across diverse transformer architectures and remains effective across both lightweight and larger-scale models. A comprehensive evaluation on the Cityscapes and BDD100k datasets shows substantial improvements of 9.20 percentage points in temporal consistency metrics and up to 1.76 percentage points in mean intersection over union compared to single-frame baselines. These results demonstrate STA as an effective architectural enhancement for video-based semantic segmentation applications.
Abstract (translated)
深度神经网络,特别是基于变压器的架构,在环境感知中的语义分割方面取得了显著的成功。然而,现有的模型处理视频帧时是独立进行的,从而未能利用时间一致性,而这在动态场景中可以大幅提高准确性和稳定性。在此项工作中,我们提出了一种空间-时间注意(STA)机制,它扩展了变压器注意力块,以纳入多帧上下文,从而使视频语义分割具备稳健的时间特征表示能力。我们的方法将标准的自注意力处理方式修改为能够同时处理时空特征序列,同时保持计算效率,并且只需要对现有架构做出最小改动。STA在各种Transformer架构中具有广泛的适用性,并且无论模型是轻量级还是大规模,在所有情况下均能有效运行。在Cityscapes和BDD100k数据集上的全面评估显示,与单帧基线相比,时间一致性指标提高了9.20个百分点,平均交并比(mean Intersection over Union)最高提升了1.76个百分点。这些结果表明STA作为一种架构增强手段,在基于视频的语义分割应用中具有有效性。
URL
https://arxiv.org/abs/2602.10052