Paper Reading AI Learner

Spatio-Temporal Attention for Consistent Video Semantic Segmentation in Automated Driving

2026-02-10 18:18:37
Serin Varghese, Kevin Ross, Fabian Hueger, Kira Maag

Abstract

Deep neural networks, especially transformer-based architectures, have achieved remarkable success in semantic segmentation for environmental perception. However, existing models process video frames independently, thus failing to leverage temporal consistency, which could significantly improve both accuracy and stability in dynamic scenes. In this work, we propose a Spatio-Temporal Attention (STA) mechanism that extends transformer attention blocks to incorporate multi-frame context, enabling robust temporal feature representations for video semantic segmentation. Our approach modifies standard self-attention to process spatio-temporal feature sequences while maintaining computational efficiency and requiring minimal changes to existing architectures. STA demonstrates broad applicability across diverse transformer architectures and remains effective across both lightweight and larger-scale models. A comprehensive evaluation on the Cityscapes and BDD100k datasets shows substantial improvements of 9.20 percentage points in temporal consistency metrics and up to 1.76 percentage points in mean intersection over union compared to single-frame baselines. These results demonstrate STA as an effective architectural enhancement for video-based semantic segmentation applications.

Abstract (translated)

深度神经网络,特别是基于变压器的架构,在环境感知中的语义分割方面取得了显著的成功。然而,现有的模型处理视频帧时是独立进行的,从而未能利用时间一致性,而这在动态场景中可以大幅提高准确性和稳定性。在此项工作中,我们提出了一种空间-时间注意(STA)机制,它扩展了变压器注意力块,以纳入多帧上下文,从而使视频语义分割具备稳健的时间特征表示能力。我们的方法将标准的自注意力处理方式修改为能够同时处理时空特征序列,同时保持计算效率,并且只需要对现有架构做出最小改动。STA在各种Transformer架构中具有广泛的适用性,并且无论模型是轻量级还是大规模,在所有情况下均能有效运行。在Cityscapes和BDD100k数据集上的全面评估显示,与单帧基线相比,时间一致性指标提高了9.20个百分点,平均交并比(mean Intersection over Union)最高提升了1.76个百分点。这些结果表明STA作为一种架构增强手段,在基于视频的语义分割应用中具有有效性。

URL

https://arxiv.org/abs/2602.10052

PDF

https://arxiv.org/pdf/2602.10052.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot