Paper Reading AI Learner

MaXTron: Mask Transformer with Trajectory Attention for Video Panoptic Segmentation

2023-11-30 13:20:09
Ju He, Qihang Yu, Inkyu Shin, Xueqing Deng, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen

Abstract

Video panoptic segmentation requires consistently segmenting (for both `thing' and `stuff' classes) and tracking objects in a video over time. In this work, we present MaXTron, a general framework that exploits Mask XFormer with Trajectory Attention to tackle the task. MaXTron enriches an off-the-shelf mask transformer by leveraging trajectory attention. The deployed mask transformer takes as input a short clip consisting of only a few frames and predicts the clip-level segmentation. To enhance the temporal consistency, MaXTron employs within-clip and cross-clip tracking modules, efficiently utilizing trajectory attention. Originally designed for video classification, trajectory attention learns to model the temporal correspondences between neighboring frames and aggregates information along the estimated motion paths. However, it is nontrivial to directly extend trajectory attention to the per-pixel dense prediction tasks due to its quadratic dependency on input size. To alleviate the issue, we propose to adapt the trajectory attention for both the dense pixel features and object queries, aiming to improve the short-term and long-term tracking results, respectively. Particularly, in our within-clip tracking module, we propose axial-trajectory attention that effectively computes the trajectory attention for tracking dense pixels sequentially along the height- and width-axes. The axial decomposition significantly reduces the computational complexity for dense pixel features. In our cross-clip tracking module, since the object queries in mask transformer are learned to encode the object information, we are able to capture the long-term temporal connections by applying trajectory attention to object queries, which learns to track each object across different clips. Without bells and whistles, MaXTron demonstrates state-of-the-art performances on video segmentation benchmarks.

Abstract (translated)

视频全景分割需要对(事物和物品)类别进行一致的分割和实时跟踪视频中的对象。在这项工作中,我们提出了MaXTron,一个利用Mask XFormer和轨迹注意力来解决任务的通用框架。MaXTron通过利用轨迹注意力对标准的mask transformer进行丰富。部署的mask transformer接收一个由几帧组成的短片段作为输入,预测片段级别的分割。为了增强时间一致性,MaXTron采用内部跟踪和跨跟踪模块,有效地利用轨迹注意力。最初设计用于视频分类,轨迹注意力学会了在相邻帧之间建模时间对应关系,并沿着估计的运动路径汇总信息。然而,由于其对输入大小的二次依赖,将轨迹注意力直接扩展到每个像素密集预测任务上并不容易。为了减轻这个问题,我们提出了一个 adapt MaXTron,旨在改进短期和长期跟踪结果。特别地,在我们的 within-clip 跟踪模块中,我们提出了轴向跟踪注意力,有效地计算了在高度和宽度轴上跟踪密集像素的轨迹注意力。轴向分解显著减少了密集像素特征的计算复杂性。在我们的跨跟踪跟踪模块中,由于mask transformer中学习到的对象信息,我们能够通过应用轨迹注意力来对对象进行跟踪,并学习在不同片段上跟踪每个对象。没有花言巧语,MaXTron在视频分割基准测试中展示了最先进的性能。

URL

https://arxiv.org/abs/2311.18537

PDF

https://arxiv.org/pdf/2311.18537.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot