MaXTron: Mask Transformer with Trajectory Attention for Video Panoptic Segmentation

Abstract
Abstract (translated)
URL
PDF

Abstract

Video panoptic segmentation requires consistently segmenting (for both `thing' and `stuff' classes) and tracking objects in a video over time. In this work, we present MaXTron, a general framework that exploits Mask XFormer with Trajectory Attention to tackle the task. MaXTron enriches an off-the-shelf mask transformer by leveraging trajectory attention. The deployed mask transformer takes as input a short clip consisting of only a few frames and predicts the clip-level segmentation. To enhance the temporal consistency, MaXTron employs within-clip and cross-clip tracking modules, efficiently utilizing trajectory attention. Originally designed for video classification, trajectory attention learns to model the temporal correspondences between neighboring frames and aggregates information along the estimated motion paths. However, it is nontrivial to directly extend trajectory attention to the per-pixel dense prediction tasks due to its quadratic dependency on input size. To alleviate the issue, we propose to adapt the trajectory attention for both the dense pixel features and object queries, aiming to improve the short-term and long-term tracking results, respectively. Particularly, in our within-clip tracking module, we propose axial-trajectory attention that effectively computes the trajectory attention for tracking dense pixels sequentially along the height- and width-axes. The axial decomposition significantly reduces the computational complexity for dense pixel features. In our cross-clip tracking module, since the object queries in mask transformer are learned to encode the object information, we are able to capture the long-term temporal connections by applying trajectory attention to object queries, which learns to track each object across different clips. Without bells and whistles, MaXTron demonstrates state-of-the-art performances on video segmentation benchmarks.

Abstract (translated)

视频全景分割需要对（事物和物品）类别进行一致的分割和实时跟踪视频中的对象。在这项工作中，我们提出了MaXTron，一个利用Mask XFormer和轨迹注意力来解决任务的通用框架。MaXTron通过利用轨迹注意力对标准的mask transformer进行丰富。部署的mask transformer接收一个由几帧组成的短片段作为输入，预测片段级别的分割。为了增强时间一致性，MaXTron采用内部跟踪和跨跟踪模块，有效地利用轨迹注意力。最初设计用于视频分类，轨迹注意力学会了在相邻帧之间建模时间对应关系，并沿着估计的运动路径汇总信息。然而，由于其对输入大小的二次依赖，将轨迹注意力直接扩展到每个像素密集预测任务上并不容易。为了减轻这个问题，我们提出了一个 adapt MaXTron，旨在改进短期和长期跟踪结果。特别地，在我们的 within-clip 跟踪模块中，我们提出了轴向跟踪注意力，有效地计算了在高度和宽度轴上跟踪密集像素的轨迹注意力。轴向分解显著减少了密集像素特征的计算复杂性。在我们的跨跟踪跟踪模块中，由于mask transformer中学习到的对象信息，我们能够通过应用轨迹注意力来对对象进行跟踪，并学习在不同片段上跟踪每个对象。没有花言巧语，MaXTron在视频分割基准测试中展示了最先进的性能。

URL

https://arxiv.org/abs/2311.18537

PDF

https://arxiv.org/pdf/2311.18537.pdf

MaXTron: Mask Transformer with Trajectory Attention for Video Panoptic Segmentation

Abstract

Abstract (translated)

URL

PDF Copy

PDF