Abstract
Referring Video Segmentation (RVOS) aims to segment objects in videos given linguistic expressions. The key to solving RVOS is to extract long-range temporal context information from the interactions of expressions and videos to depict the dynamic attributes of each object. Previous works either adopt attention across all the frames or stack dense local attention to achieve a global view of temporal context. However, they fail to strike a good balance between locality and globality, and the computation complexity significantly increases with the increase of video length. In this paper, we propose an effective long-range temporal context attention (LTCA) mechanism to aggregate global context information into object features. Specifically, we aggregate the global context information from two aspects. Firstly, we stack sparse local attentions to balance the locality and globality. We design a dilated window attention across frames to aggregate local context information and perform such attention in a stack of layers to enable a global view. Further, we enable each query to attend to a small group of keys randomly selected from a global pool to enhance the globality. Secondly, we design a global query to interact with all the other queries to directly encode the global context information. Experiments show our method achieves new state-of-the-art on four referring video segmentation benchmarks. Notably, our method shows an improvement of 11.3% and 8.3% on the MeViS valu and val datasets respectively.
Abstract (translated)
参考视频分割(RVOS)的目标是根据语言表达在视频中对对象进行分割。解决RVOS的关键是从表达和视频的互动中提取长期时间上下文信息,以描绘每个物体的动态属性。先前的工作要么采用所有帧之间的注意力机制,要么堆叠密集局部注意机制来实现时间上下文的整体视角。然而,它们未能在局部性和全局性之间取得良好的平衡,并且随着视频长度的增加,计算复杂度显著提高。 在这篇论文中,我们提出了一种有效的长期时序上下文注意(LTCA)机制,用于将全局上下文信息整合到对象特征中。具体而言,我们从两个方面聚集全球背景信息。首先,我们堆叠稀疏局部注意力以平衡局部性和全局性。设计了帧间膨胀窗口关注,用以聚合局部上下文信息,并在多层堆栈中执行此类注意机制,从而实现整体视角。此外,我们让每个查询能够随机选择从全局池中挑选的一小组关键元素,以此增强全局性。其次,我们设计了一个全球查询与所有其他查询交互的机制,直接编码全局上下文信息。 实验表明,我们的方法在四个参考视频分割基准测试上实现了新的最佳性能。特别值得注意的是,在MeViS val和valu数据集上分别取得了11.3%和8.3%的改进。
URL
https://arxiv.org/abs/2510.08305