Paper Reading AI Learner

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

2023-05-25 17:59:47
Shilin Yan, Renrui Zhang, Ziyu Guo, Wenchao Chen, Wei Zhang, Hongyang Li, Yu Qiao, Zhongjiang He, Peng Gao

Abstract

Recently, video object segmentation (VOS) referred by multi-modal signals, e.g., language and audio, has evoked increasing attention in both industry and academia. It is challenging for exploring the semantic alignment within modalities and the visual correspondence across frames. However, existing methods adopt separate network architectures for different modalities, and neglect the inter-frame temporal interaction with references. In this paper, we propose MUTR, a Multi-modal Unified Temporal transformer for Referring video object segmentation. With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference. Specifically, we introduce two strategies to fully explore the temporal relations between videos and multi-modal signals. Firstly, for low-level temporal aggregation before the transformer, we enable the multi-modal references to capture multi-scale visual cues from consecutive video frames. This effectively endows the text or audio signals with temporal knowledge and boosts the semantic alignment between modalities. Secondly, for high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video. On Ref-YouTube-VOS and AVSBench datasets with respective text and audio references, MUTR achieves +4.2% and +4.2% J&F improvements to state-of-the-art methods, demonstrating our significance for unified multi-modal VOS. Code is released at this https URL.

Abstract (translated)

最近,视频对象分割(VOS)由多种modal信号,如语言和音频,引起 industry 和学术界的广泛关注。探索modal之间语义对齐以及不同帧之间的视觉对应是挑战性的任务。然而,现有方法采用不同的modal网络架构,并忽略了帧间间的时间交互。在本文中,我们提出了MUTR,一个多模态统一时间Transformer,以 refering 视频对象分割。第一次使用统一框架,MUTR采用DETR风格的Transformer,能够以文本或音频参考分别分割指定视频对象。具体来说,我们介绍了两种策略,以 fully 探索视频和modal信号之间的时间关系。首先,在Transformer之前进行低级别时间聚合,我们使多模态参考能够从连续的视频帧捕获多尺度的视觉线索。这 effectively赋予文本或音频信号时间知识,并增强modal之间的语义对齐。其次,在Transformer之后进行高级别时间交互,我们进行不同物体嵌入之间的帧间特征通信,有助于更好地跟踪视频对象。在ref-YouTube-VOS和AVSbench数据集,以相应的文本和音频参考,MUTR取得了与当前方法相比 +4.2% 和 +4.2%的J&F改进,这表明我们对于统一多模态VOS的重要性。代码在此https URL发布。

URL

https://arxiv.org/abs/2305.16318

PDF

https://arxiv.org/pdf/2305.16318.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot