Paper Reading AI Learner

How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?

2026-01-13 01:53:20
Peng Gao, Yujian Lee, Yongqi Xu, Wentao Fan

Abstract

Audio-visual semantic segmentation (AVSS) represents an extension of the audio-visual segmentation (AVS) task, necessitating a semantic understanding of audio-visual scenes beyond merely identifying sound-emitting objects at the visual pixel level. Contrary to a previous methodology, by decomposing the AVSS task into two discrete subtasks by initially providing a prompted segmentation mask to facilitate subsequent semantic analysis, our approach innovates on this foundational strategy. We introduce a novel collaborative framework, \textit{S}tepping \textit{S}tone \textit{P}lus (SSP), which integrates optical flow and textual prompts to assist the segmentation process. In scenarios where sound sources frequently coexist with moving objects, our pre-mask technique leverages optical flow to capture motion dynamics, providing essential temporal context for precise segmentation. To address the challenge posed by stationary sound-emitting objects, such as alarm clocks, SSP incorporates two specific textual prompts: one identifies the category of the sound-emitting object, and the other provides a broader description of the scene. Additionally, we implement a visual-textual alignment module (VTA) to facilitate cross-modal integration, delivering more coherent and contextually relevant semantic interpretations. Our training regimen involves a post-mask technique aimed at compelling the model to learn the diagram of the optical flow. Experimental results demonstrate that SSP outperforms existing AVS methods, delivering efficient and precise segmentation results.

Abstract (translated)

音频-视觉语义分割(AVSS)是音频-视觉分割(AVS)任务的扩展,它要求对音频-视觉场景进行语义理解,而不仅仅是识别发出声音的对象在视觉像素层面的位置。与先前的方法不同,我们通过将AVSS任务分解为两个独立的子任务,并首先提供一个提示性分段掩模以促进后续语义分析来创新这一基础策略。为此,我们引入了一个新颖的合作框架——Stepping Stone Plus(SSP),该框架集成了光流和文本提示以辅助分割过程。 在声源经常与移动对象共存的情况下,我们的预分段技术利用光流捕捉运动动态,为精确的分割提供必要的时间上下文。为了应对静止发声物体的挑战,如闹钟,SSP引入了两个特定的文本提示:一个用于识别发出声音的对象类别,另一个则提供了场景的更广泛描述。此外,我们实施了一个视觉-文本对齐模块(VTA),以促进跨模态整合,提供更为连贯且上下文相关的语义解释。 我们的训练方案包括一种后分段技术,旨在促使模型学习光流图的结构。实验结果表明,SSP优于现有的AVS方法,在效率和精度方面均表现出色。

URL

https://arxiv.org/abs/2601.08133

PDF

https://arxiv.org/pdf/2601.08133.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot