Paper Reading AI Learner

DEL: Dense Event Localization for Multi-modal Audio-Visual Understanding

2025-06-29 11:50:19
Mona Ahmadian, Amir Shirian, Frank Guerin, Andrew Gilbert

Abstract

Real-world videos often contain overlapping events and complex temporal dependencies, making multimodal interaction modeling particularly challenging. We introduce DEL, a framework for dense semantic action localization, aiming to accurately detect and classify multiple actions at fine-grained temporal resolutions in long untrimmed videos. DEL consists of two key modules: the alignment of audio and visual features that leverage masked self-attention to enhance intra-mode consistency and a multimodal interaction refinement module that models cross-modal dependencies across multiple scales, enabling high-level semantics and fine-grained details. Our method achieves state-of-the-art performance on multiple real-world Temporal Action Localization (TAL) datasets, UnAV-100, THUMOS14, ActivityNet 1.3, and EPIC-Kitchens-100, surpassing previous approaches with notable average mAP gains of +3.3%, +2.6%, +1.2%, +1.7% (verb), and +1.4% (noun), respectively.

Abstract (translated)

真实世界的视频通常包含重叠事件和复杂的时序依赖关系,这使得多模态交互建模特别具有挑战性。我们提出了DEL框架,用于密集语义动作定位,旨在准确检测和分类长未裁剪视频中的多个动作,并在细粒度的时间分辨率下进行分类。DEL包括两个关键模块:利用掩码自我注意机制来增强内模式一致性的音频和视觉特征对齐;以及建模多尺度跨模式依赖关系的多模态交互细化模块,这使得高级语义和细粒度细节得以实现。我们的方法在多个真实世界的时序动作定位(TAL)数据集上取得了最先进的性能,包括UnAV-100、THUMOS14、ActivityNet 1.3和EPIC-Kitchens-100,在这些数据集中分别获得了显著的平均mAP提升:+3.3%,+2.6%,+1.2%,+1.7%(动词)和+1.4%(名词)。

URL

https://arxiv.org/abs/2506.23196

PDF

https://arxiv.org/pdf/2506.23196.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot