Paper Reading AI Learner

FocusedAD: Character-centric Movie Audio Description

2025-04-16 15:04:14
Xiaojun Ye, Chun Wang, Yiren Song, Sheng Zhou, Liangcheng Li, Jiajun Bu

Abstract

Movie Audio Description (AD) aims to narrate visual content during dialogue-free segments, particularly benefiting blind and visually impaired (BVI) audiences. Compared with general video captioning, AD demands plot-relevant narration with explicit character name references, posing unique challenges in movie this http URL identify active main characters and focus on storyline-relevant regions, we propose FocusedAD, a novel framework that delivers character-centric movie audio descriptions. It includes: (i) a Character Perception Module(CPM) for tracking character regions and linking them to names; (ii) a Dynamic Prior Module(DPM) that injects contextual cues from prior ADs and subtitles via learnable soft prompts; and (iii) a Focused Caption Module(FCM) that generates narrations enriched with plot-relevant details and named characters. To overcome limitations in character identification, we also introduce an automated pipeline for building character query banks. FocusedAD achieves state-of-the-art performance on multiple benchmarks, including strong zero-shot results on MAD-eval-Named and our newly proposed Cinepile-AD dataset. Code and data will be released at this https URL .

Abstract (translated)

电影音频描述(AD)旨在通过叙述视觉内容来帮助盲人和视力受损者(BVI)在无对话的片段中更好地理解画面。相比一般的视频字幕,AD需要提供与剧情相关且明确指明角色名称的叙述,这为电影带来了独特的挑战。 为了识别活跃的主要角色并专注于与剧情相关的区域,我们提出了FocusedAD这一创新框架,它提供了以人物为中心的电影音频描述。该框架包括以下部分: (i) 角色感知模块(CPM),用于跟踪角色所在的画面区域,并将其链接到对应的名字; (ii) 动态先验模块(DPM),通过可学习的软提示从之前的AD和字幕中注入上下文线索; (iii) 集中描述模块(FCM),生成包含与剧情相关的细节及命名人物的叙述。 为了克服角色识别上的局限性,我们还引入了一种自动化流程来构建角色查询库。 在多个基准测试上,包括在MAD-eval-Named和新提出的Cinepile-AD数据集中的零样本结果中,FocusedAD达到了最先进的性能水平。代码和数据将在该链接(假设提供了一个URL)发布。

URL

https://arxiv.org/abs/2504.12157

PDF

https://arxiv.org/pdf/2504.12157.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot