Paper Reading AI Learner

Integrating both Visual and Audio Cues for Enhanced Video Caption

2017-12-09 04:03:21
Wangli Hao, Zhaoxiang Zhang, He Guan, Guibo Zhu

Abstract

Video caption refers to generating a descriptive sentence for a specific short video clip automatically, which has achieved remarkable success recently. However, most of the existing methods focus more on visual information while ignoring the synchronized audio cues. We propose three multimodal deep fusion strategies to maximize the benefits of visual-audio resonance information. The first one explores the impact on cross-modalities feature fusion from low to high order. The second establishes the visual-audio short-term dependency by sharing weights of corresponding front-end networks. The third extends the temporal dependency to long-term through sharing multimodal memory across visual and audio modalities. Extensive experiments have validated the effectiveness of our three cross-modalities fusion strategies on two benchmark datasets, including Microsoft Research Video to Text (MSRVTT) and Microsoft Video Description (MSVD). It is worth mentioning that sharing weight can coordinate visual-audio feature fusion effectively and achieve the state-of-art performance on both BELU and METEOR metrics. Furthermore, we first propose a dynamic multimodal feature fusion framework to deal with the part modalities missing case. Experimental results demonstrate that even in the audio absence mode, we can still obtain comparable results with the aid of the additional audio modality inference module.

Abstract (translated)

视频标题是指自动为特定的短视频剪辑生成描述性语句,最近取得了显着成效。但是,大多数现有方法更注重视觉信息,而忽略了同步音频线索。我们提出三种多模态深度融合策略,以最大限度地提高视音频共振信息的优势。第一个探讨了从低到高顺序对交叉模态特征融合的影响。第二个通过共享相应的前端网络的权重来建立视听短期依赖性。第三种方式通过在视觉和音频模式之间共享多模态存储器,将时间依赖性扩展到长期。大量实验验证了三种跨模式融合策略在两个基准数据集(包括Microsoft Research Video to Text(MSRVTT)和Microsoft Video Description(MSVD))上的有效性。值得一提的是,分享权重可以有效地协调视音频特征融合,并在BELU和METEOR指标上实现最先进的性能。此外,我们首先提出了一个动态多模态特征融合框架来处理零件模态缺失情况。实验结果表明,即使在音频缺失模式下,借助附加的音频模态推断模块仍然可以获得可比较的结果。

URL

https://arxiv.org/abs/1711.08097

PDF

https://arxiv.org/pdf/1711.08097.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot