Integrating both Visual and Audio Cues for Enhanced Video Caption

Abstract
Abstract (translated)
URL
PDF

Abstract

Video caption refers to generating a descriptive sentence for a specific short video clip automatically, which has achieved remarkable success recently. However, most of the existing methods focus more on visual information while ignoring the synchronized audio cues. We propose three multimodal deep fusion strategies to maximize the benefits of visual-audio resonance information. The first one explores the impact on cross-modalities feature fusion from low to high order. The second establishes the visual-audio short-term dependency by sharing weights of corresponding front-end networks. The third extends the temporal dependency to long-term through sharing multimodal memory across visual and audio modalities. Extensive experiments have validated the effectiveness of our three cross-modalities fusion strategies on two benchmark datasets, including Microsoft Research Video to Text (MSRVTT) and Microsoft Video Description (MSVD). It is worth mentioning that sharing weight can coordinate visual-audio feature fusion effectively and achieve the state-of-art performance on both BELU and METEOR metrics. Furthermore, we first propose a dynamic multimodal feature fusion framework to deal with the part modalities missing case. Experimental results demonstrate that even in the audio absence mode, we can still obtain comparable results with the aid of the additional audio modality inference module.

Abstract (translated)

视频标题是指自动为特定的短视频剪辑生成描述性语句，最近取得了显着成效。但是，大多数现有方法更注重视觉信息，而忽略了同步音频线索。我们提出三种多模态深度融合策略，以最大限度地提高视音频共振信息的优势。第一个探讨了从低到高顺序对交叉模态特征融合的影响。第二个通过共享相应的前端网络的权重来建立视听短期依赖性。第三种方式通过在视觉和音频模式之间共享多模态存储器，将时间依赖性扩展到长期。大量实验验证了三种跨模式融合策略在两个基准数据集（包括Microsoft Research Video to Text（MSRVTT）和Microsoft Video Description（MSVD））上的有效性。值得一提的是，分享权重可以有效地协调视音频特征融合，并在BELU和METEOR指标上实现最先进的性能。此外，我们首先提出了一个动态多模态特征融合框架来处理零件模态缺失情况。实验结果表明，即使在音频缺失模式下，借助附加的音频模态推断模块仍然可以获得可比较的结果。

URL

https://arxiv.org/abs/1711.08097

PDF

https://arxiv.org/pdf/1711.08097.pdf

Integrating both Visual and Audio Cues for Enhanced Video Caption

Abstract

Abstract (translated)

URL

PDF Copy

PDF