Abstract
Inspired by the fact that different modalities in videos carry complementary information, we propose a Multimodal Semantic Attention Network(MSAN), which is a new encoder-decoder framework incorporating multimodal semantic attributes for video captioning. In the encoding phase, we detect and generate multimodal semantic attributes by formulating it as a multi-label classification problem. Moreover, we add auxiliary classification loss to our model that can obtain more effective visual features and high-level multimodal semantic attribute distributions for sufficient video encoding. In the decoding phase, we extend each weight matrix of the conventional LSTM to an ensemble of attribute-dependent weight matrices, and employ attention mechanism to pay attention to different attributes at each time of the captioning process. We evaluate algorithm on two popular public benchmarks: MSVD and MSR-VTT, achieving competitive results with current state-of-the-art across six evaluation metrics.
Abstract (translated)
基于视频中不同形式携带互补信息的特点,提出了一种多模式语义注意网络(MSAN),它是一种新的编码器-解码器框架,结合了视频字幕的多模式语义属性。在编码阶段,我们通过将多模语义属性定义为多标签分类问题来检测和生成多模语义属性。此外,我们在模型中加入辅助分类损失,可以获得更有效的视觉特征和高层次的多模态语义属性分布,从而获得足够的视频编码。在解码阶段,我们将传统LSTM的每一个权重矩阵扩展为一组属性相关的权重矩阵,并采用注意机制在字幕过程的每一个时刻注意不同的属性。我们在两个流行的公共基准上评估算法:MSVD和MSR-VTT,在六个评估指标中实现与当前最先进水平的竞争结果。
URL
https://arxiv.org/abs/1905.02963