Paper Reading AI Learner

Generating Video Descriptions with Topic Guidance

2017-09-04 11:38:38
Shizhe Chen, Jia Chen, Qin Jin

Abstract

Generating video descriptions in natural language (a.k.a. video captioning) is a more challenging task than image captioning as the videos are intrinsically more complicated than images in two aspects. First, videos cover a broader range of topics, such as news, music, sports and so on. Second, multiple topics could coexist in the same video. In this paper, we propose a novel caption model, topic-guided model (TGM), to generate topic-oriented descriptions for videos in the wild via exploiting topic information. In addition to predefined topics, i.e., category tags crawled from the web, we also mine topics in a data-driven way based on training captions by an unsupervised topic mining model. We show that data-driven topics reflect a better topic schema than the predefined topics. As for testing video topic prediction, we treat the topic mining model as teacher to train the student, the topic prediction model, by utilizing the full multi-modalities in the video especially the speech modality. We propose a series of caption models to exploit topic guidance, including implicitly using the topics as input features to generate words related to the topic and explicitly modifying the weights in the decoder with topics to function as an ensemble of topic-aware language decoders. Our comprehensive experimental results on the current largest video caption dataset MSR-VTT prove the effectiveness of our topic-guided model, which significantly surpasses the winning performance in the 2016 MSR video to language challenge.

Abstract (translated)

以自然语言生成视频描述(也称为视频字幕)是比图像字幕更具有挑战性的任务,因为视频在两个方面比图像本质上更复杂。首先,视频涵盖了更广泛的话题,如新闻,音乐,体育等。其次,多个主题可以共存在同一个视频中。在本文中,我们提出了一种新颖的标题模型,主题引导模型(TGM),通过利用主题信息为野外视频生成面向主题的描述。除了预定义的主题,即从网络爬取的类别标签之外,我们还通过无监督的主题挖掘模型基于训练标题以数据驱动的方式挖掘主题。我们显示数据驱动的主题反映了比预定义主题更好的主题模式。对于测试视频主题预测,我们将主题挖掘模型作为教师,通过利用视频中的完整多模式,尤其是语音模式来训练学生,主题预测模型。我们提出了一系列的标题模型来利用主题指导,包括隐含地使用主题作为输入特征来生成与主题相关的词并且明确地修改解码器中具有主题以作为主题感知型语言解码器的集合的权重。我们在当前最大的视频字幕数据集MSR-VTT上的综合实验结果证明了我们的话题导向模型的有效性,它显着超越了2016年MSR视频语言挑战的胜出表现。

URL

https://arxiv.org/abs/1708.09666

PDF

https://arxiv.org/pdf/1708.09666.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot