Paper Reading AI Learner

Dual Dense Encoding for Zero-Example Video Retrieval

2018-09-17 13:13:14
Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Xun Wang

Abstract

This paper attacks the challenging problem of zero-example video retrieval. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described in natural language text with no visual example provided. The majority of existing methods are concept based, extracting relevant concepts from queries and videos and accordingly establishing associations between the two modalities. In contrast, this paper follows a novel trend of concept-free, deep learning based encoding. To that end, we propose a dual deep encoding network that works on both video and query sides. The network can be flexibly coupled with an existing common space learning module for video-text similarity computation. As experiments on three benchmarks, i.e., MSR-VTT, TRECVID 2016 and 2017 Ad-hoc Video Search show, the proposed method establishes a new state-of-the-art for zero-example video retrieval.

Abstract (translated)

本文探讨了零示例视频检索的挑战性问题。在这种检索范例中,最终用户通过在自然语言文本中描述的即席查询来搜索未标记的视频,而不提供视觉示例。大多数现有方法是基于概念的,从查询和视频中提取相关概念,并因此建立两种模态之间的关联。相比之下,本文遵循基于概念的深度学习编码的新趋势。为此,我们提出了一种双深度编码网络,可以在视频和查询方面工作。网络可以灵活地与现有的公共空间学习模块耦合,用于视频文本相似度计算。作为三个基准的实验,即MSR-VTT,TRECVID 2016和2017 Ad-hoc视频搜索显示,所提出的方法为零示例视频检索建立了新的现有技术。

URL

https://arxiv.org/abs/1809.06181

PDF

https://arxiv.org/pdf/1809.06181.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot