Abstract
We introduce the video detours problem for navigating instructional videos. Given a source video and a natural language query asking to alter the how-to video's current path of execution in a certain way, the goal is to find a related ''detour video'' that satisfies the requested alteration. To address this challenge, we propose VidDetours, a novel video-language approach that learns to retrieve the targeted temporal segments from a large repository of how-to's using video-and-text conditioned queries. Furthermore, we devise a language-based pipeline that exploits how-to video narration text to create weakly supervised training data. We demonstrate our idea applied to the domain of how-to cooking videos, where a user can detour from their current recipe to find steps with alternate ingredients, tools, and techniques. Validating on a ground truth annotated dataset of 16K samples, we show our model's significant improvements over best available methods for video retrieval and question answering, with recall rates exceeding the state of the art by 35%.
Abstract (translated)
我们提出了一个名为视频路径问题的问题,用于指导教学视频的导航。给定一个源视频和一个自然语言查询,请求以某种方式改变如何视频的执行路径,目标是找到一个相关的“路径视频”,满足所请求的改变。为了解决这个挑战,我们提出了VidDetours,一种新颖的视频-文本方法,它学会了从大量如何-视频的存储库中检索目标时间片段。此外,我们还设计了一个基于语言的管道,利用了如何-视频的旁白文本来创建弱监督训练数据。我们将我们的想法应用于如何烹饪视频的领域,用户可以从当前食谱中找到替代食材、工具和技术。在16K个带标签的地面真实现习语数据集上进行验证,我们证明了我们的模型在视频检索和问题回答方面的显著改进,召回率超过现有方法的35%。
URL
https://arxiv.org/abs/2401.01823