Abstract
Most natural videos contain numerous events. For example, in a video of a "man playing a piano", the video might also contain "another man dancing" or "a crowd clapping". We introduce the task of dense-captioning events, which involves both detecting and describing events in a video. We propose a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language. Our model introduces a variant of an existing proposal module that is designed to capture both short as well as long events that span minutes. To capture the dependencies between the events in a video, our model introduces a new captioning module that uses contextual information from past and future events to jointly describe all events. We also introduce ActivityNet Captions, a large-scale benchmark for dense-captioning events. ActivityNet Captions contains 20k videos amounting to 849 video hours with 100k total descriptions, each with it's unique start and end time. Finally, we report performances of our model for dense-captioning events, video retrieval and localization.
Abstract (translated)
大多数自然视频包含众多活动。例如,在一个“男人弹钢琴”的视频中,视频可能还会包含“另一个男人跳舞”或“人群鼓掌”。我们介绍密集字幕事件的任务,它涉及在视频中检测和描述事件。我们提出了一种新的模型,它能够在一次视频中识别所有事件,同时用自然语言描述检测到的事件。我们的模型引入了一个现有提案模块的变体,该模块旨在捕获跨越几分钟的短和长事件。为了捕捉视频中事件之间的依赖关系,我们的模型引入了新的字幕模块,该模块使用来自过去和未来事件的上下文信息来共同描述所有事件。我们还介绍了ActivityNet Captions,一个大型的密集说明事件基准。 ActivityNet字幕包含20万个视频,总计849个视频小时,总共有10万个描述,每个视频具有独特的开始和结束时间。最后,我们报道了我们的密集字幕事件,视频检索和本地化模型的表现。
URL
https://arxiv.org/abs/1705.00754