Abstract
Early weakly supervised video grounding (WSVG) methods often struggle with incomplete boundary detection due to the absence of temporal boundary annotations. To bridge the gap between video-level and boundary-level annotation, explicit-supervision methods, i.e., generating pseudo-temporal boundaries for training, have achieved great success. However, data augmentations in these methods might disrupt critical temporal information, yielding poor pseudo boundaries. In this paper, we propose a new perspective that maintains the integrity of the original temporal content while introducing more valuable information for expanding the incomplete boundaries. To this end, we propose EtC (Expand then Clarify), first use the additional information to expand the initial incomplete pseudo boundaries, and subsequently refine these expanded ones to achieve precise boundaries. Motivated by video continuity, i.e., visual similarity across adjacent frames, we use powerful multimodal large language models (MLLMs) to annotate each frame within initial pseudo boundaries, yielding more comprehensive descriptions for expanded boundaries. To further clarify the noise of expanded boundaries, we combine mutual learning with a tailored proposal-level contrastive objective to use a learnable approach to harmonize a balance between incomplete yet clean (initial) and comprehensive yet noisy (expanded) boundaries for more precise ones. Experiments demonstrate the superiority of our method on two challenging WSVG datasets.
Abstract (translated)
早期的弱监督视频 groundeding (WSVG) 方法通常因为缺少时间边界的标注而无法处理不完整的边界检测。为了在视频级别和边界级别之间填补差距, explicit-supervision 方法,即为训练生成伪时间边界的显式监督方法,已经取得了巨大的成功。然而,这些方法中的数据增强可能会破坏关键的时间信息,从而产生低伪边。在本文中,我们提出了一种新方法,在保持原始时间内容完整的同时引入更多有价值的信息来扩展不完整的边界。为此,我们提出了 EtC (扩展然后澄清) 方法,首先利用额外的信息扩展初始不完整的伪边界,然后对其进行优化以实现精确的边界。为了进一步澄清扩展边界的噪声,我们结合相互学习和一个自适应的提议级对比目标,使用一种可学习的方法来平衡不完整但干净(初始)和全面但嘈杂(扩展)边之间的精确度。实验证明,我们的方法在两个具有挑战性的 WSVG 数据集上具有优越性。
URL
https://arxiv.org/abs/2312.02483