Naturalistic driving action localization task aims to recognize and comprehend human behaviors and actions from video data captured during real-world driving scenarios. Previous studies have shown great action localization performance by applying a recognition model followed by probability-based post-processing. Nevertheless, the probabilities provided by the recognition model frequently contain confused information causing challenge for post-processing. In this work, we adopt an action recognition model based on self-supervise learning to detect distracted activities and give potential action probabilities. Subsequently, a constraint ensemble strategy takes advantages of multi-camera views to provide robust predictions. Finally, we introduce a conditional post-processing operation to locate distracted behaviours and action temporal boundaries precisely. Experimenting on test set A2, our method obtains the sixth position on the public leaderboard of track 3 of the 2024 AI City Challenge.
自然驾驶行为定位任务旨在从真实世界驾驶场景中捕获的视频数据中识别和理解人类的行为和动作。先前的研究表明,通过应用一个识别模型并结合基于概率的后处理可以实现出色的动作定位性能。然而,识别模型提供的概率往往包含混淆信息,给后处理带来挑战。在这项工作中,我们采用了一个基于自监督学习的动作识别模型来检测分心活动并提供潜在的行为概率。随后,一种约束集成策略利用多摄像机视角的优势提供了稳健的预测。最后,我们引入了一种条件后处理操作以精确地定位分心行为和动作的时间边界。在测试集A2上的实验中,我们的方法在2024年AI城市挑战赛第3赛道的公共排行榜上获得了第六名的位置。
https://arxiv.org/abs/2411.12525
Recent breakthroughs in Multimodal Large Language Models (MLLMs) have gained significant recognition within the deep learning community, where the fusion of the Video Foundation Models (VFMs) and Large Language Models(LLMs) has proven instrumental in constructing robust video understanding systems, effectively surmounting constraints associated with predefined visual tasks. These sophisticated MLLMs exhibit remarkable proficiency in comprehending videos, swiftly attaining unprecedented performance levels across diverse benchmarks. However, their operation demands substantial memory and computational resources, underscoring the continued importance of traditional models in video comprehension tasks. In this paper, we introduce a novel learning paradigm termed MLLM4WTAL. This paradigm harnesses the potential of MLLM to offer temporal action key semantics and complete semantic priors for conventional Weakly-supervised Temporal Action Localization (WTAL) methods. MLLM4WTAL facilitates the enhancement of WTAL by leveraging MLLM guidance. It achieves this by integrating two distinct modules: Key Semantic Matching (KSM) and Complete Semantic Reconstruction (CSR). These modules work in tandem to effectively address prevalent issues like incomplete and over-complete outcomes common in WTAL methods. Rigorous experiments are conducted to validate the efficacy of our proposed approach in augmenting the performance of various heterogeneous WTAL models.
最近在多模态大型语言模型(MLLMs)方面的突破获得了深度学习社区的广泛认可,其中视频基础模型(VFMs)和大型语言模型(LLMs)的融合对于构建强大的视频理解系统至关重要,有效克服了预定义视觉任务相关的限制。这些高级的MLLM展示了对视频的理解能力,并在各种基准测试中迅速达到了前所未有的性能水平。然而,它们的操作需要大量的内存和计算资源,这突显了传统模型在视频理解任务中的持续重要性。本文介绍了一种名为MLLM4WTAL的新学习范式。该范式利用MLLM的潜力,为传统的弱监督时间动作定位(WTAL)方法提供时序动作关键语义和完整的语义先验。MLLM4WTAL通过利用MLLM指导来增强WTAL。它通过整合两个不同的模块——关键语义匹配(KSM)和完整语义重构(CSR)实现这一点。这两个模块协同工作,有效地解决了WTAL方法中常见的不完全和过度完成的问题。我们进行了严格的实验以验证我们的提议在提高各种异构WTAL模型性能方面的有效性。
https://arxiv.org/abs/2411.08466
In the task of temporal action localization of ActivityNet-1.3 datasets, we propose to locate the temporal boundaries of each action and predict action class in untrimmed videos. We first apply VideoSwinTransformer as feature extractor to extract different features. Then we apply a unified network following Faster-TAD to simultaneously obtain proposals and semantic labels. Last, we ensemble the results of different temporal action detection models which complement each other. Faster-TAD simplifies the pipeline of TAD and gets remarkable performance, obtaining comparable results as those of multi-step approaches.
在ActivityNet-1.3数据集的时间动作定位任务中,我们提出定位每个动作的时间边界,并预测未剪辑视频中的动作类别。首先应用VideoSwinTransformer作为特征提取器来抽取不同的特征。然后采用类似于Faster-TAD的统一网络同时获得提议和语义标签。最后,我们将不同时间动作检测模型的结果进行集成,这些模型互相补充。Faster-TAD简化了TAD的流程,并取得了显著性能,达到了与多步骤方法相当的效果。
https://arxiv.org/abs/2411.00883
Precise action localization in untrimmed video is vital for fields such as professional sports and minimally invasive surgery, where the delineation of particular motions in recordings can dramatically enhance analysis. But in many cases, large scale datasets with video-label pairs for localization are unavailable, limiting the opportunity to fine-tune video-understanding models. Recent developments in large vision-language models (LVLM) address this need with impressive zero-shot capabilities in a variety of video understanding tasks. However, the adaptation of image-based LVLMs, with their powerful visual question answering capabilities, to action localization in long-form video is still relatively unexplored. To this end, we introduce a true ZEro-shot Action Localization method (ZEAL). Specifically, we leverage the built-in action knowledge of a large language model (LLM) to inflate actions into highly-detailed descriptions of the archetypal start and end of the action. These descriptions serve as queries to LVLM for generating frame-level confidence scores which can be aggregated to produce localization outputs. The simplicity and flexibility of our method lends it amenable to more capable LVLMs as they are developed, and we demonstrate remarkable results in zero-shot action localization on a challenging benchmark, without any training.
https://arxiv.org/abs/2410.14340
Recognizing human activities in videos is challenging due to the spatio-temporal complexity and context-dependence of human interactions. Prior studies often rely on single input modalities, such as RGB or skeletal data, limiting their ability to exploit the complementary advantages across modalities. Recent studies focus on combining these two modalities using simple feature fusion techniques. However, due to the inherent disparities in representation between these input modalities, designing a unified neural network architecture to effectively leverage their complementary information remains a significant challenge. To address this, we propose a comprehensive multimodal framework for robust video-based human activity recognition. Our key contribution is the introduction of a novel compositional query machine, called COMPUTER ($\textbf{COMP}ositional h\textbf{U}man-cen\textbf{T}ric qu\textbf{ER}y$ machine), a generic neural architecture that models the interactions between a human of interest and its surroundings in both space and time. Thanks to its versatile design, COMPUTER can be leveraged to distill distinctive representations for various input modalities. Additionally, we introduce a consistency loss that enforces agreement in prediction between modalities, exploiting the complementary information from multimodal inputs for robust human movement recognition. Through extensive experiments on action localization and group activity recognition tasks, our approach demonstrates superior performance when compared with state-of-the-art methods. Our code is available at: this https URL.
由于人类活动的空间和时间复杂性以及上下文相关性,从视频中识别人类活动具有挑战性。之前的研究通常依赖于单个输入模式,如RGB或骨骼数据,这限制了它们在模态之间互补优势的利用能力。最近的研究专注于使用简单的特征融合技术将这些两种模式进行结合。然而,由于这些输入模式固有的差异表示,设计一个统一的神经网络架构有效地利用它们的互补信息仍然是一个重要的挑战。为了应对这个问题,我们提出了一个全面的多模态视频基于人类活动识别框架。我们的关键贡献是引入了一种新颖的合成查询机器,称为COMPUTER(Compositional human-centric-time machine),一种通用的神经网络架构,它建模了感兴趣的人类和周围环境在空间和时间上的相互作用。由于其多才多艺的设计,COMPUTER可以用于对各种输入模态进行降维。此外,我们引入了一个一致性损失,用于在模态之间建模预测的一致性,利用多模态输入的互补信息进行稳健的人类运动识别。通过在动作定位和群活动识别任务上的广泛实验,我们的方法与最先进的方法相比表现出卓越的性能。我们的代码可在此处下载:https:// this URL。
https://arxiv.org/abs/2409.02385
Video action localization aims to find timings of a specific action from a long video. Although existing learning-based approaches have been successful, those require annotating videos that come with a considerable labor cost. This paper proposes a learning-free, open-vocabulary approach based on emerging vision-language models (VLM). The challenge stems from the fact that VLMs are neither designed to process long videos nor tailored for finding actions. We overcome these problems by extending an iterative visual prompting technique. Specifically, we sample video frames into a concatenated image with frame index labels, making a VLM guess a frame that is considered to be closest to the start/end of the action. Iterating this process by narrowing a sampling time window results in finding a specific frame of start and end of an action. We demonstrate that this sampling technique yields reasonable results, illustrating a practical extension of VLMs for understanding videos.
视频动作定位旨在从长视频中查找特定动作的时间。虽然现有的基于学习的 approaches 已经取得了成功,但它们需要标注带有相当大劳动成本的视频。本文提出了一种基于新兴的视觉语言模型(VLM)的学习无,开卷积文本文提出的技术基于 emerging vision-language models (VLM)。挑战来自于 VLMs 既没有设计来处理长视频,也不是为了找到动作而设计的。我们通过扩展迭代视觉提示技术来克服这些问题。具体来说,我们将视频帧采样到与帧编号标签连接的 concatenated 图像中,使 VLM 猜测一个被认为是动作开始/结束的帧。通过缩小采样时间窗口,迭代这个过程可以找到特定动作的开始和结束。我们证明了这种采样技术产生了合理的结果,证明了 VLMs 在理解视频方面具有实际扩展。
https://arxiv.org/abs/2408.17422
To address the zero-shot temporal action localization (ZSTAL) task, existing works develop models that are generalizable to detect and classify actions from unseen categories. They typically develop a category-agnostic action detector and combine it with the Contrastive Language-Image Pre-training (CLIP) model to solve ZSTAL. However, these methods suffer from incomplete action proposals generated for \textit{unseen} categories, since they follow a frame-level prediction paradigm and require hand-crafted post-processing to generate action proposals. To address this problem, in this work, we propose a novel model named Generalizable Action Proposal generator (GAP), which can interface seamlessly with CLIP and generate action proposals in a holistic way. Our GAP is built in a query-based architecture and trained with a proposal-level objective, enabling it to estimate proposal completeness and eliminate the hand-crafted post-processing. Based on this architecture, we propose an Action-aware Discrimination loss to enhance the category-agnostic dynamic information of actions. Besides, we introduce a Static-Dynamic Rectifying module that incorporates the generalizable static information from CLIP to refine the predicted proposals, which improves proposal completeness in a generalizable manner. Our experiments show that our GAP achieves state-of-the-art performance on two challenging ZSTAL benchmarks, i.e., Thumos14 and ActivityNet1.3. Specifically, our model obtains significant performance improvement over previous works on the two benchmarks, i.e., +3.2% and +3.4% average mAP, respectively.
为解决零样本时序动作定位(ZSTAL)问题,现有研究开发了可适用于检测和分类未见类别的动作的模型。他们通常开发了一个类无关的动作检测器,并将其与对比性语言与图像预训练(CLIP)模型相结合来解决ZSTAL。然而,这些方法由于遵循帧级预测范式并需要手工后处理生成动作建议而产生了对未见类别的动作提议的缺失。为了解决这个问题,在本文中,我们提出了一个名为可扩展动作提议生成器(GAP)的新模型,该模型可以与CLIP无缝交互并以整体方式生成动作建议。我们的GAP基于查询架构构建,并使用建议级的目标进行训练,使其能够估计提议的完整性并消除手工后处理。基于这种架构,我们提出了一个动作感知差异损失,以增强动作的类无关动态信息。此外,我们还引入了一个静态动态校正模块,它从CLIP中包含了可扩展的静态信息,以改进预测建议,从而以一种通用的方式提高提议的完整性。我们的实验结果表明,我们的GAP在两个具有挑战性的ZSTAL基准上实现了最先进的性能,即Thumos14和ActivityNet1.3。具体来说,我们的模型在两个基准上的平均mAP均比以前的工作高出+3.2%和+3.4%。
https://arxiv.org/abs/2408.13777
The present few-shot temporal action localization model can't handle the situation where videos contain multiple action instances. So the purpose of this paper is to achieve manifold action instances localization in a lengthy untrimmed query video using limited trimmed support videos. To address this challenging problem effectively, we proposed a novel solution involving a spatial-channel relation transformer with probability learning and cluster refinement. This method can accurately identify the start and end boundaries of actions in the query video, utilizing only a limited number of labeled videos. Our proposed method is adept at capturing both temporal and spatial contexts to effectively classify and precisely locate actions in videos, enabling a more comprehensive utilization of these crucial details. The selective cosine penalization algorithm is designed to suppress temporal boundaries that do not include action scene switches. The probability learning combined with the label generation algorithm alleviates the problem of action duration diversity and enhances the model's ability to handle fuzzy action boundaries. The interval cluster can help us get the final results with multiple instances situations in few-shot temporal action localization. Our model achieves competitive performance through meticulous experimentation utilizing the benchmark datasets ActivityNet1.3 and THUMOS14. Our code is readily available at this https URL.
目前,这种少数shot时空动作局部定位模型无法处理视频包含多个动作实例的情况。因此,本文的目的是使用有限的剪辑支持视频在长篇未剪辑查询视频中实现多实例动作实例的局部定位,以解决这一具有挑战性的问题。为了有效地解决这个具有挑战性的问题,我们提出了一个新方法,涉及空间通道关系Transformer、概率学习和聚类 refinement。这种方法可以准确地确定查询视频中的动作开始和结束边界,只需使用很少量的标记视频。我们所提出的方法擅长捕捉时空上下文,从而有效地对视频进行分类和精确定位动作,有助于更全面地利用这些关键信息。选择性余弦惩罚算法旨在抑制不包含动作场景切换的时空边界。概率学习和标签生成算法相结合减轻了动作持续时间多样性的问题,提高了模型处理模糊动作边界的能力。区间聚类可以帮助我们在几 shot时空动作局部定位中处理多个实例的情况。我们的模型通过仔细实验利用基准数据集ActivityNet1.3和THUMOS14获得了竞争力的性能。我们的代码可通过此链接获取。
https://arxiv.org/abs/2408.13765
Online video understanding often relies on individual frames, leading to frame-by-frame predictions. Recent advancements such as Online Temporal Action Localization (OnTAL), extend this approach to instance-level predictions. However, existing methods mainly focus on short-term context, neglecting historical information. To address this, we introduce the History-Augmented Anchor Transformer (HAT) Framework for OnTAL. By integrating historical context, our framework enhances the synergy between long-term and short-term information, improving the quality of anchor features crucial for classification and localization. We evaluate our model on both procedural egocentric (PREGO) datasets (EGTEA and EPIC) and standard non-PREGO OnTAL datasets (THUMOS and MUSES). Results show that our model outperforms state-of-the-art approaches significantly on PREGO datasets and achieves comparable or slightly superior performance on non-PREGO datasets, underscoring the importance of leveraging long-term history, especially in procedural and egocentric action scenarios. Code is available at: this https URL
视频在线理解通常依赖于单个帧,导致帧对帧预测。 例如,在线时间动作定位 (OnTAL) 是一种 recent 进步,将此方法扩展到实例级别预测。然而,现有的方法主要关注短期上下文,忽视了历史信息。为了解决这个问题,我们引入了 OnTAL 的 History-Augmented Anchor Transformer (HAT) 框架。通过整合历史信息,我们的框架提高了长期和短期信息之间的协同作用,从而提高了分类和定位关键锚特征的质量。我们在 both 程序化自旋点 (PREGO) 数据集(EGTEA 和 EPIC)和非程序化 OnTAL 数据集(THUMOS 和 MUSES)上评估我们的模型。结果表明,我们的模型在 PREGO 数据集上显著优于最先进的 approaches,同时在非程序化 OnTAL 数据集上实现或稍微优于最先进的 approaches。这突出表明了利用长期历史在程序化和自旋点动作场景中的重要性。代码可在此处下载:https:// this URL
https://arxiv.org/abs/2408.06437
Weakly supervised temporal action localization (WTAL) aims to detect action instances in untrimmed videos using only video-level annotations. Since many existing works optimize WTAL models based on action classification labels, they encounter the task discrepancy problem (i.e., localization-by-classification). To tackle this issue, recent studies have attempted to utilize action category names as auxiliary semantic knowledge through vision-language pre-training (VLP). However, there are still areas where existing research falls short. Previous approaches primarily focused on leveraging textual information from language models but overlooked the alignment of dynamic human action and VLP knowledge in a joint space. Furthermore, the deterministic representation employed in previous studies struggles to capture fine-grained human motions. To address these problems, we propose a novel framework that aligns human action knowledge and VLP knowledge in a probabilistic embedding space. Moreover, we propose intra- and inter-distribution contrastive learning to enhance the probabilistic embedding space based on statistical similarities. Extensive experiments and ablation studies reveal that our method significantly outperforms all previous state-of-the-art methods. Code is available at this https URL.
弱监督的时间动作定位(WTAL)旨在通过仅使用视频级别的注释来检测未剪辑视频中的动作实例。由于许多现有工作基于动作分类标签优化WTAL模型,因此它们遇到了任务差异问题(即通过分类来定位)。为解决这个问题,最近的研究试图通过视觉语言预训练(VLP)利用动作类别名称作为辅助语义知识。然而,目前的研究仍存在不足之处。之前的解决方案主要关注利用语言模型的文本信息,但忽视了动态人类动作与VLP知识在共同空间中的对齐。此外,之前使用的确定性表示方法很难捕捉到细粒度的人类运动。为了解决这些问题,我们提出了一个新框架,将人类动作知识和VLP知识在概率嵌入空间中对齐。此外,我们还提出了内部和分布式对比学习来增强基于统计相似性的概率嵌入空间。大量实验和消融实验揭示了我们的方法显著优于所有先前的最优方法。代码可在此处访问:https://www.xxx.com/。
https://arxiv.org/abs/2408.05955
Online temporal action localization (On-TAL) is the task of identifying multiple action instances given a streaming video. Since existing methods take as input only a video segment of fixed size per iteration, they are limited in considering long-term context and require tuning the segment size carefully. To overcome these limitations, we propose memory-augmented transformer (MATR). MATR utilizes the memory queue that selectively preserves the past segment features, allowing to leverage long-term context for inference. We also propose a novel action localization method that observes the current input segment to predict the end time of the ongoing action and accesses the memory queue to estimate the start time of the action. Our method outperformed existing methods on two datasets, THUMOS14 and MUSES, surpassing not only TAL methods in the online setting but also some offline TAL methods.
在线时序动作局部化(On-TAL)是将给定流式视频中的多个动作实例进行识别的任务。由于现有的方法仅以每次迭代固定的视频片段作为输入,因此它们在考虑长期上下文方面存在局限,需要仔细调整片段大小。为了克服这些限制,我们提出了记忆增强Transformer(MATR)。MATR利用选择性地保留过去片段特征的内存队列,允许利用长期上下文进行推理。我们还提出了一个新颖的动作局部化方法,它观察当前输入片段来预测正在进行的动作的结束时间,并访问内存队列估计动作的开始时间。在THUMOS14和MUSES数据集上,我们的方法超越了现有方法,不仅在在线设置中超过了TAL方法,而且在离线设置中的一些方法中也表现优异。
https://arxiv.org/abs/2408.02957
In sewer pipe Closed-Circuit Television (CCTV) inspection, accurate temporal defect localization is essential for effective defect classification, detection, segmentation and quantification. Industry standards typically do not require time-interval annotations, even though they are more informative than time-point annotations for defect localization, resulting in additional annotation costs when fully supervised methods are used. Additionally, differences in scene types and camera motion patterns between pipe inspections and Temporal Action Localization (TAL) hinder the effective transfer of point-supervised TAL methods. Therefore, this study introduces a Semi-supervised multi-Prototype-based method incorporating visual Odometry for enhanced attention guidance (PipeSPO). PipeSPO fully leverages unlabeled data through unsupervised pretext tasks and utilizes time-point annotated data with a weakly supervised multi-prototype-based method, relying on visual odometry features to capture camera pose information. Experiments on real-world datasets demonstrate that PipeSPO achieves 41.89% average precision across Intersection over Union (IoU) thresholds of 0.1-0.7, improving by 8.14% over current state-of-the-art methods.
https://arxiv.org/abs/2407.15170
Temporal Action Localization (TAL) is a critical task in video analysis, identifying precise start and end times of actions. Existing methods like CNNs, RNNs, GCNs, and Transformers have limitations in capturing long-range dependencies and temporal causality. To address these challenges, we propose a novel TAL architecture leveraging the Selective State Space Model (S6). Our approach integrates the Feature Aggregated Bi-S6 block, Dual Bi-S6 structure, and a recurrent mechanism to enhance temporal and channel-wise dependency modeling without increasing parameter complexity. Extensive experiments on benchmark datasets demonstrate state-of-the-art results with mAP scores of 74.2% on THUMOS-14, 42.9% on ActivityNet, 29.6% on FineAction, and 45.8% on HACS. Ablation studies validate our method's effectiveness, showing that the Dual structure in the Stem module and the recurrent mechanism outperform traditional approaches. Our findings demonstrate the potential of S6-based models in TAL tasks, paving the way for future research.
Temporal Action Localization(TAL)是视频分析中一个关键的任务,用于识别动作的确切开始和结束时间。现有的方法如CNNs、RNNs、GCNs和Transformer在捕捉长距离依赖和时间因果关系方面具有局限性。为了应对这些挑战,我们提出了一种新型的TAL架构,利用了选择状态空间模型(S6)。我们的方法结合了特征聚合 bi-S6 块、双 bi-S6 结构以及一个循环机制,在不增加参数复杂性的情况下,增强了时间与通道之间的依赖建模。在基准数据集上的大量实验证明,在THUMOS-14上的mAP得分达到了74.2%,在ActivityNet上的mAP得分达到了42.9%,在FineAction上的mAP得分达到了29.6%,在HACS上的mAP得分达到了45.8%。消融研究证实了我们的方法的有效性,表明 stem 模块中的双结构以及循环机制超越了传统方法。我们的研究证明了基于S6的模型的潜力,为未来的研究奠定了基础。
https://arxiv.org/abs/2407.13078
Online Temporal Action Localization (On-TAL) is a critical task that aims to instantaneously identify action instances in untrimmed streaming videos as soon as an action concludes -- a major leap from frame-based Online Action Detection (OAD). Yet, the challenge of detecting overlapping actions is often overlooked even though it is a common scenario in streaming videos. Current methods that can address concurrent actions depend heavily on class information, limiting their flexibility. This paper introduces ActionSwitch, the first class-agnostic On-TAL framework capable of detecting overlapping actions. By obviating the reliance on class information, ActionSwitch provides wider applicability to various situations, including overlapping actions of the same class or scenarios where class information is unavailable. This approach is complemented by the proposed "conservativeness loss", which directly embeds a conservative decision-making principle into the loss function for On-TAL. Our ActionSwitch achieves state-of-the-art performance in complex datasets, including Epic-Kitchens 100 targeting the challenging egocentric view and FineAction consisting of fine-grained actions.
在线时元动作定位(On-TAL)是一个关键的任务,旨在在视频剪辑结束时立即识别完成动作实例,从基于帧的在线动作检测(OAD)中迈出了重要的一步。然而,即使这是流视频的常见情景,仍常常忽视了检测重叠动作的挑战。现有的方法在很大程度上依赖于类别信息,限制了其灵活性。本文介绍了ActionSwitch,第一个类无关的On-TAL框架,能够检测重叠动作。通过消除对类信息的依赖,ActionSwitch为各种情况提供了更广泛的适用性,包括相同类别的重叠动作或类信息不可用的情况。这种方法补充了提出的"容错损失",将保守决策原则直接嵌入到On-TAL的损失函数中。我们的ActionSwitch在复杂的数据集上实现了最先进的性能,包括Epic-Kitchens 100(针对具有挑战性的自中心视角)和FineAction(由细粒度动作组成的场景)。
https://arxiv.org/abs/2407.12987
Weakly-supervised Temporal Action Localization (WSTAL) aims to localize actions in untrimmed videos using only video-level supervision. Latest WSTAL methods introduce pseudo label learning framework to bridge the gap between classification-based training and inferencing targets at localization, and achieve cutting-edge results. In these frameworks, a classification-based model is used to generate pseudo labels for a regression-based student model to learn from. However, the quality of pseudo labels in the framework, which is a key factor to the final result, is not carefully studied. In this paper, we propose a set of simple yet efficient pseudo label quality enhancement mechanisms to build our FuSTAL framework. FuSTAL enhances pseudo label quality at three stages: cross-video contrastive learning at proposal Generation-Stage, prior-based filtering at proposal Selection-Stage and EMA-based distillation at Training-Stage. These designs enhance pseudo label quality at different stages in the framework, and help produce more informative, less false and smoother action proposals. With the help of these comprehensive designs at all stages, FuSTAL achieves an average mAP of 50.8% on THUMOS'14, outperforming the previous best method by 1.2%, and becomes the first method to reach the milestone of 50%.
弱监督时序动作定位(WSTAL)旨在使用仅视频级别的监督来对未剪辑的视频进行动作定位。最先进的WSTAL方法引入伪标签学习框架来弥合基于分类的训练和推理目标之间的差距,并取得最佳结果。在这些框架中,基于分类的模型用于为基于回归的学生模型生成伪标签,以学习。然而,框架中伪标签的质量,这是最终结果的关键因素,并没有仔细研究。在本文中,我们提出了一组简单而有效的伪标签质量增强机制来构建我们的FuSTAL框架。FuSTAL在提议生成阶段通过跨视频对比学习来提高伪标签质量,在提议选择阶段基于先验进行过滤,在训练阶段采用EMA进行蒸馏。这些设计在框架的不同阶段提高了伪标签的质量,并有助于产生更具有信息性、更准确、更平滑的动作提议。在所有阶段都有这些全面设计的帮助下,FuSTAL在THUMOS'14上的平均mAP达到50.8%,比之前最好的方法领先1.2%,成为第一个达到里程碑50%的方法。
https://arxiv.org/abs/2407.08971
Alleviating noisy pseudo labels remains a key challenge in Semi-Supervised Temporal Action Localization (SS-TAL). Existing methods often filter pseudo labels based on strict conditions, but they typically assess classification and localization quality separately, leading to suboptimal pseudo-label ranking and selection. In particular, there might be inaccurate pseudo labels within selected positives, alongside reliable counterparts erroneously assigned to negatives. To tackle these problems, we propose a novel Adaptive Pseudo-label Learning (APL) framework to facilitate better pseudo-label selection. Specifically, to improve the ranking quality, Adaptive Label Quality Assessment (ALQA) is proposed to jointly learn classification confidence and localization reliability, followed by dynamically selecting pseudo labels based on the joint score. Additionally, we propose an Instance-level Consistency Discriminator (ICD) for eliminating ambiguous positives and mining potential positives simultaneously based on inter-instance intrinsic consistency, thereby leading to a more precise selection. We further introduce a general unsupervised Action-aware Contrastive Pre-training (ACP) to enhance the discrimination both within actions and between actions and backgrounds, which benefits SS-TAL. Extensive experiments on THUMOS14 and ActivityNet v1.3 demonstrate that our method achieves state-of-the-art performance under various semi-supervised settings.
缓解嘈杂伪标签是一个关键挑战,在半监督 Temporal Action Localization(SS-TAL)中。现有的方法通常基于严格条件过滤伪标签,但它们通常将分类和定位质量分别评估,导致伪标签排名和选择效果不佳。特别地,选择中的正例中可能存在不准确的伪标签,同时负例中被错误分配的可靠伪标签。为了解决这些问题,我们提出了一个新颖的适应伪标签学习(APL)框架,以促进更好的伪标签选择。具体来说,为了提高排名质量,我们提出了Adaptive Label Quality Assessment(ALQA),以联合学习分类信心和定位可靠性,然后根据联合得分动态选择伪标签。此外,我们还提出了一个实例级别的一致性判别器(ICD),用于消除模糊的正例并同时根据个体内部一致性挖掘潜在的正例。这使得选择更加精确。我们进一步引入了一个通用无监督动作感知预训练(ACP),以增强对动作和背景之间的区分的歧视能力,从而提高 SS-TAL 的性能。在THUMOS14和ActivityNet v1.3等大量实验中,我们的方法在各种半监督设置下实现了最先进的性能。
https://arxiv.org/abs/2407.07673
The vocabulary size in temporal action localization (TAL) is constrained by the scarcity of large-scale annotated datasets. To address this, recent works incorporate powerful pre-trained vision-language models (VLMs), such as CLIP, to perform open-vocabulary TAL (OV-TAL). However, unlike VLMs trained on extensive image/video-text pairs, existing OV-TAL methods still rely on small, fully labeled TAL datasets for training an action localizer. In this paper, we explore the scalability of self-training with unlabeled YouTube videos for OV-TAL. Our self-training approach consists of two stages. First, a class-agnostic action localizer is trained on a human-labeled TAL dataset and used to generate pseudo-labels for unlabeled videos. Second, the large-scale pseudo-labeled dataset is combined with the human-labeled dataset to train the localizer. Extensive experiments demonstrate that leveraging web-scale videos in self-training significantly enhances the generalizability of an action localizer. Additionally, we highlighted issues with existing OV-TAL evaluation schemes and proposed a new evaluation protocol. Code is released at this https URL
语料库大小在时间动作定位(TAL)中受到大型带标签数据集的稀缺性的约束。为解决这个问题,最近的工作包括CLIP等强大的预训练视觉语言模型(VLMs)来执行开放词汇TAL(OV-TAL)。然而,与训练在广泛的图像/视频对上的VLMs不同,现有的OV-TAL方法仍然依赖于小而完全标注的TAL数据集来训练动作定位器。在本文中,我们探讨了使用未标记的YouTube视频进行自训练在OV-TAL中的可扩展性。我们的自训练方法包括两个阶段。第一步,在人类标注的TAL数据集上训练一个类无关的动作定位器,并用于为未标记的视频生成伪标签。第二步,将大型带标签的数据集与人类标注的数据集相结合来训练定位器。大量的实验证明,在自训练中利用网络规模的视频可以显著增强动作定位器的泛化能力。此外,我们还强调了现有OV-TAL评估方案的问题,并提出了新的评估协议。代码发布在https:// this URL
https://arxiv.org/abs/2407.07024
We introduce a new task called Referring Atomic Video Action Recognition (RAVAR), aimed at identifying atomic actions of a particular person based on a textual description and the video data of this person. This task differs from traditional action recognition and localization, where predictions are delivered for all present individuals. In contrast, we focus on recognizing the correct atomic action of a specific individual, guided by text. To explore this task, we present the RefAVA dataset, containing 36,630 instances with manually annotated textual descriptions of the individuals. To establish a strong initial benchmark, we implement and validate baselines from various domains, e.g., atomic action localization, video question answering, and text-video retrieval. Since these existing methods underperform on RAVAR, we introduce RefAtomNet -- a novel cross-stream attention-driven method specialized for the unique challenges of RAVAR: the need to interpret a textual referring expression for the targeted individual, utilize this reference to guide the spatial localization and harvest the prediction of the atomic actions for the referring person. The key ingredients are: (1) a multi-stream architecture that connects video, text, and a new location-semantic stream, and (2) cross-stream agent attention fusion and agent token fusion which amplify the most relevant information across these streams and consistently surpasses standard attention-based fusion on RAVAR. Extensive experiments demonstrate the effectiveness of RefAtomNet and its building blocks for recognizing the action of the described individual. The dataset and code will be made publicly available at this https URL.
我们提出了一个名为 Referring Atomic Video Action Recognition (RAVAR) 的新任务,旨在根据特定人员的文本描述和视频数据识别该人员的原子动作。与传统的动作识别和定位不同,我们的任务专注于识别特定人员的正确原子动作,通过文本指导。为了探索这个任务,我们提出了 RefAVA 数据集,其中包括 36,630 个带有手动标注的个体文本描述的实例。为了建立一个强大的初始基准,我们从各种领域实现了各种基线,例如原子动作的定位、视频问题回答和文本-视频检索。由于这些现有方法在 RAVAR 上表现不佳,我们引入了 RefAtomNet——一个专门针对 RAVAR 独特挑战的新颖跨流关注方法:为指定人员解释文本指征,利用这个参考引导空间定位并捕获指定期人的原子动作预测。关键成分是:(1)连接视频、文本和新位置语义流的Multi-stream架构;(2)跨流代理注意力和代理标记融合,增强这些通道中的最相关信息,并 consistently超越基于标准的注意力融合在 RAVAR 上。大量的实验证明 RefAtomNet 和其构建模块在识别描述人员的动作方面非常有效。数据集和代码将公开发布在这个 https URL 上。
https://arxiv.org/abs/2407.01872
Recent studies have shown promising results in utilizing multimodal large language models (MLLMs) for computer vision tasks such as object detection and semantic segmentation. However, many challenging video tasks remain under-explored. Video-language tasks necessitate spatial and temporal comprehension and require significant compute. Therefore, prior works have developed complex, highly specialized architectures or leveraged additional input signals such as video transcripts to best encode contextual and temporal information, which limits their generality and can be impractical. One particularly challenging task is video moment retrieval, which requires precise temporal and contextual grounding. This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval. We introduce Mr. BLIP (Mr. as in Moment Retrieval), a multimodal, single-stage model that requires no expensive video-language pretraining, no additional input signal (e.g., no transcript or audio), and has a simpler and more versatile design than prior state-of-the-art methods. We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions and illustrate our method's versatility with a new state-of-the-art in temporal action localization on ActivityNet. Notably, we attain over 9% (absolute) higher Recall (at 0.5 and 0.7 IoU) on the challenging long-video multi-moment QVHighlights benchmark. Our code is publicly available.
近年来,多模态大型语言模型(MLLMs)在计算机视觉任务(如目标检测和语义分割)中的应用取得了良好的效果。然而,许多挑战性的视频任务仍然没有被深入研究。视频与语言任务需要空间和时间上的理解,并需要大量的计算资源。因此,之前的工作开发了复杂且高度专业的架构,或利用视频转录等附加输入信号来最好地编码上下文和时间信息,这限制了它们的普适性,并且可能不可行。 一个特别具有挑战性的任务是视频时刻检索,它需要精确的时间和上下文。这项工作展示了利用图像文本预训练的MLLMs进行时刻检索的有效性。我们引入了Mr. BLIP(Mr. 代表时刻检索),一种多模态,单阶段模型,它不需要昂贵的视频语言预训练,不需要附加输入信号(例如,没有转录或音频),并且具有比现有最先进的更简单和多功能的架构。我们在广泛使用的基准Charades-STA、QVHighlights和ActivityNet Captions上实现了新的最佳时刻检索效果,并说明了我们的方法的多样性,在ActivityNet上通过新的最佳时刻定位技术展现了我们的方法的多功能性。 值得注意的是,我们在具有挑战性的长视频多时刻QVHighlights基准上实现了超过9%(绝对)更高的召回(在0.5和0.7 IoU)。我们的代码是公开的。
https://arxiv.org/abs/2406.18113
Open-Vocabulary Temporal Action Localization (OVTAL) enables a model to recognize any desired action category in videos without the need to explicitly curate training data for all categories. However, this flexibility poses significant challenges, as the model must recognize not only the action categories seen during training but also novel categories specified at inference. Unlike standard temporal action localization, where training and test categories are predetermined, OVTAL requires understanding contextual cues that reveal the semantics of novel categories. To address these challenges, we introduce OVFormer, a novel open-vocabulary framework extending ActionFormer with three key contributions. First, we employ task-specific prompts as input to a large language model to obtain rich class-specific descriptions for action categories. Second, we introduce a cross-attention mechanism to learn the alignment between class representations and frame-level video features, facilitating the multimodal guided features. Third, we propose a two-stage training strategy which includes training with a larger vocabulary dataset and finetuning to downstream data to generalize to novel categories. OVFormer extends existing TAL methods to open-vocabulary settings. Comprehensive evaluations on the THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of our method. Code and pretrained models will be publicly released.
Open-Vocabulary Temporal Action Localization (OVTAL) 使得模型能够在不需要明确 curated训练数据的情况下识别视频中的任何所需动作类别。然而,这种灵活性带来了巨大的挑战,因为模型必须不仅识别训练期间看到的动作类别,还要在推理时识别指定的新类别。与标准的时间动作局部化不同,训练和测试类别是预先确定的。OVTAL需要理解在推理过程中揭示新类别语义的情境提示。为了应对这些挑战,我们引入了OVFormer,一种新颖的开放词汇框架,用于扩展ActionFormer。 首先,我们将任务特定于提示的大型语言模型的输入,以获得针对动作类别的丰富类特定描述。其次,我们引入了跨注意机制,以学习类别表示和帧级视频特征之间的对齐,促进多模态引导特征。第三,我们提出了一个两阶段训练策略,包括使用更大的词汇集进行训练和微调以下游数据进行泛化。OVFormer将现有的TAL方法扩展到开放词汇设置中。在THUMOS14和ActivityNet-1.3基准测试中进行全面的评估,证明了我们的方法的有效性。代码和预训练的模型将公开发布。
https://arxiv.org/abs/2406.15556