The crux of semi-supervised temporal action localization (SS-TAL) lies in excavating valuable information from abundant unlabeled videos. However, current approaches predominantly focus on building models that are robust to the error-prone target class (i.e, the predicted class with the highest confidence) while ignoring informative semantics within non-target classes. This paper approaches SS-TAL from a novel perspective by advocating for learning from non-target classes, transcending the conventional focus solely on the target class. The proposed approach involves partitioning the label space of the predicted class distribution into distinct subspaces: target class, positive classes, negative classes, and ambiguous classes, aiming to mine both positive and negative semantics that are absent in the target class, while excluding ambiguous classes. To this end, we first devise innovative strategies to adaptively select high-quality positive and negative classes from the label space, by modeling both the confidence and rank of a class in relation to those of the target class. Then, we introduce novel positive and negative losses designed to guide the learning process, pushing predictions closer to positive classes and away from negative classes. Finally, the positive and negative processes are integrated into a hybrid positive-negative learning framework, facilitating the utilization of non-target classes in both labeled and unlabeled videos. Experimental results on THUMOS14 and ActivityNet v1.3 demonstrate the superiority of the proposed method over prior state-of-the-art approaches.
半监督的时间动作定位(SS-TAL)的核心在于从丰富的未标记视频中挖掘有价值的信息。然而,目前的 approaches 主要集中在构建对错误率易为目标类(即最高置信度的预测类)具有鲁棒性的模型,同时忽略了非目标类中的有信息语义。本文从新颖的角度探讨了 SS-TAL,主张从非目标类中学习,超越仅关注目标类的传统关注点。所提出的 approach 包括将预测类标签空间的标签分片为四个子空间:目标类、 positive classes(正类)、negative classes(负类)和 ambiguous classes(不确定类),旨在挖掘目标类中不存在 positive 和 negative semantics 的同时排除 ambiguous classes。为此,我们首先通过建模类与目标类之间的置信度和排名关系,设计了一些创新策略来自适应地选择标签空间中高质量的正负类。然后,我们引入了新颖的正负损失函数,用于指导学习过程,将预测结果推向正类,远离负类。最后,将正负过程整合到一种混合正负学习框架中,促进非目标类在 both labeled and unlabeled videos 中的使用。 在 THUMOS14 和 ActivityNet v1.3 上的实验结果表明,与 prior state-of-the-art approaches 相比,所提出的方法具有优越性。
https://arxiv.org/abs/2403.11189
Skeleton-based motion representations are robust for action localization and understanding for their invariance to perspective, lighting, and occlusion, compared with images. Yet, they are often ambiguous and incomplete when taken out of context, even for human annotators. As infants discern gestures before associating them with words, actions can be conceptualized before being grounded with labels. Therefore, we propose the first unsupervised pre-training framework, Boundary-Interior Decoding (BID), that partitions a skeleton-based motion sequence into discovered semantically meaningful pre-action segments. By fine-tuning our pre-training network with a small number of annotated data, we show results out-performing SOTA methods by a large margin.
基于骨架的运动表示对于动作定位和理解具有良好的鲁棒性,因为它们对视角、照明和遮挡具有不变性,而与图像相比。然而,当脱离上下文时,它们往往是不清晰和不完整的。正如婴儿在将动作与词语关联之前就开始感知一样,在将动作与标签关联之前,动作可以先于标签进行概念化。因此,我们提出了第一个无监督的前训练框架,边界内解码(BID),将基于骨架的运动序列分割为发现 semantically 有意义的前动作段。通过用小量标记数据微调我们的预训练网络,我们证明了其性能优于当前最先进的方法。
https://arxiv.org/abs/2403.07354
Temporal localization of driving actions plays a crucial role in advanced driver-assistance systems and naturalistic driving studies. However, this is a challenging task due to strict requirements for robustness, reliability and accurate localization. In this work, we focus on improving the overall performance by efficiently utilizing video action recognition networks and adapting these to the problem of action localization. To this end, we first develop a density-guided label smoothing technique based on label probability distributions to facilitate better learning from boundary video-segments that typically include multiple labels. Second, we design a post-processing step to efficiently fuse information from video-segments and multiple camera views into scene-level predictions, which facilitates elimination of false positives. Our methodology yields a competitive performance on the A2 test set of the naturalistic driving action recognition track of the 2022 NVIDIA AI City Challenge with an F1 score of 0.271.
时间的驾驶动作的局部化在高级驾驶辅助系统和自然驾驶研究中起着关键作用。然而,由于对鲁棒性、可靠性和准确局部化的严格要求,这是项具有挑战性的任务。在这项工作中,我们专注于通过有效地利用视频动作识别网络来提高整体性能,并将这些网络适应于动作局部定位问题。为此,我们首先开发了一种基于标签概率分布的密度引导标签平滑技术,以促进更好地学习边界视频段中通常包括多个标签的学习。其次,我们设计了一个后处理步骤,以有效地将视频段和多个相机视图中的信息融合到场景级别预测中,从而消除虚假阳性结果。我们采用的方法在2022 NVIDIA AI City Challenge的自然驾驶动作识别赛道上获得了竞争力的性能,F1分数为0.271。
https://arxiv.org/abs/2403.06616
This work explores the performance of a large video understanding foundation model on the downstream task of human fall detection on untrimmed video and leverages a pretrained vision transformer for multi-class action detection, with classes: "Fall", "Lying" and "Other/Activities of daily living (ADL)". A method for temporal action localization that relies on a simple cutup of untrimmed videos is demonstrated. The methodology includes a preprocessing pipeline that converts datasets with timestamp action annotations into labeled datasets of short action clips. Simple and effective clip-sampling strategies are introduced. The effectiveness of the proposed method has been empirically evaluated on the publicly available High-Quality Fall Simulation Dataset (HQFSD). The experimental results validate the performance of the proposed pipeline. The results are promising for real-time application, and the falls are detected on video level with a state-of-the-art 0.96 F1 score on the HQFSD dataset under the given experimental settings. The source code will be made available on GitHub.
这项工作探讨了在未剪辑视频的下游任务中大型视频理解基础模型的表现,并利用预训练的视觉Transformer进行多类动作检测,类别包括“跌倒”、“躺下”和“其他/日常生活活动(ADL)”。我们展示了依靠简单截断未剪辑视频的方法。 本文提出了一种基于剪辑的时序动作定位方法。该方法依赖于简单的截断未剪辑视频的数据集,将其转换为带有标签的短动作片段的数据集。我们介绍了两种简单的剪辑采样策略。 所提出方法的性能已经在公开可用的优质跌倒模拟数据集(HQFSD)上进行了实证评估。实验结果证实了所提出工作流程的有效性。在给定的实验设置下,HQFSD数据集上的跌倒事件在视频级别上的准确率达到了0.96 F1分数。 源代码将放在GitHub上。
https://arxiv.org/abs/2401.16280
Action Localization is a challenging problem that combines detection and recognition tasks, which are often addressed separately. State-of-the-art methods rely on off-the-shelf bounding box detections pre-computed at high resolution and propose transformer models that focus on the classification task alone. Such two-stage solutions are prohibitive for real-time deployment. On the other hand, single-stage methods target both tasks by devoting part of the network (generally the backbone) to sharing the majority of the workload, compromising performance for speed. These methods build on adding a DETR head with learnable queries that, after cross- and self-attention can be sent to corresponding MLPs for detecting a person's bounding box and action. However, DETR-like architectures are challenging to train and can incur in big complexity. In this paper, we observe that a straight bipartite matching loss can be applied to the output tokens of a vision transformer. This results in a backbone + MLP architecture that can do both tasks without the need of an extra encoder-decoder head and learnable queries. We show that a single MViT-S architecture trained with bipartite matching to perform both tasks surpasses the same MViT-S when trained with RoI align on pre-computed bounding boxes. With a careful design of token pooling and the proposed training pipeline, our MViTv2-S model achieves +3 mAP on AVA2.2. w.r.t. the two-stage counterpart. Code and models will be released after paper revision.
动作局部化是一个结合检测和识别任务具有挑战性的问题,这些问题通常分别处理。最先进的解决方案依赖于在高端分辨率下预先计算的边界框检测,并提出了专注于分类任务的Transformer模型。这种两阶段解决方案对于实时部署来说过于昂贵。另一方面,单阶段方法通过将网络的部分(通常是骨干网络)用于共享工作负载,以牺牲性能来提高速度。这些方法在添加可学习查询的DETR头部后,可以在 cross- 和 self- 注意力之后将对应于检测到的边界框和动作的查询发送到相应的MLP。然而,DETR类似架构的训练和实现具有挑战性,并且容易产生较大复杂度。在本文中,我们观察到可以将二元二分匹配损失应用于视觉Transformer的输出标记。这导致了一个骨干网络+MLP架构,可以同时完成这两项任务,而无需增加额外的编码器-解码器头部和学习查询。我们证明了,通过二元二分匹配训练的单MViT-S模型在预计算边界框上训练时可以超越使用RoI调整的单MViT-S模型。通过仔细设计标记池和提出的训练流程,我们的MViTv2-S模型在AVA2.2上实现了+3的AP。代码和模型将在论文修订后发布。
https://arxiv.org/abs/2312.17686
Weakly-supervised temporal action localization aims to localize action instances in videos with only video-level action labels. Existing methods mainly embrace a localization-by-classification pipeline that optimizes the snippet-level prediction with a video classification loss. However, this formulation suffers from the discrepancy between classification and detection, resulting in inaccurate separation of foreground and background (F\&B) snippets. To alleviate this problem, we propose to explore the underlying structure among the snippets by resorting to unsupervised snippet clustering, rather than heavily relying on the video classification loss. Specifically, we propose a novel clustering-based F\&B separation algorithm. It comprises two core components: a snippet clustering component that groups the snippets into multiple latent clusters and a cluster classification component that further classifies the cluster as foreground or background. As there are no ground-truth labels to train these two components, we introduce a unified self-labeling mechanism based on optimal transport to produce high-quality pseudo-labels that match several plausible prior distributions. This ensures that the cluster assignments of the snippets can be accurately associated with their F\&B labels, thereby boosting the F\&B separation. We evaluate our method on three benchmarks: THUMOS14, ActivityNet v1.2 and v1.3. Our method achieves promising performance on all three benchmarks while being significantly more lightweight than previous methods. Code is available at this https URL
弱监督的时间动作定位旨在通过仅使用视频级别的动作标签来定位视频中的动作实例。现有的方法主要采用基于分类的定位管道来通过视频分类损失优化片段级别的预测。然而,这个公式存在分类和检测之间的差异,导致前景和背景(F&B)片段的准确分离。为了减轻这个问题,我们提出了一种通过无监督片段聚类来探索片段之间的潜在结构,而不是过分依赖视频分类损失。具体来说,我们提出了一种基于聚类的F&B分割算法。它包括两个核心组件:一个片段聚类组件,将片段分组到多个潜在聚类中,和一个聚类分类组件,进一步将聚类分类为前景或背景。由于没有用于训练这两个组件的地面真标签,我们引入了一种基于最优传输的统一自标签机制,产生高质量的反向样本,匹配多个可能的先验分布。这确保了片段分配的聚类可以准确与F&B标签相关联,从而提高F&B分割。我们在THUMOS14、ActivityNet v1.2和v1.3这三个基准上评估我们的方法。我们的方法在所有三个基准上都取得了良好的性能,而重量比以前的方法轻得多。代码可以从这个链接获取:https://www.aclweb.org/anthology/W17-6246
https://arxiv.org/abs/2312.14138
Temporal Action Localization (TAL) is a complex task that poses relevant challenges, particularly when attempting to generalize on new -- unseen -- domains in real-world applications. These scenarios, despite realistic, are often neglected in the literature, exposing these solutions to important performance degradation. In this work, we tackle this issue by introducing, for the first time, an approach for Unsupervised Domain Adaptation (UDA) in sparse TAL, which we refer to as Semantic Adversarial unsupervised Domain Adaptation (SADA). Our contribution is threefold: (1) we pioneer the development of a domain adaptation model that operates on realistic sparse action detection benchmarks; (2) we tackle the limitations of global-distribution alignment techniques by introducing a novel adversarial loss that is sensitive to local class distributions, ensuring finer-grained adaptation; and (3) we present a novel experimental setup, based on EpicKitchens100, that evaluates multiple types of domain shifts in a comprehensive manner. Our experimental results indicate that SADA improves the adaptation across domains when compared to fully supervised state-of-the-art and alternative UDA methods, attaining a relative performance boost of up to 14%.
Temporal Action Localization(TAL)是一个复杂的任务,在尝试在现实应用中泛化到新的——未见过的——领域时,会提出相关的挑战。尽管这些场景在文献中很现实,但通常被忽视,这使得这些解决方案在重要性能降级方面面临风险。在这项工作中,我们通过引入在稀疏TAL中进行无监督领域适应(UDA)的方法,我们称之为语义对抗无监督领域适应(SADA),来解决这一问题。我们的贡献是三方面的:(1)我们首创了一个在现实稀疏动作检测基准上运行的领域适应模型;(2)我们引入了一种新的对抗损失,对局部类别分布敏感,确保了更细粒度的适应;(3)我们基于EpicKitchens100构建了一个新的实验设置,以全面评估各种领域转移。我们的实验结果表明,与完全监督最先进的和替代UDA方法相比,SADA在领域泛化方面具有显著的提高,实现了相对性能提升至14%。
https://arxiv.org/abs/2312.13377
Recently, temporal action localization (TAL) has garnered significant interest in information retrieval community. However, existing supervised/weakly supervised methods are heavily dependent on extensive labeled temporal boundaries and action categories, which is labor-intensive and time-consuming. Although some unsupervised methods have utilized the ``iteratively clustering and localization'' paradigm for TAL, they still suffer from two pivotal impediments: 1) unsatisfactory video clustering confidence, and 2) unreliable video pseudolabels for model training. To address these limitations, we present a novel self-paced incremental learning model to enhance clustering and localization training simultaneously, thereby facilitating more effective unsupervised TAL. Concretely, we improve the clustering confidence through exploring the contextual feature-robust visual information. Thereafter, we design two (constant- and variable- speed) incremental instance learning strategies for easy-to-hard model training, thus ensuring the reliability of these video pseudolabels and further improving overall localization performance. Extensive experiments on two public datasets have substantiated the superiority of our model over several state-of-the-art competitors.
近年来,时间动作局部定位(TAL)在信息检索领域引起了广泛关注。然而,现有的监督/弱监督方法在很大程度上依赖于广泛的标记时间边界的动作类别,这需要大量的人力和时间。尽管一些无监督方法利用了“迭代聚类和局部化”范式进行TAL,但它们仍然受到两个关键限制:1)不满意的视频聚类置信度,2)模型训练不可靠的视频伪标签。为了克服这些限制,我们提出了一个自适应的增量学习模型,以同时增强聚类和局部化训练,从而促进更有效的无监督TAL。具体来说,我们通过探索上下文特征鲁棒的视觉信息来提高聚类置信度。然后,我们设计了两组(恒定速度和可变速度)增量实例学习策略,用于易到难的模型训练,从而确保这些视频伪标签的可靠性,并进一步提高整体局部化性能。在两个公开数据集上进行的大量实验证实了我们的模型相对于最先进的竞争对手的优越性。
https://arxiv.org/abs/2312.07384
This paper addresses the challenge of point-supervised temporal action detection, in which only one frame per action instance is annotated in the training set. Self-training aims to provide supplementary supervision for the training process by generating pseudo-labels (action proposals) from a base model. However, most current methods generate action proposals by applying manually designed thresholds to action classification probabilities and treating adjacent snippets as independent entities. As a result, these methods struggle to generate complete action proposals, exhibit sensitivity to fluctuations in action classification scores, and generate redundant and overlapping action proposals. This paper proposes a novel framework termed ADM-Loc, which stands for Actionness Distribution Modeling for point-supervised action Localization. ADM-Loc generates action proposals by fitting a composite distribution, comprising both Gaussian and uniform distributions, to the action classification signals. This fitting process is tailored to each action class present in the video and is applied separately for each action instance, ensuring the distinctiveness of their distributions. ADM-Loc significantly enhances the alignment between the generated action proposals and ground-truth action instances and offers high-quality pseudo-labels for self-training. Moreover, to model action boundary snippets, it enforces consistency in action classification scores during training by employing Gaussian kernels, supervised with the proposed loss functions. ADM-Loc outperforms the state-of-the-art point-supervised methods on THUMOS14 and ActivityNet-v1.2 datasets.
本文解决了基于点标注的时间动作检测(在训练集中只有一个动作实例被标注)的挑战。自训练旨在为训练过程提供额外的监督,通过从基础模型中生成伪标签(动作建议)来提供这种监督。然而,大多数现有方法通过手动设计阈值对动作分类概率应用,并将相邻片段视为独立实体。因此,这些方法很难生成完整的动作建议,对动作分类分数的波动敏感,并生成冗余和重叠的动作建议。本文提出了一种新框架,称为ADM-Loc,意为点标注动作局部定位模型。ADM-Loc通过将高斯分布和均匀分布的组合拟合到动作分类信号上生成动作建议。拟合过程针对视频中的每个动作类别进行定制,并对每个动作实例分别应用。这确保了它们的分布的独立性。ADM-Loc显著增强了生成的动作建议与真实动作实例之间的对齐程度,并为自训练提供了高质量的伪标签。此外,为了建模动作边界片段,它在训练过程中通过使用高斯核来维持动作分类分数的一致性,并监督采用提出的损失函数。ADM-Loc在THUMOS14和ActivityNet-v1.2数据集上优于最先进的基于点标注的方法。
https://arxiv.org/abs/2311.15916
We present MM-Navigator, a GPT-4V-based agent for the smartphone graphical user interface (GUI) navigation task. MM-Navigator can interact with a smartphone screen as human users, and determine subsequent actions to fulfill given instructions. Our findings demonstrate that large multimodal models (LMMs), specifically GPT-4V, excel in zero-shot GUI navigation through its advanced screen interpretation, action reasoning, and precise action localization capabilities. We first benchmark MM-Navigator on our collected iOS screen dataset. According to human assessments, the system exhibited a 91\% accuracy rate in generating reasonable action descriptions and a 75\% accuracy rate in executing the correct actions for single-step instructions on iOS. Additionally, we evaluate the model on a subset of an Android screen navigation dataset, where the model outperforms previous GUI navigators in a zero-shot fashion. Our benchmark and detailed analyses aim to lay a robust groundwork for future research into the GUI navigation task. The project page is at this https URL.
我们提出了MM-Navigator,一种基于GPT-4V的智能手机图形用户界面(GUI)导航任务代理。MM-Navigator可以以人类用户的方式与智能手机屏幕互动,并确定根据给定指令后续的操作。我们的研究结果表明,大型多模态模型(LMMs),特别是GPT-4V,通过其先进的屏幕解释、动作推理和精确的动作定位能力在零散射击GUI导航中表现出色。 我们首先在收集的iOS屏幕数据集上对MM-Navigator进行基准测试。根据人类评估,系统在生成合理的动作描述和执行单步指令的正确动作方面都表现出91\%的准确率。此外,我们在一组Android屏幕导航数据集上评估了该模型,发现模型在零散射击方式上优于之前的GUI导航器。 我们的基准和详细分析旨在为未来研究提供稳固的基础,深入了解GUI导航任务。项目页面位于https://www.xxx.com。
https://arxiv.org/abs/2311.07562
This paper tackles the challenge of point-supervised temporal action detection, wherein only a single frame is annotated for each action instance in the training set. Most of the current methods, hindered by the sparse nature of annotated points, struggle to effectively represent the continuous structure of actions or the inherent temporal and semantic dependencies within action instances. Consequently, these methods frequently learn merely the most distinctive segments of actions, leading to the creation of incomplete action proposals. This paper proposes POTLoc, a Pseudo-label Oriented Transformer for weakly-supervised Action Localization utilizing only point-level annotation. POTLoc is designed to identify and track continuous action structures via a self-training strategy. The base model begins by generating action proposals solely with point-level supervision. These proposals undergo refinement and regression to enhance the precision of the estimated action boundaries, which subsequently results in the production of `pseudo-labels' to serve as supplementary supervisory signals. The architecture of the model integrates a transformer with a temporal feature pyramid to capture video snippet dependencies and model actions of varying duration. The pseudo-labels, providing information about the coarse locations and boundaries of actions, assist in guiding the transformer for enhanced learning of action dynamics. POTLoc outperforms the state-of-the-art point-supervised methods on THUMOS'14 and ActivityNet-v1.2 datasets, showing a significant improvement of 5% average mAP on the former.
本文解决了基于点监督的时间动作检测中的挑战,其中每个训练集中的动作实例只标注了一个帧。大多数现有方法受到稀疏性标注点的限制,很难有效地表示动作的连续结构或动作实例中的内在时间和语义依赖关系。因此,这些方法通常只能学习动作的最具特征性的部分,导致创建不完整的动作提案。本文提出了一种利用点级注释的伪标签定向Transformer(POTLoc)来进行弱监督动作局部化。POTLoc通过自训练策略来识别和跟踪连续的动作结构。 基模型首先仅通过点级监督生成动作建议。这些建议经过细化和回归以提高估计动作边界的精度,从而产生“伪标签”,作为补充监督信号。模型的架构融合了Transformer和时间特征金字塔,以捕捉视频片段依赖关系并建模具有不同持续时间的动作。伪标签提供关于动作粗略位置和边界的信息,有助于引导Transformer进行增强的学习动作 dynamics。 POTLoc在THUMOS'14和ActivityNet-v1.2数据集上优于最先进的点监督方法,其平均mAP提高了5%。
https://arxiv.org/abs/2310.13585
Point-level supervised temporal action localization (PTAL) aims at recognizing and localizing actions in untrimmed videos where only a single point (frame) within every action instance is annotated in training data. Without temporal annotations, most previous works adopt the multiple instance learning (MIL) framework, where the input video is segmented into non-overlapped short snippets, and action classification is performed independently on every short snippet. We argue that the MIL framework is suboptimal for PTAL because it operates on separated short snippets that contain limited temporal information. Therefore, the classifier only focuses on several easy-to-distinguish snippets instead of discovering the whole action instance without missing any relevant snippets. To alleviate this problem, we propose a novel method that localizes actions by generating and evaluating action proposals of flexible duration that involve more comprehensive temporal information. Moreover, we introduce an efficient clustering algorithm to efficiently generate dense pseudo labels that provide stronger supervision, and a fine-grained contrastive loss to further refine the quality of pseudo labels. Experiments show that our proposed method achieves competitive or superior performance to the state-of-the-art methods and some fully-supervised methods on four benchmarks: ActivityNet 1.3, THUMOS 14, GTEA, and BEOID datasets.
点级监督下的时序动作局部定位(PTAL)旨在识别和局部化未剪辑视频中的动作,其中在训练数据中仅有一个点(帧)被标注。如果没有时序注释,大多数先前的 works 采用多实例学习(MIL)框架,其中输入视频被分割成非重叠的短片段,并对每个短片段进行独立的动作分类。我们认为,MIL 框架对于 PTAL 来说是不最优的,因为它在包含有限时序信息的分离短片段上操作。因此,分类器仅关注几个容易区分的片段,而不是发现没有丢失任何相关片段的完整动作实例。为了减轻这个问题,我们提出了一种新方法,通过生成和评估具有更全面时序信息的灵活时序建议来局部化动作。此外,我们还引入了高效的聚类算法来生成密集伪标签,以提供更强监督,以及细粒度的对比损失来进一步提高伪标签的质量。实验证明,我们在四个基准数据集(ActivityNet 1.3、THUMOS 14、GTEA 和 BEOID)上的方法实现了与最先进方法或完全监督方法竞争或卓越的表现。
https://arxiv.org/abs/2310.05511
Temporal Action Localization (TAL) aims to identify actions' start, end, and class labels in untrimmed videos. While recent advancements using transformer networks and Feature Pyramid Networks (FPN) have enhanced visual feature recognition in TAL tasks, less progress has been made in the integration of audio features into such frameworks. This paper introduces the Multi-Resolution Audio-Visual Feature Fusion (MRAV-FF), an innovative method to merge audio-visual data across different temporal resolutions. Central to our approach is a hierarchical gated cross-attention mechanism, which discerningly weighs the importance of audio information at diverse temporal scales. Such a technique not only refines the precision of regression boundaries but also bolsters classification confidence. Importantly, MRAV-FF is versatile, making it compatible with existing FPN TAL architectures and offering a significant enhancement in performance when audio data is available.
时间动作定位(TAL)的目标是在未剪辑的视频中提取动作的开始、结束和类别标签。尽管使用Transformer网络和特征层次网络(FPN)的最新研究已经增强了TAL任务中的视觉特征识别,但在将音频特征融合到这些框架中方面进展较少。本 paper 介绍了多分辨率音频-视觉特征融合(MRAV-FF),一种创新的方法,将不同时间分辨率的音频和视觉数据合并。我们的 approach 的核心是分层门控交叉注意力机制,该机制区分地权重音频信息在不同时间尺度上的 importance。这种技术不仅可以优化回归边界的精度,还可以增强分类信心。重要的是,MRAV-FF 是灵活的,使其与现有的 FPN TAL 架构兼容,并在可用音频数据时提供显著的性能提升。
https://arxiv.org/abs/2310.03456
The goal of Temporal Action Localization (TAL) is to find the categories and temporal boundaries of actions in an untrimmed video. Most TAL methods rely heavily on action recognition models that are sensitive to action labels rather than temporal boundaries. More importantly, few works consider the background frames that are similar to action frames in pixels but dissimilar in semantics, which also leads to inaccurate temporal boundaries. To address the challenge above, we propose a Boundary-Aware Proposal Generation (BAPG) method with contrastive learning. Specifically, we define the above background frames as hard negative samples. Contrastive learning with hard negative mining is introduced to improve the discrimination of BAPG. BAPG is independent of the existing TAL network architecture, so it can be applied plug-and-play to mainstream TAL models. Extensive experimental results on THUMOS14 and ActivityNet-1.3 demonstrate that BAPG can significantly improve the performance of TAL.
时间动作定位(TAL)的目标是在未修剪的视频中找到行动类别和时间边界。大多数TAL方法都 heavily rely on 行动识别模型,这些模型对行动标签敏感,而不是时间边界。更重要的是,很少有工作考虑背景帧,它们在像素上与行动帧相似,但在语义上却不同,这也会导致不准确的时间边界。为了应对上述挑战,我们提出了一种带有对比学习的Boundary-Aware Proposal Generation (BAPG)方法。具体来说,我们定义上述背景帧为硬负样本。对比学习和硬负挖掘引入了,以改善BAPG的区分度。BAPG与现有TAL网络架构无关,因此可以将其轻松应用于主流TAL模型。在THUMOS14和ActivityNet-1.3的实验中,广泛证明了BAPG能够显著改善TAL性能。
https://arxiv.org/abs/2309.13810
Action scene understanding in soccer is a challenging task due to the complex and dynamic nature of the game, as well as the interactions between players. This article provides a comprehensive overview of this task divided into action recognition, spotting, and spatio-temporal action localization, with a particular emphasis on the modalities used and multimodal methods. We explore the publicly available data sources and metrics used to evaluate models' performance. The article reviews recent state-of-the-art methods that leverage deep learning techniques and traditional methods. We focus on multimodal methods, which integrate information from multiple sources, such as video and audio data, and also those that represent one source in various ways. The advantages and limitations of methods are discussed, along with their potential for improving the accuracy and robustness of models. Finally, the article highlights some of the open research questions and future directions in the field of soccer action recognition, including the potential for multimodal methods to advance this field. Overall, this survey provides a valuable resource for researchers interested in the field of action scene understanding in soccer.
在足球比赛中的动作场景理解是一个具有挑战性的任务,因为足球比赛具有复杂和动态的特点,以及球员之间的互动。本文对这个任务进行了全面综述,并将其分成动作识别、发现和时间和空间动作定位,其中特别注重使用的模式和多种模式方法。我们探索了可用于评估模型性能的公开可用数据源和指标。本文回顾了最近利用深度学习技术和传统方法的最新方法。我们重点探讨了多种模式方法,它们整合了来自多个来源的信息,例如视频和音频数据,以及以不同方式代表一个来源的方法。方法的优点和局限性被讨论,以及它们如何提高模型的准确性和鲁棒性的潜力。最后,文章强调了足球动作识别领域的一些开放研究问题和未来的研究方向,包括多种模式方法推动该领域的进步的潜力。总的来说,本文为对足球动作场景理解领域感兴趣的研究人员提供了宝贵的资源。
https://arxiv.org/abs/2309.12067
While most modern video understanding models operate on short-range clips, real-world videos are often several minutes long with semantically consistent segments of variable length. A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length and aggregating the outputs. This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative. In this paper, we aim to provide a generic and adaptive sampling approach for long-form videos in lieu of the de facto uniform sampling. Viewing videos as semantically consistent segments, we formulate a task-agnostic, unsupervised, and scalable approach based on Kernel Temporal Segmentation (KTS) for sampling and tokenizing long videos. We evaluate our method on long-form video understanding tasks such as video classification and temporal action localization, showing consistent gains over existing approaches and achieving state-of-the-art performance on long-form video modeling.
现代视频理解模型通常处理的是短片段,而实际视频往往几十分钟,具有语义 consistent 的片段长度可变。处理长视频的常见方法是使用一段固定时间长度的片段进行均匀采样,并汇总输出。这种方法忽略了长视频的深层次本质,因为固定长度的片段往往重复或无意义。在本文中,我们旨在提供一种通用的、自适应的采样方法,以代替事实上的均匀采样。将视频视为语义 consistent 的片段,我们制定了基于核心时间分割(KTS)的任务无关、无监督和可扩展的方法,用于采样和 tokenizing 长视频。我们针对长视频理解任务,如视频分类和时间行为定位,评估了我们的方法,显示与现有方法一致的增益,并在长视频建模方面实现了最先进的性能。
https://arxiv.org/abs/2309.11569
Point-level weakly-supervised temporal action localization (PWTAL) aims to localize actions with only a single timestamp annotation for each action instance. Existing methods tend to mine dense pseudo labels to alleviate the label sparsity, but overlook the potential sub-action temporal structures, resulting in inferior performance. To tackle this problem, we propose a novel sub-action prototype learning framework (SPL-Loc) which comprises Sub-action Prototype Clustering (SPC) and Ordered Prototype Alignment (OPA). SPC adaptively extracts representative sub-action prototypes which are capable to perceive the temporal scale and spatial content variation of action instances. OPA selects relevant prototypes to provide completeness clue for pseudo label generation by applying a temporal alignment loss. As a result, pseudo labels are derived from alignment results to improve action boundary prediction. Extensive experiments on three popular benchmarks demonstrate that the proposed SPL-Loc significantly outperforms existing SOTA PWTAL methods.
点级别的弱监督时间动作定位(PWTAL)旨在为每个行动实例只标注一个 timestamp 注解,以定位每个行动。现有的方法倾向于挖掘密集伪标签以减轻标签稀疏的问题,但忽略了潜在的子行动时间结构,导致性能较差。为了解决这个问题,我们提出了一种新的子行动原型学习框架(SPL-Loc),它包括子行动原型聚类(SPC)和有序原型对齐(OPA)。SPC 自适应地提取具有代表性的子行动原型,能够感知行动实例的时间尺度和空间内容变化。OPA 选择相关的原型,通过应用时间对齐损失来生成伪标签,以提供完整的线索,以改进行动边界预测。在三个流行的基准测试上进行了广泛的实验,结果表明,所提出的 SPL-Loc 显著优于现有的 SOTA PWTAL 方法。
https://arxiv.org/abs/2309.09060
Temporal action detection (TAD) aims to detect all action boundaries and their corresponding categories in an untrimmed video. The unclear boundaries of actions in videos often result in imprecise predictions of action boundaries by existing methods. To resolve this issue, we propose a one-stage framework named TriDet. First, we propose a Trident-head to model the action boundary via an estimated relative probability distribution around the boundary. Then, we analyze the rank-loss problem (i.e. instant discriminability deterioration) in transformer-based methods and propose an efficient scalable-granularity perception (SGP) layer to mitigate this issue. To further push the limit of instant discriminability in the video backbone, we leverage the strong representation capability of pretrained large models and investigate their performance on TAD. Last, considering the adequate spatial-temporal context for classification, we design a decoupled feature pyramid network with separate feature pyramids to incorporate rich spatial context from the large model for localization. Experimental results demonstrate the robustness of TriDet and its state-of-the-art performance on multiple TAD datasets, including hierarchical (multilabel) TAD datasets.
时间动作检测(TAD)旨在在未剪辑的视频中提取所有动作边界及其对应分类,但视频中的动作边界通常不够清晰,导致现有方法无法准确预测动作边界。为了解决这一问题,我们提出了名为TriDet的一阶段框架。我们首先提出了一个箭頭模型,通过估计边界周围的相对概率分布来建模动作边界。然后,我们分析了基于Transformer的方法中的排名损失问题(即即时分类能力恶化),并提出了高效的可扩展级联精度感知层(SGP)来解决该问题。为了进一步逼近视频主干中即时分类能力的极限,我们利用预训练大型模型的强大表示能力,并研究了它们在TAD任务中的表现。最后,考虑到适当的空间和时间上下文来进行分类,我们设计了一个分离的特征金字塔网络,并使用独立的特征金字塔来集成大型模型提供的丰富空间上下文来进行定位。实验结果显示TriDet的鲁棒性和其在多个TAD数据集上包括层次(多标签)TAD数据集上的最新性能。
https://arxiv.org/abs/2309.05590
In temporal action localization, given an input video, the goal is to predict which actions it contains, where they begin, and where they end. Training and testing current state-of-the-art deep learning models requires access to large amounts of data and computational power. However, gathering such data is challenging and computational resources might be limited. This work explores and measures how current deep temporal action localization models perform in settings constrained by the amount of data or computational power. We measure data efficiency by training each model on a subset of the training set. We find that TemporalMaxer outperforms other models in data-limited settings. Furthermore, we recommend TriDet when training time is limited. To test the efficiency of the models during inference, we pass videos of different lengths through each model. We find that TemporalMaxer requires the least computational resources, likely due to its simple architecture.
在时间行为本地化中,给定输入视频,的目标是预测它包含哪些动作,这些动作开始和结束的位置。训练和测试当前最先进的深度学习模型需要访问大量的数据和计算资源。然而,收集这些数据是困难的,计算资源可能是有限的。这项工作探索并衡量当前深度时间行为本地化模型在数据或计算资源受限的情况下的表现。我们通过训练每个模型在训练集的子集上来衡量数据效率。我们发现,在数据限制的情况下,TemporalMaxer比其他模型表现更好。此外,当训练时间有限时,我们建议使用TriDet。为了测试模型在推理期间的效率,我们通过每个模型传递不同长度的视频。我们发现TemporalMaxer需要最少的计算资源,可能是因为其简单的架构。
https://arxiv.org/abs/2308.13082
Weakly supervised temporal action localization (WSTAL) aims to localize actions in untrimmed videos using video-level labels. Despite recent advances, existing approaches mainly follow a localization-by-classification pipeline, generally processing each segment individually, thereby exploiting only limited contextual information. As a result, the model will lack a comprehensive understanding (e.g. appearance and temporal structure) of various action patterns, leading to ambiguity in classification learning and temporal localization. Our work addresses this from a novel perspective, by exploring and exploiting the cross-video contextual knowledge within the dataset to recover the dataset-level semantic structure of action instances via weak labels only, thereby indirectly improving the holistic understanding of fine-grained action patterns and alleviating the aforementioned ambiguities. Specifically, an end-to-end framework is proposed, including a Robust Memory-Guided Contrastive Learning (RMGCL) module and a Global Knowledge Summarization and Aggregation (GKSA) module. First, the RMGCL module explores the contrast and consistency of cross-video action features, assisting in learning more structured and compact embedding space, thus reducing ambiguity in classification learning. Further, the GKSA module is used to efficiently summarize and propagate the cross-video representative action knowledge in a learnable manner to promote holistic action patterns understanding, which in turn allows the generation of high-confidence pseudo-labels for self-learning, thus alleviating ambiguity in temporal localization. Extensive experiments on THUMOS14, ActivityNet1.3, and FineAction demonstrate that our method outperforms the state-of-the-art methods, and can be easily plugged into other WSTAL methods.
弱监督时间动作定位(WSTAL)旨在使用视频级别的标签对未修剪的视频进行动作定位。尽管近年来取得了进展,但现有的方法主要采用分类后再定位的流程,通常对每个片段单独进行处理,因此只能利用有限的上下文信息。因此,模型将缺乏对各种动作模式(例如外观和时间结构)的全面理解,导致分类学习和时间定位中的歧义。我们的工作从一个全新的角度解决这个问题,通过探索和利用数据集中的跨视频上下文知识,仅使用弱标签恢复行动实例的数据集语义结构,从而间接地改善精细动作模式全面了解和提高整体理解,并减轻上述歧义。具体来说,我们提出了一个端到端框架,包括一个 robust 记忆引导比较学习(RMGCL)模块和一个全球知识摘要和聚合(GKSA)模块。首先,RMGCL模块探索跨视频动作特征的对比度和一致性,协助学习更结构化和紧凑嵌入空间,从而减少分类学习中的歧义。进一步,GKSA模块用于高效摘要和传播跨视频代表行动知识,以促进整体动作模式理解,从而允许生成高信心的自学习伪标签,从而减轻时间定位中的歧义。在THUMOS14、ActivityNet1.3和FineAction等数据集上进行广泛的实验表明,我们的方法优于当前的最佳方法,并且可以轻松地与其他WSTAL方法整合。
https://arxiv.org/abs/2308.12609