In this study, we introduce DeepLocalization, an innovative framework devised for the real-time localization of actions tailored explicitly for monitoring driver behavior. Utilizing the power of advanced deep learning methodologies, our objective is to tackle the critical issue of distracted driving-a significant factor contributing to road accidents. Our strategy employs a dual approach: leveraging Graph-Based Change-Point Detection for pinpointing actions in time alongside a Video Large Language Model (Video-LLM) for precisely categorizing activities. Through careful prompt engineering, we customize the Video-LLM to adeptly handle driving activities' nuances, ensuring its classification efficacy even with sparse data. Engineered to be lightweight, our framework is optimized for consumer-grade GPUs, making it vastly applicable in practical scenarios. We subjected our method to rigorous testing on the SynDD2 dataset, a complex benchmark for distracted driving behaviors, where it demonstrated commendable performance-achieving 57.5% accuracy in event classification and 51% in event detection. These outcomes underscore the substantial promise of DeepLocalization in accurately identifying diverse driver behaviors and their temporal occurrences, all within the bounds of limited computational resources.
在这项研究中,我们引入了DeepLocalization,一种专为实时定位针对监控驾驶员行为的动作的创新框架。利用先进深度学习方法论的力量,我们的目标是解决驾驶员分心驾驶这一关键问题,这是导致道路事故的一个重要因素。我们的策略采用了一种双方法:利用基于图的变换点检测来精确定位时间中的动作,同时结合视频大型语言模型(Video-LLM)进行精确分类活动。通过仔细的提示工程,我们定制了Video-LLM,使其能够熟练处理驾驶活动的细节,即使数据稀疏,也能确保分类效果。经优化后,我们的框架轻便且适用于消费级GPU,因此在实际场景中具有广泛的应用前景。我们对该方法在SynDD2数据集上的测试进行了严格的评估,这是一个复杂的驾驶员分心驾驶行为基准数据集,它在事件分类和事件检测方面都取得了令人满意的成绩,证明了DeepLocalization在准确识别不同驾驶员行为及其时间发生情况方面具有巨大的潜力。
https://arxiv.org/abs/2404.12258
Autism Spectrum Disorder (ASD) presents significant challenges in early diagnosis and intervention, impacting children and their families. With prevalence rates rising, there is a critical need for accessible and efficient screening tools. Leveraging machine learning (ML) techniques, in particular Temporal Action Localization (TAL), holds promise for automating ASD screening. This paper introduces a self-attention based TAL model designed to identify ASD-related behaviors in infant videos. Unlike existing methods, our approach simplifies complex modeling and emphasizes efficiency, which is essential for practical deployment in real-world scenarios. Importantly, this work underscores the importance of developing computer vision methods capable of operating in naturilistic environments with little equipment control, addressing key challenges in ASD screening. This study is the first to conduct end-to-end temporal action localization in untrimmed videos of infants with ASD, offering promising avenues for early intervention and support. We report baseline results of behavior detection using our TAL model. We achieve 70% accuracy for look face, 79% accuracy for look object, 72% for smile and 65% for vocalization.
autism spectrum disorder(ASD)在早期诊断和干预方面带来了显著的挑战,影响了儿童及其家庭。随着发病率的上升,对于可访问和高效的筛查工具的需求越来越迫切。利用机器学习(ML)技术,特别是Temporal Action Localization(TAL)技术,ASD筛查有望实现自动化。本文介绍了一种基于自注意力的TAL模型,用于识别婴儿视频中的ASD相关行为。与现有方法不同,我们的方法简化了复杂建模,突出了效率,这对于在现实场景中进行实际部署至关重要。重要的是,这项工作强调了开发能够在自然环境中运行的计算机视觉方法的重要性,该方法具有少量的设备控制,这是ASD筛查中的关键挑战。本研究是第一个对带有ASD的婴儿视频进行端到端TAL的,为早期干预和支持提供了有前途的途径。我们报告了使用我们的TAL模型的行为检测结果。我们获得了70%的看脸准确率,79%的看物准确率,72%的微笑准确率和65%的语音准确率。
https://arxiv.org/abs/2404.05849
Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate actions in untrimmed videos unseen during training. Existing ZS-TAL methods involve fine-tuning a model on a large amount of annotated training data. While effective, training-based ZS-TAL approaches assume the availability of labeled data for supervised learning, which can be impractical in some applications. Furthermore, the training process naturally induces a domain bias into the learned model, which may adversely affect the model's generalization ability to arbitrary videos. These considerations prompt us to approach the ZS-TAL problem from a radically novel perspective, relaxing the requirement for training data. To this aim, we introduce a novel method that performs Test-Time adaptation for Temporal Action Localization (T3AL). In a nutshell, T3AL adapts a pre-trained Vision and Language Model (VLM). T3AL operates in three steps. First, a video-level pseudo-label of the action category is computed by aggregating information from the entire video. Then, action localization is performed adopting a novel procedure inspired by self-supervised learning. Finally, frame-level textual descriptions extracted with a state-of-the-art captioning model are employed for refining the action region proposals. We validate the effectiveness of T3AL by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results demonstrate that T3AL significantly outperforms zero-shot baselines based on state-of-the-art VLMs, confirming the benefit of a test-time adaptation approach.
Zero-Shot Temporal Action Localization (ZS-TAL)旨在识别和定位未在训练过程中见过的视频中的动作。现有的ZS-TAL方法需要在大量注释训练数据上对模型进行微调。虽然有效,但基于训练的ZS-TAL方法假设存在用于监督学习的有标注数据,这在某些应用中可能不可行。此外,训练过程会自然地引入领域偏见到学习到的模型中,这可能会影响模型对任意视频的泛化能力。这些考虑促使我们从一种全新且新颖的角度来解决ZS-TAL问题,放宽对训练数据的要求。为此,我们引入了一种名为T3AL的新方法,用于对Temporal Action Localization (T3AL)进行测试时间适应。 T3AL的工作原理如下。首先,通过汇总整个视频的信息,计算出动作类别的视频级伪标签。然后,采用一种新型的方法,受到自监督学习启发的动作定位。最后,采用最先进的捕捉模型的帧级文本描述来微调动作区域建议。 我们对THUMOS14和ActivityNet-v1.3数据集进行了实验验证。结果表明,T3AL在基于最佳VLMs的零散拍摄基准上显著优于零散拍摄基线。这证实了测试时间适应方法的优势。
https://arxiv.org/abs/2404.05426
Video localization tasks aim to temporally locate specific instances in videos, including temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL). Existing methods over-specialize on each task, overlooking the fact that these instances often occur in the same video to form the complete video content. In this work, we present UniAV, a Unified Audio-Visual perception network, to achieve joint learning of TAL, SED and AVEL tasks for the first time. UniAV can leverage diverse data available in task-specific datasets, allowing the model to learn and share mutually beneficial knowledge across tasks and modalities. To tackle the challenges posed by substantial variations in datasets (size/domain/duration) and distinct task characteristics, we propose to uniformly encode visual and audio modalities of all videos to derive generic representations, while also designing task-specific experts to capture unique knowledge for each task. Besides, we develop a unified language-aware classifier by utilizing a pre-trained text encoder, enabling the model to flexibly detect various types of instances and previously unseen ones by simply changing prompts during inference. UniAV outperforms its single-task counterparts by a large margin with fewer parameters, achieving on-par or superior performances compared to state-of-the-art task-specific methods across ActivityNet 1.3, DESED and UnAV-100 benchmarks.
视频本地化任务旨在在视频内定位特定的实例,包括时间动作定位(TAL)、声音事件检测(SED)和音频-视觉事件定位(AVEL)。现有的方法过于专业化,忽视了这些实例通常在同一视频中发生,组成了完整的视频内容。在这项工作中,我们提出了UniAV,一个统一音频-视觉感知网络,实现了TAL、SED和AVEL任务的联合学习,这是第一次实现。UniAV可以利用任务特定数据集中的多样数据,使模型能够在任务和模式之间共享有益的知识。为了应对数据集(大小/领域/持续时间)的巨变和任务特征的差异,我们提出了统一编码所有视频的视觉和音频模态,以获得通用表示,同时为每个任务设计特定专家,捕捉独特知识。此外,我们还通过利用预训练的文本编码器,开发了一个统一语言感知的分类器,使模型在推理过程中可以灵活检测各种实例和之前未见过的实例,只需更改提示即可。UniAV在参数更少的情况下优于其单一任务 counterparts,在ActivityNet 1.3、DESED和UnAV-100基准测试中都实现了与最先进任务特定方法相当或更好的性能。
https://arxiv.org/abs/2404.03179
In this paper, we introduce ASTRA, a Transformer-based model designed for the task of Action Spotting in soccer matches. ASTRA addresses several challenges inherent in the task and dataset, including the requirement for precise action localization, the presence of a long-tail data distribution, non-visibility in certain actions, and inherent label noise. To do so, ASTRA incorporates (a) a Transformer encoder-decoder architecture to achieve the desired output temporal resolution and to produce precise predictions, (b) a balanced mixup strategy to handle the long-tail distribution of the data, (c) an uncertainty-aware displacement head to capture the label variability, and (d) input audio signal to enhance detection of non-visible actions. Results demonstrate the effectiveness of ASTRA, achieving a tight Average-mAP of 66.82 on the test set. Moreover, in the SoccerNet 2023 Action Spotting challenge, we secure the 3rd position with an Average-mAP of 70.21 on the challenge set.
在本文中,我们提出了ASTRA,一种专为足球比赛动作捕捉任务设计的Transformer-based模型。ASTRA解决了该任务和数据集中固有的几个挑战,包括对精确动作局部定位的要求、存在长尾数据分布、某些动作的不可见性以及固有标签噪声。为了实现这一目标,ASTRA采用了以下方法:(a)Transformer编码器-解码器架构以实现所需的输出时间分辨率并产生精确预测;(b)平衡混合策略来处理数据的长期尾分布;(c)一个具有不确定性的迁移头以捕捉标签的可变性;(d)输入音频信号以增强对不可见动作的检测。结果表明,ASTRA取得了很好的效果,在测试集上的平均得分达到了66.82。此外,在足球网络2023年动作捕捉挑战中,ASTRA在挑战集上的平均得分达到了70.21。
https://arxiv.org/abs/2404.01891
Temporal Action Localization (TAL) involves localizing and classifying action snippets in an untrimmed video. The emergence of large video foundation models has led RGB-only video backbones to outperform previous methods needing both RGB and optical flow modalities. Leveraging these large models is often limited to training only the TAL head due to the prohibitively large GPU memory required to adapt the video backbone for TAL. To overcome this limitation, we introduce LoSA, the first memory-and-parameter-efficient backbone adapter designed specifically for TAL to handle untrimmed videos. LoSA specializes for TAL by introducing Long-Short-range Adapters that adapt the intermediate layers of the video backbone over different temporal ranges. These adapters run parallel to the video backbone to significantly reduce memory footprint. LoSA also includes Long-Short-range Fusion that strategically combines the output of these adapters from the video backbone layers to enhance the video features provided to the TAL head. Experiments show that LoSA significantly outperforms all existing methods on standard TAL benchmarks, THUMOS-14 and ActivityNet-v1.3, by scaling end-to-end backbone adaptation to billion-parameter-plus models like VideoMAEv2~(ViT-g) and leveraging them beyond head-only transfer learning.
Temporal Action Localization(TAL)涉及在未剪辑的视频中定位和分类动作片段。大型视频基础模型的发展使得仅使用RGB视频骨干的先前方法已经无法比需要同时具备RGB和光学流模态的先前方法更优秀。利用这些大型模型通常局限于仅训练TAL头部,因为需要横跨GPU内存训练视频骨干。为了克服这一限制,我们引入了LoSA,专为TAL设计的第一个内存和参数高效的骨架适配器,以处理未剪辑的视频。LoSA专门为TAL设计,通过引入长-短程适配器来调整视频骨干的中间层。这些适配器与视频骨干并行运行,显著减少了内存足迹。LoSA还包括长-短程融合,将来自视频骨干层的中输出适配器策略性地组合以增强TAL头提供的视频特征。实验证明,LoSA在标准TAL基准测试、THUMOS-14和ActivityNet-v1.3等所有现有方法中均显著胜出,通过将端到端骨架适应性扩展到像VideoMAEv2~(ViT-g)这样的大规模参数模型,并超越仅头部传输学习。
https://arxiv.org/abs/2404.01282
This paper introduces a novel approach to temporal action localization (TAL) in few-shot learning. Our work addresses the inherent limitations of conventional single-prompt learning methods that often lead to overfitting due to the inability to generalize across varying contexts in real-world videos. Recognizing the diversity of camera views, backgrounds, and objects in videos, we propose a multi-prompt learning framework enhanced with optimal transport. This design allows the model to learn a set of diverse prompts for each action, capturing general characteristics more effectively and distributing the representation to mitigate the risk of overfitting. Furthermore, by employing optimal transport theory, we efficiently align these prompts with action features, optimizing for a comprehensive representation that adapts to the multifaceted nature of video data. Our experiments demonstrate significant improvements in action localization accuracy and robustness in few-shot settings on the standard challenging datasets of THUMOS-14 and EpicKitchens100, highlighting the efficacy of our multi-prompt optimal transport approach in overcoming the challenges of conventional few-shot TAL methods.
本文提出了一种在少量样本学习中的新颖的时间动作局部化(TAL)方法。我们的工作解决了传统单提示学习方法在无法跨越现实视频中的各种上下文进行泛化的问题。认识到视频中相机视角、背景和对象存在的多样性,我们提出了一个带有最优传输的多提示学习框架。这种设计允许模型学习每个动作的一组多样性提示,更有效地捕捉通用特征并分散表示以降低过拟合风险。此外,通过采用最优传输理论,我们有效地将这些提示与动作特征对齐,优化全面表示以适应视频数据的多样化特点。我们的实验结果表明,在THUMOS-14标准和EpicKitchens100等具有挑战性的数据集上,动作局部化精度和鲁棒性在少量样本设置中得到了显著提高,这充分证明了我们在多提示最优传输方法中克服了传统少量样本TAL方法的挑战。
https://arxiv.org/abs/2403.18915
The crux of semi-supervised temporal action localization (SS-TAL) lies in excavating valuable information from abundant unlabeled videos. However, current approaches predominantly focus on building models that are robust to the error-prone target class (i.e, the predicted class with the highest confidence) while ignoring informative semantics within non-target classes. This paper approaches SS-TAL from a novel perspective by advocating for learning from non-target classes, transcending the conventional focus solely on the target class. The proposed approach involves partitioning the label space of the predicted class distribution into distinct subspaces: target class, positive classes, negative classes, and ambiguous classes, aiming to mine both positive and negative semantics that are absent in the target class, while excluding ambiguous classes. To this end, we first devise innovative strategies to adaptively select high-quality positive and negative classes from the label space, by modeling both the confidence and rank of a class in relation to those of the target class. Then, we introduce novel positive and negative losses designed to guide the learning process, pushing predictions closer to positive classes and away from negative classes. Finally, the positive and negative processes are integrated into a hybrid positive-negative learning framework, facilitating the utilization of non-target classes in both labeled and unlabeled videos. Experimental results on THUMOS14 and ActivityNet v1.3 demonstrate the superiority of the proposed method over prior state-of-the-art approaches.
半监督的时间动作定位(SS-TAL)的核心在于从丰富的未标记视频中挖掘有价值的信息。然而,目前的 approaches 主要集中在构建对错误率易为目标类(即最高置信度的预测类)具有鲁棒性的模型,同时忽略了非目标类中的有信息语义。本文从新颖的角度探讨了 SS-TAL,主张从非目标类中学习,超越仅关注目标类的传统关注点。所提出的 approach 包括将预测类标签空间的标签分片为四个子空间:目标类、 positive classes(正类)、negative classes(负类)和 ambiguous classes(不确定类),旨在挖掘目标类中不存在 positive 和 negative semantics 的同时排除 ambiguous classes。为此,我们首先通过建模类与目标类之间的置信度和排名关系,设计了一些创新策略来自适应地选择标签空间中高质量的正负类。然后,我们引入了新颖的正负损失函数,用于指导学习过程,将预测结果推向正类,远离负类。最后,将正负过程整合到一种混合正负学习框架中,促进非目标类在 both labeled and unlabeled videos 中的使用。 在 THUMOS14 和 ActivityNet v1.3 上的实验结果表明,与 prior state-of-the-art approaches 相比,所提出的方法具有优越性。
https://arxiv.org/abs/2403.11189
Skeleton-based motion representations are robust for action localization and understanding for their invariance to perspective, lighting, and occlusion, compared with images. Yet, they are often ambiguous and incomplete when taken out of context, even for human annotators. As infants discern gestures before associating them with words, actions can be conceptualized before being grounded with labels. Therefore, we propose the first unsupervised pre-training framework, Boundary-Interior Decoding (BID), that partitions a skeleton-based motion sequence into discovered semantically meaningful pre-action segments. By fine-tuning our pre-training network with a small number of annotated data, we show results out-performing SOTA methods by a large margin.
基于骨架的运动表示对于动作定位和理解具有良好的鲁棒性,因为它们对视角、照明和遮挡具有不变性,而与图像相比。然而,当脱离上下文时,它们往往是不清晰和不完整的。正如婴儿在将动作与词语关联之前就开始感知一样,在将动作与标签关联之前,动作可以先于标签进行概念化。因此,我们提出了第一个无监督的前训练框架,边界内解码(BID),将基于骨架的运动序列分割为发现 semantically 有意义的前动作段。通过用小量标记数据微调我们的预训练网络,我们证明了其性能优于当前最先进的方法。
https://arxiv.org/abs/2403.07354
Temporal localization of driving actions plays a crucial role in advanced driver-assistance systems and naturalistic driving studies. However, this is a challenging task due to strict requirements for robustness, reliability and accurate localization. In this work, we focus on improving the overall performance by efficiently utilizing video action recognition networks and adapting these to the problem of action localization. To this end, we first develop a density-guided label smoothing technique based on label probability distributions to facilitate better learning from boundary video-segments that typically include multiple labels. Second, we design a post-processing step to efficiently fuse information from video-segments and multiple camera views into scene-level predictions, which facilitates elimination of false positives. Our methodology yields a competitive performance on the A2 test set of the naturalistic driving action recognition track of the 2022 NVIDIA AI City Challenge with an F1 score of 0.271.
时间的驾驶动作的局部化在高级驾驶辅助系统和自然驾驶研究中起着关键作用。然而,由于对鲁棒性、可靠性和准确局部化的严格要求,这是项具有挑战性的任务。在这项工作中,我们专注于通过有效地利用视频动作识别网络来提高整体性能,并将这些网络适应于动作局部定位问题。为此,我们首先开发了一种基于标签概率分布的密度引导标签平滑技术,以促进更好地学习边界视频段中通常包括多个标签的学习。其次,我们设计了一个后处理步骤,以有效地将视频段和多个相机视图中的信息融合到场景级别预测中,从而消除虚假阳性结果。我们采用的方法在2022 NVIDIA AI City Challenge的自然驾驶动作识别赛道上获得了竞争力的性能,F1分数为0.271。
https://arxiv.org/abs/2403.06616
This work explores the performance of a large video understanding foundation model on the downstream task of human fall detection on untrimmed video and leverages a pretrained vision transformer for multi-class action detection, with classes: "Fall", "Lying" and "Other/Activities of daily living (ADL)". A method for temporal action localization that relies on a simple cutup of untrimmed videos is demonstrated. The methodology includes a preprocessing pipeline that converts datasets with timestamp action annotations into labeled datasets of short action clips. Simple and effective clip-sampling strategies are introduced. The effectiveness of the proposed method has been empirically evaluated on the publicly available High-Quality Fall Simulation Dataset (HQFSD). The experimental results validate the performance of the proposed pipeline. The results are promising for real-time application, and the falls are detected on video level with a state-of-the-art 0.96 F1 score on the HQFSD dataset under the given experimental settings. The source code will be made available on GitHub.
这项工作探讨了在未剪辑视频的下游任务中大型视频理解基础模型的表现,并利用预训练的视觉Transformer进行多类动作检测,类别包括“跌倒”、“躺下”和“其他/日常生活活动(ADL)”。我们展示了依靠简单截断未剪辑视频的方法。 本文提出了一种基于剪辑的时序动作定位方法。该方法依赖于简单的截断未剪辑视频的数据集,将其转换为带有标签的短动作片段的数据集。我们介绍了两种简单的剪辑采样策略。 所提出方法的性能已经在公开可用的优质跌倒模拟数据集(HQFSD)上进行了实证评估。实验结果证实了所提出工作流程的有效性。在给定的实验设置下,HQFSD数据集上的跌倒事件在视频级别上的准确率达到了0.96 F1分数。 源代码将放在GitHub上。
https://arxiv.org/abs/2401.16280
Action Localization is a challenging problem that combines detection and recognition tasks, which are often addressed separately. State-of-the-art methods rely on off-the-shelf bounding box detections pre-computed at high resolution and propose transformer models that focus on the classification task alone. Such two-stage solutions are prohibitive for real-time deployment. On the other hand, single-stage methods target both tasks by devoting part of the network (generally the backbone) to sharing the majority of the workload, compromising performance for speed. These methods build on adding a DETR head with learnable queries that, after cross- and self-attention can be sent to corresponding MLPs for detecting a person's bounding box and action. However, DETR-like architectures are challenging to train and can incur in big complexity. In this paper, we observe that a straight bipartite matching loss can be applied to the output tokens of a vision transformer. This results in a backbone + MLP architecture that can do both tasks without the need of an extra encoder-decoder head and learnable queries. We show that a single MViT-S architecture trained with bipartite matching to perform both tasks surpasses the same MViT-S when trained with RoI align on pre-computed bounding boxes. With a careful design of token pooling and the proposed training pipeline, our MViTv2-S model achieves +3 mAP on AVA2.2. w.r.t. the two-stage counterpart. Code and models will be released after paper revision.
动作局部化是一个结合检测和识别任务具有挑战性的问题,这些问题通常分别处理。最先进的解决方案依赖于在高端分辨率下预先计算的边界框检测,并提出了专注于分类任务的Transformer模型。这种两阶段解决方案对于实时部署来说过于昂贵。另一方面,单阶段方法通过将网络的部分(通常是骨干网络)用于共享工作负载,以牺牲性能来提高速度。这些方法在添加可学习查询的DETR头部后,可以在 cross- 和 self- 注意力之后将对应于检测到的边界框和动作的查询发送到相应的MLP。然而,DETR类似架构的训练和实现具有挑战性,并且容易产生较大复杂度。在本文中,我们观察到可以将二元二分匹配损失应用于视觉Transformer的输出标记。这导致了一个骨干网络+MLP架构,可以同时完成这两项任务,而无需增加额外的编码器-解码器头部和学习查询。我们证明了,通过二元二分匹配训练的单MViT-S模型在预计算边界框上训练时可以超越使用RoI调整的单MViT-S模型。通过仔细设计标记池和提出的训练流程,我们的MViTv2-S模型在AVA2.2上实现了+3的AP。代码和模型将在论文修订后发布。
https://arxiv.org/abs/2312.17686
Weakly-supervised temporal action localization aims to localize action instances in videos with only video-level action labels. Existing methods mainly embrace a localization-by-classification pipeline that optimizes the snippet-level prediction with a video classification loss. However, this formulation suffers from the discrepancy between classification and detection, resulting in inaccurate separation of foreground and background (F\&B) snippets. To alleviate this problem, we propose to explore the underlying structure among the snippets by resorting to unsupervised snippet clustering, rather than heavily relying on the video classification loss. Specifically, we propose a novel clustering-based F\&B separation algorithm. It comprises two core components: a snippet clustering component that groups the snippets into multiple latent clusters and a cluster classification component that further classifies the cluster as foreground or background. As there are no ground-truth labels to train these two components, we introduce a unified self-labeling mechanism based on optimal transport to produce high-quality pseudo-labels that match several plausible prior distributions. This ensures that the cluster assignments of the snippets can be accurately associated with their F\&B labels, thereby boosting the F\&B separation. We evaluate our method on three benchmarks: THUMOS14, ActivityNet v1.2 and v1.3. Our method achieves promising performance on all three benchmarks while being significantly more lightweight than previous methods. Code is available at this https URL
弱监督的时间动作定位旨在通过仅使用视频级别的动作标签来定位视频中的动作实例。现有的方法主要采用基于分类的定位管道来通过视频分类损失优化片段级别的预测。然而,这个公式存在分类和检测之间的差异,导致前景和背景(F&B)片段的准确分离。为了减轻这个问题,我们提出了一种通过无监督片段聚类来探索片段之间的潜在结构,而不是过分依赖视频分类损失。具体来说,我们提出了一种基于聚类的F&B分割算法。它包括两个核心组件:一个片段聚类组件,将片段分组到多个潜在聚类中,和一个聚类分类组件,进一步将聚类分类为前景或背景。由于没有用于训练这两个组件的地面真标签,我们引入了一种基于最优传输的统一自标签机制,产生高质量的反向样本,匹配多个可能的先验分布。这确保了片段分配的聚类可以准确与F&B标签相关联,从而提高F&B分割。我们在THUMOS14、ActivityNet v1.2和v1.3这三个基准上评估我们的方法。我们的方法在所有三个基准上都取得了良好的性能,而重量比以前的方法轻得多。代码可以从这个链接获取:https://www.aclweb.org/anthology/W17-6246
https://arxiv.org/abs/2312.14138
Temporal Action Localization (TAL) is a complex task that poses relevant challenges, particularly when attempting to generalize on new -- unseen -- domains in real-world applications. These scenarios, despite realistic, are often neglected in the literature, exposing these solutions to important performance degradation. In this work, we tackle this issue by introducing, for the first time, an approach for Unsupervised Domain Adaptation (UDA) in sparse TAL, which we refer to as Semantic Adversarial unsupervised Domain Adaptation (SADA). Our contribution is threefold: (1) we pioneer the development of a domain adaptation model that operates on realistic sparse action detection benchmarks; (2) we tackle the limitations of global-distribution alignment techniques by introducing a novel adversarial loss that is sensitive to local class distributions, ensuring finer-grained adaptation; and (3) we present a novel experimental setup, based on EpicKitchens100, that evaluates multiple types of domain shifts in a comprehensive manner. Our experimental results indicate that SADA improves the adaptation across domains when compared to fully supervised state-of-the-art and alternative UDA methods, attaining a relative performance boost of up to 14%.
Temporal Action Localization(TAL)是一个复杂的任务,在尝试在现实应用中泛化到新的——未见过的——领域时,会提出相关的挑战。尽管这些场景在文献中很现实,但通常被忽视,这使得这些解决方案在重要性能降级方面面临风险。在这项工作中,我们通过引入在稀疏TAL中进行无监督领域适应(UDA)的方法,我们称之为语义对抗无监督领域适应(SADA),来解决这一问题。我们的贡献是三方面的:(1)我们首创了一个在现实稀疏动作检测基准上运行的领域适应模型;(2)我们引入了一种新的对抗损失,对局部类别分布敏感,确保了更细粒度的适应;(3)我们基于EpicKitchens100构建了一个新的实验设置,以全面评估各种领域转移。我们的实验结果表明,与完全监督最先进的和替代UDA方法相比,SADA在领域泛化方面具有显著的提高,实现了相对性能提升至14%。
https://arxiv.org/abs/2312.13377
Recently, temporal action localization (TAL) has garnered significant interest in information retrieval community. However, existing supervised/weakly supervised methods are heavily dependent on extensive labeled temporal boundaries and action categories, which is labor-intensive and time-consuming. Although some unsupervised methods have utilized the ``iteratively clustering and localization'' paradigm for TAL, they still suffer from two pivotal impediments: 1) unsatisfactory video clustering confidence, and 2) unreliable video pseudolabels for model training. To address these limitations, we present a novel self-paced incremental learning model to enhance clustering and localization training simultaneously, thereby facilitating more effective unsupervised TAL. Concretely, we improve the clustering confidence through exploring the contextual feature-robust visual information. Thereafter, we design two (constant- and variable- speed) incremental instance learning strategies for easy-to-hard model training, thus ensuring the reliability of these video pseudolabels and further improving overall localization performance. Extensive experiments on two public datasets have substantiated the superiority of our model over several state-of-the-art competitors.
近年来,时间动作局部定位(TAL)在信息检索领域引起了广泛关注。然而,现有的监督/弱监督方法在很大程度上依赖于广泛的标记时间边界的动作类别,这需要大量的人力和时间。尽管一些无监督方法利用了“迭代聚类和局部化”范式进行TAL,但它们仍然受到两个关键限制:1)不满意的视频聚类置信度,2)模型训练不可靠的视频伪标签。为了克服这些限制,我们提出了一个自适应的增量学习模型,以同时增强聚类和局部化训练,从而促进更有效的无监督TAL。具体来说,我们通过探索上下文特征鲁棒的视觉信息来提高聚类置信度。然后,我们设计了两组(恒定速度和可变速度)增量实例学习策略,用于易到难的模型训练,从而确保这些视频伪标签的可靠性,并进一步提高整体局部化性能。在两个公开数据集上进行的大量实验证实了我们的模型相对于最先进的竞争对手的优越性。
https://arxiv.org/abs/2312.07384
This paper addresses the challenge of point-supervised temporal action detection, in which only one frame per action instance is annotated in the training set. Self-training aims to provide supplementary supervision for the training process by generating pseudo-labels (action proposals) from a base model. However, most current methods generate action proposals by applying manually designed thresholds to action classification probabilities and treating adjacent snippets as independent entities. As a result, these methods struggle to generate complete action proposals, exhibit sensitivity to fluctuations in action classification scores, and generate redundant and overlapping action proposals. This paper proposes a novel framework termed ADM-Loc, which stands for Actionness Distribution Modeling for point-supervised action Localization. ADM-Loc generates action proposals by fitting a composite distribution, comprising both Gaussian and uniform distributions, to the action classification signals. This fitting process is tailored to each action class present in the video and is applied separately for each action instance, ensuring the distinctiveness of their distributions. ADM-Loc significantly enhances the alignment between the generated action proposals and ground-truth action instances and offers high-quality pseudo-labels for self-training. Moreover, to model action boundary snippets, it enforces consistency in action classification scores during training by employing Gaussian kernels, supervised with the proposed loss functions. ADM-Loc outperforms the state-of-the-art point-supervised methods on THUMOS14 and ActivityNet-v1.2 datasets.
本文解决了基于点标注的时间动作检测(在训练集中只有一个动作实例被标注)的挑战。自训练旨在为训练过程提供额外的监督,通过从基础模型中生成伪标签(动作建议)来提供这种监督。然而,大多数现有方法通过手动设计阈值对动作分类概率应用,并将相邻片段视为独立实体。因此,这些方法很难生成完整的动作建议,对动作分类分数的波动敏感,并生成冗余和重叠的动作建议。本文提出了一种新框架,称为ADM-Loc,意为点标注动作局部定位模型。ADM-Loc通过将高斯分布和均匀分布的组合拟合到动作分类信号上生成动作建议。拟合过程针对视频中的每个动作类别进行定制,并对每个动作实例分别应用。这确保了它们的分布的独立性。ADM-Loc显著增强了生成的动作建议与真实动作实例之间的对齐程度,并为自训练提供了高质量的伪标签。此外,为了建模动作边界片段,它在训练过程中通过使用高斯核来维持动作分类分数的一致性,并监督采用提出的损失函数。ADM-Loc在THUMOS14和ActivityNet-v1.2数据集上优于最先进的基于点标注的方法。
https://arxiv.org/abs/2311.15916
We present MM-Navigator, a GPT-4V-based agent for the smartphone graphical user interface (GUI) navigation task. MM-Navigator can interact with a smartphone screen as human users, and determine subsequent actions to fulfill given instructions. Our findings demonstrate that large multimodal models (LMMs), specifically GPT-4V, excel in zero-shot GUI navigation through its advanced screen interpretation, action reasoning, and precise action localization capabilities. We first benchmark MM-Navigator on our collected iOS screen dataset. According to human assessments, the system exhibited a 91\% accuracy rate in generating reasonable action descriptions and a 75\% accuracy rate in executing the correct actions for single-step instructions on iOS. Additionally, we evaluate the model on a subset of an Android screen navigation dataset, where the model outperforms previous GUI navigators in a zero-shot fashion. Our benchmark and detailed analyses aim to lay a robust groundwork for future research into the GUI navigation task. The project page is at this https URL.
我们提出了MM-Navigator,一种基于GPT-4V的智能手机图形用户界面(GUI)导航任务代理。MM-Navigator可以以人类用户的方式与智能手机屏幕互动,并确定根据给定指令后续的操作。我们的研究结果表明,大型多模态模型(LMMs),特别是GPT-4V,通过其先进的屏幕解释、动作推理和精确的动作定位能力在零散射击GUI导航中表现出色。 我们首先在收集的iOS屏幕数据集上对MM-Navigator进行基准测试。根据人类评估,系统在生成合理的动作描述和执行单步指令的正确动作方面都表现出91\%的准确率。此外,我们在一组Android屏幕导航数据集上评估了该模型,发现模型在零散射击方式上优于之前的GUI导航器。 我们的基准和详细分析旨在为未来研究提供稳固的基础,深入了解GUI导航任务。项目页面位于https://www.xxx.com。
https://arxiv.org/abs/2311.07562
This paper tackles the challenge of point-supervised temporal action detection, wherein only a single frame is annotated for each action instance in the training set. Most of the current methods, hindered by the sparse nature of annotated points, struggle to effectively represent the continuous structure of actions or the inherent temporal and semantic dependencies within action instances. Consequently, these methods frequently learn merely the most distinctive segments of actions, leading to the creation of incomplete action proposals. This paper proposes POTLoc, a Pseudo-label Oriented Transformer for weakly-supervised Action Localization utilizing only point-level annotation. POTLoc is designed to identify and track continuous action structures via a self-training strategy. The base model begins by generating action proposals solely with point-level supervision. These proposals undergo refinement and regression to enhance the precision of the estimated action boundaries, which subsequently results in the production of `pseudo-labels' to serve as supplementary supervisory signals. The architecture of the model integrates a transformer with a temporal feature pyramid to capture video snippet dependencies and model actions of varying duration. The pseudo-labels, providing information about the coarse locations and boundaries of actions, assist in guiding the transformer for enhanced learning of action dynamics. POTLoc outperforms the state-of-the-art point-supervised methods on THUMOS'14 and ActivityNet-v1.2 datasets, showing a significant improvement of 5% average mAP on the former.
本文解决了基于点监督的时间动作检测中的挑战,其中每个训练集中的动作实例只标注了一个帧。大多数现有方法受到稀疏性标注点的限制,很难有效地表示动作的连续结构或动作实例中的内在时间和语义依赖关系。因此,这些方法通常只能学习动作的最具特征性的部分,导致创建不完整的动作提案。本文提出了一种利用点级注释的伪标签定向Transformer(POTLoc)来进行弱监督动作局部化。POTLoc通过自训练策略来识别和跟踪连续的动作结构。 基模型首先仅通过点级监督生成动作建议。这些建议经过细化和回归以提高估计动作边界的精度,从而产生“伪标签”,作为补充监督信号。模型的架构融合了Transformer和时间特征金字塔,以捕捉视频片段依赖关系并建模具有不同持续时间的动作。伪标签提供关于动作粗略位置和边界的信息,有助于引导Transformer进行增强的学习动作 dynamics。 POTLoc在THUMOS'14和ActivityNet-v1.2数据集上优于最先进的点监督方法,其平均mAP提高了5%。
https://arxiv.org/abs/2310.13585
Point-level supervised temporal action localization (PTAL) aims at recognizing and localizing actions in untrimmed videos where only a single point (frame) within every action instance is annotated in training data. Without temporal annotations, most previous works adopt the multiple instance learning (MIL) framework, where the input video is segmented into non-overlapped short snippets, and action classification is performed independently on every short snippet. We argue that the MIL framework is suboptimal for PTAL because it operates on separated short snippets that contain limited temporal information. Therefore, the classifier only focuses on several easy-to-distinguish snippets instead of discovering the whole action instance without missing any relevant snippets. To alleviate this problem, we propose a novel method that localizes actions by generating and evaluating action proposals of flexible duration that involve more comprehensive temporal information. Moreover, we introduce an efficient clustering algorithm to efficiently generate dense pseudo labels that provide stronger supervision, and a fine-grained contrastive loss to further refine the quality of pseudo labels. Experiments show that our proposed method achieves competitive or superior performance to the state-of-the-art methods and some fully-supervised methods on four benchmarks: ActivityNet 1.3, THUMOS 14, GTEA, and BEOID datasets.
点级监督下的时序动作局部定位(PTAL)旨在识别和局部化未剪辑视频中的动作,其中在训练数据中仅有一个点(帧)被标注。如果没有时序注释,大多数先前的 works 采用多实例学习(MIL)框架,其中输入视频被分割成非重叠的短片段,并对每个短片段进行独立的动作分类。我们认为,MIL 框架对于 PTAL 来说是不最优的,因为它在包含有限时序信息的分离短片段上操作。因此,分类器仅关注几个容易区分的片段,而不是发现没有丢失任何相关片段的完整动作实例。为了减轻这个问题,我们提出了一种新方法,通过生成和评估具有更全面时序信息的灵活时序建议来局部化动作。此外,我们还引入了高效的聚类算法来生成密集伪标签,以提供更强监督,以及细粒度的对比损失来进一步提高伪标签的质量。实验证明,我们在四个基准数据集(ActivityNet 1.3、THUMOS 14、GTEA 和 BEOID)上的方法实现了与最先进方法或完全监督方法竞争或卓越的表现。
https://arxiv.org/abs/2310.05511
Temporal Action Localization (TAL) aims to identify actions' start, end, and class labels in untrimmed videos. While recent advancements using transformer networks and Feature Pyramid Networks (FPN) have enhanced visual feature recognition in TAL tasks, less progress has been made in the integration of audio features into such frameworks. This paper introduces the Multi-Resolution Audio-Visual Feature Fusion (MRAV-FF), an innovative method to merge audio-visual data across different temporal resolutions. Central to our approach is a hierarchical gated cross-attention mechanism, which discerningly weighs the importance of audio information at diverse temporal scales. Such a technique not only refines the precision of regression boundaries but also bolsters classification confidence. Importantly, MRAV-FF is versatile, making it compatible with existing FPN TAL architectures and offering a significant enhancement in performance when audio data is available.
时间动作定位(TAL)的目标是在未剪辑的视频中提取动作的开始、结束和类别标签。尽管使用Transformer网络和特征层次网络(FPN)的最新研究已经增强了TAL任务中的视觉特征识别,但在将音频特征融合到这些框架中方面进展较少。本 paper 介绍了多分辨率音频-视觉特征融合(MRAV-FF),一种创新的方法,将不同时间分辨率的音频和视觉数据合并。我们的 approach 的核心是分层门控交叉注意力机制,该机制区分地权重音频信息在不同时间尺度上的 importance。这种技术不仅可以优化回归边界的精度,还可以增强分类信心。重要的是,MRAV-FF 是灵活的,使其与现有的 FPN TAL 架构兼容,并在可用音频数据时提供显著的性能提升。
https://arxiv.org/abs/2310.03456