This paper addresses the challenge of point-supervised temporal action detection, in which only one frame per action instance is annotated in the training set. Self-training aims to provide supplementary supervision for the training process by generating pseudo-labels (action proposals) from a base model. However, most current methods generate action proposals by applying manually designed thresholds to action classification probabilities and treating adjacent snippets as independent entities. As a result, these methods struggle to generate complete action proposals, exhibit sensitivity to fluctuations in action classification scores, and generate redundant and overlapping action proposals. This paper proposes a novel framework termed ADM-Loc, which stands for Actionness Distribution Modeling for point-supervised action Localization. ADM-Loc generates action proposals by fitting a composite distribution, comprising both Gaussian and uniform distributions, to the action classification signals. This fitting process is tailored to each action class present in the video and is applied separately for each action instance, ensuring the distinctiveness of their distributions. ADM-Loc significantly enhances the alignment between the generated action proposals and ground-truth action instances and offers high-quality pseudo-labels for self-training. Moreover, to model action boundary snippets, it enforces consistency in action classification scores during training by employing Gaussian kernels, supervised with the proposed loss functions. ADM-Loc outperforms the state-of-the-art point-supervised methods on THUMOS14 and ActivityNet-v1.2 datasets.
本文解决了基于点标注的时间动作检测(在训练集中只有一个动作实例被标注)的挑战。自训练旨在为训练过程提供额外的监督,通过从基础模型中生成伪标签(动作建议)来提供这种监督。然而,大多数现有方法通过手动设计阈值对动作分类概率应用,并将相邻片段视为独立实体。因此,这些方法很难生成完整的动作建议,对动作分类分数的波动敏感,并生成冗余和重叠的动作建议。本文提出了一种新框架,称为ADM-Loc,意为点标注动作局部定位模型。ADM-Loc通过将高斯分布和均匀分布的组合拟合到动作分类信号上生成动作建议。拟合过程针对视频中的每个动作类别进行定制,并对每个动作实例分别应用。这确保了它们的分布的独立性。ADM-Loc显著增强了生成的动作建议与真实动作实例之间的对齐程度,并为自训练提供了高质量的伪标签。此外,为了建模动作边界片段,它在训练过程中通过使用高斯核来维持动作分类分数的一致性,并监督采用提出的损失函数。ADM-Loc在THUMOS14和ActivityNet-v1.2数据集上优于最先进的基于点标注的方法。
https://arxiv.org/abs/2311.15916
We present MM-Navigator, a GPT-4V-based agent for the smartphone graphical user interface (GUI) navigation task. MM-Navigator can interact with a smartphone screen as human users, and determine subsequent actions to fulfill given instructions. Our findings demonstrate that large multimodal models (LMMs), specifically GPT-4V, excel in zero-shot GUI navigation through its advanced screen interpretation, action reasoning, and precise action localization capabilities. We first benchmark MM-Navigator on our collected iOS screen dataset. According to human assessments, the system exhibited a 91\% accuracy rate in generating reasonable action descriptions and a 75\% accuracy rate in executing the correct actions for single-step instructions on iOS. Additionally, we evaluate the model on a subset of an Android screen navigation dataset, where the model outperforms previous GUI navigators in a zero-shot fashion. Our benchmark and detailed analyses aim to lay a robust groundwork for future research into the GUI navigation task. The project page is at this https URL.
我们提出了MM-Navigator,一种基于GPT-4V的智能手机图形用户界面(GUI)导航任务代理。MM-Navigator可以以人类用户的方式与智能手机屏幕互动,并确定根据给定指令后续的操作。我们的研究结果表明,大型多模态模型(LMMs),特别是GPT-4V,通过其先进的屏幕解释、动作推理和精确的动作定位能力在零散射击GUI导航中表现出色。 我们首先在收集的iOS屏幕数据集上对MM-Navigator进行基准测试。根据人类评估,系统在生成合理的动作描述和执行单步指令的正确动作方面都表现出91\%的准确率。此外,我们在一组Android屏幕导航数据集上评估了该模型,发现模型在零散射击方式上优于之前的GUI导航器。 我们的基准和详细分析旨在为未来研究提供稳固的基础,深入了解GUI导航任务。项目页面位于https://www.xxx.com。
https://arxiv.org/abs/2311.07562
This paper tackles the challenge of point-supervised temporal action detection, wherein only a single frame is annotated for each action instance in the training set. Most of the current methods, hindered by the sparse nature of annotated points, struggle to effectively represent the continuous structure of actions or the inherent temporal and semantic dependencies within action instances. Consequently, these methods frequently learn merely the most distinctive segments of actions, leading to the creation of incomplete action proposals. This paper proposes POTLoc, a Pseudo-label Oriented Transformer for weakly-supervised Action Localization utilizing only point-level annotation. POTLoc is designed to identify and track continuous action structures via a self-training strategy. The base model begins by generating action proposals solely with point-level supervision. These proposals undergo refinement and regression to enhance the precision of the estimated action boundaries, which subsequently results in the production of `pseudo-labels' to serve as supplementary supervisory signals. The architecture of the model integrates a transformer with a temporal feature pyramid to capture video snippet dependencies and model actions of varying duration. The pseudo-labels, providing information about the coarse locations and boundaries of actions, assist in guiding the transformer for enhanced learning of action dynamics. POTLoc outperforms the state-of-the-art point-supervised methods on THUMOS'14 and ActivityNet-v1.2 datasets, showing a significant improvement of 5% average mAP on the former.
本文解决了基于点监督的时间动作检测中的挑战,其中每个训练集中的动作实例只标注了一个帧。大多数现有方法受到稀疏性标注点的限制,很难有效地表示动作的连续结构或动作实例中的内在时间和语义依赖关系。因此,这些方法通常只能学习动作的最具特征性的部分,导致创建不完整的动作提案。本文提出了一种利用点级注释的伪标签定向Transformer(POTLoc)来进行弱监督动作局部化。POTLoc通过自训练策略来识别和跟踪连续的动作结构。 基模型首先仅通过点级监督生成动作建议。这些建议经过细化和回归以提高估计动作边界的精度,从而产生“伪标签”,作为补充监督信号。模型的架构融合了Transformer和时间特征金字塔,以捕捉视频片段依赖关系并建模具有不同持续时间的动作。伪标签提供关于动作粗略位置和边界的信息,有助于引导Transformer进行增强的学习动作 dynamics。 POTLoc在THUMOS'14和ActivityNet-v1.2数据集上优于最先进的点监督方法,其平均mAP提高了5%。
https://arxiv.org/abs/2310.13585
Point-level supervised temporal action localization (PTAL) aims at recognizing and localizing actions in untrimmed videos where only a single point (frame) within every action instance is annotated in training data. Without temporal annotations, most previous works adopt the multiple instance learning (MIL) framework, where the input video is segmented into non-overlapped short snippets, and action classification is performed independently on every short snippet. We argue that the MIL framework is suboptimal for PTAL because it operates on separated short snippets that contain limited temporal information. Therefore, the classifier only focuses on several easy-to-distinguish snippets instead of discovering the whole action instance without missing any relevant snippets. To alleviate this problem, we propose a novel method that localizes actions by generating and evaluating action proposals of flexible duration that involve more comprehensive temporal information. Moreover, we introduce an efficient clustering algorithm to efficiently generate dense pseudo labels that provide stronger supervision, and a fine-grained contrastive loss to further refine the quality of pseudo labels. Experiments show that our proposed method achieves competitive or superior performance to the state-of-the-art methods and some fully-supervised methods on four benchmarks: ActivityNet 1.3, THUMOS 14, GTEA, and BEOID datasets.
点级监督下的时序动作局部定位(PTAL)旨在识别和局部化未剪辑视频中的动作,其中在训练数据中仅有一个点(帧)被标注。如果没有时序注释,大多数先前的 works 采用多实例学习(MIL)框架,其中输入视频被分割成非重叠的短片段,并对每个短片段进行独立的动作分类。我们认为,MIL 框架对于 PTAL 来说是不最优的,因为它在包含有限时序信息的分离短片段上操作。因此,分类器仅关注几个容易区分的片段,而不是发现没有丢失任何相关片段的完整动作实例。为了减轻这个问题,我们提出了一种新方法,通过生成和评估具有更全面时序信息的灵活时序建议来局部化动作。此外,我们还引入了高效的聚类算法来生成密集伪标签,以提供更强监督,以及细粒度的对比损失来进一步提高伪标签的质量。实验证明,我们在四个基准数据集(ActivityNet 1.3、THUMOS 14、GTEA 和 BEOID)上的方法实现了与最先进方法或完全监督方法竞争或卓越的表现。
https://arxiv.org/abs/2310.05511
Temporal Action Localization (TAL) aims to identify actions' start, end, and class labels in untrimmed videos. While recent advancements using transformer networks and Feature Pyramid Networks (FPN) have enhanced visual feature recognition in TAL tasks, less progress has been made in the integration of audio features into such frameworks. This paper introduces the Multi-Resolution Audio-Visual Feature Fusion (MRAV-FF), an innovative method to merge audio-visual data across different temporal resolutions. Central to our approach is a hierarchical gated cross-attention mechanism, which discerningly weighs the importance of audio information at diverse temporal scales. Such a technique not only refines the precision of regression boundaries but also bolsters classification confidence. Importantly, MRAV-FF is versatile, making it compatible with existing FPN TAL architectures and offering a significant enhancement in performance when audio data is available.
时间动作定位(TAL)的目标是在未剪辑的视频中提取动作的开始、结束和类别标签。尽管使用Transformer网络和特征层次网络(FPN)的最新研究已经增强了TAL任务中的视觉特征识别,但在将音频特征融合到这些框架中方面进展较少。本 paper 介绍了多分辨率音频-视觉特征融合(MRAV-FF),一种创新的方法,将不同时间分辨率的音频和视觉数据合并。我们的 approach 的核心是分层门控交叉注意力机制,该机制区分地权重音频信息在不同时间尺度上的 importance。这种技术不仅可以优化回归边界的精度,还可以增强分类信心。重要的是,MRAV-FF 是灵活的,使其与现有的 FPN TAL 架构兼容,并在可用音频数据时提供显著的性能提升。
https://arxiv.org/abs/2310.03456
The goal of Temporal Action Localization (TAL) is to find the categories and temporal boundaries of actions in an untrimmed video. Most TAL methods rely heavily on action recognition models that are sensitive to action labels rather than temporal boundaries. More importantly, few works consider the background frames that are similar to action frames in pixels but dissimilar in semantics, which also leads to inaccurate temporal boundaries. To address the challenge above, we propose a Boundary-Aware Proposal Generation (BAPG) method with contrastive learning. Specifically, we define the above background frames as hard negative samples. Contrastive learning with hard negative mining is introduced to improve the discrimination of BAPG. BAPG is independent of the existing TAL network architecture, so it can be applied plug-and-play to mainstream TAL models. Extensive experimental results on THUMOS14 and ActivityNet-1.3 demonstrate that BAPG can significantly improve the performance of TAL.
时间动作定位(TAL)的目标是在未修剪的视频中找到行动类别和时间边界。大多数TAL方法都 heavily rely on 行动识别模型,这些模型对行动标签敏感,而不是时间边界。更重要的是,很少有工作考虑背景帧,它们在像素上与行动帧相似,但在语义上却不同,这也会导致不准确的时间边界。为了应对上述挑战,我们提出了一种带有对比学习的Boundary-Aware Proposal Generation (BAPG)方法。具体来说,我们定义上述背景帧为硬负样本。对比学习和硬负挖掘引入了,以改善BAPG的区分度。BAPG与现有TAL网络架构无关,因此可以将其轻松应用于主流TAL模型。在THUMOS14和ActivityNet-1.3的实验中,广泛证明了BAPG能够显著改善TAL性能。
https://arxiv.org/abs/2309.13810
Action scene understanding in soccer is a challenging task due to the complex and dynamic nature of the game, as well as the interactions between players. This article provides a comprehensive overview of this task divided into action recognition, spotting, and spatio-temporal action localization, with a particular emphasis on the modalities used and multimodal methods. We explore the publicly available data sources and metrics used to evaluate models' performance. The article reviews recent state-of-the-art methods that leverage deep learning techniques and traditional methods. We focus on multimodal methods, which integrate information from multiple sources, such as video and audio data, and also those that represent one source in various ways. The advantages and limitations of methods are discussed, along with their potential for improving the accuracy and robustness of models. Finally, the article highlights some of the open research questions and future directions in the field of soccer action recognition, including the potential for multimodal methods to advance this field. Overall, this survey provides a valuable resource for researchers interested in the field of action scene understanding in soccer.
在足球比赛中的动作场景理解是一个具有挑战性的任务,因为足球比赛具有复杂和动态的特点,以及球员之间的互动。本文对这个任务进行了全面综述,并将其分成动作识别、发现和时间和空间动作定位,其中特别注重使用的模式和多种模式方法。我们探索了可用于评估模型性能的公开可用数据源和指标。本文回顾了最近利用深度学习技术和传统方法的最新方法。我们重点探讨了多种模式方法,它们整合了来自多个来源的信息,例如视频和音频数据,以及以不同方式代表一个来源的方法。方法的优点和局限性被讨论,以及它们如何提高模型的准确性和鲁棒性的潜力。最后,文章强调了足球动作识别领域的一些开放研究问题和未来的研究方向,包括多种模式方法推动该领域的进步的潜力。总的来说,本文为对足球动作场景理解领域感兴趣的研究人员提供了宝贵的资源。
https://arxiv.org/abs/2309.12067
While most modern video understanding models operate on short-range clips, real-world videos are often several minutes long with semantically consistent segments of variable length. A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length and aggregating the outputs. This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative. In this paper, we aim to provide a generic and adaptive sampling approach for long-form videos in lieu of the de facto uniform sampling. Viewing videos as semantically consistent segments, we formulate a task-agnostic, unsupervised, and scalable approach based on Kernel Temporal Segmentation (KTS) for sampling and tokenizing long videos. We evaluate our method on long-form video understanding tasks such as video classification and temporal action localization, showing consistent gains over existing approaches and achieving state-of-the-art performance on long-form video modeling.
现代视频理解模型通常处理的是短片段,而实际视频往往几十分钟,具有语义 consistent 的片段长度可变。处理长视频的常见方法是使用一段固定时间长度的片段进行均匀采样,并汇总输出。这种方法忽略了长视频的深层次本质,因为固定长度的片段往往重复或无意义。在本文中,我们旨在提供一种通用的、自适应的采样方法,以代替事实上的均匀采样。将视频视为语义 consistent 的片段,我们制定了基于核心时间分割(KTS)的任务无关、无监督和可扩展的方法,用于采样和 tokenizing 长视频。我们针对长视频理解任务,如视频分类和时间行为定位,评估了我们的方法,显示与现有方法一致的增益,并在长视频建模方面实现了最先进的性能。
https://arxiv.org/abs/2309.11569
Point-level weakly-supervised temporal action localization (PWTAL) aims to localize actions with only a single timestamp annotation for each action instance. Existing methods tend to mine dense pseudo labels to alleviate the label sparsity, but overlook the potential sub-action temporal structures, resulting in inferior performance. To tackle this problem, we propose a novel sub-action prototype learning framework (SPL-Loc) which comprises Sub-action Prototype Clustering (SPC) and Ordered Prototype Alignment (OPA). SPC adaptively extracts representative sub-action prototypes which are capable to perceive the temporal scale and spatial content variation of action instances. OPA selects relevant prototypes to provide completeness clue for pseudo label generation by applying a temporal alignment loss. As a result, pseudo labels are derived from alignment results to improve action boundary prediction. Extensive experiments on three popular benchmarks demonstrate that the proposed SPL-Loc significantly outperforms existing SOTA PWTAL methods.
点级别的弱监督时间动作定位(PWTAL)旨在为每个行动实例只标注一个 timestamp 注解,以定位每个行动。现有的方法倾向于挖掘密集伪标签以减轻标签稀疏的问题,但忽略了潜在的子行动时间结构,导致性能较差。为了解决这个问题,我们提出了一种新的子行动原型学习框架(SPL-Loc),它包括子行动原型聚类(SPC)和有序原型对齐(OPA)。SPC 自适应地提取具有代表性的子行动原型,能够感知行动实例的时间尺度和空间内容变化。OPA 选择相关的原型,通过应用时间对齐损失来生成伪标签,以提供完整的线索,以改进行动边界预测。在三个流行的基准测试上进行了广泛的实验,结果表明,所提出的 SPL-Loc 显著优于现有的 SOTA PWTAL 方法。
https://arxiv.org/abs/2309.09060
Temporal action detection (TAD) aims to detect all action boundaries and their corresponding categories in an untrimmed video. The unclear boundaries of actions in videos often result in imprecise predictions of action boundaries by existing methods. To resolve this issue, we propose a one-stage framework named TriDet. First, we propose a Trident-head to model the action boundary via an estimated relative probability distribution around the boundary. Then, we analyze the rank-loss problem (i.e. instant discriminability deterioration) in transformer-based methods and propose an efficient scalable-granularity perception (SGP) layer to mitigate this issue. To further push the limit of instant discriminability in the video backbone, we leverage the strong representation capability of pretrained large models and investigate their performance on TAD. Last, considering the adequate spatial-temporal context for classification, we design a decoupled feature pyramid network with separate feature pyramids to incorporate rich spatial context from the large model for localization. Experimental results demonstrate the robustness of TriDet and its state-of-the-art performance on multiple TAD datasets, including hierarchical (multilabel) TAD datasets.
时间动作检测(TAD)旨在在未剪辑的视频中提取所有动作边界及其对应分类,但视频中的动作边界通常不够清晰,导致现有方法无法准确预测动作边界。为了解决这一问题,我们提出了名为TriDet的一阶段框架。我们首先提出了一个箭頭模型,通过估计边界周围的相对概率分布来建模动作边界。然后,我们分析了基于Transformer的方法中的排名损失问题(即即时分类能力恶化),并提出了高效的可扩展级联精度感知层(SGP)来解决该问题。为了进一步逼近视频主干中即时分类能力的极限,我们利用预训练大型模型的强大表示能力,并研究了它们在TAD任务中的表现。最后,考虑到适当的空间和时间上下文来进行分类,我们设计了一个分离的特征金字塔网络,并使用独立的特征金字塔来集成大型模型提供的丰富空间上下文来进行定位。实验结果显示TriDet的鲁棒性和其在多个TAD数据集上包括层次(多标签)TAD数据集上的最新性能。
https://arxiv.org/abs/2309.05590
In temporal action localization, given an input video, the goal is to predict which actions it contains, where they begin, and where they end. Training and testing current state-of-the-art deep learning models requires access to large amounts of data and computational power. However, gathering such data is challenging and computational resources might be limited. This work explores and measures how current deep temporal action localization models perform in settings constrained by the amount of data or computational power. We measure data efficiency by training each model on a subset of the training set. We find that TemporalMaxer outperforms other models in data-limited settings. Furthermore, we recommend TriDet when training time is limited. To test the efficiency of the models during inference, we pass videos of different lengths through each model. We find that TemporalMaxer requires the least computational resources, likely due to its simple architecture.
在时间行为本地化中,给定输入视频,的目标是预测它包含哪些动作,这些动作开始和结束的位置。训练和测试当前最先进的深度学习模型需要访问大量的数据和计算资源。然而,收集这些数据是困难的,计算资源可能是有限的。这项工作探索并衡量当前深度时间行为本地化模型在数据或计算资源受限的情况下的表现。我们通过训练每个模型在训练集的子集上来衡量数据效率。我们发现,在数据限制的情况下,TemporalMaxer比其他模型表现更好。此外,当训练时间有限时,我们建议使用TriDet。为了测试模型在推理期间的效率,我们通过每个模型传递不同长度的视频。我们发现TemporalMaxer需要最少的计算资源,可能是因为其简单的架构。
https://arxiv.org/abs/2308.13082
Weakly supervised temporal action localization (WSTAL) aims to localize actions in untrimmed videos using video-level labels. Despite recent advances, existing approaches mainly follow a localization-by-classification pipeline, generally processing each segment individually, thereby exploiting only limited contextual information. As a result, the model will lack a comprehensive understanding (e.g. appearance and temporal structure) of various action patterns, leading to ambiguity in classification learning and temporal localization. Our work addresses this from a novel perspective, by exploring and exploiting the cross-video contextual knowledge within the dataset to recover the dataset-level semantic structure of action instances via weak labels only, thereby indirectly improving the holistic understanding of fine-grained action patterns and alleviating the aforementioned ambiguities. Specifically, an end-to-end framework is proposed, including a Robust Memory-Guided Contrastive Learning (RMGCL) module and a Global Knowledge Summarization and Aggregation (GKSA) module. First, the RMGCL module explores the contrast and consistency of cross-video action features, assisting in learning more structured and compact embedding space, thus reducing ambiguity in classification learning. Further, the GKSA module is used to efficiently summarize and propagate the cross-video representative action knowledge in a learnable manner to promote holistic action patterns understanding, which in turn allows the generation of high-confidence pseudo-labels for self-learning, thus alleviating ambiguity in temporal localization. Extensive experiments on THUMOS14, ActivityNet1.3, and FineAction demonstrate that our method outperforms the state-of-the-art methods, and can be easily plugged into other WSTAL methods.
弱监督时间动作定位(WSTAL)旨在使用视频级别的标签对未修剪的视频进行动作定位。尽管近年来取得了进展,但现有的方法主要采用分类后再定位的流程,通常对每个片段单独进行处理,因此只能利用有限的上下文信息。因此,模型将缺乏对各种动作模式(例如外观和时间结构)的全面理解,导致分类学习和时间定位中的歧义。我们的工作从一个全新的角度解决这个问题,通过探索和利用数据集中的跨视频上下文知识,仅使用弱标签恢复行动实例的数据集语义结构,从而间接地改善精细动作模式全面了解和提高整体理解,并减轻上述歧义。具体来说,我们提出了一个端到端框架,包括一个 robust 记忆引导比较学习(RMGCL)模块和一个全球知识摘要和聚合(GKSA)模块。首先,RMGCL模块探索跨视频动作特征的对比度和一致性,协助学习更结构化和紧凑嵌入空间,从而减少分类学习中的歧义。进一步,GKSA模块用于高效摘要和传播跨视频代表行动知识,以促进整体动作模式理解,从而允许生成高信心的自学习伪标签,从而减轻时间定位中的歧义。在THUMOS14、ActivityNet1.3和FineAction等数据集上进行广泛的实验表明,我们的方法优于当前的最佳方法,并且可以轻松地与其他WSTAL方法整合。
https://arxiv.org/abs/2308.12609
Point-supervised Temporal Action Localization (PSTAL) is an emerging research direction for label-efficient learning. However, current methods mainly focus on optimizing the network either at the snippet-level or the instance-level, neglecting the inherent reliability of point annotations at both levels. In this paper, we propose a Hierarchical Reliability Propagation (HR-Pro) framework, which consists of two reliability-aware stages: Snippet-level Discrimination Learning and Instance-level Completeness Learning, both stages explore the efficient propagation of high-confidence cues in point annotations. For snippet-level learning, we introduce an online-updated memory to store reliable snippet prototypes for each class. We then employ a Reliability-aware Attention Block to capture both intra-video and inter-video dependencies of snippets, resulting in more discriminative and robust snippet representation. For instance-level learning, we propose a point-based proposal generation approach as a means of connecting snippets and instances, which produces high-confidence proposals for further optimization at the instance level. Through multi-level reliability-aware learning, we obtain more reliable confidence scores and more accurate temporal boundaries of predicted proposals. Our HR-Pro achieves state-of-the-art performance on multiple challenging benchmarks, including an impressive average mAP of 60.3% on THUMOS14. Notably, our HR-Pro largely surpasses all previous point-supervised methods, and even outperforms several competitive fully supervised methods. Code will be available at this https URL.
Point-supervised Temporal Action Localization (PSTAL) 是一种高效的标签学习新研究方向。然而,目前的方法主要关注片段级别或实例级别的网络优化,而忽视了点标注在这两个级别的固有可靠性。在本文中,我们提出了一种Hierarchical Reliability Propagation (HR-Pro)框架,该框架包括两个可靠性意识的不同阶段:片段级别的歧视性和实例级别的完整度学习,这两个阶段研究了点标注的高可靠性信号的有效传播。对于片段级别的学习,我们引入了在线更新的记忆来存储每个类别的可靠片段原型。然后,我们使用可靠性意识的注意力块捕获片段之间的内部视频和外部视频依赖关系,从而生成更具个性和鲁棒性的片段表示。对于实例级别的学习,我们提出了基于点的建议生成方法,以连接片段和实例,并产生在实例级别的进一步优化的高可靠性建议。通过多级可靠性意识学习,我们获得了更可靠的信心评分和更准确的预测建议的时间边界。我们的HR-Pro在多个具有挑战性的基准测试中取得了最先进的表现,包括在THUMOS14上令人印象深刻的平均mAP为60.3%。值得注意的是,我们的HR-Pro几乎超越了所有以前的点标注方法,甚至超越了一些竞争完全监督方法。代码将在这个httpsURL上提供。
https://arxiv.org/abs/2308.12608
Weakly-supervised action localization aims to recognize and localize action instancese in untrimmed videos with only video-level labels. Most existing models rely on multiple instance learning(MIL), where the predictions of unlabeled instances are supervised by classifying labeled bags. The MIL-based methods are relatively well studied with cogent performance achieved on classification but not on localization. Generally, they locate temporal regions by the video-level classification but overlook the temporal variations of feature semantics. To address this problem, we propose a novel attention-based hierarchically-structured latent model to learn the temporal variations of feature semantics. Specifically, our model entails two components, the first is an unsupervised change-points detection module that detects change-points by learning the latent representations of video features in a temporal hierarchy based on their rates of change, and the second is an attention-based classification model that selects the change-points of the foreground as the boundaries. To evaluate the effectiveness of our model, we conduct extensive experiments on two benchmark datasets, THUMOS-14 and ActivityNet-v1.3. The experiments show that our method outperforms current state-of-the-art methods, and even achieves comparable performance with fully-supervised methods.
弱监督行动定位的目标是在未修剪的视频上识别和定位行动实例,仅使用视频级别的标签。大多数现有模型依赖于多个实例学习(MIL),其中未标记实例的预测由分类标签袋监督。基于MIL的方法相对较为深入研究,在分类方面取得了显著性能,但在定位方面则忽视了特征语义的时间变化。为了解决这一问题,我们提出了一种新的基于注意力的Hierarchically-StructuredLatent模型来学习特征语义的时间变化。具体来说,我们的模型包含两个组件,第一个是 unsupervised的转折点检测模块,通过学习视频特征的时间级潜在表示,从它们的增长率中检测转折点,第二个是基于注意力的分类模型,选择前景转折点作为边界。为了评估我们的模型的有效性,我们研究了两个基准数据集,THUMOS-14和ActivityNet-v1.3。实验结果表明,我们的方法比当前最先进的方法表现更好,甚至与完全监督的方法达到类似的性能。
https://arxiv.org/abs/2308.09946
Weakly-supervised temporal action localization (WTAL) is a practical yet challenging task. Due to large-scale datasets, most existing methods use a network pretrained in other datasets to extract features, which are not suitable enough for WTAL. To address this problem, researchers design several modules for feature enhancement, which improve the performance of the localization module, especially modeling the temporal relationship between snippets. However, all of them neglect the adverse effects of ambiguous information, which would reduce the discriminability of others. Considering this phenomenon, we propose Discriminability-Driven Graph Network (DDG-Net), which explicitly models ambiguous snippets and discriminative snippets with well-designed connections, preventing the transmission of ambiguous information and enhancing the discriminability of snippet-level representations. Additionally, we propose feature consistency loss to prevent the assimilation of features and drive the graph convolution network to generate more discriminative representations. Extensive experiments on THUMOS14 and ActivityNet1.2 benchmarks demonstrate the effectiveness of DDG-Net, establishing new state-of-the-art results on both datasets. Source code is available at \url{this https URL}.
弱监督的时间行动定位(WTAL)是一个实用但具有挑战性的任务。由于大规模的数据集,大多数现有方法使用在其他数据集中训练的网络提取特征,这些特征不适合用于WTAL。为了解决这一问题,研究人员设计了几个特征增强模块,以提高定位模块的性能,特别是建模片段之间的时间关系。然而,他们都忽略了歧义信息的副作用,这将会减少其他人的区分能力。考虑到这种现象,我们提出了区分性驱动的 Graph 网络(DDG-Net),它 explicitly 建模歧义片段和有用的片段,采用设计良好的连接,防止传输歧义信息,并增强片段级表示的区分能力。此外,我们提出了特征一致性损失,以防止特征融合并推动Graph卷积网络生成更多的有用表示。在THUMOS14和ActivityNet1.2基准数据上的广泛实验证明了DDG-Net的有效性,在两个数据集上实现了新的最先进的结果。源代码可在 \url{this https URL} 找到。
https://arxiv.org/abs/2307.16415
This report describes our submission to the Ego4D Moment Queries Challenge 2023. Our submission extends ActionFormer, a latest method for temporal action localization. Our extension combines an improved ground-truth assignment strategy during training and a refined version of SoftNMS at inference time. Our solution is ranked 2nd on the public leaderboard with 26.62% average mAP and 45.69% Recall@1x at tIoU=0.5 on the test set, significantly outperforming the strong baseline from 2023 challenge. Our code is available at this https URL.
本报告描述了我们向2023年Ego4D时刻查询挑战提交的方案。我们的方案扩展了ActionFormer,这是一种最新的时间行动定位方法。我们的扩展在训练期间采用了改进的真实值分配策略,并在推断期间采用了改进的SoftNMS版本。我们在测试集上的解决方案在public leaderboard上排名第二,平均mAP为26.62%,Recall@1x的精度为45.69%。在2023年挑战中强有力的基准线的显著超越表明我们的解决方案是出色的。我们的代码可在this https URL上获取。
https://arxiv.org/abs/2307.02025
The prevalence of violence in daily life poses significant threats to individuals' physical and mental well-being. Using surveillance cameras in public spaces has proven effective in proactively deterring and preventing such incidents. However, concerns regarding privacy invasion have emerged due to their widespread deployment. To address the problem, we leverage Dynamic Vision Sensors (DVS) cameras to detect violent incidents and preserve privacy since it captures pixel brightness variations instead of static imagery. We introduce the Bullying10K dataset, encompassing various actions, complex movements, and occlusions from real-life scenarios. It provides three benchmarks for evaluating different tasks: action recognition, temporal action localization, and pose estimation. With 10,000 event segments, totaling 12 billion events and 255 GB of data, Bullying10K contributes significantly by balancing violence detection and personal privacy persevering. And it also poses a challenge to the neuromorphic dataset. It will serve as a valuable resource for training and developing privacy-protecting video systems. The Bullying10K opens new possibilities for innovative approaches in these domains.
日常生活中的暴力普遍存在,对个体的身体健康和心理健康构成严重威胁。在公共场所使用监控摄像头已经证明能够有效地预防并阻止此类事件的发生。然而,隐私侵犯问题也因为广泛使用监控摄像头而凸显出来。为了解决这一问题,我们利用动态视觉传感器(DVS)摄像头来检测暴力事件并保护隐私,因为它能够捕捉像素亮度的变化而不是静态图像。我们介绍了“欺凌10K数据集”,涵盖了各种行动、复杂动作和遮挡情况,提供了三个基准任务以评估不同的任务:行动识别、时间动作定位和姿态估计。该数据集有10,000个事件片段,总共涵盖了120亿事件和2.55GB的数据,通过平衡暴力检测和个人隐私坚持, significantly contribute to these domains' innovative approaches. 该数据集也挑战了神经可塑性数据集。它将成为训练和开发保护隐私的视频系统的有价值的资源。欺凌10K为这些领域的创新方法带来了新的可能性。
https://arxiv.org/abs/2306.11546
Video moment localization, also known as video moment retrieval, aiming to search a target segment within a video described by a given natural language query. Beyond the task of temporal action localization whereby the target actions are pre-defined, video moment retrieval can query arbitrary complex activities. In this survey paper, we aim to present a comprehensive review of existing video moment localization techniques, including supervised, weakly supervised, and unsupervised ones. We also review the datasets available for video moment localization and group results of related work. In addition, we discuss promising future directions for this field, in particular large-scale datasets and interpretable video moment localization models.
视频时刻定位(Video moment localization,又称视频时刻检索)旨在根据给定自然语言查询搜索视频中指定目标片段。除了时间行为定位任务(target actions are pre-defined in this task),视频时刻检索还可以查询任意复杂的活动。在这份调查 paper 中,我们旨在全面综述现有的视频时刻定位技术,包括受监督、弱监督和无监督的。我们还 review 了用于视频时刻定位的可用数据集和相关工作的群体结果。此外,我们讨论了该领域有前景的未来方向,特别是大规模数据集和可解释的视频时刻定位模型。
https://arxiv.org/abs/2306.07515
Weakly-supervised temporal action localization aims to localize and recognize actions in untrimmed videos with only video-level category labels during training. Without instance-level annotations, most existing methods follow the Segment-based Multiple Instance Learning (S-MIL) framework, where the predictions of segments are supervised by the labels of videos. However, the objective for acquiring segment-level scores during training is not consistent with the target for acquiring proposal-level scores during testing, leading to suboptimal results. To deal with this problem, we propose a novel Proposal-based Multiple Instance Learning (P-MIL) framework that directly classifies the candidate proposals in both the training and testing stages, which includes three key designs: 1) a surrounding contrastive feature extraction module to suppress the discriminative short proposals by considering the surrounding contrastive information, 2) a proposal completeness evaluation module to inhibit the low-quality proposals with the guidance of the completeness pseudo labels, and 3) an instance-level rank consistency loss to achieve robust detection by leveraging the complementarity of RGB and FLOW modalities. Extensive experimental results on two challenging benchmarks including THUMOS14 and ActivityNet demonstrate the superior performance of our method.
弱监督的时间行动定位旨在在训练期间使用视频级别的类别标签唯一地定位和识别未修剪的视频,而不需要实例级别的注释。在没有实例级别的注释的情况下,大多数现有方法遵循基于片段的多个实例学习框架(S-MIL),其中片段的预测由视频标签监督。然而,在训练期间获取片段级评分的目标与在测试期间获取提议级评分的目标不一致,导致优化结果不足。为了解决这个问题,我们提出了一种新的提议基于多个实例学习框架(P-MIL),它在训练和测试阶段直接分类候选人提议,包括三个关键设计:1)周围的对比特征提取模块,通过考虑周围的对比信息,抑制具有歧视性的短提议;2)提议完整性评估模块,通过指导完整性伪标签,抑制低质量的提议;3)实例级别的等级一致性损失,通过利用RGB和Flow特征的互补性,实现鲁棒检测。在包括THUMOS14和ActivityNet等两个挑战基准的广泛实验结果表明,我们的方法和性能优越。
https://arxiv.org/abs/2305.17861
Temporal action localization (TAL), which involves recognizing and locating action instances, is a challenging task in video understanding. Most existing approaches directly predict action classes and regress offsets to boundaries, while overlooking the discrepant importance of each frame. In this paper, we propose an Action Sensitivity Learning framework (ASL) to tackle this task, which aims to assess the value of each frame and then leverage the generated action sensitivity to recalibrate the training procedure. We first introduce a lightweight Action Sensitivity Evaluator to learn the action sensitivity at the class level and instance level, respectively. The outputs of the two branches are combined to reweight the gradient of the two sub-tasks. Moreover, based on the action sensitivity of each frame, we design an Action Sensitive Contrastive Loss to enhance features, where the action-aware frames are sampled as positive pairs to push away the action-irrelevant frames. The extensive studies on various action localization benchmarks (i.e., MultiThumos, Charades, Ego4D-Moment Queries v1.0, Epic-Kitchens 100, Thumos14 and ActivityNet1.3) show that ASL surpasses the state-of-the-art in terms of average-mAP under multiple types of scenarios, e.g., single-labeled, densely-labeled and egocentric.
时间动作定位(TAL)是指在视频中识别和定位动作实例的挑战性任务。大部分现有方法直接预测动作类别并回归到边界,同时忽略每个帧的离散重要性。在本文中,我们提出一个Action sensitive Learning framework(ASL)来解决该任务,旨在评估每个帧的价值,然后利用生成的动作敏感性来重新校准训练过程。我们首先介绍了一个轻量级Action sensitive Evaluator,分别学习动作敏感性在类级别和实例级别的响应。两个分支的输出组合在一起重新调整两个子任务的梯度。此外,基于每个帧的动作敏感性,我们设计了一个Action Sensitive Contrastive Loss,以增强特征,其中动作意识到帧被采样为正值对,以推开无关的动作帧。对多种动作定位基准(如MultiThumos、Charades、Ego4D-Moment Queries v1.0、Epic-Kitchens 100、 Thumos14和ActivityNet1.3)广泛的研究结果表明,ASL在平均MAP方面超越了当前的最佳方法,特别是在各种场景类型(如单个标记、密集标记和自我中心)下的情况。
https://arxiv.org/abs/2305.15701