This paper tackles the challenge of point-supervised temporal action detection, wherein only a single frame is annotated for each action instance in the training set. Most of the current methods, hindered by the sparse nature of annotated points, struggle to effectively represent the continuous structure of actions or the inherent temporal and semantic dependencies within action instances. Consequently, these methods frequently learn merely the most distinctive segments of actions, leading to the creation of incomplete action proposals. This paper proposes POTLoc, a Pseudo-label Oriented Transformer for weakly-supervised Action Localization utilizing only point-level annotation. POTLoc is designed to identify and track continuous action structures via a self-training strategy. The base model begins by generating action proposals solely with point-level supervision. These proposals undergo refinement and regression to enhance the precision of the estimated action boundaries, which subsequently results in the production of `pseudo-labels' to serve as supplementary supervisory signals. The architecture of the model integrates a transformer with a temporal feature pyramid to capture video snippet dependencies and model actions of varying duration. The pseudo-labels, providing information about the coarse locations and boundaries of actions, assist in guiding the transformer for enhanced learning of action dynamics. POTLoc outperforms the state-of-the-art point-supervised methods on THUMOS'14 and ActivityNet-v1.2 datasets, showing a significant improvement of 5% average mAP on the former.
本文解决了基于点监督的时间动作检测中的挑战,其中每个训练集中的动作实例只标注了一个帧。大多数现有方法受到稀疏性标注点的限制,很难有效地表示动作的连续结构或动作实例中的内在时间和语义依赖关系。因此,这些方法通常只能学习动作的最具特征性的部分,导致创建不完整的动作提案。本文提出了一种利用点级注释的伪标签定向Transformer(POTLoc)来进行弱监督动作局部化。POTLoc通过自训练策略来识别和跟踪连续的动作结构。 基模型首先仅通过点级监督生成动作建议。这些建议经过细化和回归以提高估计动作边界的精度,从而产生“伪标签”,作为补充监督信号。模型的架构融合了Transformer和时间特征金字塔,以捕捉视频片段依赖关系并建模具有不同持续时间的动作。伪标签提供关于动作粗略位置和边界的信息,有助于引导Transformer进行增强的学习动作 dynamics。 POTLoc在THUMOS'14和ActivityNet-v1.2数据集上优于最先进的点监督方法,其平均mAP提高了5%。
https://arxiv.org/abs/2310.13585
Point-level supervised temporal action localization (PTAL) aims at recognizing and localizing actions in untrimmed videos where only a single point (frame) within every action instance is annotated in training data. Without temporal annotations, most previous works adopt the multiple instance learning (MIL) framework, where the input video is segmented into non-overlapped short snippets, and action classification is performed independently on every short snippet. We argue that the MIL framework is suboptimal for PTAL because it operates on separated short snippets that contain limited temporal information. Therefore, the classifier only focuses on several easy-to-distinguish snippets instead of discovering the whole action instance without missing any relevant snippets. To alleviate this problem, we propose a novel method that localizes actions by generating and evaluating action proposals of flexible duration that involve more comprehensive temporal information. Moreover, we introduce an efficient clustering algorithm to efficiently generate dense pseudo labels that provide stronger supervision, and a fine-grained contrastive loss to further refine the quality of pseudo labels. Experiments show that our proposed method achieves competitive or superior performance to the state-of-the-art methods and some fully-supervised methods on four benchmarks: ActivityNet 1.3, THUMOS 14, GTEA, and BEOID datasets.
点级监督下的时序动作局部定位(PTAL)旨在识别和局部化未剪辑视频中的动作,其中在训练数据中仅有一个点(帧)被标注。如果没有时序注释,大多数先前的 works 采用多实例学习(MIL)框架,其中输入视频被分割成非重叠的短片段,并对每个短片段进行独立的动作分类。我们认为,MIL 框架对于 PTAL 来说是不最优的,因为它在包含有限时序信息的分离短片段上操作。因此,分类器仅关注几个容易区分的片段,而不是发现没有丢失任何相关片段的完整动作实例。为了减轻这个问题,我们提出了一种新方法,通过生成和评估具有更全面时序信息的灵活时序建议来局部化动作。此外,我们还引入了高效的聚类算法来生成密集伪标签,以提供更强监督,以及细粒度的对比损失来进一步提高伪标签的质量。实验证明,我们在四个基准数据集(ActivityNet 1.3、THUMOS 14、GTEA 和 BEOID)上的方法实现了与最先进方法或完全监督方法竞争或卓越的表现。
https://arxiv.org/abs/2310.05511
Temporal Action Localization (TAL) aims to identify actions' start, end, and class labels in untrimmed videos. While recent advancements using transformer networks and Feature Pyramid Networks (FPN) have enhanced visual feature recognition in TAL tasks, less progress has been made in the integration of audio features into such frameworks. This paper introduces the Multi-Resolution Audio-Visual Feature Fusion (MRAV-FF), an innovative method to merge audio-visual data across different temporal resolutions. Central to our approach is a hierarchical gated cross-attention mechanism, which discerningly weighs the importance of audio information at diverse temporal scales. Such a technique not only refines the precision of regression boundaries but also bolsters classification confidence. Importantly, MRAV-FF is versatile, making it compatible with existing FPN TAL architectures and offering a significant enhancement in performance when audio data is available.
时间动作定位(TAL)的目标是在未剪辑的视频中提取动作的开始、结束和类别标签。尽管使用Transformer网络和特征层次网络(FPN)的最新研究已经增强了TAL任务中的视觉特征识别,但在将音频特征融合到这些框架中方面进展较少。本 paper 介绍了多分辨率音频-视觉特征融合(MRAV-FF),一种创新的方法,将不同时间分辨率的音频和视觉数据合并。我们的 approach 的核心是分层门控交叉注意力机制,该机制区分地权重音频信息在不同时间尺度上的 importance。这种技术不仅可以优化回归边界的精度,还可以增强分类信心。重要的是,MRAV-FF 是灵活的,使其与现有的 FPN TAL 架构兼容,并在可用音频数据时提供显著的性能提升。
https://arxiv.org/abs/2310.03456
The goal of Temporal Action Localization (TAL) is to find the categories and temporal boundaries of actions in an untrimmed video. Most TAL methods rely heavily on action recognition models that are sensitive to action labels rather than temporal boundaries. More importantly, few works consider the background frames that are similar to action frames in pixels but dissimilar in semantics, which also leads to inaccurate temporal boundaries. To address the challenge above, we propose a Boundary-Aware Proposal Generation (BAPG) method with contrastive learning. Specifically, we define the above background frames as hard negative samples. Contrastive learning with hard negative mining is introduced to improve the discrimination of BAPG. BAPG is independent of the existing TAL network architecture, so it can be applied plug-and-play to mainstream TAL models. Extensive experimental results on THUMOS14 and ActivityNet-1.3 demonstrate that BAPG can significantly improve the performance of TAL.
时间动作定位(TAL)的目标是在未修剪的视频中找到行动类别和时间边界。大多数TAL方法都 heavily rely on 行动识别模型,这些模型对行动标签敏感,而不是时间边界。更重要的是,很少有工作考虑背景帧,它们在像素上与行动帧相似,但在语义上却不同,这也会导致不准确的时间边界。为了应对上述挑战,我们提出了一种带有对比学习的Boundary-Aware Proposal Generation (BAPG)方法。具体来说,我们定义上述背景帧为硬负样本。对比学习和硬负挖掘引入了,以改善BAPG的区分度。BAPG与现有TAL网络架构无关,因此可以将其轻松应用于主流TAL模型。在THUMOS14和ActivityNet-1.3的实验中,广泛证明了BAPG能够显著改善TAL性能。
https://arxiv.org/abs/2309.13810
Action scene understanding in soccer is a challenging task due to the complex and dynamic nature of the game, as well as the interactions between players. This article provides a comprehensive overview of this task divided into action recognition, spotting, and spatio-temporal action localization, with a particular emphasis on the modalities used and multimodal methods. We explore the publicly available data sources and metrics used to evaluate models' performance. The article reviews recent state-of-the-art methods that leverage deep learning techniques and traditional methods. We focus on multimodal methods, which integrate information from multiple sources, such as video and audio data, and also those that represent one source in various ways. The advantages and limitations of methods are discussed, along with their potential for improving the accuracy and robustness of models. Finally, the article highlights some of the open research questions and future directions in the field of soccer action recognition, including the potential for multimodal methods to advance this field. Overall, this survey provides a valuable resource for researchers interested in the field of action scene understanding in soccer.
在足球比赛中的动作场景理解是一个具有挑战性的任务,因为足球比赛具有复杂和动态的特点,以及球员之间的互动。本文对这个任务进行了全面综述,并将其分成动作识别、发现和时间和空间动作定位,其中特别注重使用的模式和多种模式方法。我们探索了可用于评估模型性能的公开可用数据源和指标。本文回顾了最近利用深度学习技术和传统方法的最新方法。我们重点探讨了多种模式方法,它们整合了来自多个来源的信息,例如视频和音频数据,以及以不同方式代表一个来源的方法。方法的优点和局限性被讨论,以及它们如何提高模型的准确性和鲁棒性的潜力。最后,文章强调了足球动作识别领域的一些开放研究问题和未来的研究方向,包括多种模式方法推动该领域的进步的潜力。总的来说,本文为对足球动作场景理解领域感兴趣的研究人员提供了宝贵的资源。
https://arxiv.org/abs/2309.12067
While most modern video understanding models operate on short-range clips, real-world videos are often several minutes long with semantically consistent segments of variable length. A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length and aggregating the outputs. This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative. In this paper, we aim to provide a generic and adaptive sampling approach for long-form videos in lieu of the de facto uniform sampling. Viewing videos as semantically consistent segments, we formulate a task-agnostic, unsupervised, and scalable approach based on Kernel Temporal Segmentation (KTS) for sampling and tokenizing long videos. We evaluate our method on long-form video understanding tasks such as video classification and temporal action localization, showing consistent gains over existing approaches and achieving state-of-the-art performance on long-form video modeling.
现代视频理解模型通常处理的是短片段,而实际视频往往几十分钟,具有语义 consistent 的片段长度可变。处理长视频的常见方法是使用一段固定时间长度的片段进行均匀采样,并汇总输出。这种方法忽略了长视频的深层次本质,因为固定长度的片段往往重复或无意义。在本文中,我们旨在提供一种通用的、自适应的采样方法,以代替事实上的均匀采样。将视频视为语义 consistent 的片段,我们制定了基于核心时间分割(KTS)的任务无关、无监督和可扩展的方法,用于采样和 tokenizing 长视频。我们针对长视频理解任务,如视频分类和时间行为定位,评估了我们的方法,显示与现有方法一致的增益,并在长视频建模方面实现了最先进的性能。
https://arxiv.org/abs/2309.11569
Point-level weakly-supervised temporal action localization (PWTAL) aims to localize actions with only a single timestamp annotation for each action instance. Existing methods tend to mine dense pseudo labels to alleviate the label sparsity, but overlook the potential sub-action temporal structures, resulting in inferior performance. To tackle this problem, we propose a novel sub-action prototype learning framework (SPL-Loc) which comprises Sub-action Prototype Clustering (SPC) and Ordered Prototype Alignment (OPA). SPC adaptively extracts representative sub-action prototypes which are capable to perceive the temporal scale and spatial content variation of action instances. OPA selects relevant prototypes to provide completeness clue for pseudo label generation by applying a temporal alignment loss. As a result, pseudo labels are derived from alignment results to improve action boundary prediction. Extensive experiments on three popular benchmarks demonstrate that the proposed SPL-Loc significantly outperforms existing SOTA PWTAL methods.
点级别的弱监督时间动作定位(PWTAL)旨在为每个行动实例只标注一个 timestamp 注解,以定位每个行动。现有的方法倾向于挖掘密集伪标签以减轻标签稀疏的问题,但忽略了潜在的子行动时间结构,导致性能较差。为了解决这个问题,我们提出了一种新的子行动原型学习框架(SPL-Loc),它包括子行动原型聚类(SPC)和有序原型对齐(OPA)。SPC 自适应地提取具有代表性的子行动原型,能够感知行动实例的时间尺度和空间内容变化。OPA 选择相关的原型,通过应用时间对齐损失来生成伪标签,以提供完整的线索,以改进行动边界预测。在三个流行的基准测试上进行了广泛的实验,结果表明,所提出的 SPL-Loc 显著优于现有的 SOTA PWTAL 方法。
https://arxiv.org/abs/2309.09060
Temporal action detection (TAD) aims to detect all action boundaries and their corresponding categories in an untrimmed video. The unclear boundaries of actions in videos often result in imprecise predictions of action boundaries by existing methods. To resolve this issue, we propose a one-stage framework named TriDet. First, we propose a Trident-head to model the action boundary via an estimated relative probability distribution around the boundary. Then, we analyze the rank-loss problem (i.e. instant discriminability deterioration) in transformer-based methods and propose an efficient scalable-granularity perception (SGP) layer to mitigate this issue. To further push the limit of instant discriminability in the video backbone, we leverage the strong representation capability of pretrained large models and investigate their performance on TAD. Last, considering the adequate spatial-temporal context for classification, we design a decoupled feature pyramid network with separate feature pyramids to incorporate rich spatial context from the large model for localization. Experimental results demonstrate the robustness of TriDet and its state-of-the-art performance on multiple TAD datasets, including hierarchical (multilabel) TAD datasets.
时间动作检测(TAD)旨在在未剪辑的视频中提取所有动作边界及其对应分类,但视频中的动作边界通常不够清晰,导致现有方法无法准确预测动作边界。为了解决这一问题,我们提出了名为TriDet的一阶段框架。我们首先提出了一个箭頭模型,通过估计边界周围的相对概率分布来建模动作边界。然后,我们分析了基于Transformer的方法中的排名损失问题(即即时分类能力恶化),并提出了高效的可扩展级联精度感知层(SGP)来解决该问题。为了进一步逼近视频主干中即时分类能力的极限,我们利用预训练大型模型的强大表示能力,并研究了它们在TAD任务中的表现。最后,考虑到适当的空间和时间上下文来进行分类,我们设计了一个分离的特征金字塔网络,并使用独立的特征金字塔来集成大型模型提供的丰富空间上下文来进行定位。实验结果显示TriDet的鲁棒性和其在多个TAD数据集上包括层次(多标签)TAD数据集上的最新性能。
https://arxiv.org/abs/2309.05590
In temporal action localization, given an input video, the goal is to predict which actions it contains, where they begin, and where they end. Training and testing current state-of-the-art deep learning models requires access to large amounts of data and computational power. However, gathering such data is challenging and computational resources might be limited. This work explores and measures how current deep temporal action localization models perform in settings constrained by the amount of data or computational power. We measure data efficiency by training each model on a subset of the training set. We find that TemporalMaxer outperforms other models in data-limited settings. Furthermore, we recommend TriDet when training time is limited. To test the efficiency of the models during inference, we pass videos of different lengths through each model. We find that TemporalMaxer requires the least computational resources, likely due to its simple architecture.
在时间行为本地化中,给定输入视频,的目标是预测它包含哪些动作,这些动作开始和结束的位置。训练和测试当前最先进的深度学习模型需要访问大量的数据和计算资源。然而,收集这些数据是困难的,计算资源可能是有限的。这项工作探索并衡量当前深度时间行为本地化模型在数据或计算资源受限的情况下的表现。我们通过训练每个模型在训练集的子集上来衡量数据效率。我们发现,在数据限制的情况下,TemporalMaxer比其他模型表现更好。此外,当训练时间有限时,我们建议使用TriDet。为了测试模型在推理期间的效率,我们通过每个模型传递不同长度的视频。我们发现TemporalMaxer需要最少的计算资源,可能是因为其简单的架构。
https://arxiv.org/abs/2308.13082
Weakly supervised temporal action localization (WSTAL) aims to localize actions in untrimmed videos using video-level labels. Despite recent advances, existing approaches mainly follow a localization-by-classification pipeline, generally processing each segment individually, thereby exploiting only limited contextual information. As a result, the model will lack a comprehensive understanding (e.g. appearance and temporal structure) of various action patterns, leading to ambiguity in classification learning and temporal localization. Our work addresses this from a novel perspective, by exploring and exploiting the cross-video contextual knowledge within the dataset to recover the dataset-level semantic structure of action instances via weak labels only, thereby indirectly improving the holistic understanding of fine-grained action patterns and alleviating the aforementioned ambiguities. Specifically, an end-to-end framework is proposed, including a Robust Memory-Guided Contrastive Learning (RMGCL) module and a Global Knowledge Summarization and Aggregation (GKSA) module. First, the RMGCL module explores the contrast and consistency of cross-video action features, assisting in learning more structured and compact embedding space, thus reducing ambiguity in classification learning. Further, the GKSA module is used to efficiently summarize and propagate the cross-video representative action knowledge in a learnable manner to promote holistic action patterns understanding, which in turn allows the generation of high-confidence pseudo-labels for self-learning, thus alleviating ambiguity in temporal localization. Extensive experiments on THUMOS14, ActivityNet1.3, and FineAction demonstrate that our method outperforms the state-of-the-art methods, and can be easily plugged into other WSTAL methods.
弱监督时间动作定位(WSTAL)旨在使用视频级别的标签对未修剪的视频进行动作定位。尽管近年来取得了进展,但现有的方法主要采用分类后再定位的流程,通常对每个片段单独进行处理,因此只能利用有限的上下文信息。因此,模型将缺乏对各种动作模式(例如外观和时间结构)的全面理解,导致分类学习和时间定位中的歧义。我们的工作从一个全新的角度解决这个问题,通过探索和利用数据集中的跨视频上下文知识,仅使用弱标签恢复行动实例的数据集语义结构,从而间接地改善精细动作模式全面了解和提高整体理解,并减轻上述歧义。具体来说,我们提出了一个端到端框架,包括一个 robust 记忆引导比较学习(RMGCL)模块和一个全球知识摘要和聚合(GKSA)模块。首先,RMGCL模块探索跨视频动作特征的对比度和一致性,协助学习更结构化和紧凑嵌入空间,从而减少分类学习中的歧义。进一步,GKSA模块用于高效摘要和传播跨视频代表行动知识,以促进整体动作模式理解,从而允许生成高信心的自学习伪标签,从而减轻时间定位中的歧义。在THUMOS14、ActivityNet1.3和FineAction等数据集上进行广泛的实验表明,我们的方法优于当前的最佳方法,并且可以轻松地与其他WSTAL方法整合。
https://arxiv.org/abs/2308.12609
Point-supervised Temporal Action Localization (PSTAL) is an emerging research direction for label-efficient learning. However, current methods mainly focus on optimizing the network either at the snippet-level or the instance-level, neglecting the inherent reliability of point annotations at both levels. In this paper, we propose a Hierarchical Reliability Propagation (HR-Pro) framework, which consists of two reliability-aware stages: Snippet-level Discrimination Learning and Instance-level Completeness Learning, both stages explore the efficient propagation of high-confidence cues in point annotations. For snippet-level learning, we introduce an online-updated memory to store reliable snippet prototypes for each class. We then employ a Reliability-aware Attention Block to capture both intra-video and inter-video dependencies of snippets, resulting in more discriminative and robust snippet representation. For instance-level learning, we propose a point-based proposal generation approach as a means of connecting snippets and instances, which produces high-confidence proposals for further optimization at the instance level. Through multi-level reliability-aware learning, we obtain more reliable confidence scores and more accurate temporal boundaries of predicted proposals. Our HR-Pro achieves state-of-the-art performance on multiple challenging benchmarks, including an impressive average mAP of 60.3% on THUMOS14. Notably, our HR-Pro largely surpasses all previous point-supervised methods, and even outperforms several competitive fully supervised methods. Code will be available at this https URL.
Point-supervised Temporal Action Localization (PSTAL) 是一种高效的标签学习新研究方向。然而,目前的方法主要关注片段级别或实例级别的网络优化,而忽视了点标注在这两个级别的固有可靠性。在本文中,我们提出了一种Hierarchical Reliability Propagation (HR-Pro)框架,该框架包括两个可靠性意识的不同阶段:片段级别的歧视性和实例级别的完整度学习,这两个阶段研究了点标注的高可靠性信号的有效传播。对于片段级别的学习,我们引入了在线更新的记忆来存储每个类别的可靠片段原型。然后,我们使用可靠性意识的注意力块捕获片段之间的内部视频和外部视频依赖关系,从而生成更具个性和鲁棒性的片段表示。对于实例级别的学习,我们提出了基于点的建议生成方法,以连接片段和实例,并产生在实例级别的进一步优化的高可靠性建议。通过多级可靠性意识学习,我们获得了更可靠的信心评分和更准确的预测建议的时间边界。我们的HR-Pro在多个具有挑战性的基准测试中取得了最先进的表现,包括在THUMOS14上令人印象深刻的平均mAP为60.3%。值得注意的是,我们的HR-Pro几乎超越了所有以前的点标注方法,甚至超越了一些竞争完全监督方法。代码将在这个httpsURL上提供。
https://arxiv.org/abs/2308.12608
Weakly-supervised action localization aims to recognize and localize action instancese in untrimmed videos with only video-level labels. Most existing models rely on multiple instance learning(MIL), where the predictions of unlabeled instances are supervised by classifying labeled bags. The MIL-based methods are relatively well studied with cogent performance achieved on classification but not on localization. Generally, they locate temporal regions by the video-level classification but overlook the temporal variations of feature semantics. To address this problem, we propose a novel attention-based hierarchically-structured latent model to learn the temporal variations of feature semantics. Specifically, our model entails two components, the first is an unsupervised change-points detection module that detects change-points by learning the latent representations of video features in a temporal hierarchy based on their rates of change, and the second is an attention-based classification model that selects the change-points of the foreground as the boundaries. To evaluate the effectiveness of our model, we conduct extensive experiments on two benchmark datasets, THUMOS-14 and ActivityNet-v1.3. The experiments show that our method outperforms current state-of-the-art methods, and even achieves comparable performance with fully-supervised methods.
弱监督行动定位的目标是在未修剪的视频上识别和定位行动实例,仅使用视频级别的标签。大多数现有模型依赖于多个实例学习(MIL),其中未标记实例的预测由分类标签袋监督。基于MIL的方法相对较为深入研究,在分类方面取得了显著性能,但在定位方面则忽视了特征语义的时间变化。为了解决这一问题,我们提出了一种新的基于注意力的Hierarchically-StructuredLatent模型来学习特征语义的时间变化。具体来说,我们的模型包含两个组件,第一个是 unsupervised的转折点检测模块,通过学习视频特征的时间级潜在表示,从它们的增长率中检测转折点,第二个是基于注意力的分类模型,选择前景转折点作为边界。为了评估我们的模型的有效性,我们研究了两个基准数据集,THUMOS-14和ActivityNet-v1.3。实验结果表明,我们的方法比当前最先进的方法表现更好,甚至与完全监督的方法达到类似的性能。
https://arxiv.org/abs/2308.09946
Weakly-supervised temporal action localization (WTAL) is a practical yet challenging task. Due to large-scale datasets, most existing methods use a network pretrained in other datasets to extract features, which are not suitable enough for WTAL. To address this problem, researchers design several modules for feature enhancement, which improve the performance of the localization module, especially modeling the temporal relationship between snippets. However, all of them neglect the adverse effects of ambiguous information, which would reduce the discriminability of others. Considering this phenomenon, we propose Discriminability-Driven Graph Network (DDG-Net), which explicitly models ambiguous snippets and discriminative snippets with well-designed connections, preventing the transmission of ambiguous information and enhancing the discriminability of snippet-level representations. Additionally, we propose feature consistency loss to prevent the assimilation of features and drive the graph convolution network to generate more discriminative representations. Extensive experiments on THUMOS14 and ActivityNet1.2 benchmarks demonstrate the effectiveness of DDG-Net, establishing new state-of-the-art results on both datasets. Source code is available at \url{this https URL}.
弱监督的时间行动定位(WTAL)是一个实用但具有挑战性的任务。由于大规模的数据集,大多数现有方法使用在其他数据集中训练的网络提取特征,这些特征不适合用于WTAL。为了解决这一问题,研究人员设计了几个特征增强模块,以提高定位模块的性能,特别是建模片段之间的时间关系。然而,他们都忽略了歧义信息的副作用,这将会减少其他人的区分能力。考虑到这种现象,我们提出了区分性驱动的 Graph 网络(DDG-Net),它 explicitly 建模歧义片段和有用的片段,采用设计良好的连接,防止传输歧义信息,并增强片段级表示的区分能力。此外,我们提出了特征一致性损失,以防止特征融合并推动Graph卷积网络生成更多的有用表示。在THUMOS14和ActivityNet1.2基准数据上的广泛实验证明了DDG-Net的有效性,在两个数据集上实现了新的最先进的结果。源代码可在 \url{this https URL} 找到。
https://arxiv.org/abs/2307.16415
This report describes our submission to the Ego4D Moment Queries Challenge 2023. Our submission extends ActionFormer, a latest method for temporal action localization. Our extension combines an improved ground-truth assignment strategy during training and a refined version of SoftNMS at inference time. Our solution is ranked 2nd on the public leaderboard with 26.62% average mAP and 45.69% Recall@1x at tIoU=0.5 on the test set, significantly outperforming the strong baseline from 2023 challenge. Our code is available at this https URL.
本报告描述了我们向2023年Ego4D时刻查询挑战提交的方案。我们的方案扩展了ActionFormer,这是一种最新的时间行动定位方法。我们的扩展在训练期间采用了改进的真实值分配策略,并在推断期间采用了改进的SoftNMS版本。我们在测试集上的解决方案在public leaderboard上排名第二,平均mAP为26.62%,Recall@1x的精度为45.69%。在2023年挑战中强有力的基准线的显著超越表明我们的解决方案是出色的。我们的代码可在this https URL上获取。
https://arxiv.org/abs/2307.02025
The prevalence of violence in daily life poses significant threats to individuals' physical and mental well-being. Using surveillance cameras in public spaces has proven effective in proactively deterring and preventing such incidents. However, concerns regarding privacy invasion have emerged due to their widespread deployment. To address the problem, we leverage Dynamic Vision Sensors (DVS) cameras to detect violent incidents and preserve privacy since it captures pixel brightness variations instead of static imagery. We introduce the Bullying10K dataset, encompassing various actions, complex movements, and occlusions from real-life scenarios. It provides three benchmarks for evaluating different tasks: action recognition, temporal action localization, and pose estimation. With 10,000 event segments, totaling 12 billion events and 255 GB of data, Bullying10K contributes significantly by balancing violence detection and personal privacy persevering. And it also poses a challenge to the neuromorphic dataset. It will serve as a valuable resource for training and developing privacy-protecting video systems. The Bullying10K opens new possibilities for innovative approaches in these domains.
日常生活中的暴力普遍存在,对个体的身体健康和心理健康构成严重威胁。在公共场所使用监控摄像头已经证明能够有效地预防并阻止此类事件的发生。然而,隐私侵犯问题也因为广泛使用监控摄像头而凸显出来。为了解决这一问题,我们利用动态视觉传感器(DVS)摄像头来检测暴力事件并保护隐私,因为它能够捕捉像素亮度的变化而不是静态图像。我们介绍了“欺凌10K数据集”,涵盖了各种行动、复杂动作和遮挡情况,提供了三个基准任务以评估不同的任务:行动识别、时间动作定位和姿态估计。该数据集有10,000个事件片段,总共涵盖了120亿事件和2.55GB的数据,通过平衡暴力检测和个人隐私坚持, significantly contribute to these domains' innovative approaches. 该数据集也挑战了神经可塑性数据集。它将成为训练和开发保护隐私的视频系统的有价值的资源。欺凌10K为这些领域的创新方法带来了新的可能性。
https://arxiv.org/abs/2306.11546
Video moment localization, also known as video moment retrieval, aiming to search a target segment within a video described by a given natural language query. Beyond the task of temporal action localization whereby the target actions are pre-defined, video moment retrieval can query arbitrary complex activities. In this survey paper, we aim to present a comprehensive review of existing video moment localization techniques, including supervised, weakly supervised, and unsupervised ones. We also review the datasets available for video moment localization and group results of related work. In addition, we discuss promising future directions for this field, in particular large-scale datasets and interpretable video moment localization models.
视频时刻定位(Video moment localization,又称视频时刻检索)旨在根据给定自然语言查询搜索视频中指定目标片段。除了时间行为定位任务(target actions are pre-defined in this task),视频时刻检索还可以查询任意复杂的活动。在这份调查 paper 中,我们旨在全面综述现有的视频时刻定位技术,包括受监督、弱监督和无监督的。我们还 review 了用于视频时刻定位的可用数据集和相关工作的群体结果。此外,我们讨论了该领域有前景的未来方向,特别是大规模数据集和可解释的视频时刻定位模型。
https://arxiv.org/abs/2306.07515
Weakly-supervised temporal action localization aims to localize and recognize actions in untrimmed videos with only video-level category labels during training. Without instance-level annotations, most existing methods follow the Segment-based Multiple Instance Learning (S-MIL) framework, where the predictions of segments are supervised by the labels of videos. However, the objective for acquiring segment-level scores during training is not consistent with the target for acquiring proposal-level scores during testing, leading to suboptimal results. To deal with this problem, we propose a novel Proposal-based Multiple Instance Learning (P-MIL) framework that directly classifies the candidate proposals in both the training and testing stages, which includes three key designs: 1) a surrounding contrastive feature extraction module to suppress the discriminative short proposals by considering the surrounding contrastive information, 2) a proposal completeness evaluation module to inhibit the low-quality proposals with the guidance of the completeness pseudo labels, and 3) an instance-level rank consistency loss to achieve robust detection by leveraging the complementarity of RGB and FLOW modalities. Extensive experimental results on two challenging benchmarks including THUMOS14 and ActivityNet demonstrate the superior performance of our method.
弱监督的时间行动定位旨在在训练期间使用视频级别的类别标签唯一地定位和识别未修剪的视频,而不需要实例级别的注释。在没有实例级别的注释的情况下,大多数现有方法遵循基于片段的多个实例学习框架(S-MIL),其中片段的预测由视频标签监督。然而,在训练期间获取片段级评分的目标与在测试期间获取提议级评分的目标不一致,导致优化结果不足。为了解决这个问题,我们提出了一种新的提议基于多个实例学习框架(P-MIL),它在训练和测试阶段直接分类候选人提议,包括三个关键设计:1)周围的对比特征提取模块,通过考虑周围的对比信息,抑制具有歧视性的短提议;2)提议完整性评估模块,通过指导完整性伪标签,抑制低质量的提议;3)实例级别的等级一致性损失,通过利用RGB和Flow特征的互补性,实现鲁棒检测。在包括THUMOS14和ActivityNet等两个挑战基准的广泛实验结果表明,我们的方法和性能优越。
https://arxiv.org/abs/2305.17861
Temporal action localization (TAL), which involves recognizing and locating action instances, is a challenging task in video understanding. Most existing approaches directly predict action classes and regress offsets to boundaries, while overlooking the discrepant importance of each frame. In this paper, we propose an Action Sensitivity Learning framework (ASL) to tackle this task, which aims to assess the value of each frame and then leverage the generated action sensitivity to recalibrate the training procedure. We first introduce a lightweight Action Sensitivity Evaluator to learn the action sensitivity at the class level and instance level, respectively. The outputs of the two branches are combined to reweight the gradient of the two sub-tasks. Moreover, based on the action sensitivity of each frame, we design an Action Sensitive Contrastive Loss to enhance features, where the action-aware frames are sampled as positive pairs to push away the action-irrelevant frames. The extensive studies on various action localization benchmarks (i.e., MultiThumos, Charades, Ego4D-Moment Queries v1.0, Epic-Kitchens 100, Thumos14 and ActivityNet1.3) show that ASL surpasses the state-of-the-art in terms of average-mAP under multiple types of scenarios, e.g., single-labeled, densely-labeled and egocentric.
时间动作定位(TAL)是指在视频中识别和定位动作实例的挑战性任务。大部分现有方法直接预测动作类别并回归到边界,同时忽略每个帧的离散重要性。在本文中,我们提出一个Action sensitive Learning framework(ASL)来解决该任务,旨在评估每个帧的价值,然后利用生成的动作敏感性来重新校准训练过程。我们首先介绍了一个轻量级Action sensitive Evaluator,分别学习动作敏感性在类级别和实例级别的响应。两个分支的输出组合在一起重新调整两个子任务的梯度。此外,基于每个帧的动作敏感性,我们设计了一个Action Sensitive Contrastive Loss,以增强特征,其中动作意识到帧被采样为正值对,以推开无关的动作帧。对多种动作定位基准(如MultiThumos、Charades、Ego4D-Moment Queries v1.0、Epic-Kitchens 100、 Thumos14和ActivityNet1.3)广泛的研究结果表明,ASL在平均MAP方面超越了当前的最佳方法,特别是在各种场景类型(如单个标记、密集标记和自我中心)下的情况。
https://arxiv.org/abs/2305.15701
Weakly-supervised temporal action localization aims to identify and localize the action instances in the untrimmed videos with only video-level action labels. When humans watch videos, we can adapt our abstract-level knowledge about actions in different video scenarios and detect whether some actions are occurring. In this paper, we mimic how humans do and bring a new perspective for locating and identifying multiple actions in a video. We propose a network named VQK-Net with a video-specific query-key attention modeling that learns a unique query for each action category of each input video. The learned queries not only contain the actions' knowledge features at the abstract level but also have the ability to fit this knowledge into the target video scenario, and they will be used to detect the presence of the corresponding action along the temporal dimension. To better learn these action category queries, we exploit not only the features of the current input video but also the correlation between different videos through a novel video-specific action category query learner worked with a query similarity loss. Finally, we conduct extensive experiments on three commonly used datasets (THUMOS14, ActivityNet1.2, and ActivityNet1.3) and achieve state-of-the-art performance.
弱监督的时域动作定位旨在仅使用视频级别的动作标签来确定和定位未剪辑的视频中的行动实例。当人类观看视频时,我们可以适应不同视频场景下抽象的知识,并检测某些行动是否正在发生。在本文中,我们模拟人类的行为并提出了一种名为VQK-Net的网络,它使用视频特定的查询关键注意力模型来学习每个输入视频的行动类别的唯一查询。学习到的查询不仅包含行动在抽象层面上的知识特征,而且有能力将这些知识嵌入到目标视频场景之中,并将用于检测时间维度上相应的行动的存在。为了更好地学习这些行动类别查询,我们不仅利用当前输入视频的特征,还通过一种独特的视频特定行动类别查询学习器与查询相似度损失合作,利用不同视频之间的相关关系。最后,我们研究了三个常用的数据集(THUMOS14、ActivityNet1.2和ActivityNet1.3)并实现了最先进的性能。
https://arxiv.org/abs/2305.04186
Due to the lack of temporal annotation, current Weakly-supervised Temporal Action Localization (WTAL) methods are generally stuck into over-complete or incomplete localization. In this paper, we aim to leverage the text information to boost WTAL from two aspects, i.e., (a) the discriminative objective to enlarge the inter-class difference, thus reducing the over-complete; (b) the generative objective to enhance the intra-class integrity, thus finding more complete temporal boundaries. For the discriminative objective, we propose a Text-Segment Mining (TSM) mechanism, which constructs a text description based on the action class label, and regards the text as the query to mine all class-related segments. Without the temporal annotation of actions, TSM compares the text query with the entire videos across the dataset to mine the best matching segments while ignoring irrelevant ones. Due to the shared sub-actions in different categories of videos, merely applying TSM is too strict to neglect the semantic-related segments, which results in incomplete localization. We further introduce a generative objective named Video-text Language Completion (VLC), which focuses on all semantic-related segments from videos to complete the text sentence. We achieve the state-of-the-art performance on THUMOS14 and ActivityNet1.3. Surprisingly, we also find our proposed method can be seamlessly applied to existing methods, and improve their performances with a clear margin. The code is available at this https URL.
由于缺乏时间标注,当前Weakly-supervised Temporal Action Localization (WTAL)方法往往陷入过度完整或不完整Localization的状态。在本文中,我们旨在利用文本信息从两个方面提高WTAL,即(a)增强不同类别之间的差异,减少过度完整;(b)增强内部类别一致性,找到更多的完整时间边界。针对增强目标,我们提出了文本片段挖掘机制(TSM),该机制基于行动类别标签构建文本描述,并将文本视为查询,以挖掘所有类别相关的片段。在没有时间标注的行动的情况下,TSM将文本查询与整个数据集的视频进行比较,以找到最佳匹配片段,并忽略无关的片段。由于不同类别视频共享相同的子行动,仅仅应用TSM过于严格,忽略语义相关的片段,导致不完整Localization。我们还介绍了一个生成目标名为视频文本语言完整(VLC),它专注于从视频中提取所有语义相关的片段,以完成句子。我们在THUMOS14和ActivityNet1.3上实现了最先进的性能。令人惊讶地,我们还发现,我们的提出方法可以无缝应用于现有方法,并以明显优势改进其性能。代码在此httpsURL上可用。
https://arxiv.org/abs/2305.00607