Pose-based Video Anomaly Detection (VAD) has gained significant attention for its privacy-preserving nature and robustness to environmental variations. However, traditional frame-level evaluations treat video as a collection of isolated frames, fundamentally misaligned with how anomalies manifest and are acted upon in the real world. In operational surveillance systems, what matters is not the flagging of individual frames, but the reliable detection, localization, and reporting of a coherent anomalous event, a contiguous temporal episode with an identifiable onset and duration. Frame-level metrics are blind to this distinction, and as a result, they systematically overestimate model performance for any deployment that requires actionable, event-level alerts. In this work, we propose a shift toward an event-centric perspective in VAD. We first audit widely used VAD benchmarks, including SHT[19], CHAD[6], NWPUC[4], and HuVAD[25], to characterize their event structure. We then introduce two strategies for temporal event localization: a score-refinement pipeline with hierarchical Gaussian smoothing and adaptive binarization, and an end-to-end Dual-Branch Model that directly generates event-level detections. Finally, we establish the first event-based evaluation standard for VAD by adapting Temporal Action Localization metrics, including tIoU-based event matching and multi-threshold F1 evaluation. Our results quantify a substantial performance gap: while all SoTA models achieve frame-level AUC-ROC exceeding 52% on the NWPUC[4], their event-level localization precision falls below 10% even at a minimal tIoU=0.2, with an average event-level F1 of only 0.11 across all thresholds. The code base for this work is available at this https URL.
https://arxiv.org/abs/2604.09327
Learning methods using synthetic data have attracted attention as an effective approach for increasing the diversity of training data while reducing collection costs, thereby improving the robustness of model discrimination. However, many existing methods improve robustness only indirectly through the diversification of training samples and do not explicitly teach the model which regions in the input space truly contribute to discrimination; consequently, the model may learn spurious correlations caused by synthesis biases and artifacts. Motivated by this limitation, this paper proposes a learning framework that uses provenance information obtained during the training data synthesis process, indicating whether each region in the input space originates from the target object, as an auxiliary supervisory signal to promote the acquisition of representations focused on target regions. Specifically, input gradients are decomposed based on information about target and non-target regions during synthesis, and input gradient guidance is introduced to suppress gradients over non-target regions. This suppresses the model's reliance on non-target regions and directly promotes the learning of discriminative representations for target regions. Experiments demonstrate the effectiveness and generality of the proposed method across multiple tasks and modalities, including weakly supervised object localization, spatio-temporal action localization, and image classification.
基于合成数据的学习方法因其能有效提升训练数据多样性、降低采集成本,进而增强模型判别鲁棒性而受到关注。然而,现有方法大多仅通过训练样本多样化间接提升鲁棒性,并未明确指导模型识别输入空间中真正贡献判别能力的区域;因此模型可能习得由合成偏差与伪影导致的虚假关联。针对这一局限,本文提出一种利用训练数据合成过程中获取的溯源信息作为辅助监督信号的学习框架,该信息指示输入空间中各区域是否源自目标对象,从而促进模型聚焦于目标区域的表征学习。具体而言,在合成过程中基于目标与非目标区域的信息对输入梯度进行分解,并引入输入梯度引导机制以抑制非目标区域的梯度响应。该方法可抑制模型对非目标区域的依赖,并直接促进目标区域判别性表征的学习。实验表明,所提方法在弱监督目标定位、时空动作定位及图像分类等多个任务与模态中均具有有效性与通用性。
https://arxiv.org/abs/2604.02946
Animal models, particularly rats, play a critical role in seizure research for studying epileptogenesis and treatment response. However, progress is limited by the lack of datasets with precise temporal annotations and standardized evaluation protocols. Existing animal behavior datasets often have limited accessibility, coarse labeling, and insufficient temporal localization of clinically meaningful events. To address these limitations, we introduce RatSeizure, the first publicly benchmark for fine-grained seizure behavior analysis. The dataset consists of recorded clips annotated with seizure-related action units and temporal boundaries, enabling both behavior classification and temporal localization. We further propose RaSeformer, a saliency-context Transformer for temporal action localization that highlights behavior-relevant context while suppressing redundant cues. Experiments on RatSeizure show that RaSeformer achieves strong performance and provides a competitive reference model for this challenging task. We also establish standardized dataset splits and evaluation protocols to support reproducible benchmarking.
动物模型,尤其是大鼠,在癫痫发生及治疗反应的发作研究中发挥着关键作用。然而,由于缺乏具有精确时序标注和标准化评估协议的数据集,相关研究进展受限。现有动物行为数据集通常存在可访问性有限、标注粗糙以及临床有意义事件的时序定位不足等问题。为应对这些局限,我们推出了RatSeizure——首个用于细粒度发作行为分析的公开基准数据集。该数据集包含经过标注的视频片段,标注内容涉及发作相关动作单元及时序边界,从而支持行为分类与时序定位双重任务。此外,我们提出了RaSeformer,一种用于时序动作定位的显著性上下文Transformer模型,该模型能够突出与行为相关的上下文信息,同时抑制冗余线索。在RatSeizure上的实验表明,RaSeformer取得了优异性能,并为这一具有挑战性的任务提供了有竞争力的参考模型。我们还建立了标准化的数据集划分方案与评估协议,以支持可复现的基准测试。
https://arxiv.org/abs/2603.26780
The identification of hazardous driving behaviors from in-cabin video streams is essential for enhancing road safety and supporting the detection of traffic violations and unsafe driver actions. However, current temporal action localization techniques often struggle to balance accuracy with computational efficiency. In this work, we develop and evaluate a temporal action localization framework tailored for driver monitoring scenarios, particularly suitable for periodic inspection settings such as transportation safety checkpoints or fleet management assessment systems. Our approach follows a two-stage pipeline that combines VideoMAE-based feature extraction with an Augmented Self-Mask Attention (AMA) detector, enhanced by a Spatial Pyramid Pooling-Fast (SPPF) module to capture multi-scale temporal features. Experimental results reveal a distinct trade-off between model capacity and efficiency. At the feature extraction stage, the ViT-Giant backbone delivers higher representations with 88.09% Top-1 test accuracy, while the ViT-based variant proves to be a practical alternative, achieving 82.55% accuracy with significantly lower computational fine-tuning costs (101.85 GFLOPs/segment compared to 1584.06 GFLOPs/segment for Giant). In the downstream localization task, the integration of SPPF consistently improves performance across all configurations. Notably, the ViT-Giant + SPPF model achieves a peak mAP of 92.67%, while the lightweight ViT-based configuration maintains robust results.
基于车内视频流识别危险驾驶行为,对于提升道路安全以及辅助检测交通违规与不安全驾驶操作至关重要。然而,当前时序动作定位技术常难以兼顾准确性与计算效率。本研究开发并评估了一种专为驾驶员监控场景设计的时序动作定位框架,尤其适用于交通运输安全检查站或车队管理评估系统等周期性检查场景。该方法采用两阶段流程,结合基于VideoMAE的特征提取与增强自掩码注意力(AMA)检测器,并引入空间金字塔池化快速(SPPF)模块以捕获多尺度时序特征。实验结果表明,模型能力与效率之间存在明显权衡:在特征提取阶段,ViT-Giant主干网络以88.09%的Top-1测试准确率提供了更优表征,而基于ViT的变体则展现出实用性,以82.55%的准确率实现了显著更低的微调计算成本(每片段101.85 GFLOPs,而Giant为1584.06 GFLOPs)。在下游定位任务中,SPPF的集成持续提升了所有配置的性能。值得注意的是,ViT-Giant + SPPF模型达到了92.67%的峰值mAP,而轻量级基于ViT的配置也保持了稳健的结果。
https://arxiv.org/abs/2603.21048
Point-supervised Temporal Action Localization (PTAL) adopts a lightly frame-annotated paradigm (\textit{i.e.}, labeling only a single frame per action instance) to train a model to effectively locate action instances within untrimmed videos. Most existing approaches design the task head of models with only a point-supervised snippet-level classification, without explicit modeling of understanding temporal relationships among frames of an action. However, understanding the temporal relationships of frames is crucial because it can help a model understand how an action is defined and therefore benefits localizing the full frames of an action. To this end, in this paper, we design a multi-task learning framework that fully utilizes point supervision to boost the model's temporal understanding capability for action localization. Specifically, we design three self-supervised temporal understanding tasks: (i) Action Completion, (ii) Action Order Understanding, and (iii) Action Regularity Understanding. These tasks help a model understand the temporal consistency of actions across videos. To the best of our knowledge, this is the first attempt to explicitly explore temporal consistency for point supervision action localization. Extensive experimental results on four benchmark datasets demonstrate the effectiveness of the proposed method compared to several state-of-the-art approaches.
点监督时间动作定位(PTAL)采用轻量级的帧标注范式(即,仅对每个动作实例标记一个帧),来训练模型以有效地在未剪辑的视频中定位动作实例。大多数现有方法设计了只有基于点监督的片段级别分类的任务头,并没有明确地建模理解动作帧之间的时序关系。然而,理解帧之间的时序关系至关重要,因为它可以帮助模型理解动作是如何定义的,并且有助于定位整个动作帧。为此,在本文中,我们设计了一个多任务学习框架,该框架充分利用了点监督以增强模型的时间理解能力来进行动作定位。具体来说,我们设计了三个自我监督的时间理解任务:(i)动作完成,(ii)动作顺序理解,和(iii)动作规律性理解。这些任务帮助模型理解动作在不同视频中的时间一致性。据我们所知,这是首次明确探索点监督动作定位中的时间一致性的尝试。我们在四个基准数据集上进行了广泛的实验,结果表明与几种最先进的方法相比,本文提出的方法是有效的。
https://arxiv.org/abs/2602.05718
IMU-based Human Activity Recognition (HAR) has enabled a wide range of ubiquitous computing applications, yet its dominant clip classification paradigm cannot capture the rich temporal structure of real-world behaviors. This motivates a shift toward IMU Temporal Action Localization (IMU-TAL), which predicts both action categories and their start/end times in continuous streams. However, current progress is strongly bottlenecked by the need for dense, frame-level boundary annotations, which are costly and difficult to scale. To address this bottleneck, we introduce WS-IMUBench, a systematic benchmark study of weakly supervised IMU-TAL (WS-IMU-TAL) under only sequence-level labels. Rather than proposing a new localization algorithm, we evaluate how well established weakly supervised localization paradigms from audio, image, and video transfer to IMU-TAL under only sequence-level labels. We benchmark seven representative weakly supervised methods on seven public IMU datasets, resulting in over 3,540 model training runs and 7,080 inference evaluations. Guided by three research questions on transferability, effectiveness, and insights, our findings show that (i) transfer is modality-dependent, with temporal-domain methods generally more stable than image-derived proposal-based approaches; (ii) weak supervision can be competitive on favorable datasets (e.g., with longer actions and higher-dimensional sensing); and (iii) dominant failure modes arise from short actions, temporal ambiguity, and proposal quality. Finally, we outline concrete directions for advancing WS-IMU-TAL (e.g., IMU-specific proposal generation, boundary-aware objectives, and stronger temporal reasoning). Beyond individual results, WS-IMUBench establishes a reproducible benchmarking template, datasets, protocols, and analyses, to accelerate community-wide progress toward scalable WS-IMU-TAL.
https://arxiv.org/abs/2602.01850
Recently, point-supervised temporal action localization has gained significant attention for its effective balance between labeling costs and localization accuracy. However, current methods only consider features from visual inputs, neglecting helpful semantic information from the text side. To address this issue, we propose a Text Refinement and Alignment (TRA) framework that effectively utilizes textual features from visual descriptions to complement the visual features as they are semantically rich. This is achieved by designing two new modules for the original point-supervised framework: a Point-based Text Refinement module (PTR) and a Point-based Multimodal Alignment module (PMA). Specifically, we first generate descriptions for video frames using a pre-trained multimodal model. Next, PTR refines the initial descriptions by leveraging point annotations together with multiple pre-trained models. PMA then projects all features into a unified semantic space and leverages a point-level multimodal feature contrastive learning to reduce the gap between visual and linguistic modalities. Last, the enhanced multi-modal features are fed into the action detector for precise localization. Extensive experimental results on five widely used benchmarks demonstrate the favorable performance of our proposed framework compared to several state-of-the-art methods. Moreover, our computational overhead analysis shows that the framework can run on a single 24 GB RTX 3090 GPU, indicating its practicality and scalability.
https://arxiv.org/abs/2602.01257
Temporal Action Localization (TAL) requires identifying both the boundaries and categories of actions in untrimmed videos. While vision-language models (VLMs) offer rich semantics to complement visual evidence, existing approaches tend to overemphasize linguistic priors at the expense of visual performance, leading to a pronounced modality bias. We propose ActionVLM, a vision-language aggregation framework that systematically mitigates modality bias in TAL. Our key insight is to preserve vision as the dominant signal while adaptively exploiting language only when beneficial. To this end, we introduce (i) a debiasing reweighting module that estimates the language advantage-the incremental benefit of language over vision-only predictions-and dynamically reweights language modality accordingly, and (ii) a residual aggregation strategy that treats language as a complementary refinement rather than the primary driver. This combination alleviates modality bias, reduces overconfidence from linguistic priors, and strengthens temporal reasoning. Experiments on THUMOS14 show that our model outperforms state-of-the-art by up to 3.2% mAP.
https://arxiv.org/abs/2601.21078
Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.
离散视频变分自编码器(VAEs)是现代文本到视频生成和视频理解系统的基础,然而现有的标记化方法通常在单一尺度上学习有限词汇量的视觉代码本,并且语言监督较浅层,导致跨模态对齐效果不佳及零样本迁移性能差。我们引入了PyraTok,这是一种与语言相匹配的金字塔式标记器,它能够在多个时空分辨率下学习语义结构化的离散潜在变量。PyraTok基于预训练的视频VAE以及一种新颖的语言一致金字塔量化(LaPQ)模块构建而成,该模块通过共享的大二进制代码本来自不同深度对编码特征进行离散化处理,生成紧凑且表达力强的视频标记序列。 为了将视觉标记与语言紧密耦合,PyraTok同时优化多尺度文本引导量化和整个令牌层次上的全局自回归目标。在十项基准测试中,PyraTok提供了最先进的(SOTA)视频重建效果,在文本到视频质量上持续改进,并在视频分割、时间动作定位以及视频理解的零样本性能方面设立新的SOTA标准,能够稳健地扩展至4K/8K分辨率。
https://arxiv.org/abs/2601.16210
The self-supervised pretraining paradigm has achieved great success in learning 3D action representations for skeleton-based action recognition using contrastive learning. However, learning effective representations for skeleton-based temporal action localization remains challenging and underexplored. Unlike video-level {action} recognition, detecting action boundaries requires temporally sensitive features that capture subtle differences between adjacent frames where labels change. To this end, we formulate a snippet discrimination pretext task for self-supervised pretraining, which densely projects skeleton sequences into non-overlapping segments and promotes features that distinguish them across videos via contrastive learning. Additionally, we build on strong backbones of skeleton-based action recognition models by fusing intermediate features with a U-shaped module to enhance feature resolution for frame-level localization. Our approach consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL across diverse subsets and evaluation protocols. We also achieve state-of-the-art transfer learning performance on PKUMMD with pretraining on NTU RGB+D and BABEL.
自监督预训练范式在使用对比学习方法从基于骨架的动作识别中学习3D动作表示方面取得了巨大成功。然而,为基于骨架的时间动作定位学习有效的表示仍然具有挑战性且研究不足。与视频级动作识别不同,检测动作边界需要能够捕捉到标签变化前后相邻帧之间细微差别的时间敏感特征。为此,我们提出了一个片段区分的预训练任务,将骨架序列密集地投影到非重叠段中,并通过对比学习促进跨视频区分这些段落的特性。此外,我们在基于骨架的动作识别模型的强大骨干网络基础上融合了中间特征和U形模块以增强帧级定位的特征分辨率。我们的方法在BABEL数据集的不同子集和评估协议上持续改进现有的基于骨架的对比学习方法用于动作定位的表现。我们还在PKUMMD上实现了最先进的迁移学习性能,使用NTU RGB+D和BABEL进行预训练。
https://arxiv.org/abs/2512.16504
Computer vision and video understanding have transformed sports analytics by enabling large-scale, automated analysis of game dynamics from broadcast footage. Despite significant advances in player and ball tracking, pose estimation, action localization, and automatic foul recognition, anticipating actions before they occur in sports videos has received comparatively little attention. This work introduces the task of action anticipation in basketball broadcast videos, focusing on predicting which team will gain possession of the ball following a shot attempt. To benchmark this task, a new self-curated dataset comprising 100,000 basketball video clips, over 300 hours of footage, and more than 2,000 manually annotated rebound events is presented. Comprehensive baseline results are reported using state-of-the-art action anticipation methods, representing the first application of deep learning techniques to basketball rebound prediction. Additionally, two complementary tasks, rebound classification and rebound spotting, are explored, demonstrating that this dataset supports a wide range of video understanding applications in basketball, for which no comparable datasets currently exist. Experimental results highlight both the feasibility and inherent challenges of anticipating rebounds, providing valuable insights into predictive modeling for dynamic multi-agent sports scenarios. By forecasting team possession before rebounds occur, this work enables applications in real-time automated broadcasting and post-game analysis tools to support decision-making.
计算机视觉和视频理解技术通过从广播画面中进行大规模自动化分析,已经彻底改变了体育数据分析。尽管在球员追踪、姿态估计、动作定位以及自动犯规识别方面取得了重大进展,但在体育视频中预测即将发生的动作却受到了相对较少的关注。这项工作引入了篮球比赛广播视频中的动作预判任务,重点关注在投篮尝试后哪一队将获得球权的预测。为了评估这一任务,我们提出了一套新的自采集数据集,包括10万段篮球视频片段、超过300小时的影像资料以及2000多个手动标注的篮板事件。 研究报告了使用最先进的动作预判方法所得到的全面基线结果,这是首次将深度学习技术应用于预测篮球比赛中的篮板情况。此外,还探讨了两个互补的任务:篮板分类和篮板定位,这表明该数据集支持广泛的篮球视频理解应用,并且目前没有类似的现有数据集可以比较。 实验结果显示出了在体育比赛中预判篮板的可行性和固有挑战性,为多代理动态场景下的预测建模提供了宝贵的见解。通过提前预测在篮板发生前哪一队将获得球权,这项工作为实时自动化广播和赛后分析工具中的决策支持应用开辟了道路。
https://arxiv.org/abs/2512.15386
We present our solution to the BinEgo-360 Challenge at ICCV 2025, which focuses on temporal action localization (TAL) in multi-perspective and multi-modal video settings. The challenge provides a dataset containing panoramic, third-person, and egocentric recordings, annotated with fine-grained action classes. Our approach is built on the Temporal Shift Module (TSM), which we extend to handle TAL by introducing a background class and classifying fixed-length non-overlapping intervals. We employ a multi-task learning framework that jointly optimizes for scene classification and TAL, leveraging contextual cues between actions and environments. Finally, we integrate multiple models through a weighted ensemble strategy, which improves robustness and consistency of predictions. Our method is ranked first in both the initial and extended rounds of the competition, demonstrating the effectiveness of combining multi-task learning, an efficient backbone, and ensemble learning for TAL.
我们介绍了针对2025年ICCV BinEgo-360挑战赛的解决方案,该挑战专注于多视角和多模态视频环境下的时序动作定位(TAL)。此次比赛提供的数据集包括全景、第三人称和第一人称视角的记录,并详细标注了细粒度的动作类别。我们的方法基于时间偏移模块(Temporal Shift Module, TSM),并将其扩展以处理TAL,通过引入背景类以及对固定长度且不重叠的时间间隔进行分类来实现这一点。我们采用了一个多任务学习框架,在该框架中同时优化场景分类和TAL的性能,并利用动作与环境之间的上下文线索。最后,我们通过加权集成策略整合多个模型,这提升了预测结果的鲁棒性和一致性。 我们的方法在竞赛初始阶段以及扩展阶段均获得第一名,证明了将多任务学习、高效的骨干网络(backbone)及集成学习相结合,对于解决TAL问题的有效性。
https://arxiv.org/abs/2512.11189
Temporal Action Localization (TAL) remains a fundamental challenge in video understanding, aiming to identify the start time, end time, and category of all action instances within untrimmed videos. While recent single-stage, anchor-free models like ActionFormer have set a high standard by leveraging Transformers for temporal reasoning, they often struggle with two persistent issues: the precise localization of actions with ambiguous or "fuzzy" temporal boundaries and the effective fusion of multi-scale contextual information. In this paper, we introduce the Temporal Boundary Transformer (TBT-Former), a new architecture that directly addresses these limitations. TBT-Former enhances the strong ActionFormer baseline with three core contributions: (1) a higher-capacity scaled Transformer backbone with an increased number of attention heads and an expanded Multi-Layer Perceptron (MLP) dimension for more powerful temporal feature extraction; (2) a cross-scale feature pyramid network (FPN) that integrates a top-down pathway with lateral connections, enabling richer fusion of high-level semantics and low-level temporal details; and (3) a novel boundary distribution regression head. Inspired by the principles of Generalized Focal Loss (GFL), this new head recasts the challenging task of boundary regression as a more flexible probability distribution learning problem, allowing the model to explicitly represent and reason about boundary uncertainty. Within the paradigm of Transformer-based architectures, TBT-Former advances the formidable benchmark set by its predecessors, establishing a new level of performance on the highly competitive THUMOS14 and EPIC-Kitchens 100 datasets, while remaining competitive on the large-scale ActivityNet-1.3. Our code is available at this https URL
时序动作定位(TAL)在视频理解中仍是一个基本挑战,目标是识别未剪辑视频内所有动作实例的开始时间、结束时间和类别。虽然近期的一些单阶段无锚点模型,例如ActionFormer,通过利用Transformer进行时间推理已设立了高标准,但它们经常面临两个持久性问题:难以精确定位具有模糊或“模棱两可”时间边界的动作以及有效融合多尺度上下文信息的挑战。本文介绍了Temporal Boundary Transformer (TBT-Former),这是一种新的架构,直接解决了这些限制。通过三个核心贡献增强了强大的ActionFormer基线模型: 1. 更强大容量的时间Transformer骨干网络,具有增加的关注头数量和扩展的多层感知机(MLP)维度,以实现更强大的时间特征提取; 2. 跨尺度特征金字塔网络 (FPN),该网络集成了自上而下的路径与横向连接,使高级语义信息和低级时间细节的融合更加丰富; 3. 一种新颖的时间边界分布回归头。受到广义焦点损失(GFL)原理的启发,这个新的头部将具有挑战性的边界回归任务重新定义为一个更灵活的概率分布学习问题,允许模型明确表示并推理关于边界的不确定性。 在基于Transformer架构的框架内,TBT-Former在其前辈设定的强大基准上取得了进展,在竞争激烈的THUMOS14和EPIC-Kitchens 100数据集上建立了新的性能水平,并且在大规模的ActivityNet-1.3上保持了竞争力。我们的代码可在该链接获取:[提供的URL]
https://arxiv.org/abs/2512.01298
Web automation employs intelligent agents to execute high-level tasks by mimicking human interactions with web interfaces. Despite the capabilities of recent Large Language Model (LLM)-based web agents, navigating complex, real-world webpages efficiently remains a significant hurdle due to the prohibitively large size of Document Object Model (DOM) structures, often ranging from 10,000 to 100,000 tokens. Existing strategies typically rely on crude DOM truncation -- risking the loss of critical information -- or employ inefficient heuristics and separate ranking models, failing to achieve an optimal balance between precision and scalability. To address these challenges, we introduce Prune4Web, a novel paradigm that shifts DOM processing from resource-intensive LLM reading to efficient programmatic pruning. Central to our approach is DOM Tree Pruning Programming, where an LLM generates executable Python scoring scripts to dynamically filter DOM elements based on semantic cues from decomposed sub-tasks. This mechanism eliminates the need for LLMs to ingest raw, massive DOMs, instead delegating traversal and scoring to lightweight, interpretable programs. This methodology achieves a 25x to 50x reduction in candidate elements for grounding, thereby facilitating precise action localization while mitigating attention dilution. Furthermore, we propose a specialized data annotation pipeline and a two-turn dialogue training strategy that jointly optimizes the Planner, Programmatic Filter, and Grounder within a unified framework. Extensive experiments demonstrate state-of-the-art performance. Notably, on our low-level grounding task, Prune4Web dramatically improves accuracy from 46.8% to 88.28%, underscoring its efficacy in real-world web automation.
https://arxiv.org/abs/2511.21398
Soccer video understanding has motivated the creation of datasets for tasks such as temporal action localization, spatiotemporal action detection (STAD), or multiobject tracking (MOT). The annotation of structured sequences of events (who does what, when, and where) used for soccer analytics requires a holistic approach that integrates both STAD and MOT. However, current action recognition methods remain insufficient for constructing reliable play-by-play data and are typically used to assist rather than fully automate annotation. Parallel research has advanced tactical modeling, trajectory forecasting, and performance analysis, all grounded in game-state and play-by-play data. This motivates leveraging tactical knowledge as a prior to support computer-vision-based predictions, enabling more automated and reliable extraction of play-by-play data. We introduce Footovision Play-by-Play Action Spotting in Soccer Dataset (FOOTPASS), the first benchmark for play-by-play action spotting over entire soccer matches in a multi-modal, multi-agent tactical context. It enables the development of methods for player-centric action spotting that exploit both outputs from computer-vision tasks (e.g., tracking, identification) and prior knowledge of soccer, including its tactical regularities over long time horizons, to generate reliable play-by-play data streams. These streams form an essential input for data-driven sports analytics.
https://arxiv.org/abs/2511.16183
Open-Vocabulary Temporal Action Localization (OV-TAL) aims to recognize and localize instances of any desired action categories in videos without explicitly curating training data for all categories. Existing methods mostly recognize action categories at a single granularity, which degrades the recognition accuracy of both base and novel action categories. To address these issues, we propose a Multi-Grained Category-Aware Network (MGCA-Net) comprising a localizer, an action presence predictor, a conventional classifier, and a coarse-to-fine classifier. Specifically, the localizer localizes category-agnostic action proposals. For these action proposals, the action presence predictor estimates the probability that they belong to an action instance. At the same time, the conventional classifier predicts the probability of each action proposal over base action categories at the snippet granularity. Novel action categories are recognized by the coarse-to-fine classifier, which first identifies action presence at the video granularity. Finally, it assigns each action proposal to one category from the coarse categories at the proposal granularity. Through coarse-to-fine category awareness for novel actions and the conventional classifier's awareness of base actions, multi-grained category awareness is achieved, effectively enhancing localization performance. Comprehensive evaluations on the THUMOS'14 and ActivityNet-1.3 benchmarks demonstrate that our method achieves state-of-the-art performance. Furthermore, our MGCA-Net achieves state-of-the-art results under the Zero-Shot Temporal Action Localization setting.
https://arxiv.org/abs/2511.13039
Analyzing hand-object interaction in egocentric vision facilitates VR/AR applications and human-robot policy transfer. Existing research has mostly focused on modeling the behavior paradigm of interactive actions (i.e., "how to interact"). However, the more challenging and fine-grained problem of capturing the critical moments of contact and separation between the hand and the target object (i.e., "when to interact") is still underexplored, which is crucial for immersive interactive experiences in mixed reality and robotic motion planning. Therefore, we formulate this problem as temporal interaction localization (TIL). Some recent works extract semantic masks as TIL references, but suffer from inaccurate object grounding and cluttered scenarios. Although current temporal action localization (TAL) methods perform well in detecting verb-noun action segments, they rely on category annotations during training and exhibit limited precision in localizing hand-object contact/separation moments. To address these issues, we propose a novel zero-shot approach dubbed EgoLoc to localize hand-object contact and separation timestamps in egocentric videos. EgoLoc introduces hand-dynamics-guided sampling to generate high-quality visual prompts. It exploits the vision-language model to identify contact/separation attributes, localize specific timestamps, and provide closed-loop feedback for further refinement. EgoLoc eliminates the need for object masks and verb-noun taxonomies, leading to generalizable zero-shot implementation. Comprehensive experiments on the public dataset and our novel benchmarks demonstrate that EgoLoc achieves plausible TIL for egocentric videos. It is also validated to effectively facilitate multiple downstream applications in egocentric vision and robotic manipulation tasks. Code and relevant data will be released at this https URL.
https://arxiv.org/abs/2511.12878
Temporal action localization requires precise boundary detection; however, current methods apply uniform computation despite significant variations in difficulty across boundaries. We present two complementary contributions. First, Boundary Distance Regression (BDR) provides information-theoretically optimal localization through signed-distance regression rather than classification, achieving 43\% sharper boundary peaks. BDR retrofits to existing methods with approximately 50 lines of code, yielding consistent 1.8 to 3.1\% mAP@0.7 improvements across diverse architectures. Second, Adaptive Temporal Refinement (ATR) allocates computation via continuous depth selection $\tau \in [0,1]$, enabling end-to-end differentiable optimization without reinforcement learning. On THUMOS14, ATR achieves 56.5\% mAP@0.7 at 162G FLOPs, compared to 53.6\% at 198G for uniform processing, providing a 2.9\% improvement with 18\% less compute. Gains scale with boundary heterogeneity, showing 4.2\% improvement on short actions. Training cost is mitigated via knowledge distillation, with lightweight students retaining 99\% performance at baseline cost. Results are validated across four benchmarks with rigorous statistical testing.
https://arxiv.org/abs/2511.03943
Referring Atomic Video Action Recognition (RAVAR) aims to recognize fine-grained, atomic-level actions of a specific person of interest conditioned on natural language descriptions. Distinct from conventional action recognition and detection tasks, RAVAR emphasizes precise language-guided action understanding, which is particularly critical for interactive human action analysis in complex multi-person scenarios. In this work, we extend our previously introduced RefAVA dataset to RefAVA++, which comprises >2.9 million frames and >75.1k annotated persons in total. We benchmark this dataset using baselines from multiple related domains, including atomic action localization, video question answering, and text-video retrieval, as well as our earlier model, RefAtomNet. Although RefAtomNet surpasses other baselines by incorporating agent attention to highlight salient features, its ability to align and retrieve cross-modal information remains limited, leading to suboptimal performance in localizing the target person and predicting fine-grained actions. To overcome the aforementioned limitations, we introduce RefAtomNet++, a novel framework that advances cross-modal token aggregation through a multi-hierarchical semantic-aligned cross-attention mechanism combined with multi-trajectory Mamba modeling at the partial-keyword, scene-attribute, and holistic-sentence levels. In particular, scanning trajectories are constructed by dynamically selecting the nearest visual spatial tokens at each timestep for both partial-keyword and scene-attribute levels. Moreover, we design a multi-hierarchical semantic-aligned cross-attention strategy, enabling more effective aggregation of spatial and temporal tokens across different semantic hierarchies. Experiments show that RefAtomNet++ establishes new state-of-the-art results. The dataset and code are released at this https URL.
基于参考的原子视频动作识别(RAVAR)的目标是在自然语言描述的基础上,对特定感兴趣人物的细微、原子级的动作进行识别。与传统的动作识别和检测任务不同,RAVAR强调精准的语言引导下的动作理解,在复杂的多人场景中互动的人类动作分析尤为重要。 在这一工作中,我们扩展了先前介绍的RefAVA数据集到RefAVA++版本,该数据集包括超过290万帧和总计75,100多个标注人物。我们在这一数据集上使用来自多个相关领域的基线模型进行了基准测试,这些领域包括原子动作定位、视频问答以及文本-视频检索,并且还包含了我们早期开发的RefAtomNet模型。 尽管RefAtomNet通过引入代理人注意机制突出显示显著特征从而超越了其他基线模型,但其在跨模态信息对齐与检索方面的能力仍然有限,这导致它在定位目标人物和预测细微动作时表现不佳。为克服上述限制,我们提出了RefAtomNet++这一新型框架,该框架通过引入多层次语义对齐的跨注意力机制以及多轨迹Mamba建模(分别针对部分关键词、场景属性和整体句子级别)进一步推进了跨模态标记聚合。 具体而言,在部分关键词和场景属性水平上,扫描轨迹是通过对每个时间步长动态选择最近的视觉空间令牌来构建。此外,我们设计了一个多层次语义对齐的跨注意力策略,使得在不同的语义层次之间更有效地聚集时空令牌成为可能。实验表明,RefAtomNet++确立了新的最先进的结果。 数据集和代码已在此链接处发布。
https://arxiv.org/abs/2510.16444
Multi-modal large language models (MLLMs) are making rapid progress toward general-purpose embodied agents. However, current training pipelines primarily rely on high-level vision-sound-text pairs and lack fine-grained, structured alignment between pixel-level visual content and textual semantics. To overcome this challenge, we propose ESCA, a new framework for contextualizing embodied agents through structured spatial-temporal understanding. At its core is SGClip, a novel CLIP-based, open-domain, and promptable model for generating scene graphs. SGClip is trained on 87K+ open-domain videos via a neurosymbolic learning pipeline, which harnesses model-driven self-supervision from video-caption pairs and structured reasoning, thereby eliminating the need for human-labeled scene graph annotations. We demonstrate that SGClip supports both prompt-based inference and task-specific fine-tuning, excelling in scene graph generation and action localization benchmarks. ESCA with SGClip consistently improves both open-source and commercial MLLMs, achieving state-of-the-art performance across two embodied environments. Notably, it significantly reduces agent perception errors and enables open-source models to surpass proprietary baselines.
多模态大型语言模型(MLLMs)正在迅速向通用实体代理发展。然而,当前的训练流程主要依赖于高层次的视觉-声音-文本对,缺乏像素级视觉内容与文本语义之间的细粒度、结构化对齐。为解决这一挑战,我们提出了ESCA,这是一个通过结构化的时空理解来赋予实体代理人情境的新框架。其核心是SGClip,这是一种基于CLIP(对比语言-图像预训练模型)、开放领域且可提示的模型,用于生成场景图。SGClip在87K+个开放领域的视频上进行训练,通过神经符号学习流程,利用从视频字幕对中得出的模型驱动自我监督和结构化推理,从而无需人类标注的场景图注释。我们证明了SGClip支持基于提示的推理和特定任务的微调,在场景图生成和动作定位基准测试中表现出色。带有SGClip的ESCA在两个实体环境中持续提升了开源及商业MLLMs的表现,达到了最先进的性能水平,并显著减少了代理人的感知错误,使开源模型能够超越专有基线。
https://arxiv.org/abs/2510.15963