Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.
离散视频变分自编码器(VAEs)是现代文本到视频生成和视频理解系统的基础,然而现有的标记化方法通常在单一尺度上学习有限词汇量的视觉代码本,并且语言监督较浅层,导致跨模态对齐效果不佳及零样本迁移性能差。我们引入了PyraTok,这是一种与语言相匹配的金字塔式标记器,它能够在多个时空分辨率下学习语义结构化的离散潜在变量。PyraTok基于预训练的视频VAE以及一种新颖的语言一致金字塔量化(LaPQ)模块构建而成,该模块通过共享的大二进制代码本来自不同深度对编码特征进行离散化处理,生成紧凑且表达力强的视频标记序列。 为了将视觉标记与语言紧密耦合,PyraTok同时优化多尺度文本引导量化和整个令牌层次上的全局自回归目标。在十项基准测试中,PyraTok提供了最先进的(SOTA)视频重建效果,在文本到视频质量上持续改进,并在视频分割、时间动作定位以及视频理解的零样本性能方面设立新的SOTA标准,能够稳健地扩展至4K/8K分辨率。
https://arxiv.org/abs/2601.16210
The self-supervised pretraining paradigm has achieved great success in learning 3D action representations for skeleton-based action recognition using contrastive learning. However, learning effective representations for skeleton-based temporal action localization remains challenging and underexplored. Unlike video-level {action} recognition, detecting action boundaries requires temporally sensitive features that capture subtle differences between adjacent frames where labels change. To this end, we formulate a snippet discrimination pretext task for self-supervised pretraining, which densely projects skeleton sequences into non-overlapping segments and promotes features that distinguish them across videos via contrastive learning. Additionally, we build on strong backbones of skeleton-based action recognition models by fusing intermediate features with a U-shaped module to enhance feature resolution for frame-level localization. Our approach consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL across diverse subsets and evaluation protocols. We also achieve state-of-the-art transfer learning performance on PKUMMD with pretraining on NTU RGB+D and BABEL.
自监督预训练范式在使用对比学习方法从基于骨架的动作识别中学习3D动作表示方面取得了巨大成功。然而,为基于骨架的时间动作定位学习有效的表示仍然具有挑战性且研究不足。与视频级动作识别不同,检测动作边界需要能够捕捉到标签变化前后相邻帧之间细微差别的时间敏感特征。为此,我们提出了一个片段区分的预训练任务,将骨架序列密集地投影到非重叠段中,并通过对比学习促进跨视频区分这些段落的特性。此外,我们在基于骨架的动作识别模型的强大骨干网络基础上融合了中间特征和U形模块以增强帧级定位的特征分辨率。我们的方法在BABEL数据集的不同子集和评估协议上持续改进现有的基于骨架的对比学习方法用于动作定位的表现。我们还在PKUMMD上实现了最先进的迁移学习性能,使用NTU RGB+D和BABEL进行预训练。
https://arxiv.org/abs/2512.16504
Computer vision and video understanding have transformed sports analytics by enabling large-scale, automated analysis of game dynamics from broadcast footage. Despite significant advances in player and ball tracking, pose estimation, action localization, and automatic foul recognition, anticipating actions before they occur in sports videos has received comparatively little attention. This work introduces the task of action anticipation in basketball broadcast videos, focusing on predicting which team will gain possession of the ball following a shot attempt. To benchmark this task, a new self-curated dataset comprising 100,000 basketball video clips, over 300 hours of footage, and more than 2,000 manually annotated rebound events is presented. Comprehensive baseline results are reported using state-of-the-art action anticipation methods, representing the first application of deep learning techniques to basketball rebound prediction. Additionally, two complementary tasks, rebound classification and rebound spotting, are explored, demonstrating that this dataset supports a wide range of video understanding applications in basketball, for which no comparable datasets currently exist. Experimental results highlight both the feasibility and inherent challenges of anticipating rebounds, providing valuable insights into predictive modeling for dynamic multi-agent sports scenarios. By forecasting team possession before rebounds occur, this work enables applications in real-time automated broadcasting and post-game analysis tools to support decision-making.
计算机视觉和视频理解技术通过从广播画面中进行大规模自动化分析,已经彻底改变了体育数据分析。尽管在球员追踪、姿态估计、动作定位以及自动犯规识别方面取得了重大进展,但在体育视频中预测即将发生的动作却受到了相对较少的关注。这项工作引入了篮球比赛广播视频中的动作预判任务,重点关注在投篮尝试后哪一队将获得球权的预测。为了评估这一任务,我们提出了一套新的自采集数据集,包括10万段篮球视频片段、超过300小时的影像资料以及2000多个手动标注的篮板事件。 研究报告了使用最先进的动作预判方法所得到的全面基线结果,这是首次将深度学习技术应用于预测篮球比赛中的篮板情况。此外,还探讨了两个互补的任务:篮板分类和篮板定位,这表明该数据集支持广泛的篮球视频理解应用,并且目前没有类似的现有数据集可以比较。 实验结果显示出了在体育比赛中预判篮板的可行性和固有挑战性,为多代理动态场景下的预测建模提供了宝贵的见解。通过提前预测在篮板发生前哪一队将获得球权,这项工作为实时自动化广播和赛后分析工具中的决策支持应用开辟了道路。
https://arxiv.org/abs/2512.15386
We present our solution to the BinEgo-360 Challenge at ICCV 2025, which focuses on temporal action localization (TAL) in multi-perspective and multi-modal video settings. The challenge provides a dataset containing panoramic, third-person, and egocentric recordings, annotated with fine-grained action classes. Our approach is built on the Temporal Shift Module (TSM), which we extend to handle TAL by introducing a background class and classifying fixed-length non-overlapping intervals. We employ a multi-task learning framework that jointly optimizes for scene classification and TAL, leveraging contextual cues between actions and environments. Finally, we integrate multiple models through a weighted ensemble strategy, which improves robustness and consistency of predictions. Our method is ranked first in both the initial and extended rounds of the competition, demonstrating the effectiveness of combining multi-task learning, an efficient backbone, and ensemble learning for TAL.
我们介绍了针对2025年ICCV BinEgo-360挑战赛的解决方案,该挑战专注于多视角和多模态视频环境下的时序动作定位(TAL)。此次比赛提供的数据集包括全景、第三人称和第一人称视角的记录,并详细标注了细粒度的动作类别。我们的方法基于时间偏移模块(Temporal Shift Module, TSM),并将其扩展以处理TAL,通过引入背景类以及对固定长度且不重叠的时间间隔进行分类来实现这一点。我们采用了一个多任务学习框架,在该框架中同时优化场景分类和TAL的性能,并利用动作与环境之间的上下文线索。最后,我们通过加权集成策略整合多个模型,这提升了预测结果的鲁棒性和一致性。 我们的方法在竞赛初始阶段以及扩展阶段均获得第一名,证明了将多任务学习、高效的骨干网络(backbone)及集成学习相结合,对于解决TAL问题的有效性。
https://arxiv.org/abs/2512.11189
Temporal Action Localization (TAL) remains a fundamental challenge in video understanding, aiming to identify the start time, end time, and category of all action instances within untrimmed videos. While recent single-stage, anchor-free models like ActionFormer have set a high standard by leveraging Transformers for temporal reasoning, they often struggle with two persistent issues: the precise localization of actions with ambiguous or "fuzzy" temporal boundaries and the effective fusion of multi-scale contextual information. In this paper, we introduce the Temporal Boundary Transformer (TBT-Former), a new architecture that directly addresses these limitations. TBT-Former enhances the strong ActionFormer baseline with three core contributions: (1) a higher-capacity scaled Transformer backbone with an increased number of attention heads and an expanded Multi-Layer Perceptron (MLP) dimension for more powerful temporal feature extraction; (2) a cross-scale feature pyramid network (FPN) that integrates a top-down pathway with lateral connections, enabling richer fusion of high-level semantics and low-level temporal details; and (3) a novel boundary distribution regression head. Inspired by the principles of Generalized Focal Loss (GFL), this new head recasts the challenging task of boundary regression as a more flexible probability distribution learning problem, allowing the model to explicitly represent and reason about boundary uncertainty. Within the paradigm of Transformer-based architectures, TBT-Former advances the formidable benchmark set by its predecessors, establishing a new level of performance on the highly competitive THUMOS14 and EPIC-Kitchens 100 datasets, while remaining competitive on the large-scale ActivityNet-1.3. Our code is available at this https URL
时序动作定位(TAL)在视频理解中仍是一个基本挑战,目标是识别未剪辑视频内所有动作实例的开始时间、结束时间和类别。虽然近期的一些单阶段无锚点模型,例如ActionFormer,通过利用Transformer进行时间推理已设立了高标准,但它们经常面临两个持久性问题:难以精确定位具有模糊或“模棱两可”时间边界的动作以及有效融合多尺度上下文信息的挑战。本文介绍了Temporal Boundary Transformer (TBT-Former),这是一种新的架构,直接解决了这些限制。通过三个核心贡献增强了强大的ActionFormer基线模型: 1. 更强大容量的时间Transformer骨干网络,具有增加的关注头数量和扩展的多层感知机(MLP)维度,以实现更强大的时间特征提取; 2. 跨尺度特征金字塔网络 (FPN),该网络集成了自上而下的路径与横向连接,使高级语义信息和低级时间细节的融合更加丰富; 3. 一种新颖的时间边界分布回归头。受到广义焦点损失(GFL)原理的启发,这个新的头部将具有挑战性的边界回归任务重新定义为一个更灵活的概率分布学习问题,允许模型明确表示并推理关于边界的不确定性。 在基于Transformer架构的框架内,TBT-Former在其前辈设定的强大基准上取得了进展,在竞争激烈的THUMOS14和EPIC-Kitchens 100数据集上建立了新的性能水平,并且在大规模的ActivityNet-1.3上保持了竞争力。我们的代码可在该链接获取:[提供的URL]
https://arxiv.org/abs/2512.01298
Web automation employs intelligent agents to execute high-level tasks by mimicking human interactions with web interfaces. Despite the capabilities of recent Large Language Model (LLM)-based web agents, navigating complex, real-world webpages efficiently remains a significant hurdle due to the prohibitively large size of Document Object Model (DOM) structures, often ranging from 10,000 to 100,000 tokens. Existing strategies typically rely on crude DOM truncation -- risking the loss of critical information -- or employ inefficient heuristics and separate ranking models, failing to achieve an optimal balance between precision and scalability. To address these challenges, we introduce Prune4Web, a novel paradigm that shifts DOM processing from resource-intensive LLM reading to efficient programmatic pruning. Central to our approach is DOM Tree Pruning Programming, where an LLM generates executable Python scoring scripts to dynamically filter DOM elements based on semantic cues from decomposed sub-tasks. This mechanism eliminates the need for LLMs to ingest raw, massive DOMs, instead delegating traversal and scoring to lightweight, interpretable programs. This methodology achieves a 25x to 50x reduction in candidate elements for grounding, thereby facilitating precise action localization while mitigating attention dilution. Furthermore, we propose a specialized data annotation pipeline and a two-turn dialogue training strategy that jointly optimizes the Planner, Programmatic Filter, and Grounder within a unified framework. Extensive experiments demonstrate state-of-the-art performance. Notably, on our low-level grounding task, Prune4Web dramatically improves accuracy from 46.8% to 88.28%, underscoring its efficacy in real-world web automation.
https://arxiv.org/abs/2511.21398
Soccer video understanding has motivated the creation of datasets for tasks such as temporal action localization, spatiotemporal action detection (STAD), or multiobject tracking (MOT). The annotation of structured sequences of events (who does what, when, and where) used for soccer analytics requires a holistic approach that integrates both STAD and MOT. However, current action recognition methods remain insufficient for constructing reliable play-by-play data and are typically used to assist rather than fully automate annotation. Parallel research has advanced tactical modeling, trajectory forecasting, and performance analysis, all grounded in game-state and play-by-play data. This motivates leveraging tactical knowledge as a prior to support computer-vision-based predictions, enabling more automated and reliable extraction of play-by-play data. We introduce Footovision Play-by-Play Action Spotting in Soccer Dataset (FOOTPASS), the first benchmark for play-by-play action spotting over entire soccer matches in a multi-modal, multi-agent tactical context. It enables the development of methods for player-centric action spotting that exploit both outputs from computer-vision tasks (e.g., tracking, identification) and prior knowledge of soccer, including its tactical regularities over long time horizons, to generate reliable play-by-play data streams. These streams form an essential input for data-driven sports analytics.
https://arxiv.org/abs/2511.16183
Open-Vocabulary Temporal Action Localization (OV-TAL) aims to recognize and localize instances of any desired action categories in videos without explicitly curating training data for all categories. Existing methods mostly recognize action categories at a single granularity, which degrades the recognition accuracy of both base and novel action categories. To address these issues, we propose a Multi-Grained Category-Aware Network (MGCA-Net) comprising a localizer, an action presence predictor, a conventional classifier, and a coarse-to-fine classifier. Specifically, the localizer localizes category-agnostic action proposals. For these action proposals, the action presence predictor estimates the probability that they belong to an action instance. At the same time, the conventional classifier predicts the probability of each action proposal over base action categories at the snippet granularity. Novel action categories are recognized by the coarse-to-fine classifier, which first identifies action presence at the video granularity. Finally, it assigns each action proposal to one category from the coarse categories at the proposal granularity. Through coarse-to-fine category awareness for novel actions and the conventional classifier's awareness of base actions, multi-grained category awareness is achieved, effectively enhancing localization performance. Comprehensive evaluations on the THUMOS'14 and ActivityNet-1.3 benchmarks demonstrate that our method achieves state-of-the-art performance. Furthermore, our MGCA-Net achieves state-of-the-art results under the Zero-Shot Temporal Action Localization setting.
https://arxiv.org/abs/2511.13039
Analyzing hand-object interaction in egocentric vision facilitates VR/AR applications and human-robot policy transfer. Existing research has mostly focused on modeling the behavior paradigm of interactive actions (i.e., "how to interact"). However, the more challenging and fine-grained problem of capturing the critical moments of contact and separation between the hand and the target object (i.e., "when to interact") is still underexplored, which is crucial for immersive interactive experiences in mixed reality and robotic motion planning. Therefore, we formulate this problem as temporal interaction localization (TIL). Some recent works extract semantic masks as TIL references, but suffer from inaccurate object grounding and cluttered scenarios. Although current temporal action localization (TAL) methods perform well in detecting verb-noun action segments, they rely on category annotations during training and exhibit limited precision in localizing hand-object contact/separation moments. To address these issues, we propose a novel zero-shot approach dubbed EgoLoc to localize hand-object contact and separation timestamps in egocentric videos. EgoLoc introduces hand-dynamics-guided sampling to generate high-quality visual prompts. It exploits the vision-language model to identify contact/separation attributes, localize specific timestamps, and provide closed-loop feedback for further refinement. EgoLoc eliminates the need for object masks and verb-noun taxonomies, leading to generalizable zero-shot implementation. Comprehensive experiments on the public dataset and our novel benchmarks demonstrate that EgoLoc achieves plausible TIL for egocentric videos. It is also validated to effectively facilitate multiple downstream applications in egocentric vision and robotic manipulation tasks. Code and relevant data will be released at this https URL.
https://arxiv.org/abs/2511.12878
Temporal action localization requires precise boundary detection; however, current methods apply uniform computation despite significant variations in difficulty across boundaries. We present two complementary contributions. First, Boundary Distance Regression (BDR) provides information-theoretically optimal localization through signed-distance regression rather than classification, achieving 43\% sharper boundary peaks. BDR retrofits to existing methods with approximately 50 lines of code, yielding consistent 1.8 to 3.1\% mAP@0.7 improvements across diverse architectures. Second, Adaptive Temporal Refinement (ATR) allocates computation via continuous depth selection $\tau \in [0,1]$, enabling end-to-end differentiable optimization without reinforcement learning. On THUMOS14, ATR achieves 56.5\% mAP@0.7 at 162G FLOPs, compared to 53.6\% at 198G for uniform processing, providing a 2.9\% improvement with 18\% less compute. Gains scale with boundary heterogeneity, showing 4.2\% improvement on short actions. Training cost is mitigated via knowledge distillation, with lightweight students retaining 99\% performance at baseline cost. Results are validated across four benchmarks with rigorous statistical testing.
https://arxiv.org/abs/2511.03943
Referring Atomic Video Action Recognition (RAVAR) aims to recognize fine-grained, atomic-level actions of a specific person of interest conditioned on natural language descriptions. Distinct from conventional action recognition and detection tasks, RAVAR emphasizes precise language-guided action understanding, which is particularly critical for interactive human action analysis in complex multi-person scenarios. In this work, we extend our previously introduced RefAVA dataset to RefAVA++, which comprises >2.9 million frames and >75.1k annotated persons in total. We benchmark this dataset using baselines from multiple related domains, including atomic action localization, video question answering, and text-video retrieval, as well as our earlier model, RefAtomNet. Although RefAtomNet surpasses other baselines by incorporating agent attention to highlight salient features, its ability to align and retrieve cross-modal information remains limited, leading to suboptimal performance in localizing the target person and predicting fine-grained actions. To overcome the aforementioned limitations, we introduce RefAtomNet++, a novel framework that advances cross-modal token aggregation through a multi-hierarchical semantic-aligned cross-attention mechanism combined with multi-trajectory Mamba modeling at the partial-keyword, scene-attribute, and holistic-sentence levels. In particular, scanning trajectories are constructed by dynamically selecting the nearest visual spatial tokens at each timestep for both partial-keyword and scene-attribute levels. Moreover, we design a multi-hierarchical semantic-aligned cross-attention strategy, enabling more effective aggregation of spatial and temporal tokens across different semantic hierarchies. Experiments show that RefAtomNet++ establishes new state-of-the-art results. The dataset and code are released at this https URL.
基于参考的原子视频动作识别(RAVAR)的目标是在自然语言描述的基础上,对特定感兴趣人物的细微、原子级的动作进行识别。与传统的动作识别和检测任务不同,RAVAR强调精准的语言引导下的动作理解,在复杂的多人场景中互动的人类动作分析尤为重要。 在这一工作中,我们扩展了先前介绍的RefAVA数据集到RefAVA++版本,该数据集包括超过290万帧和总计75,100多个标注人物。我们在这一数据集上使用来自多个相关领域的基线模型进行了基准测试,这些领域包括原子动作定位、视频问答以及文本-视频检索,并且还包含了我们早期开发的RefAtomNet模型。 尽管RefAtomNet通过引入代理人注意机制突出显示显著特征从而超越了其他基线模型,但其在跨模态信息对齐与检索方面的能力仍然有限,这导致它在定位目标人物和预测细微动作时表现不佳。为克服上述限制,我们提出了RefAtomNet++这一新型框架,该框架通过引入多层次语义对齐的跨注意力机制以及多轨迹Mamba建模(分别针对部分关键词、场景属性和整体句子级别)进一步推进了跨模态标记聚合。 具体而言,在部分关键词和场景属性水平上,扫描轨迹是通过对每个时间步长动态选择最近的视觉空间令牌来构建。此外,我们设计了一个多层次语义对齐的跨注意力策略,使得在不同的语义层次之间更有效地聚集时空令牌成为可能。实验表明,RefAtomNet++确立了新的最先进的结果。 数据集和代码已在此链接处发布。
https://arxiv.org/abs/2510.16444
Multi-modal large language models (MLLMs) are making rapid progress toward general-purpose embodied agents. However, current training pipelines primarily rely on high-level vision-sound-text pairs and lack fine-grained, structured alignment between pixel-level visual content and textual semantics. To overcome this challenge, we propose ESCA, a new framework for contextualizing embodied agents through structured spatial-temporal understanding. At its core is SGClip, a novel CLIP-based, open-domain, and promptable model for generating scene graphs. SGClip is trained on 87K+ open-domain videos via a neurosymbolic learning pipeline, which harnesses model-driven self-supervision from video-caption pairs and structured reasoning, thereby eliminating the need for human-labeled scene graph annotations. We demonstrate that SGClip supports both prompt-based inference and task-specific fine-tuning, excelling in scene graph generation and action localization benchmarks. ESCA with SGClip consistently improves both open-source and commercial MLLMs, achieving state-of-the-art performance across two embodied environments. Notably, it significantly reduces agent perception errors and enables open-source models to surpass proprietary baselines.
多模态大型语言模型(MLLMs)正在迅速向通用实体代理发展。然而,当前的训练流程主要依赖于高层次的视觉-声音-文本对,缺乏像素级视觉内容与文本语义之间的细粒度、结构化对齐。为解决这一挑战,我们提出了ESCA,这是一个通过结构化的时空理解来赋予实体代理人情境的新框架。其核心是SGClip,这是一种基于CLIP(对比语言-图像预训练模型)、开放领域且可提示的模型,用于生成场景图。SGClip在87K+个开放领域的视频上进行训练,通过神经符号学习流程,利用从视频字幕对中得出的模型驱动自我监督和结构化推理,从而无需人类标注的场景图注释。我们证明了SGClip支持基于提示的推理和特定任务的微调,在场景图生成和动作定位基准测试中表现出色。带有SGClip的ESCA在两个实体环境中持续提升了开源及商业MLLMs的表现,达到了最先进的性能水平,并显著减少了代理人的感知错误,使开源模型能够超越专有基线。
https://arxiv.org/abs/2510.15963
The ability to determine when a person struggles during skill acquisition is crucial for both optimizing human learning and enabling the development of effective assistive systems. As skills develop, the type and frequency of struggles tend to change, and understanding this evolution is key to determining the user's current stage of learning. However, existing manipulation datasets have not focused on how struggle evolves over time. In this work, we collect a dataset for struggle determination, featuring 61.68 hours of video recordings, 2,793 videos, and 5,385 annotated temporal struggle segments collected from 76 participants. The dataset includes 18 tasks grouped into four diverse activities -- tying knots, origami, tangram puzzles, and shuffling cards, representing different task variations. In addition, participants repeated the same task five times to capture their evolution of skill. We define the struggle determination problem as a temporal action localization task, focusing on identifying and precisely localizing struggle segments with start and end times. Experimental results show that Temporal Action Localization models can successfully learn to detect struggle cues, even when evaluated on unseen tasks or activities. The models attain an overall average mAP of 34.56% when generalizing across tasks and 19.24% across activities, indicating that struggle is a transferable concept across various skill-based tasks while still posing challenges for further improvement in struggle detection. Our dataset is available at this https URL.
确定一个人在技能获取过程中遇到困难的能力对于优化人类学习和开发有效的辅助系统至关重要。随着技能的发展,遇到的困难类型和频率往往会改变,理解这种演变是判断用户当前学习阶段的关键所在。然而,现有的操作数据集并没有关注随着时间推移困难是如何演化的。在这项工作中,我们收集了一个用于确定困难的数据集,该数据集包含76名参与者录制的视频资料,时长总计为61.68小时、2,793段视频和5,385个标注的时间段,涵盖了四种不同的活动——打结、折纸、七巧板拼图和洗牌,代表了不同任务的变化形式。此外,每个参与者重复执行相同的任务五次以记录他们的技能演变过程。 我们将确定困难的问题定义为一个时间动作定位任务,重点在于识别并精确定位开始时间和结束时间的困难片段。实验结果显示,时间动作定位模型能够成功地学习到检测困难信号的能力,即使在评估未见过的任务或活动中也是如此。当模型跨任务推广时,整体平均mAP值为34.56%,而跨活动推广时则为19.24%。这表明困难作为一个可转移的概念存在于各种基于技能的任务中,尽管它仍然对进一步改善困难检测构成了挑战。 我们的数据集可以在[这里](https://this.url/dataset)获取。
https://arxiv.org/abs/2510.01362
We introduce Hierarchical Streaming Video Understanding, a task that combines online temporal action localization with free-form description generation. Given the scarcity of datasets with hierarchical and fine-grained temporal annotations, we demonstrate that LLMs can effectively group atomic actions into higher-level events, enriching existing datasets. We then propose OpenHOUSE (Open-ended Hierarchical Online Understanding System for Events), which extends streaming action perception beyond action classification. OpenHOUSE features a specialized streaming module that accurately detects boundaries between closely adjacent actions, nearly doubling the performance of direct extensions of existing methods. We envision the future of streaming action perception in the integration of powerful generative models, with OpenHOUSE representing a key step in that direction.
我们介绍了层次化流媒体视频理解任务,该任务结合了在线时序动作定位与自由形式描述生成。鉴于缺少具有分层和细粒度时间注释的数据集,我们展示了大型语言模型(LLMs)可以有效地将原子动作组合成更高层次的事件,从而丰富现有的数据集。然后,我们提出了OpenHOUSE(开放式层次化在线理解系统),该系统将流媒体动作感知扩展到了超出动作分类的范围。OpenHOUSE具有一个专门设计的流模块,能够准确检测紧密相邻的动作之间的边界,这几乎使现有方法直接扩展后的性能翻了一番。我们展望了流式动作感知的未来发展方向是集成强大的生成模型,并认为OpenHOUSE代表了朝这个方向迈出的关键一步。
https://arxiv.org/abs/2509.12145
Fine-grained action localization in untrimmed sports videos presents a significant challenge due to rapid and subtle motion transitions over short durations. Existing supervised and weakly supervised solutions often rely on extensive annotated datasets and high-capacity models, making them computationally intensive and less adaptable to real-world scenarios. In this work, we introduce a lightweight and unsupervised skeleton-based action localization pipeline that leverages spatio-temporal graph neural representations. Our approach pre-trains an Attention-based Spatio-Temporal Graph Convolutional Network (ASTGCN) on a pose-sequence denoising task with blockwise partitions, enabling it to learn intrinsic motion dynamics without any manual labeling. At inference, we define a novel Action Dynamics Metric (ADM), computed directly from low-dimensional ASTGCN embeddings, which detects motion boundaries by identifying inflection points in its curvature profile. Our method achieves a mean Average Precision (mAP) of 82.66% and average localization latency of 29.09 ms on the DSV Diving dataset, matching state-of-the-art supervised performance while maintaining computational efficiency. Furthermore, it generalizes robustly to unseen, in-the-wild diving footage without retraining, demonstrating its practical applicability for lightweight, real-time action analysis systems in embedded or dynamic environments.
未剪辑体育视频中的精细动作定位由于在短时间内快速而微妙的运动过渡,带来了重大挑战。现有的监督学习和弱监督解决方案通常依赖于大量注释数据集和高容量模型,这使得它们计算成本高昂且不太适应实际场景。在这项工作中,我们提出了一种基于骨架、轻量级且无监督的动作定位流水线,该管道利用了空间-时间图神经表示。我们的方法通过在分块划分的姿势序列去噪任务上对Attention-based Spatio-Temporal Graph Convolutional Network (ASTGCN)进行预训练,使其能够学习内在运动动力学而无需任何人工标注。 在推断阶段,我们定义了一种新颖的动作动态度量(ADM),该度量直接从低维ASTGCN嵌入中计算得出,并通过识别其曲率轮廓中的拐点来检测运动边界。我们的方法在DSV跳水数据集上实现了82.66%的平均精度(mAP)和29.09毫秒的平均定位延迟,同时匹配了最先进的监督性能并保持了计算效率。此外,该方法还能稳健地推广到未见过的真实场景中的潜水视频,无需重新训练,证明其在嵌入式或动态环境中实现轻量级、实时动作分析系统的实际应用性。
https://arxiv.org/abs/2508.19647
Analyzing hand-object interaction in egocentric vision facilitates VR/AR applications and human-robot policy transfer. Existing research has mostly focused on modeling the behavior paradigm of interactive actions (i.e., ``how to interact''). However, the more challenging and fine-grained problem of capturing the critical moments of contact and separation between the hand and the target object (i.e., ``when to interact'') is still underexplored, which is crucial for immersive interactive experiences in mixed reality and robotic motion planning. Therefore, we formulate this problem as temporal interaction localization (TIL). Some recent works extract semantic masks as TIL references, but suffer from inaccurate object grounding and cluttered scenarios. Although current temporal action localization (TAL) methods perform well in detecting verb-noun action segments, they rely on category annotations during training and exhibit limited precision in localizing hand-object contact/separation moments. To address these issues, we propose a novel zero-shot approach dubbed EgoLoc to localize hand-object contact and separation timestamps in egocentric videos. EgoLoc introduces hand-dynamics-guided sampling to generate high-quality visual prompts. It exploits the vision-language model to identify contact/separation attributes, localize specific timestamps, and provide closed-loop feedback for further refinement. EgoLoc eliminates the need for object masks and verb-noun taxonomies, leading to generalizable zero-shot implementation. Comprehensive experiments on the public dataset and our novel benchmarks demonstrate that EgoLoc achieves plausible TIL for egocentric videos. It is also validated to effectively facilitate multiple downstream applications in egocentric vision and robotic manipulation tasks. Code and relevant data will be released at this https URL.
在第一人称视觉中分析手与物体的交互,有助于虚拟现实(VR)和增强现实(AR)应用以及人机政策转移。现有研究主要集中在互动行为模式的建模上(即“如何进行互动”)。然而,更复杂且精细的问题是捕捉手与目标物之间接触和分离的关键时刻(即“何时进行互动”),这个问题至今仍未得到充分探索,但对于混合现实中的沉浸式交互体验以及机器人运动规划至关重要。因此,我们将该问题定义为时间互动定位(TIL)。 一些最近的研究提取语义掩码作为 TIL 的参考,但存在物体对齐不准确和复杂场景的问题。尽管当前的时间动作定位 (TAL) 方法在检测动词-名词行为片段方面表现出色,但在训练过程中依赖于类别标注,并且在定位手与物体接触/分离时刻的精度有限。 为了解决这些问题,我们提出了一种新颖的零样本方法 EgoLoc,用于定位第一人称视频中手与物体的接触和分离时间戳。EgoLoc 引入了以手部动态为导向的采样技术来生成高质量的视觉提示。它利用视觉-语言模型识别接触/分离属性、精确定位特定时间戳,并提供闭环反馈进行进一步优化。 通过使用 EgoLoc,可以消除对物体掩码和动词-名词分类法的需求,从而实现通用化的零样本实施。在公共数据集以及我们自己创建的新基准上的综合实验表明,EgoLoc 在第一人称视频中实现了合理的 TIL 结果,并且已被证明能够有效促进多个下游应用,包括第一人称视觉任务及机器人操作任务。 代码及相关数据将在此网址发布:[此链接](请将 [此链接] 替换为实际的网址)。
https://arxiv.org/abs/2508.12349
Semi-Supervised Learning (SSL) has shown tremendous potential to improve the predictive performance of deep learning models when annotations are hard to obtain. However, the application of SSL has so far been mainly studied in the context of image classification. In this work, we present a semi-supervised approach for spatial-temporal action localization. We introduce a dual guidance network to select better pseudo-bounding boxes. It combines a frame-level classification with a bounding-box prediction to enforce action class consistency across frames and boxes. Our evaluation across well-known spatial-temporal action localization datasets, namely UCF101-24 , J-HMDB-21 and AVA shows that the proposed module considerably enhances the model's performance in limited labeled data settings. Our framework achieves superior results compared to extended image-based semi-supervised baselines.
半监督学习(Semi-Supervised Learning,SSL)在标注难以获取的情况下,展示了显著提升深度学习模型预测性能的巨大潜力。然而,到目前为止,SSL的应用主要是在图像分类的背景下进行研究。在这项工作中,我们提出了一种用于空间-时间动作定位的半监督方法。我们引入了一个双引导网络来选择更好的伪边界框(pseudo-bounding boxes)。该网络结合了帧级分类和边界框预测,以确保不同帧和框之间的动作类别一致性。我们在著名的空间-时间动作定位数据集UCF101-24、J-HMDB-21和AVA上的评估表明,所提出的模块在有限标注数据的情况下显著提升了模型的性能。我们的框架相比于扩展的基于图像的半监督基准方法取得了更好的结果。
https://arxiv.org/abs/2507.21247
Inspired by the recent success of transformers and multi-stage architectures in video recognition and object detection domains. We thoroughly explore the rich spatio-temporal properties of transformers within a multi-stage architecture paradigm for the temporal action localization (TAL) task. This exploration led to the development of a hierarchical multi-stage transformer architecture called PCL-Former, where each subtask is handled by a dedicated transformer module with a specialized loss function. Specifically, the Proposal-Former identifies candidate segments in an untrimmed video that may contain actions, the Classification-Former classifies the action categories within those segments, and the Localization-Former precisely predicts the temporal boundaries (i.e., start and end) of the action instances. To evaluate the performance of our method, we have conducted extensive experiments on three challenging benchmark datasets: THUMOS-14, ActivityNet-1.3, and HACS Segments. We also conducted detailed ablation experiments to assess the impact of each individual module of our PCL-Former. The obtained quantitative results validate the effectiveness of the proposed PCL-Former, outperforming state-of-the-art TAL approaches by 2.8%, 1.2%, and 4.8% on THUMOS14, ActivityNet-1.3, and HACS datasets, respectively.
受到近年来在视频识别和目标检测领域中变压器(Transformer)及多阶段架构成功应用的启发,我们深入探索了变压器在网络中的多层次架构下处理时间动作定位(TAL)任务时所具有的丰富时空特性。这一研究促进了分层多阶段变压器架构PCL-Former的发展,该架构通过专门设计的损失函数,让每个子任务都能由特定的Transformer模块来完成。具体来说: - Proposal-Former:识别未修剪视频中可能包含动作的候选片段。 - Classification-Former:对这些片段中的动作类别进行分类。 - Localization-Former:精确预测动作实例的时间边界(即开始和结束时间)。 为了评估我们方法的表现,我们在三个具有挑战性的基准数据集上进行了广泛的实验:THUMOS-14、ActivityNet-1.3 和 HACS Segments。此外,我们也进行了详细的消融研究来评估 PCL-Former 中每个单独模块的影响。所获得的定量结果验证了提出的 PCL-Former 的有效性,在 THUMOS14、ActivityNet-1.3 和 HACS 数据集上分别超过了现有的 TAL 方法 2.8%、1.2% 和 4.8%。
https://arxiv.org/abs/2507.06411
Real-world videos often contain overlapping events and complex temporal dependencies, making multimodal interaction modeling particularly challenging. We introduce DEL, a framework for dense semantic action localization, aiming to accurately detect and classify multiple actions at fine-grained temporal resolutions in long untrimmed videos. DEL consists of two key modules: the alignment of audio and visual features that leverage masked self-attention to enhance intra-mode consistency and a multimodal interaction refinement module that models cross-modal dependencies across multiple scales, enabling high-level semantics and fine-grained details. Our method achieves state-of-the-art performance on multiple real-world Temporal Action Localization (TAL) datasets, UnAV-100, THUMOS14, ActivityNet 1.3, and EPIC-Kitchens-100, surpassing previous approaches with notable average mAP gains of +3.3%, +2.6%, +1.2%, +1.7% (verb), and +1.4% (noun), respectively.
真实世界的视频通常包含重叠事件和复杂的时序依赖关系,这使得多模态交互建模特别具有挑战性。我们提出了DEL框架,用于密集语义动作定位,旨在准确检测和分类长未裁剪视频中的多个动作,并在细粒度的时间分辨率下进行分类。DEL包括两个关键模块:利用掩码自我注意机制来增强内模式一致性的音频和视觉特征对齐;以及建模多尺度跨模式依赖关系的多模态交互细化模块,这使得高级语义和细粒度细节得以实现。我们的方法在多个真实世界的时序动作定位(TAL)数据集上取得了最先进的性能,包括UnAV-100、THUMOS14、ActivityNet 1.3和EPIC-Kitchens-100,在这些数据集中分别获得了显著的平均mAP提升:+3.3%,+2.6%,+1.2%,+1.7%(动词)和+1.4%(名词)。
https://arxiv.org/abs/2506.23196
Weakly supervised temporal action localization is a challenging task as only the video-level annotation is available during the training process. To address this problem, we propose a two-stage approach to fully exploit multi-resolution information in the temporal domain and generate high quality frame-level pseudo labels based on both appearance and motion streams. Specifically, in the first stage, we generate reliable initial frame-level pseudo labels, and in the second stage, we iteratively refine the pseudo labels and use a set of selected frames with highly confident pseudo labels to train neural networks and better predict action class scores at each frame. We fully exploit temporal information at multiple scales to improve temporal action localization performance. Specifically, in order to obtain reliable initial frame-level pseudo labels, in the first stage, we propose an Initial Label Generation (ILG) module, which leverages temporal multi-resolution consistency to generate high quality class activation sequences (CASs), which consist of a number of sequences with each sequence measuring how likely each video frame belongs to one specific action class. In the second stage, we propose a Progressive Temporal Label Refinement (PTLR) framework. In our PTLR framework, two networks called Network-OTS and Network-RTS, which are respectively used to generate CASs for the original temporal scale and the reduced temporal scales, are used as two streams (i.e., the OTS stream and the RTS stream) to refine the pseudo labels in turn. By this way, the multi-resolution information in the temporal domain is exchanged at the pseudo label level, and our work can help improve each stream (i.e., the OTS/RTS stream) by exploiting the refined pseudo labels from another stream (i.e., the RTS/OTS stream).
弱监督时间动作定位是一项具有挑战性的任务,因为在训练过程中只有视频级别的注释可用。为了解决这个问题,我们提出了一种两阶段的方法来充分利用时域中的多分辨率信息,并基于外观和运动流生成高质量的帧级伪标签。具体来说,在第一阶段中,我们生成可靠的初始帧级伪标签;在第二阶段中,我们迭代地细化这些伪标签,并使用一组具有高置信度伪标签的选择性帧训练神经网络,以更好地预测每个帧的动作类别得分。 为了充分利用时域中的多分辨率信息来提高时间动作定位性能,我们在第一阶段提出了一个初始标签生成(ILG)模块。该模块利用了时间多分辨率一致性来生成高质量的类激活序列(CAS),这些序列由多个序列组成,每个序列衡量视频中每一帧属于特定动作类别的可能性。 在第二阶段,我们提出了一种渐进式时间标签细化(PTLR)框架,在此框架中使用两个称为Network-OTS和Network-RTS的网络分别生成原始时间和减少的时间尺度上的CAS。这两个网络作为两条流(即OTC流和RTC流),交替地用来细化伪标签。 通过这种方式,时域中的多分辨率信息以伪标签的形式进行交换,我们的工作可以通过利用另一条流中经过精炼的伪标签来改进每一条流(即OTC/RTC流)。
https://arxiv.org/abs/2506.18261