We present Recurrent Video Masked-Autoencoders (RVM): a novel video representation learning approach that uses a transformer-based recurrent neural network to aggregate dense image features over time, effectively capturing the spatio-temporal structure of natural video data. RVM learns via an asymmetric masked prediction task requiring only a standard pixel reconstruction objective. This design yields a highly efficient ``generalist'' encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action recognition and point/object tracking, while also performing favorably against image models (e.g. DINOv2) on tasks that test geometric and dense spatial understanding. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge distillation, exhibiting up to 30x greater parameter efficiency than competing video masked autoencoders. Moreover, we demonstrate that RVM's recurrent nature allows for stable feature propagation over long temporal horizons with linear computational cost, overcoming some of the limitations of standard spatio-temporal attention-based architectures. Finally, we use qualitative visualizations to highlight that RVM learns rich representations of scene semantics, structure, and motion.
我们提出了递归视频掩码自编码器(RVM):一种新颖的视频表示学习方法,它使用基于变压器的循环神经网络来在时间维度上聚合密集图像特征,从而有效地捕捉自然视频数据中的空间-时间结构。RVM 通过一个非对称的掩码预测任务进行学习,该任务只需要标准像素重建目标即可完成。这种设计产生了一个高效的“通才”编码器:RVM 在诸如动作识别和点/对象跟踪等视频级别的任务上表现出与最先进的视频模型(如 VideoMAE、V-JEPA)相当的性能,并且在测试几何学和密集空间理解的任务中,其表现也优于图像模型(例如 DINOv2)。值得注意的是,即使不使用知识蒸馏,在小规模模型环境下 RVM 也能取得优异的成绩,比竞争中的视频掩码自编码器参数效率高出多达30倍。此外,我们证明了由于 RVM 的递归特性,它能够在长时序范围内以线性计算成本稳定地传播特征,克服了一些基于空间-时间注意的架构的限制。最后,我们通过定性的可视化展示了 RVM 学习到了丰富的场景语义、结构和运动表示。
https://arxiv.org/abs/2512.13684
Premature semantic collapse -- the forced early commitment to a single meaning -- remains a core architectural limitation of current language models. Softmax-driven competition and greedy decoding cause models to discard valid interpretations before sufficient context is available, resulting in brittle reasoning and context failures. We introduce Non-Resolution Reasoning (NRR), a general computational framework that preserves semantic ambiguity during inference and performs resolution only when explicitly required. NRR integrates three components: (1) Multi-Vector Embeddings that maintain multiple viable interpretations per token, (2) Non-Collapsing Attention that prevents winner-take-all dynamics across layers, and (3) Contextual Identity Tracking (CIT), which assigns context-specific identities to recurring entities (e.g., distinguishing "Dr. Smith the cardiologist" from "Dr. Smith the researcher"). These mechanisms are unified by an external Resolution Operator $\rho$ that makes semantic commitment explicit, controllable, and task-dependent. Unlike standard architectures, NRR separates representation from resolution, allowing a single model to shift between creative, factual, and ambiguity-preserving reasoning without retraining. A synthetic evaluation demonstrates NRR's ability to preserve ambiguity and track context: CIT-enhanced models achieve 90.9% accuracy on out-of-distribution identity-shift tasks, compared to 9.1% for transformer baselines. NRR provides a principled alternative to premature collapse, reframing ambiguity as an explicit representational state rather than a failure mode. The question is not whether AI should resolve ambiguity, but when, how, and under whose control.
过早的语义崩溃——即在有足够的上下文之前被迫提前承诺单一含义——仍然是当前语言模型的核心架构限制之一。由softmax驱动的竞争和贪婪解码导致模型在没有足够背景信息的情况下舍弃有效的解释,从而引发脆弱的推理和上下文理解失败。我们提出了一种通用计算框架:非解析推理(NRR),该框架在推断过程中保留语义模糊性,并仅在明确需要时进行解析。NRR整合了三个组成部分: 1. 多向量嵌入,为每个标记维持多个可行的解释。 2. 非崩溃注意力机制,阻止各层间的赢家通吃动态过程。 3. 上下文身份追踪(CIT),用于根据上下文环境赋予反复出现的实体特定的身份(例如区分“心脏病专家史密斯医生”和“研究员史密斯博士”)。 这些机制通过一个外部解析操作符$\rho$统一起来,该操作符使语义承诺变得显式、可控并依赖于任务需求。与标准架构不同的是,NRR将表示与解析分离,使得单一模型能够在创造性推理、事实性推理和保留模糊性的推理之间自由切换而无需重新训练。合成评估表明了NRR在保持模糊性和追踪上下文方面的有效性:增强CIT的模型在外来身份转变任务上达到90.9%的准确率,相比之下基于变换器的基础模型仅为9.1%。 NRR为过早崩溃提供了一个有原则的替代方案,将模棱两可视为一种显式的表示状态而不是失败模式。问题不再是AI是否应该解析模糊性,而是何时、如何以及在谁的控制下进行这种解析。
https://arxiv.org/abs/2512.13478
This paper proposes a reinforcement learning (RL) framework for controlling and stabilizing the Twin Rotor Aerodynamic System (TRAS) at specific pitch and azimuth angles and tracking a given trajectory. The complex dynamics and non-linear characteristics of the TRAS make it challenging to control using traditional control algorithms. However, recent developments in RL have attracted interest due to their potential applications in the control of multirotors. The Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm was used in this paper to train the RL agent. This algorithm is used for environments with continuous state and action spaces, similar to the TRAS, as it does not require a model of the system. The simulation results illustrated the effectiveness of the RL control method. Next, external disturbances in the form of wind disturbances were used to test the controller's effectiveness compared to conventional PID controllers. Lastly, experiments on a laboratory setup were carried out to confirm the controller's effectiveness in real-world applications.
本文提出了一种用于控制和稳定双旋翼气动系统(TRAS)在特定俯仰角和方位角,并追踪给定轨迹的强化学习(RL)框架。由于TRAS复杂的动态特性和非线性特征,使用传统的控制算法进行控制具有挑战性。然而,近年来强化学习的发展因其在多旋翼控制系统中的潜在应用而引起了人们的兴趣。 本文采用双延迟深度确定型策略梯度(TD3)算法来训练RL智能体。该算法适用于连续状态和动作空间的环境,类似于TRAS,因为它不需要系统的模型。模拟结果展示了RL控制方法的有效性。接下来,通过使用风扰动等外部干扰的形式测试控制器的效果,并将其与传统的PID控制器进行比较。最后,在实验室环境中进行了实验以验证控制器在实际应用中的有效性。
https://arxiv.org/abs/2512.13356
Face recognition systems rely on learning highly discriminative and compact identity clusters to enable accurate retrieval. However, as with other surveillance-oriented technologies, such systems raise serious privacy concerns due to their potential for unauthorized identity tracking. While several works have explored machine unlearning as a means of privacy protection, their applicability to face retrieval - especially for modern embedding-based recognition models - remains largely unexplored. In this work, we study the problem of face identity unlearning for retrieval systems and present its inherent challenges. The goal is to make selected identities unretrievable by dispersing their embeddings on the hypersphere and preventing the formation of compact identity clusters that enable re-identification in the gallery. The primary challenge is to achieve this forgetting effect while preserving the discriminative structure of the embedding space and the retrieval performance of the model for the remaining identities. To address this, we evaluate several existing approximate class unlearning methods (e.g., Random Labeling, Gradient Ascent, Boundary Unlearning, and other recent approaches) in the context of face retrieval and propose a simple yet effective dispersion-based unlearning approach. Extensive experiments on standard benchmarks (VGGFace2, CelebA) demonstrate that our method achieves superior forgetting behavior while preserving retrieval utility.
面部识别系统依赖于学习高度判别性和紧凑的身份聚类,以实现准确的检索。然而,与其他监控技术一样,这类系统由于其潜在的未经授权的身份追踪能力而引发了严重的隐私问题。尽管已有几项研究探索了机器遗忘作为隐私保护手段的应用,但它们在面部检索(尤其是针对现代嵌入式识别模型)中的适用性仍然鲜有探讨。 在这项工作中,我们研究了面向检索系统的面部身份遗忘问题,并展示了其固有的挑战。我们的目标是通过分散选定身份的嵌入以使其无法被检索,在超球面上防止紧凑的身份聚类形成,从而阻止重新识别。主要挑战在于在保留嵌入空间的判别性结构和模型对剩余身份的检索性能的同时实现这种忘记效果。 为解决这一问题,我们评估了几种现有的近似类别遗忘方法(如随机标签法、梯度上升法、边界遗忘法及其他最近的方法)在面部检索中的应用,并提出了一种简单而有效的基于分散的遗忘策略。我们在标准基准测试集(VGGFace2和CelebA)上的广泛实验表明,我们的方法能够实现更优的忘记行为同时保持检索效用。
https://arxiv.org/abs/2512.13317
High resolution phenotyping at the level of individual leaves offers fine-grained insights into plant development and stress responses. However, the full potential of accurate leaf tracking over time remains largely unexplored due to the absence of robust tracking methods-particularly for structurally complex crops such as canola. Existing plant-specific tracking methods are typically limited to small-scale species or rely on constrained imaging conditions. In contrast, generic multi-object tracking (MOT) methods are not designed for dynamic biological scenes. Progress in the development of accurate leaf tracking models has also been hindered by a lack of large-scale datasets captured under realistic conditions. In this work, we introduce CanolaTrack, a new benchmark dataset comprising 5,704 RGB images with 31,840 annotated leaf instances spanning the early growth stages of 184 canola plants. To enable accurate leaf tracking over time, we introduce LeafTrackNet, an efficient framework that combines a YOLOv10-based leaf detector with a MobileNetV3-based embedding network. During inference, leaf identities are maintained over time through an embedding-based memory association strategy. LeafTrackNet outperforms both plant-specific trackers and state-of-the-art MOT baselines, achieving a 9% HOTA improvement on CanolaTrack. With our work we provide a new standard for leaf-level tracking under realistic conditions and we provide CanolaTrack - the largest dataset for leaf tracking in agriculture crops, which will contribute to future research in plant phenotyping. Our code and dataset are publicly available at this https URL.
高分辨率的单叶表型分析为植物发育和压力响应提供了精细的见解。然而,由于缺乏稳健的追踪方法(特别是对于结构复杂的作物如油菜),随着时间推移准确地进行叶片跟踪的全部潜力尚未被充分发掘。现有的特定于植物的跟踪方法通常仅限于小型物种或依赖于受限的成像条件。相比之下,通用多目标跟踪(MOT)方法并未针对动态生物场景设计。由于缺乏在现实条件下捕获的大规模数据集,准确叶片追踪模型的发展也受到了阻碍。 在这项工作中,我们介绍了CanolaTrack,这是一个新的基准数据集,包含5,704张RGB图像和31,840个注释的叶子实例,涵盖了184株油菜植物的早期生长阶段。为了实现在时间上的准确叶片跟踪,我们引入了LeafTrackNet,这是一种高效的框架,结合了一种基于YOLOv10的叶检测器和一种基于MobileNetV3的嵌入式网络。在推断过程中,通过一种基于嵌入式的记忆关联策略,在一段时间内保持叶子的身份信息。LeafTrackNet超越了特定于植物的跟踪器和最先进的MOT基线方法,在CanolaTrack上实现了9%的HOTA改进。我们提供的这项工作为现实条件下叶片级追踪提供了一个新的标准,并提供了CanolaTrack——农业作物中最大的叶片跟踪数据集,这将有助于未来在植物表型研究中的发展。 我们的代码和数据集可以在[此链接](https://给出URL)公开获取。
https://arxiv.org/abs/2512.13130
Current methods for dense 3D point tracking in dynamic scenes typically rely on pairwise processing, require known camera poses, or assume a temporal ordering to input frames, constraining their flexibility and applicability. Additionally, recent advances have successfully enabled efficient 3D reconstruction from large-scale, unposed image collections, underscoring opportunities for unified approaches to dynamic scene understanding. Motivated by this, we propose DePT3R, a novel framework that simultaneously performs dense point tracking and 3D reconstruction of dynamic scenes from multiple images in a single forward pass. This multi-task learning is achieved by extracting deep spatio-temporal features with a powerful backbone and regressing pixel-wise maps with dense prediction heads. Crucially, DePT3R operates without requiring camera poses, substantially enhancing its adaptability and efficiency-especially important in dynamic environments with rapid changes. We validate DePT3R on several challenging benchmarks involving dynamic scenes, demonstrating strong performance and significant improvements in memory efficiency over existing state-of-the-art methods. Data and codes are available via the open repository: this https URL
当前在动态场景中进行密集三维点跟踪的方法通常依赖于成对处理、需要已知的相机姿态,或者假设输入帧具有时间顺序,这些限制了它们的灵活性和适用性。此外,最近的研究成功地实现了从大规模未定位图像集合中的高效3D重建,突显了统一方法在动态场景理解方面的机会。受此启发,我们提出了DePT3R,这是一种新颖的框架,该框架能够同时利用多张图像在一个前向传播过程中完成密集点跟踪和动态场景的三维重建任务。通过强大的骨干网络提取深度时空特征,并使用密集预测头回归像素级映射,从而实现了这种多任务学习。关键的是,DePT3R在不需要相机姿态的情况下运行,这大大增强了其适应性和效率——特别是在快速变化的动态环境中尤为重要。 我们在多个涉及动态场景的具有挑战性的基准测试中验证了DePT3R,展示了强大的性能,并且与现有的最先进的方法相比,在内存效率方面取得了显著改进。数据和代码可通过开放存储库获取:[此链接](this https URL)
https://arxiv.org/abs/2512.13122
Object tracking is an important step in robotics and reautonomous driving pipelines, which has to generalize to previously unseen and complex objects. Existing high-performing methods often rely on pre-captured object views to build explicit reference models, which restricts them to a fixed set of known objects. However, such reference models can struggle with visually complex appearance, reducing the quality of tracking. In this work, we introduce an object tracking method based on light field images that does not depend on a pre-trained model, while being robust to complex visual behavior, such as reflections. We extract semantic and geometric features from light field inputs using vision foundation models and convert them into view-dependent Gaussian splats. These splats serve as a unified object representation, supporting differentiable rendering and pose optimization. We further introduce a light field object tracking dataset containing challenging reflective objects with precise ground truth poses. Experiments demonstrate that our method is competitive with state-of-the-art model-based trackers in these difficult cases, paving the way toward universal object tracking in robotic systems. Code/data available at this https URL.
对象跟踪是机器人和自动驾驶流水线中的一个重要步骤,需要对以前未见过的复杂物体进行泛化。现有的高性能方法通常依赖于预捕获的对象视图来构建显式的参考模型,这限制了它们只能处理一组已知的对象。然而,这样的参考模型在面对视觉上复杂的外观时会遇到困难,从而降低了跟踪质量。在这项工作中,我们介绍了一种基于光场图像的对象跟踪方法,该方法无需依赖预训练的模型,并且对诸如反射等复杂视觉行为具有鲁棒性。我们利用视觉基础模型从光场输入中提取语义和几何特征,并将其转换为视图相关的高斯点(Gaussian splats)。这些点作为统一的对象表示形式,支持可微渲染和姿态优化。此外,我们还引入了一个包含挑战性的反射物体的光场对象跟踪数据集,并提供了精确的姿态地面实况。实验表明,在这些困难的情况下,我们的方法与最先进的模型基于的方法具有竞争力,为机器人系统中的通用对象跟踪铺平了道路。代码和数据可在该链接获取。
https://arxiv.org/abs/2512.13007
This paper proposes two new algorithms for the lane keeping system (LKS) in autonomous vehicles (AVs) operating under snowy road conditions. These algorithms use deep reinforcement learning (DRL) to handle uncertainties and slippage. They include Action-Robust Recurrent Deep Deterministic Policy Gradient (AR-RDPG) and end-to-end Action-Robust convolutional neural network Attention Deterministic Policy Gradient (AR-CADPG), two action-robust approaches for decision-making. In the AR-RDPG method, within the perception layer, camera images are first denoised using multi-scale neural networks. Then, the centerline coefficients are extracted by a pre-trained deep convolutional neural network (DCNN). These coefficients, concatenated with the driving characteristics, are used as input to the control layer. The AR-CADPG method presents an end-to-end approach in which a convolutional neural network (CNN) and an attention mechanism are integrated within a DRL framework. Both methods are first trained in the CARLA simulator and validated under various snowy scenarios. Real-world experiments on a Jetson Nano-based autonomous vehicle confirm the feasibility and stability of the learned policies. Among the two models, the AR-CADPG approach demonstrates superior path-tracking accuracy and robustness, highlighting the effectiveness of combining temporal memory, adversarial resilience, and attention mechanisms in AVs.
本文提出两种新的算法,用于在雪地路况下自主车辆(AV)车道保持系统(LKS)中的决策。这些算法利用深度强化学习(DRL)来处理不确定性和打滑问题。它们包括动作鲁棒递归深层确定性策略梯度(Action-Robust Recurrent Deep Deterministic Policy Gradient, AR-RDPG)和端到端动作鲁棒卷积神经网络注意机制确定性策略梯度(end-to-end Action-Robust Convolutional Neural Network Attention Deterministic Policy Gradient, AR-CADPG),这两种方法都是用于决策制定的动作鲁棒性方案。 在AR-RDPG方法中,感知层首先使用多尺度神经网络对相机图像进行去噪处理。然后,利用预训练的深度卷积神经网络(DCNN)提取中心线系数,并将这些系数与驾驶特性相结合作为控制层的输入数据。 AR-CADPG方法则提供了一种端到端的方法,在这种方法中,一个卷积神经网络(CNN)和注意机制被整合在了DRL框架内。这两种方法都在CARLA仿真器上进行初步训练,并通过各种雪地场景进行了验证。基于Jetson Nano的自动驾驶车辆上的真实世界实验证实了所学策略的有效性和稳定性。 在这两个模型中,AR-CADPG方法展示了更优的道路跟踪准确性和鲁棒性,这突显了在AV中结合时间记忆、对抗性适应能力和注意机制的有效性。
https://arxiv.org/abs/2512.12987
LLM-based agents often operate in a greedy, step-by-step manner, selecting actions solely based on the current observation without considering long-term consequences or alternative paths. This lack of foresight is particularly problematic in web environments, which are only partially observable-limited to browser-visible content (e.g., DOM and UI elements)-where a single misstep often requires complex and brittle navigation to undo. Without an explicit backtracking mechanism, agents struggle to correct errors or systematically explore alternative paths. Tree-search methods provide a principled framework for such structured exploration, but existing approaches lack mechanisms for safe backtracking, making them prone to unintended side effects. They also assume that all actions are reversible, ignoring the presence of irreversible actions-limitations that reduce their effectiveness in realistic web tasks. To address these challenges, we introduce WebOperator, a tree-search framework that enables reliable backtracking and strategic exploration. Our method incorporates a best-first search strategy that ranks actions by both reward estimates and safety considerations, along with a robust backtracking mechanism that verifies the feasibility of previously visited paths before replaying them, preventing unintended side effects. To further guide exploration, WebOperator generates action candidates from multiple, varied reasoning contexts to ensure diverse and robust exploration, and subsequently curates a high-quality action set by filtering out invalid actions pre-execution and merging semantically equivalent ones. Experimental results on WebArena and WebVoyager demonstrate the effectiveness of WebOperator. On WebArena, WebOperator achieves a state-of-the-art 54.6% success rate with gpt-4o, underscoring the critical advantage of integrating strategic foresight with safe execution.
基于大语言模型(LLM)的代理通常以贪婪、逐步的方式运行,仅根据当前观察到的信息选择行动,而不考虑长期后果或替代路径。这种缺乏远见的问题在只能部分观察的网络环境中尤为严重——比如受限于浏览器可见内容(例如DOM和UI元素),一个错误步骤往往需要复杂的且脆弱的导航来撤销。没有明确的回溯机制,代理难以纠正错误或系统地探索其他路径。尽管树搜索方法提供了一种结构化探索的原则框架,但现有方法缺乏安全回溯机制,导致它们容易产生意外副作用,并假设所有动作都是可逆的,忽略了不可逆操作的存在——这些限制削弱了其在现实网络任务中的有效性。 为了解决这些问题,我们引入了WebOperator,这是一种树搜索框架,能够实现可靠的回溯和战略探索。我们的方法结合了最佳优先搜索策略,根据奖励估计和安全考量对动作进行排名,并且采用了一种稳健的回溯机制,在重新播放之前访问过的路径时验证其可行性,防止产生意外副作用。为了进一步指导探索,WebOperator从多个不同推理上下文中生成行动候选人,以确保多样性和鲁棒性,并通过预执行过滤无效操作和合并语义上等价的操作来后期精炼高质量的动作集。 在WebArena和WebVoyager上的实验结果验证了WebOperator的有效性。在WebArena中,使用gpt-4o时,WebOperator达到了54.6%的最新成功率,突显了将战略远见与安全执行相结合的关键优势。
https://arxiv.org/abs/2512.12692
This study introduces a lightweight perimeter tracking method designed for micro UAV teams operating over wildfire environments under limited bandwidth conditions. Thermal image frames generate coarse hot region masks through adaptive thresholding and morphological refinement, while RGB frames contribute edge cues and suppress texture related false detections using gradient based filtering. A rule level merging strategy selects boundary candidates and simplifies them via the Ramer Douglas Peucker algorithm. The system incorporates periodic beacons and an inertial feedback loop that maintains trajectory stability in the presence of GPS degradation. The guidance loop targets sub 50 ms latency on embedded System on Chip (SoC) platforms by constraining per frame pixel operations and precomputing gradient tables. Small scale simulations demonstrate reductions in average path length and boundary jitter compared to a pure edge tracking baseline, while maintaining environmental coverage measured through intersection merge analysis. Battery consumption and computational utilization confirm the feasibility of achieving 10, 15 m/s forward motion on standard micro platforms. This approach enables rapid deployment in the field, requiring robust sensing and minimal communications for emergency reconnaissance applications.
这项研究介绍了一种轻量级的周界追踪方法,专为在带宽有限条件下执行于野火环境中的微型无人机团队设计。热像仪帧通过自适应阈值和形态学细化生成粗略的热点区域掩模,而RGB帧则提供边缘线索,并使用基于梯度的滤波器抑制与纹理相关的误检。规则级合并策略选择边界候选点并通过拉梅尔·道格拉斯-普克算法简化它们。该系统结合了周期性信标和惯性反馈回路,在GPS性能下降的情况下维持轨迹稳定性。引导回路在嵌入式片上系统(SoC)平台上通过限制每帧的像素操作并预计算梯度表,以实现低于50毫秒的延迟为目标。小型模拟实验表明,与纯边缘追踪基准相比,路径长度和边界抖动有所减少,并且通过交集合并分析保持了环境覆盖率。电池消耗和计算利用证实,在标准微型平台上实现每秒10米到15米的速度是可行的。这种方法使得快速现场部署成为可能,仅需坚固的传感设备和最小的通信需求即可用于紧急侦察任务。
https://arxiv.org/abs/2512.12199
Large language models (LLMs) have demonstrated remarkable performance on long-context tasks, but are often bottlenecked by memory constraints. Namely, the KV cache, which is used to significantly speed up attention computations, grows linearly with context length. A suite of compression algorithms has been introduced to alleviate cache growth by evicting unimportant tokens. However, several popular strategies are targeted towards the prefill phase, i.e., processing long prompt context, and their performance is rarely assessed on reasoning tasks requiring long decoding. In particular, short but complex prompts, such as those in benchmarks like GSM8K and MATH500, often benefit from multi-step reasoning and self-reflection, resulting in thinking sequences thousands of tokens long. In this work, we benchmark the performance of several popular compression strategies on long-reasoning tasks. For the non-reasoning Llama-3.1-8B-Instruct, we determine that no singular strategy fits all, and that performance is heavily influenced by dataset type. However, we discover that H2O and our decoding-enabled variant of SnapKV are dominant strategies for reasoning models, indicating the utility of heavy-hitter tracking for reasoning traces. We also find that eviction strategies at low budgets can produce longer reasoning traces, revealing a tradeoff between cache size and inference costs.
大型语言模型(LLMs)在处理长上下文任务时表现出色,但常常受到内存限制的制约。具体来说,用于显著加速注意力计算的KV缓存随着上下文长度的增长而线性增长。为此,一系列压缩算法被引入以通过移除不重要的令牌来缓解缓存增长问题。然而,许多流行的策略主要针对预填充阶段(即处理长提示上下文),其在需要长时间解码的任务中的性能表现很少被评估。尤其是,像GSM8K和MATH500这样的基准测试中短但复杂的提示通常受益于多步骤推理和自我反思,从而产生数千令牌长度的思考序列。 在这项工作中,我们针对长推理任务对几种流行的压缩策略进行了性能评测。对于非推理型Llama-3.1-8B-Instruct模型,我们发现没有单一策略适用于所有情况,并且性能严重受到数据集类型的影响。然而,我们发现在推理模型中H2O和我们的解码启用版本的SnapKV是主导策略,这表明在追踪推理痕迹时进行重量级令牌跟踪具有实用性。此外,我们还发现,在低预算下采用驱逐策略可以生成更长的推理痕迹,揭示了缓存大小与推断成本之间的权衡关系。
https://arxiv.org/abs/2512.12008
Reality is a dance between rigid constraints and deformable structures. For video models, that means generating motion that preserves fidelity as well as structure. Despite progress in diffusion models, producing realistic structure-preserving motion remains challenging, especially for articulated and deformable objects such as humans and animals. Scaling training data alone, so far, has failed to resolve physically implausible transitions. Existing approaches rely on conditioning with noisy motion representations, such as optical flow or skeletons extracted using an external imperfect model. To address these challenges, we introduce an algorithm to distill structure-preserving motion priors from an autoregressive video tracking model (SAM2) into a bidirectional video diffusion model (CogVideoX). With our method, we train SAM2VideoX, which contains two innovations: (1) a bidirectional feature fusion module that extracts global structure-preserving motion priors from a recurrent model like SAM2; (2) a Local Gram Flow loss that aligns how local features move together. Experiments on VBench and in human studies show that SAM2VideoX delivers consistent gains (+2.60\% on VBench, 21-22\% lower FVD, and 71.4\% human preference) over prior baselines. Specifically, on VBench, we achieve 95.51\%, surpassing REPA (92.91\%) by 2.60\%, and reduce FVD to 360.57, a 21.20\% and 22.46\% improvement over REPA- and LoRA-finetuning, respectively. The project website can be found at this https URL .
现实是在刚性约束和可变形结构之间的舞蹈。对于视频模型而言,这意味着生成既保持保真度又保留结构的运动。尽管在扩散模型方面取得了进展,但产生真实且结构保守的动作仍然具有挑战性,特别是对于像人类和动物这样的连杆和可变形对象来说。仅靠扩大训练数据尚未解决物理上不可能的过渡问题。现有的方法依赖于使用嘈杂的动作表示进行条件设置,例如通过外部不完美的模型提取的光学流或骨架。 为了解决这些挑战,我们引入了一种算法,将结构保守运动先验从自回归视频跟踪模型(SAM2)提炼到双向视频扩散模型(CogVideoX)中。利用我们的方法,我们训练了名为SAM2VideoX的新模型,该模型包含两项创新: 1. 一个双向特征融合模块,可以从像SAM2这样的递归模型中提取全局结构保守运动先验。 2. 局部Gram Flow损失,用于对齐局部特征如何共同移动。 在VBench和人类研究中的实验表明,SAM2VideoX相对于之前的基准线提供了持续的改进(VBench上的+2.60%,FVD降低了21-22%,并且在71.4%的人类偏好度上表现出优势)。具体来说,在VBench上我们达到了95.51%,超越了REPA的92.91%,将FVD降低到了360.57,分别比REPA和LoRA微调改进了21.20% 和 22.46%。 项目的网站可以在[这里](https://this-should-be-the-url-of-the-project)找到。请注意,链接处应填写实际的项目网址。
https://arxiv.org/abs/2512.11792
We introduce FactorPortrait, a video diffusion method for controllable portrait animation that enables lifelike synthesis from disentangled control signals of facial expressions, head movement, and camera viewpoints. Given a single portrait image, a driving video, and camera trajectories, our method animates the portrait by transferring facial expressions and head movements from the driving video while simultaneously enabling novel view synthesis from arbitrary viewpoints. We utilize a pre-trained image encoder to extract facial expression latents from the driving video as control signals for animation generation. Such latents implicitly capture nuanced facial expression dynamics with identity and pose information disentangled, and they are efficiently injected into the video diffusion transformer through our proposed expression controller. For camera and head pose control, we employ Plücker ray maps and normal maps rendered from 3D body mesh tracking. To train our model, we curate a large-scale synthetic dataset containing diverse combinations of camera viewpoints, head poses, and facial expression dynamics. Extensive experiments demonstrate that our method outperforms existing approaches in realism, expressiveness, control accuracy, and view consistency.
我们介绍了一种名为FactorPortrait的视频扩散方法,该方法用于可控的人像动画制作。通过这种方法可以从面部表情、头部动作和相机视角的解耦控制信号中合成逼真的图像。给定一张单人肖像图片、一个驱动视频以及相机轨迹,我们的方法能够通过将驱动视频中的面部表情和头部运动转移到肖像上进行动画化,并同时从任意视点生成新的视角画面。 为了生成动画,我们使用预训练的图像编码器来提取驱动视频中用于动画控制的面部表情潜在变量。这些潜在变量隐式地捕捉到了细微的表情动态变化,并且分离了身份信息和姿态信息。它们可以通过我们提出的表达控制器高效地注入到视频扩散变换器中。 对于相机视角和头部姿势的控制,我们采用了Plücker光线图和从3D人体网格追踪生成的法线图。 为了训练我们的模型,我们整理了一个大规模合成的数据集,其中包含了各种组合的摄像机视点、头部姿态以及面部表情动态变化。广泛的实验表明,我们的方法在逼真度、表现力、控制准确性和视角一致性方面均优于现有技术。
https://arxiv.org/abs/2512.11645
The success of foundation models in language and vision motivated research in fully end-to-end robot navigation foundation models (NFMs). NFMs directly map monocular visual input to control actions and ignore mid-level vision modules (tracking, depth estimation, etc) entirely. While the assumption that vision capabilities will emerge implicitly is compelling, it requires large amounts of pixel-to-action supervision that are difficult to obtain. The challenge is especially pronounced in dynamic and unstructured settings, where robust navigation requires precise geometric and dynamic understanding, while the depth-scale ambiguity in monocular views further limits accurate spatial reasoning. In this paper, we show that relying on monocular vision and ignoring mid-level vision priors is inefficient. We present StereoWalker, which augments NFMs with stereo inputs and explicit mid-level vision such as depth estimation and dense pixel tracking. Our intuition is straightforward: stereo inputs resolve the depth-scale ambiguity, and modern mid-level vision models provide reliable geometric and motion structure in dynamic scenes. We also curate a large stereo navigation dataset with automatic action annotation from Internet stereo videos to support training of StereoWalker and to facilitate future research. Through our experiments, we find that mid-level vision enables StereoWalker to achieve a comparable performance as the state-of-the-art using only 1.5% of the training data, and surpasses the state-of-the-art using the full data. We also observe that stereo vision yields higher navigation performance than monocular input.
在语言和视觉领域,基础模型的成功激发了对完全端到端机器人导航基础模型(NFMs)的研究。NFMs直接将单目视觉输入映射为控制动作,并完全忽略中间层次的视觉模块(如跟踪、深度估计等)。虽然假设视觉能力会隐式地出现是有吸引力的,但这一过程需要大量的像素到动作监督数据,而这些数据难以获取。特别是在动态和非结构化环境中,这种挑战尤为突出,在这样的环境中,稳健导航要求精确的几何与动力学理解,单目视角中的深度尺度模糊进一步限制了准确的空间推理能力。 在本文中,我们展示了仅依赖于单目视觉并忽略中间层次视觉先验知识是低效的。为此,我们提出了StereoWalker模型,该模型通过引入立体视觉输入和显式的中间层次视觉(如深度估计、密集像素跟踪)来增强NFMs。我们的直觉很简单:立体视图可以解决深度尺度模糊的问题,并且现代中间层次视觉模型在动态场景中提供了可靠的几何与运动结构。 我们还整理了一个大型的基于互联网立体视频自动动作标注的大规模导航数据集,用于支持StereoWalker的训练,并为未来研究提供便利。通过我们的实验发现,中间层次的视觉能力使StereoWalker能够在仅使用1.5%的数据时达到最先进的性能水平,在使用全部数据的情况下则能够超越当前的最佳水平。我们也观察到,立体视觉输入相比单目输入可以带来更高的导航性能。 这段文字详细阐述了在机器人导航领域中引入立体视图和显式中间层次视觉模型的重要性,并通过实验验证了这一方法的有效性。
https://arxiv.org/abs/2512.10956
This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects' motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object segmentation (RVOS) methods, 3 audio-guided video object segmentation (AVOS) methods, 2 referring multi-object tracking (RMOT) methods, and 4 video captioning methods for the newly introduced referring motion expression generation (RMEG) task. The results demonstrate weaknesses and limitations of existing methods in addressing motion expression-guided video understanding. We further analyze the challenges and propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results. Our dataset provides a platform that facilitates the development of motion expression-guided video understanding algorithms in complex video scenes. The proposed MeViS dataset and the method's source code are publicly available at this https URL
这篇论文提出了一种大规模的多模态数据集,用于根据对象运动的语言描述对视频进行参考性动作表达分割。该数据集着重于依据语言描述中的物体运动来分割和追踪视频中的目标对象。现有的参考视频分割数据集往往关注显眼的对象,并使用富含静态属性的语言表达方式,这使得在单帧图像中识别出目标对象成为可能。然而,这些数据集低估了动作在视频及语言理解中的作用。 为了探索利用动作表达和运动推理线索进行像素级视频理解的可能性,我们引入了一个名为MeViS的数据集,它包含了33,072个人工注释的动作表达文本与音频,涵盖了复杂场景中2,006个视频里的8,171个物体。我们在MeViS支持的4项任务上基准测试了15种现有方法,包括6种参考视频对象分割(RVOS)方法、3种基于语音引导的视频对象分割(AVOS)方法、2种参考多目标跟踪(RMOT)方法以及针对新提出的参考动作表达生成(RMEG)任务的4种视频描述方法。测试结果揭示了现有方法在处理由动作表达指导的视频理解时所存在的弱点和局限性。 我们进一步分析了面临的挑战,并提出了一种名为LMPM++的新方法,该方法适用于RVOS、AVOS及RMOT,在这些领域实现了新的最佳性能。我们的数据集为复杂场景下基于动作表达的视频理解算法的发展提供了一个平台。MeViS数据集和所提方法的源代码已在以下网址公开发布:[URL]
https://arxiv.org/abs/2512.10945
Speech-driven talking heads have recently emerged and enable interactive avatars. However, real-world applications are limited, as current methods achieve high visual fidelity but slow or fast yet temporally unstable. Diffusion methods provide realistic image generation, yet struggle with oneshot settings. Gaussian Splatting approaches are real-time, yet inaccuracies in facial tracking, or inconsistent Gaussian mappings, lead to unstable outputs and video artifacts that are detrimental to realistic use cases. We address this problem by mapping Gaussian Splatting using 3D Morphable Models to generate person-specific avatars. We introduce transformer-based prediction of model parameters, directly from audio, to drive temporal consistency. From monocular video and independent audio speech inputs, our method enables generation of real-time talking head videos where we report competitive quantitative and qualitative performance.
基于语音驱动的虚拟头像(说话头部)技术最近发展迅速,使互动式化身成为可能。然而,现实世界中的应用仍然受到限制,因为目前的方法虽然能实现高视觉保真度,但存在速度慢或快而不稳定的问题。扩散方法能够生成逼真的图像,但在一次性的设置中表现不佳。Gaussian Splatting 方法实现了实时性能,但由于面部跟踪不准确或者高斯映射的一致性问题,会导致输出不稳定和视频伪影,这对真实使用场景不利。 为了解决这些问题,我们通过将3D Morphable Models(三维形态模型)与 Gaussian Splatting 结合起来生成特定于个人的头像。此外,我们引入了基于变压器的预测方法,直接从音频中预测模型参数,从而驱动时间上的稳定性。我们的方法可以从单目视频和独立的语音输入中生成实时说话头部视频,并在定量和定性评估上都表现出竞争力。
https://arxiv.org/abs/2512.10939
We present Any4D, a scalable multi-view transformer for metric-scale, dense feed-forward 4D reconstruction. Any4D directly generates per-pixel motion and geometry predictions for N frames, in contrast to prior work that typically focuses on either 2-view dense scene flow or sparse 3D point tracking. Moreover, unlike other recent methods for 4D reconstruction from monocular RGB videos, Any4D can process additional modalities and sensors such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements, when available. One of the key innovations that allows for such a flexible framework is a modular representation of a 4D scene; specifically, per-view 4D predictions are encoded using a variety of egocentric factors (depthmaps and camera intrinsics) represented in local camera coordinates, and allocentric factors (camera extrinsics and scene flow) represented in global world coordinates. We achieve superior performance across diverse setups - both in terms of accuracy (2-3X lower error) and compute efficiency (15X faster), opening avenues for multiple downstream applications.
我们介绍了Any4D,这是一种可扩展的多视角变压器,用于度量规模密集前向反馈4D重建。与以往的工作通常专注于两视图稠密场景流或稀疏3D点跟踪不同,Any4D直接为N帧生成每像素运动和几何预测。此外,与其他从单目RGB视频进行4D重建的最近方法相比,当可用时,Any4D可以处理其他模式和传感器的数据,例如RGB-D帧、基于IMU的自运动以及雷达多普勒测量。 实现这一灵活框架的关键创新之一是采用了模块化的4D场景表示方式;具体来说,每个视角下的4D预测使用一系列以局部相机坐标表示的自身因素(深度图和相机内参)进行编码,并且使用全局世界坐标表示的一系列他向因素(相机外参和场景流)进行编码。 我们实现了在各种设置中的卓越性能——无论是在准确性方面(误差降低2-3倍),还是在计算效率方面(速度提升15倍)。这为多个下游应用开辟了新的途径。
https://arxiv.org/abs/2512.10935
Aerial manipulators undergo rapid, configuration-dependent changes in inertial coupling forces and aerodynamic forces, making accurate dynamics modeling a core challenge for reliable control. Analytical models lose fidelity under these nonlinear and nonstationary effects, while standard data-driven methods such as deep neural networks and Gaussian processes cannot represent the diverse residual behaviors that arise across different operating conditions. We propose a regime-conditioned diffusion framework that models the full distribution of residual forces using a conditional diffusion process and a lightweight temporal encoder. The encoder extracts a compact summary of recent motion and configuration, enabling consistent residual predictions even through abrupt transitions or unseen payloads. When combined with an adaptive controller, the framework enables dynamics uncertainty compensation and yields markedly improved tracking accuracy in real-world tests.
空中操作机械臂在惯性耦合力和气动阻力方面会根据配置的不同经历快速且复杂的变动,这使得准确的动力学建模成为可靠控制的核心挑战。分析模型在这种非线性和非平稳效应下失去精确度,而传统的数据驱动方法(如深度神经网络和高斯过程)也无法表示在不同操作条件下出现的多样化残差行为。我们提出了一种基于条件扩散框架的方法,该方法使用条件扩散过程来建模残余力的整体分布,并采用轻量级的时间编码器来提取最近运动和配置的精简摘要。这种编码器使得即使通过突然的过渡或未知负载也能进行一致的残差预测。当与自适应控制器结合时,此框架能够实现动力学不确定性补偿,并在实际测试中显著提高跟踪精度。
https://arxiv.org/abs/2512.10773
Point tracking in video sequences is a foundational capability for real-world computer vision applications, including robotics, autonomous systems, augmented reality, and video analysis. While recent deep learning-based trackers achieve state-of-the-art accuracy on challenging benchmarks, their reliance on per-frame GPU inference poses a major barrier to deployment on resource-constrained edge devices, where compute, power, and connectivity are limited. We introduce K-Track (Kalman-enhanced Tracking), a general-purpose, tracker-agnostic acceleration framework designed to bridge this deployment gap. K-Track reduces inference cost by combining sparse deep learning keyframe updates with lightweight Kalman filtering for intermediate frame prediction, using principled Bayesian uncertainty propagation to maintain temporal coherence. This hybrid strategy enables 5-10X speedup while retaining over 85% of the original trackers' accuracy. We evaluate K-Track across multiple state-of-the-art point trackers and demonstrate real-time performance on edge platforms such as the NVIDIA Jetson Nano and RTX Titan. By preserving accuracy while dramatically lowering computational requirements, K-Track provides a practical path toward deploying high-quality point tracking in real-world, resource-limited settings, closing the gap between modern tracking algorithms and deployable vision systems.
视频序列中的点跟踪是机器人技术、自主系统、增强现实和视频分析等实际计算机视觉应用的基础能力。尽管最近基于深度学习的追踪器在具有挑战性的基准测试中取得了最先进的准确性,但它们依赖于每帧GPU推理这一特点却为资源受限的边缘设备上的部署设下了一个重大障碍,在这些设备上计算能力、电力和连接性都是有限的。 我们引入了K-Track(卡尔曼增强跟踪),这是一种通用且独立于特定追踪器的加速框架,旨在弥合这种部署差距。通过将稀疏深度学习关键帧更新与轻量级卡尔曼滤波相结合来预测中间帧,K-Track 降低了推理成本,并利用基于贝叶斯不确定性传播的方法保持时间一致性。这种方法在减少5到10倍计算成本的同时保留了原始追踪器85%以上的准确性。 我们在多个最先进的点跟踪器上评估了 K-Track 的性能,并且在像 NVIDIA Jetson Nano 和 RTX Titan 这样的边缘平台上展示了实时性能。通过同时保持高精度和显著降低计算需求,K-Track 为在实际、资源受限的环境中部署高质量的点追踪提供了一条实用路径,从而缩小现代跟踪算法与可部署视觉系统之间的差距。
https://arxiv.org/abs/2512.10628
We present Lang2Motion, a framework for language-guided point trajectory generation by aligning motion manifolds with joint embedding spaces. Unlike prior work focusing on human motion or video synthesis, we generate explicit trajectories for arbitrary objects using motion extracted from real-world videos via point tracking. Our transformer-based auto-encoder learns trajectory representations through dual supervision: textual motion descriptions and rendered trajectory visualizations, both mapped through CLIP's frozen encoders. Lang2Motion achieves 34.2% Recall@1 on text-to-trajectory retrieval, outperforming video-based methods by 12.5 points, and improves motion accuracy by 33-52% (12.4 ADE vs 18.3-25.3) compared to video generation baselines. We demonstrate 88.3% Top-1 accuracy on human action recognition despite training only on diverse object motions, showing effective transfer across motion domains. Lang2Motion supports style transfer, semantic interpolation, and latent-space editing through CLIP-aligned trajectory representations.
我们提出了Lang2Motion,这是一个通过将运动流形与联合嵌入空间对齐来生成语言引导的点轨迹的框架。不同于先前专注于人体动作或视频合成的工作,我们利用从现实世界视频中提取并通过点跟踪获得的运动数据,为任意对象生成显式的轨迹。我们的基于变压器的自编码器通过双重监督学习轨迹表示:文本运动描述和渲染后的轨迹可视化,两者都通过CLIP的冻结编码器进行映射。 Lang2Motion在文本到轨迹检索上达到了34.2%的Recall@1,比视频基线方法高出12.5个百分点,并将动作准确性提高了33-52%,与视频生成基准相比(ADE值为12.4 vs 18.3-25.3)。尽管训练数据仅为多样化对象的动作,但我们在人类动作识别上达到了88.3%的Top-1准确率,证明了在不同运动领域之间的有效迁移。Lang2Motion支持风格转移、语义插值以及通过与CLIP对齐的轨迹表示进行潜在空间编辑。
https://arxiv.org/abs/2512.10617