Real-world data collection for embodied agents remains costly and unsafe, calling for scalable, realistic, and simulator-ready 3D environments. However, existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes. We present SAGE, an agentic framework that, given a user-specified embodied task (e.g., "pick up a bowl and place it on the table"), understands the intent and automatically generates simulation-ready environments at scale. The agent couples multiple generators for layout and object composition with critics that evaluate semantic plausibility, visual realism, and physical stability. Through iterative reasoning and adaptive tool selection, it self-refines the scenes until meeting user intent and physical validity. The resulting environments are realistic, diverse, and directly deployable in modern simulators for policy training. Policies trained purely on this data exhibit clear scaling trends and generalize to unseen objects and layouts, demonstrating the promise of simulation-driven scaling for embodied AI. Code, demos, and the SAGE-10k dataset can be found on the project page here: this https URL.
真实世界中为具身智能体(如机器人或虚拟角色)收集数据既昂贵又不安全,因此需要可扩展、逼真且适用于模拟器的3D环境。然而,现有的场景生成系统通常依赖于基于规则的方法或特定任务的管道,这会导致生成的效果不佳和物理上无效的场景。我们介绍了SAGE框架,这是一种智能体驱动的方法,能够根据用户指定的任务(例如,“拿起碗并把它放在桌子上”),理解意图,并自动生成大规模的、可用于模拟器的环境。 该方法集成了多个布局与物体组合的生成器以及评估语义合理性、视觉真实性和物理稳定性的批评者。通过迭代推理和适应性工具选择,智能体能够不断自我优化场景,直至符合用户意图并满足物理有效性要求。最终产生的环境既逼真又多样化,并且可以直接部署在现代模拟器中用于策略训练。 仅使用这种数据进行的策略训练显示出明确的规模效应,并能推广到未见过的对象和布局上,这表明了基于模拟的数据驱动扩展对具身AI的前景。该项目的相关代码、演示以及SAGE-10k数据集可以在项目页面(此链接)找到。
https://arxiv.org/abs/2602.10116
Multiple rotation averaging (MRA) is a fundamental optimization problem in 3D vision and robotics that aims to recover globally consistent absolute rotations from noisy relative measurements. Established classical methods, such as L1-IRLS and Shonan, face limitations including local minima susceptibility and reliance on convex relaxations that fail to preserve the exact manifold geometry, leading to reduced accuracy in high-noise scenarios. We introduce IQARS (Iterative Quantum Annealing for Rotation Synchronization), the first algorithm that reformulates MRA as a sequence of local quadratic non-convex sub-problems executable on quantum annealers after binarization, to leverage inherent hardware advantages. IQARS removes convex relaxation dependence and better preserves non-Euclidean rotation manifold geometry while leveraging quantum tunneling and parallelism for efficient solution space exploration. We evaluate IQARS's performance on synthetic and real-world datasets. While current annealers remain in their nascent phase and only support solving problems of limited scale with constrained performance, we observed that IQARS on D-Wave annealers can already achieve ca. 12% higher accuracy than Shonan, i.e., the best-performing classical method evaluated empirically.
多重旋转平均(MRA)是3D视觉和机器人技术中的一个基本优化问题,旨在从嘈杂的相对测量中恢复全局一致的绝对旋转。传统的经典方法,如L1-IRLS和Shonan,在局部最小值敏感性和依赖于无法保持精确非欧几里得流形几何结构的凸松弛方面存在局限性,在高噪声场景中的准确性降低。 我们引入了IQARS(迭代量子退火以同步旋转),这是第一个将MRA重新表述为一系列在二进制化后可在量子退火器上执行的局部二次非凸子问题的算法,利用其内在硬件优势。IQARS消除了对凸松弛的依赖,并更好地保持非欧几里得旋转流形几何结构的同时,通过利用量子隧穿和并行性来高效地探索解空间。 我们在合成数据集和现实世界数据集上评估了IQARS的性能。尽管当前的退火器仍处于初级阶段,仅支持解决规模有限且性能受限的问题,但我们观察到在D-Wave退火器上的IQARS已经能够比Shonan(即实证评估中表现最佳的经典方法)高出大约12%的准确性。
https://arxiv.org/abs/2602.10115
Recent advances in sampling-based motion planning algorithms for high DOF arms leverage GPUs to provide SOTA performance. These algorithms can be used to control multiple arms jointly, but this approach scales poorly. To address this, we extend STORM, a sampling-based model-predictive-control (MPC) motion planning algorithm, to handle multiple robots in a distributed fashion. First, we modify STORM to handle dynamic obstacles. Then, we let each arm compute its own motion plan prefix, which it shares with the other arms, which treat it as a dynamic obstacle. Finally, we add a dynamic priority scheme. The new algorithm, MR-STORM, demonstrates clear empirical advantages over SOTA algorithms when operating with both static and dynamic obstacles.
最近,基于采样的运动规划算法在高自由度机械臂上的进展利用了GPU来提供最先进的性能。这些算法可以用来同时控制多台机械臂,但这种方法的可扩展性较差。为了应对这一问题,我们将STORM(一种基于模型预测控制(MPC)的采样式运动规划算法)进行扩展,使其能够在分布式环境中处理多个机器人。首先,我们对STORM进行了修改以处理动态障碍物。然后,让每个机械臂计算自己的运动计划前缀,并将其与其它机械臂共享;其他机械臂将这些前缀视为动态障碍物。最后,我们添加了一个动态优先级方案。新的算法MR-STORM在面对静态和动态障碍物时,相比现有最先进的算法显示出了明显的经验优势。
https://arxiv.org/abs/2602.10114
Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions, yet preserving fine-grained object identity under changing viewpoints remains a persistent challenge. Unlike text-to-video models, existing I2V pipelines often suffer from appearance drift and geometric distortion, artifacts we attribute to the sparsity of single-view 2D observations and weak cross-modal alignment. Here we address this problem from both data and model perspectives. First, we curate ConsIDVid, a large-scale object-centric dataset built with a scalable pipeline for high-quality, temporally aligned videos, and establish ConsIDVid-Bench, where we present a novel benchmarking and evaluation framework for multi-view consistency using metrics sensitive to subtle geometric and appearance deviations. We further propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views and fuses semantic and structural cues via a dual-stream visual-geometric encoder as well as a text-visual connector, yielding unified conditioning for a Diffusion Transformer backbone. Experiments across ConsIDVid-Bench demonstrate that ConsID-Gen consistently outperforms in multiple metrics, with the best overall performance surpassing leading video generation models like Wan2.1 and HunyuanVideo, delivering superior identity fidelity and temporal coherence under challenging real-world scenarios. We will release our model and dataset at this https URL.
图像到视频生成(I2V)技术将静态图像转化为遵循文本指令的连贯视频序列,但保持在视角变化下的细粒度对象身份一致仍是一项持续挑战。不同于从文字直接生成视频的方法,现有的I2V流程常常出现外观漂移和几何失真等问题,这些问题我们归因于单视图二维观察数据的稀疏性和跨模态对齐能力较弱。 为了解决这个问题,本文从数据和模型两个角度入手进行研究。首先,我们创建了ConsIDVid,这是一个大规模以对象为中心的数据集,通过可扩展的流水线构建高质量、时间同步的视频,并建立了ConsIDVid-Bench,即一个新型多视角一致性基准测试框架,使用对细微几何与外观偏差敏感的度量来评估模型性能。此外,我们提出了一种名为ConsID-Gen的新方法,这是一种视图辅助I2V生成框架,通过增强初始帧以未放置的辅助视图为特征,并融合语义和结构线索,在视觉-几何双流编码器以及文本-视觉连接模块的帮助下,为一个扩散变换器骨干网络提供统一的条件设置。跨ConsIDVid-Bench的实验表明,ConsID-Gen在多种度量标准下均表现出色,最佳性能超越了Wan2.1和HunyuanVideo等领先视频生成模型,在复杂的真实世界场景中实现了卓越的身份忠实度和时间连贯性。 我们将在此 URL(请参阅原文链接)发布我们的模型和数据集。
https://arxiv.org/abs/2602.10113
Learning-based controllers have achieved impressive performance in agile quadrotor flight but typically rely on massive training in simulation, necessitating accurate system identification for effective Sim2Real transfer. However, even with precise modeling, fixed policies remain susceptible to out-of-distribution scenarios, ranging from external aerodynamic disturbances to internal hardware degradation. To ensure safety under these evolving uncertainties, such controllers are forced to operate with conservative safety margins, inherently constraining their agility outside of controlled settings. While online adaptation offers a potential remedy, safely exploring physical limits remains a critical bottleneck due to data scarcity and safety risks. To bridge this gap, we propose a self-adaptive framework that eliminates the need for precise system identification or offline Sim2Real transfer. We introduce Adaptive Temporal Scaling (ATS) to actively explore platform physical limits, and employ online residual learning to augment a simple nominal model. {Based on the learned hybrid model, we further propose Real-world Anchored Short-horizon Backpropagation Through Time (RASH-BPTT) to achieve efficient and robust in-flight policy updates. Extensive experiments demonstrate that our quadrotor reliably executes agile maneuvers near actuator saturation limits. The system evolves a conservative base policy with a peak speed of 1.9 m/s to 7.3 m/s within approximately 100 seconds of flight time. These findings underscore that real-world adaptation serves not merely to compensate for modeling errors, but as a practical mechanism for sustained performance improvement in aggressive flight regimes.
基于学习的控制器在敏捷四旋翼飞行中取得了令人印象深刻的性能,但通常需要在模拟环境中进行大量的训练,以实现有效的从仿真到现实(Sim2Real)转换。然而,即便模型精确无误,固定策略仍可能因超出分布范围的情况而失效,这些问题包括外部气动干扰和内部硬件退化等。为了确保在这种不确定性和不断变化的情况下安全运行,控制器被迫采用保守的安全边界,这在控制环境之外会限制其敏捷性。尽管在线适应提供了一种潜在的解决方案,但由于数据稀缺和安全风险,物理极限下的安全探索仍然是一个关键瓶颈。 为了解决这一问题,我们提出了一种自我适应框架,无需精确系统识别或离线Sim2Real转换。我们引入了自适应时间缩放(ATS),以积极地探索平台的物理极限,并采用在线残差学习来增强简单的名义模型。基于所学的混合模型,我们进一步提出了现实锚定短时域反向传播(RASH-BPTT)算法,以便实现高效的飞行中策略更新并保持鲁棒性。 大量的实验表明,我们的四旋翼飞机能够可靠地执行接近致动器饱和极限附近的敏捷机动动作。系统将一个保守的基础政策从1.9米/秒加速到7.3米/秒,在大约100秒的飞行时间内完成这一演变过程。这些发现强调了现实世界中的适应性不仅用于补偿模型误差,而且还是一种在激进飞行制度下实现持续性能改进的实际机制。 通过这种方法,可以确保四旋翼机在面对各种不确定性因素时仍能保持高度敏捷和高效的表现,同时极大地减少了对精确仿真环境的需求以及由此带来的高昂成本。
https://arxiv.org/abs/2602.10111
Large vision-language models (VLMs) excel at multimodal understanding but fall short when extended to embodied tasks, where instructions must be transformed into low-level motor actions. We introduce ST4VLA, a dual-system Vision-Language-Action framework that leverages Spatial Guided Training to align action learning with spatial priors in VLMs. ST4VLA includes two stages: (i) spatial grounding pre-training, which equips the VLM with transferable priors via scalable point, box, and trajectory prediction from both web-scale and robot-specific data, and (ii) spatially guided action post-training, which encourages the model to produce richer spatial priors to guide action generation via spatial prompting. This design preserves spatial grounding during policy learning and promotes consistent optimization across spatial and action objectives. Empirically, ST4VLA achieves substantial improvements over vanilla VLA, with performance increasing from 66.1 -> 84.6 on Google Robot and from 54.7 -> 73.2 on WidowX Robot, establishing new state-of-the-art results on SimplerEnv. It also demonstrates stronger generalization to unseen objects and paraphrased instructions, as well as robustness to long-horizon perturbations in real-world settings. These results highlight scalable spatially guided training as a promising direction for robust, generalizable robot learning. Source code, data and models are released at this https URL
大型视觉-语言模型(VLMs)在多模态理解方面表现出色,但在扩展到需要将指令转换为低级动作任务的具身化任务时却表现不佳。为此,我们引入了ST4VLA,这是一种双系统视觉-语言-行动框架,它利用空间引导训练来使动作学习与VLM中的空间先验对齐。ST4VLA包括两个阶段: (i) 空间定位预训练:通过可扩展的点、框和轨迹预测从网络规模数据和机器人特定数据中为VLM装备可转移的空间先验。 (ii) 空间引导动作后训练:鼓励模型生成更丰富空间先验以指导动作生成,通过空间提示进行操作。这一设计在策略学习期间保留了空间定位,并促进了空间与行动目标之间的一致优化。 从经验上讲,ST4VLA在Google机器人和WidowX机器人上的表现明显优于原始的VLA模型,其性能分别从66.1提升至84.6,以及从54.7提升到73.2,在SimplerEnv数据集上建立了新的最先进的结果。它还展示了对未见过物体及重述指令更强的一般化能力,并且在现实世界的长时域干扰条件下表现出更好的鲁棒性。 这些成果突显了基于空间引导的大规模训练作为稳健、通用机器人学习的一个有前途的方向。源代码、数据和模型可在该链接的网址上获得(此处应插入实际URL)。
https://arxiv.org/abs/2602.10109
Human demonstrations offer rich environmental diversity and scale naturally, making them an appealing alternative to robot teleoperation. While this paradigm has advanced robot-arm manipulation, its potential for the more challenging, data-hungry problem of humanoid loco-manipulation remains largely unexplored. We present EgoHumanoid, the first framework to co-train a vision-language-action policy using abundant egocentric human demonstrations together with a limited amount of robot data, enabling humanoids to perform loco-manipulation across diverse real-world environments. To bridge the embodiment gap between humans and robots, including discrepancies in physical morphology and viewpoint, we introduce a systematic alignment pipeline spanning from hardware design to data processing. A portable system for scalable human data collection is developed, and we establish practical collection protocols to improve transferability. At the core of our human-to-humanoid alignment pipeline lies two key components. The view alignment reduces visual domain discrepancies caused by camera height and perspective variation. The action alignment maps human motions into a unified, kinematically feasible action space for humanoid control. Extensive real-world experiments demonstrate that incorporating robot-free egocentric data significantly outperforms robot-only baselines by 51\%, particularly in unseen environments. Our analysis further reveals which behaviors transfer effectively and the potential for scaling human data.
人类演示提供了丰富的环境多样性和自然扩展性,使其成为机器人遥操作的有吸引力的替代方案。尽管这种范式在机械臂操纵方面取得了进展,但对于数据需求更大、更具挑战性的仿人行走与操作问题(humanoid loco-manipulation),其潜力尚未得到充分探索。我们提出了EgoHumanoid框架,这是首个利用大量第一视角的人类演示和少量机器人数据联合训练视觉-语言-动作策略的框架,使人类机器人能够在多样的真实世界环境中进行行走与操作。 为了弥合人类与机器人在形态学和视点方面的实体差距,我们提出了一套从硬件设计到数据处理的系统化对齐流程。开发了一个便携式系统,用于大规模收集人类数据,并制定了一系列实用的数据采集协议以提高传输性能。我们在人类向机器人对齐管道的核心部分提出了两个关键组件:视角对齐减少因相机高度和透视变化造成的视觉领域差异;动作对齐将人体运动映射到统一且符合动力学可行的动作空间中,以便进行仿人控制。 通过大量的真实世界实验,我们证明了在没有机器人数据的情况下纳入第一视角的人类数据比仅使用机器人基准方法的表现高出51%,特别是在未见过的环境中。我们的分析进一步揭示了哪些行为可以有效转移以及利用人类数据扩展的可能性。
https://arxiv.org/abs/2602.10106
Data scarcity fundamentally limits the generalization of bimanual dexterous manipulation, as real-world data collection for dexterous hands is expensive and labor-intensive. Human manipulation videos, as a direct carrier of manipulation knowledge, offer significant potential for scaling up robot learning. However, the substantial embodiment gap between human hands and robotic dexterous hands makes direct pretraining from human videos extremely challenging. To bridge this gap and unleash the potential of large-scale human manipulation video data, we propose DexImit, an automated framework that converts monocular human manipulation videos into physically plausible robot data, without any additional information. DexImit employs a four-stage generation pipeline: (1) reconstructing hand-object interactions from arbitrary viewpoints with near-metric scale; (2) performing subtask decomposition and bimanual scheduling; (3) synthesizing robot trajectories consistent with the demonstrated interactions; (4) comprehensive data augmentation for zero-shot real-world deployment. Building on these designs, DexImit can generate large-scale robot data based on human videos, either from the Internet or video generation models. DexImit is capable of handling diverse manipulation tasks, including tool use (e.g., cutting an apple), long-horizon tasks (e.g., making a beverage), and fine-grained manipulations (e.g., stacking cups).
数据稀缺从根本上限制了双手灵巧操作的泛化能力,因为为灵巧的手收集真实世界的数据显示出高成本和劳动密集型的特点。人类操作视频作为操作知识的直接载体,为机器人学习提供了显著扩展潜能的机会。然而,人类手部与机器人灵巧手之间的实体差距使从人类视频进行直接预训练变得极其具有挑战性。为了弥合这一差距并释放大规模人类操作视频数据的潜力,我们提出了DexImit,这是一个自动化框架,它能够将单目人类操作视频转换成物理上合理的机器人数据,无需任何额外信息。 DexImit采用了一个四阶段生成管道: 1. 从任意视角重构手部与对象之间的互动,并接近于度量比例; 2. 执行子任务分解和双手调度; 3. 合成符合展示互动的机器人轨迹; 4. 对数据进行全面增强,以实现零样本的真实世界部署。 基于这些设计,DexImit能够根据来自互联网或视频生成模型的人类视频生成大规模的机器人数据。DexImit可以处理各种操作任务,包括工具使用(例如切苹果)、长时序任务(例如制作饮料)和精细的操作(例如叠放杯子)。
https://arxiv.org/abs/2602.10105
Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq$\Delta$-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.
规模可调的、受动作控制的世界模型受限于动作标签的稀缺性。虽然潜在动作学习承诺从无标签视频中提取控制接口,但所学得的潜在变量往往难以跨上下文转移:它们会纠缠特定场景中的线索,并缺乏一个共享的坐标系统。这是因为标准目标仅在每个片段内操作,没有机制来对齐不同上下文的动作语义。 我们的关键见解是:尽管动作不可见,其语义效果是可以观察到并能作为共同参考点使用的。我们引入了Seq$\Delta$-REPA,这是一个序列级的控制效果对准目标,它将整合后的潜在动作锚定于来自冻结的自监督视频编码器的时序特征差异上。 在此基础上,我们提出了Olaf-World,这是一种流水线方法,通过大规模被动视频预训练受动作条件约束的视频世界模型。大量的实验表明,我们的方法学习到了一个更结构化的潜在动作空间,在零样本动作迁移和对新控制接口的数据高效适应方面优于现有的最先进技术。
https://arxiv.org/abs/2602.10104
Learning transferable knowledge from unlabeled video data and applying it in new environments is a fundamental capability of intelligent agents. This work presents VideoWorld 2, which extends VideoWorld and offers the first investigation into learning transferable knowledge directly from raw real-world videos. At its core, VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles visual appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related dynamics. These latent codes are then modeled autoregressively to learn task policies and support long-horizon reasoning. We evaluate VideoWorld 2 on challenging real-world handcraft making tasks, where prior video generation and latent-dynamics models struggle to operate reliably. Remarkably, VideoWorld 2 achieves up to 70% improvement in task success rate and produces coherent long execution videos. In robotics, we show that VideoWorld 2 can acquire effective manipulation knowledge from the Open-X dataset, which substantially improves task performance on CALVIN. This study reveals the potential of learning transferable world knowledge directly from raw videos, with all code, data, and models to be open-sourced for further research.
从未标记的视频数据中学习可转移的知识并在新环境中应用,是智能代理的一项基本能力。这项工作介绍了VideoWorld 2,它扩展了原有的VideoWorld,并首次研究了直接从原始真实世界视频中学习可转移知识的问题。在核心部分,VideoWorld 2引入了一种动态增强型潜在动力学模型(dLDM),该模型将动作动力学与视觉外观解耦:一个预训练的视频扩散模型处理视觉外观建模,使dLDM能够专注于从任务相关性中学习紧凑且有意义的动力学编码。这些潜在编码随后被自回归地建模以学习任务策略并支持长时域推理。 我们在具有挑战性的真实世界手工艺品制作任务上评估了VideoWorld 2,在这些任务中,先前的视频生成和潜在动力学模型难以可靠运行。令人印象深刻的是,VideoWorld 2在任务成功率方面提高了高达70%,并且能够产生连贯且长时间执行的视频。在机器人技术领域,我们展示了VideoWorld 2可以从小型开放数据集(Open-X)中获取有效的操作知识,并显著提升了CALVIN中的任务性能。 这项研究揭示了直接从原始视频学习可转移世界知识的巨大潜力,并将所有代码、数据和模型开源以供进一步的研究使用。
https://arxiv.org/abs/2602.10102
3D spatial perception is fundamental to generalizable robotic manipulation, yet obtaining reliable, high-quality 3D geometry remains challenging. Depth sensors suffer from noise and material sensitivity, while existing reconstruction models lack the precision and metric consistency required for physical interaction. We introduce Robo3R, a feed-forward, manipulation-ready 3D reconstruction model that predicts accurate, metric-scale scene geometry directly from RGB images and robot states in real time. Robo3R jointly infers scale-invariant local geometry and relative camera poses, which are unified into the scene representation in the canonical robot frame via a learned global similarity transformation. To meet the precision demands of manipulation, Robo3R employs a masked point head for sharp, fine-grained point clouds, and a keypoint-based Perspective-n-Point (PnP) formulation to refine camera extrinsics and global alignment. Trained on Robo3R-4M, a curated large-scale synthetic dataset with four million high-fidelity annotated frames, Robo3R consistently outperforms state-of-the-art reconstruction methods and depth sensors. Across downstream tasks including imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning, we observe consistent gains in performance, suggesting the promise of this alternative 3D sensing module for robotic manipulation.
3D空间感知对于通用的机器人操作至关重要,但获取可靠、高质量的3D几何信息仍然是一项挑战。深度传感器会受到噪声和材质敏感性的影响,而现有的重建模型缺乏物理交互所需的精度和度量一致性。为此,我们引入了Robo3R,这是一种前馈式的、专为操作设计的3D重建模型,能够直接从RGB图像和机器人状态实时预测精确的度量级场景几何信息。 Robo3R同时推断尺度不变的局部几何结构和相对相机姿态,并通过一个学习到的整体相似变换将其统一在机器人的标准坐标系中。为了满足操作所需的精度要求,Robo3R采用了一种掩码点头(masked point head)以生成尖锐、细节丰富的点云,并使用基于关键点的透视-n-点(PnP)方法来细化相机外参和整体对齐。 在经过精心策划的大规模合成数据集Robo3R-4M上进行训练,该数据集包含四百万帧高保真注释图像。Robo3R在重建精度方面优于现有的最先进的重建方法和深度传感器。在模仿学习、从仿真到真实环境的转移、抓取合成以及无碰撞路径规划等下游任务中,我们观察到了一致性的性能提升,这表明这种替代的3D感知模块对于机器人操作具有潜力。
https://arxiv.org/abs/2602.10101
Leveraging representation encoders for generative modeling offers a path for efficient, high-fidelity synthesis. However, standard diffusion transformers fail to converge on these representations directly. While recent work attributes this to a capacity bottleneck proposing computationally expensive width scaling of diffusion transformers we demonstrate that the failure is fundamentally geometric. We identify Geometric Interference as the root cause: standard Euclidean flow matching forces probability paths through the low-density interior of the hyperspherical feature space of representation encoders, rather than following the manifold surface. To resolve this, we propose Riemannian Flow Matching with Jacobi Regularization (RJF). By constraining the generative process to the manifold geodesics and correcting for curvature-induced error propagation, RJF enables standard Diffusion Transformer architectures to converge without width scaling. Our method RJF enables the standard DiT-B architecture (131M parameters) to converge effectively, achieving an FID of 3.37 where prior methods fail to converge. Code: this https URL
利用表示编码器进行生成建模提供了一种高效、高保真合成的路径。然而,标准扩散变换器无法直接在这些表示上收敛。虽然最近的工作将这一问题归因于容量瓶颈,并建议通过昂贵的宽度扩展来解决扩散变换器的问题,但我们证明了这个问题的根本原因在于几何特性。我们识别出几何干扰(Geometric Interference)是根本原因:标准欧氏流匹配强迫概率路径穿过表示编码器超球体特征空间中的低密度内部区域,而不是沿着流形表面行进。为了纠正这一问题,我们提出了黎曼流匹配与雅可比正则化(Riemannian Flow Matching with Jacobi Regularization, RJF)。通过将生成过程约束在流形测地线上,并修正曲率引起的误差传播,RJF使标准扩散变换器架构能够在不进行宽度扩展的情况下收敛。我们的方法RJF使得标准DiT-B架构(1.31亿参数)能够有效收敛,达到了先前方法无法达到的FID 3.37。 代码链接:[这个URL]
https://arxiv.org/abs/2602.10099
Pretraining Vision-Language-Action (VLA) policies on internet-scale video is appealing, yet current latent-action objectives often learn the wrong thing: they remain anchored to pixel variation rather than action-relevant state transitions, making them vulnerable to appearance bias, nuisance motion, and information leakage. We introduce VLA-JEPA, a JEPA-style pretraining framework that sidesteps these pitfalls by design. The key idea is \emph{leakage-free state prediction}: a target encoder produces latent representations from future frames, while the student pathway sees only the current observation -- future information is used solely as supervision targets, never as input. By predicting in latent space rather than pixel space, VLA-JEPA learns dynamics abstractions that are robust to camera motion and irrelevant background changes. This yields a simple two-stage recipe -- JEPA pretraining followed by action-head fine-tuning -- without the multi-stage complexity of prior latent-action pipelines. Experiments on LIBERO, LIBERO-Plus, SimplerEnv and real-world manipulation tasks show that VLA-JEPA achieves consistent gains in generalization and robustness over existing methods.
在互联网规模的视频上对视觉-语言-行动(VLA)策略进行预训练具有吸引力,然而目前的潜在动作目标往往学习错误的内容:它们仍被像素变化所束缚,而不是与动作相关的状态转换,这使它们容易受到外观偏差、不必要的运动和信息泄露的影响。我们引入了VLA-JEPA,这是一种JEPA风格的预训练框架,旨在通过设计避免这些陷阱。 核心思想是“无泄漏的状态预测”:目标编码器从未来帧中生成潜在表示,而学生路径仅看到当前观察结果——未来的相关信息仅用作监督目标,而不是输入。通过在潜在空间而非像素空间进行预测,VLA-JEPA能够学习出对摄像机运动和无关背景变化具有鲁棒性的动态抽象。 这提供了一个简单的两阶段配方:JEPA预训练后接动作头微调,而无需先前的潜在动作管道中的多阶段复杂性。在LIBERO、LIBERO-Plus、SimplerEnv以及现实世界的操作任务上的实验表明,VLA-JEPA相对于现有方法,在泛化和鲁棒性方面取得了持续性的改进。
https://arxiv.org/abs/2602.10098
Causality -- referring to temporal, uni-directional cause-effect relationships between components -- underlies many complex generative processes, including videos, language, and robot trajectories. Current causal diffusion models entangle temporal reasoning with iterative denoising, applying causal attention across all layers, at every denoising step, and over the entire context. In this paper, we show that the causal reasoning in these models is separable from the multi-step denoising process. Through systematic probing of autoregressive video diffusers, we uncover two key regularities: (1) early layers produce highly similar features across denoising steps, indicating redundant computation along the diffusion trajectory; and (2) deeper layers exhibit sparse cross-frame attention and primarily perform intra-frame rendering. Motivated by these findings, we introduce Separable Causal Diffusion (SCD), a new architecture that explicitly decouples once-per-frame temporal reasoning, via a causal transformer encoder, from multi-step frame-wise rendering, via a lightweight diffusion decoder. Extensive experiments on both pretraining and post-training tasks across synthetic and real benchmarks show that SCD significantly improves throughput and per-frame latency while matching or surpassing the generation quality of strong causal diffusion baselines.
因果关系——指组件之间的时间性和单向的因果效应关系——是许多复杂生成过程的基础,包括视频、语言和机器人轨迹。当前的因果扩散模型将时间推理与迭代去噪紧密相连,在每一层中每次去噪步骤以及整个上下文中应用因果注意力。在本文中,我们展示了这些模型中的因果推理可以独立于多步去噪过程。通过对自回归视频扩散器进行系统的探究,我们发现了两个关键规律:(1)早期层次在网络的去噪过程中生成了高度相似的功能特征,表明沿着扩散轨迹存在冗余计算;以及 (2) 更深层次表现出稀疏的跨帧注意力,并主要执行帧内的渲染工作。 受这些发现启发,我们引入了一种新的架构——可分离因果扩散(SCD),该架构通过一个因果变换器编码器明确地将一次一帧的时间推理与多步帧内渲染过程区分开来。后者则通过一个轻量级的扩散解码器实现。 在合成和真实基准上进行预训练和后训练任务的广泛实验表明,SCD 在提高吞吐量和每帧延迟的同时,可以匹配或超越强因果扩散基线模型的生成质量。
https://arxiv.org/abs/2602.10095
We present 4RC, a unified feed-forward framework for 4D reconstruction from monocular videos. Unlike existing approaches that typically decouple motion from geometry or produce limited 4D attributes such as sparse trajectories or two-view scene flow, 4RC learns a holistic 4D representation that jointly captures dense scene geometry and motion dynamics. At its core, 4RC introduces a novel encode-once, query-anywhere and anytime paradigm: a transformer backbone encodes the entire video into a compact spatio-temporal latent space, from which a conditional decoder can efficiently query 3D geometry and motion for any query frame at any target timestamp. To facilitate learning, we represent per-view 4D attributes in a minimally factorized form by decomposing them into base geometry and time-dependent relative motion. Extensive experiments demonstrate that 4RC outperforms prior and concurrent methods across a wide range of 4D reconstruction tasks.
我们介绍了4RC,这是一种用于从单目视频进行4D重建的统一前馈框架。不同于现有方法通常将运动与几何结构分开处理或仅产生有限的4D属性(如稀疏轨迹或两视图场景流),4RC学习了一种全面的4D表示形式,能够同时捕捉密集场景几何和运动动力学。 在核心理念上,4RC引入了一个新颖的一次编码、任意位置查询以及任何时间点查询的范式:一个Transformer骨干网络将整个视频编码到一个紧凑的空间-时间潜在空间中,从中一个条件解码器可以高效地为任何查询帧和目标时间戳处的3D几何结构与运动进行查询。 为了促进学习过程,我们将每个视图的4D属性表示为其基本几何和时变相对运动的最小因子分解形式。广泛的实验表明,在各种4D重建任务中,4RC超越了此前和当前的方法。
https://arxiv.org/abs/2602.10094
Robotic manipulation has seen rapid progress with vision-language-action (VLA) policies. However, visuo-tactile perception is critical for contact-rich manipulation, as tasks such as insertion are difficult to complete robustly using vision alone. At the same time, acquiring large-scale and reliable tactile data in the physical world remains costly and challenging, and the lack of a unified evaluation platform further limits policy learning and systematic analysis. To address these challenges, we propose UniVTAC, a simulation-based visuo-tactile data synthesis platform that supports three commonly used visuo-tactile sensors and enables scalable and controllable generation of informative contact interactions. Based on this platform, we introduce the UniVTAC Encoder, a visuo-tactile encoder trained on large-scale simulation-synthesized data with designed supervisory signals, providing tactile-centric visuo-tactile representations for downstream manipulation tasks. In addition, we present the UniVTAC Benchmark, which consists of eight representative visuo-tactile manipulation tasks for evaluating tactile-driven policies. Experimental results show that integrating the UniVTAC Encoder improves average success rates by 17.1% on the UniVTAC Benchmark, while real-world robotic experiments further demonstrate a 25% improvement in task success. Our webpage is available at this https URL.
机器人操纵领域借助视觉-语言-行动(VLA)政策取得了迅速进展。然而,对于接触丰富的操作任务而言,如插入等任务仅依靠视觉难以稳健地完成,因此触觉感知变得至关重要。同时,在现实世界中获取大规模且可靠的触觉数据仍然成本高昂且颇具挑战性,并且缺乏统一的评估平台进一步限制了策略学习和系统分析的发展。 为了解决这些问题,我们提出了UniVTAC,这是一个基于仿真的视触觉数据合成平台,支持三种常用的视触觉传感器,并能够生成可扩展、可控的信息接触交互。在这一平台上,我们引入了UniVTAC编码器(UniVTAC Encoder),它是在大规模仿真合成的数据上训练出来的,且设计有监督信号,为下游操作任务提供以触觉为中心的视触觉表示。此外,我们还提出了UniVTAC基准测试集(UniVTAC Benchmark),包含八个具有代表性的视触觉操作任务,用于评估以触觉驱动的策略。 实验结果表明,在使用UniVTAC编码器后,UniVTAC基准测试中的平均成功率提高了17.1%;而真实世界中的机器人实验进一步证明了该方法在任务成功率方面提升了25%。我们的网页可在提供的链接中访问(假设此处应有一个URL链接)。
https://arxiv.org/abs/2602.10093
Language models have become practical tools for quantum computing education and research, from summarizing technical papers to explaining theoretical concepts and answering questions about recent developments in the field. While existing benchmarks evaluate quantum code generation and circuit design, their understanding of quantum computing concepts has not been systematically measured. Quantum-Audit addresses this gap with 2,700 questions covering core quantum computing topics. We evaluate 26 models from leading organizations. Our benchmark comprises 1,000 expert-written questions, 1,000 questions extracted from research papers using LLMs and validated by experts, plus an additional 700 questions including 350 open-ended questions and 350 questions with false premises to test whether models can correct erroneous assumptions. Human participants scored between 23% and 86%, with experts averaging 74%. Top-performing models exceeded the expert average, with Claude Opus 4.5 reaching 84% accuracy, though top models showed an average 12-point accuracy drop on expert-written questions compared to LLM-generated ones. Performance declined further on advanced topics, dropping to 73% on security questions. Additionally, models frequently accepted and reinforced false premises embedded in questions instead of identifying them, with accuracy below 66% on these critical reasoning tasks.
语言模型已经成为量子计算教育和研究的实际工具,从总结技术论文到解释理论概念以及回答该领域近期发展的问题。尽管现有的基准测试评估了量子代码生成和电路设计,但它们对量子计算概念的理解尚未系统地测量过。Quantum-Audit 通过提供涵盖核心量子计算主题的2,700个问题来填补这一空白。我们评估了来自领先组织的26种模型。我们的基准包括1,000个由专家编写的问题、1,000个使用大型语言模型(LLM)从研究论文中提取并经专家验证的问题,以及另外700个问题,其中包括350道开放式问题和350个假设前提错误的问题,以测试模型能否纠正错误的假设。人类参与者得分在23%到86%之间,而专家平均得分为74%。表现最好的模型超过了专家平均水平,Claude Opus 4.5达到了84%的准确率,尽管顶级模型在专家编写的问题上的平均准确率比LLM生成的问题低12分。对于高级主题,性能进一步下降,在安全问题上仅达到73%的准确性。此外,模型常常接受并强化了嵌入在问题中的错误前提,而不是识别它们,在这些关键推理任务中准确率低于66%。
https://arxiv.org/abs/2602.10092
Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets (35 tools per environment on average) and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at this https URL.
最近在大型语言模型(LLM)方面的进展使自主代理能够执行需要与工具和环境进行多轮交互的复杂任务。然而,由于缺乏多样且可靠的环境,此类代理训练的扩展受到了限制。为此,本文提出了一个全合成环境生成管道——Agent World Model (AWM)。利用这一管道,我们构建了1,000个涵盖日常场景的环境,在这些环境中,代理可以与丰富的工具集(每个环境平均35种工具)进行互动,并获取高质量的观察结果。值得注意的是,这些环境是由代码驱动并依赖于数据库支持的,这比由LLM模拟出的环境提供了更加可靠和一致的状态转换。此外,它们相比从真实环境中收集轨迹来说,能够实现更为高效的代理交互。 为了展示这一资源的有效性,我们在多轮工具使用代理的大规模强化学习上进行了实验。得益于全执行型环境以及可访问的数据状态,我们还可以设计可靠的奖励函数。在三个基准测试上的实验表明,在合成环境中进行单独训练(而不是特定于基准的环境),能够获得强大的分布外泛化能力。相关代码可在提供的URL地址获取。
https://arxiv.org/abs/2602.10090
In scientific research, analysis requires accurately interpreting complex multimodal knowledge, integrating evidence from different sources, and drawing inferences grounded in domain-specific knowledge. However, current artificial intelligence (AI) systems struggle to consistently demonstrate such capabilities. The complexity and variability of scientific tables and figures, combined with heterogeneous structures and long-context requirements, pose fundamental obstacles to scientific table \& figure analysis. To quantify these challenges, we introduce AnaBench, a large-scale benchmark featuring $63,178$ instances from nine scientific domains, systematically categorized along seven complexity dimensions. To tackle these challenges, we propose Anagent, a multi-agent framework for enhanced scientific table \& figure analysis through four specialized agents: Planner decomposes tasks into actionable subtasks, Expert retrieves task-specific information through targeted tool execution, Solver synthesizes information to generate coherent analysis, and Critic performs iterative refinement through five-dimensional quality assessment. We further develop modular training strategies that leverage supervised finetuning and specialized reinforcement learning to optimize individual capabilities while maintaining effective collaboration. Comprehensive evaluation across 170 subdomains demonstrates that Anagent achieves substantial improvements, up to $\uparrow 13.43\%$ in training-free settings and $\uparrow 42.12\%$ with finetuning, while revealing that task-oriented reasoning and context-aware problem-solving are essential for high-quality scientific table \& figure analysis. Our project page: this https URL.
https://arxiv.org/abs/2602.10081
We introduce Forensim, an attention-based state-space framework for image forgery detection that jointly localizes both manipulated (target) and source regions. Unlike traditional approaches that rely solely on artifact cues to detect spliced or forged areas, Forensim is designed to capture duplication patterns crucial for understanding context. In scenarios such as protest imagery, detecting only the forged region, for example a duplicated act of violence inserted into a peaceful crowd, can mislead interpretation, highlighting the need for joint source-target localization. Forensim outputs three-class masks (pristine, source, target) and supports detection of both splicing and copy-move forgeries within a unified architecture. We propose a visual state-space model that leverages normalized attention maps to identify internal similarities, paired with a region-based block attention module to distinguish manipulated regions. This design enables end-to-end training and precise localization. Forensim achieves state-of-the-art performance on standard benchmarks. We also release CMFD-Anything, a new dataset addressing limitations of existing copy-move forgery datasets.
我们介绍了Forensim,这是一种基于注意力机制的状态空间框架,用于图像篡改检测,并同时定位被操纵的目标区域和原始来源区域。与传统的仅依赖于伪造或拼接痕迹来检测篡改区域的方法不同,Forensim旨在捕捉理解上下文至关重要的复制模式。例如,在抗议活动的影像中,仅仅检测出插入到和平人群中的一次暴力行为所形成的伪造区域可能会误导解释,这凸显了同时定位来源和目标区域的需求。 Forensim输出三类掩码(原始、来源、目标),并且在一个统一架构内支持拼接和复制移动篡改的检测。我们提出了一种视觉状态空间模型,利用归一化注意力图来识别内部相似性,并结合基于区域的块注意模块以区分被操纵的区域。这种设计允许端到端训练并实现精确定位。 Forensim在标准基准测试中达到了最先进的性能水平。此外,我们还发布了CMFD-Anything数据集,这是一个新的数据集,旨在解决现有复制移动篡改数据集中的局限性。
https://arxiv.org/abs/2602.10079