Learning robot manipulation policies from raw, real-world image data requires a large number of robot-action trials in the physical environment. Although training using simulations offers a cost-effective alternative, the visual domain gap between simulation and robot workspace remains a major limitation. Gaussian Splatting visual reconstruction methods have recently provided new directions for robot manipulation by generating realistic environments. In this paper, we propose the first method for learning supervised-based robot handovers solely from RGB images without the need of real-robot training or real-robot data collection. The proposed policy learner, Human-to-Robot Handover using Sparse-View Gaussian Splatting (H2RH-SGS), leverages sparse-view Gaussian Splatting reconstruction of human-to-robot handover scenes to generate robot demonstrations containing image-action pairs captured with a camera mounted on the robot gripper. As a result, the simulated camera pose changes in the reconstructed scene can be directly translated into gripper pose changes. We train a robot policy on demonstrations collected with 16 household objects and {\em directly} deploy this policy in the real environment. Experiments in both Gaussian Splatting reconstructed scene and real-world human-to-robot handover experiments demonstrate that H2RH-SGS serves as a new and effective representation for the human-to-robot handover task.
从原始的真实世界图像数据中学习机器人操作策略需要在物理环境中进行大量的机器人动作试验。虽然使用仿真训练提供了成本效益高的替代方案,但模拟环境与机器人工作空间之间的视觉领域差距仍然是主要限制之一。最近,高斯点绘(Gaussian Splatting)的视觉重构方法为机器人操控提供了一些新的方向,通过生成逼真的环境来帮助这一问题。在本文中,我们提出了首个仅基于RGB图像进行监督学习的人机交接策略的方法,并且无需实际机器人的训练或数据采集。该提出的策略学习者名为“使用稀疏视图高斯点绘的人机交互政策学习器”(Human-to-Robot Handover using Sparse-View Gaussian Splatting,简称H2RH-SGS),它利用稀疏视图的高斯点绘重构人与机器人交接场景来生成包含相机安装在机械抓手上拍摄到的图像动作对的机器人演示。因此,在重建的场景中模拟相机姿态的变化可以直接转换为夹爪姿态的变化。我们使用16种家庭常用物品进行实验收集示范,并**直接**将此策略部署到了实际环境中。无论是高斯点绘重构的场景还是现实世界的人机交接实验,都表明H2RH-SGS对于人与机器人交互任务提供了一种新的且有效的表示方法。 该研究的核心在于利用了稀疏视图高斯点绘技术来生成逼真的训练环境,并以此为基础从RGB图像中学习到机器人的操作策略。这种方法不仅减少了对实际物理试验的需求,而且在转换到真实环境中时能够保持较高精度和有效性,为机器人操控任务提供了新的可能路径。
https://arxiv.org/abs/2507.08726
Inverse Reinforcement Learning (IRL) presents a powerful paradigm for learning complex robotic tasks from human demonstrations. However, most approaches make the assumption that expert demonstrations are available, which is often not the case. Those that allow for suboptimality in the demonstrations are not designed for long-horizon goals or adversarial tasks. Many desirable robot capabilities fall into one or both of these categories, thus highlighting a critical shortcoming in the ability of IRL to produce field-ready robotic agents. We introduce Sample-efficient Preference-based inverse reinforcement learning for Long-horizon Adversarial tasks from Suboptimal Hierarchical demonstrations (SPLASH), which advances the state-of-the-art in learning from suboptimal demonstrations to long-horizon and adversarial settings. We empirically validate SPLASH on a maritime capture-the-flag task in simulation, and demonstrate real-world applicability with sim-to-real translation experiments on autonomous unmanned surface vehicles. We show that our proposed methods allow SPLASH to significantly outperform the state-of-the-art in reward learning from suboptimal demonstrations.
逆向强化学习(IRL)为从人类演示中学习复杂机器人任务提供了一个强大的框架。然而,大多数方法假设专家演示是可用的,这在实践中常常不成立。那些允许演示存在非最优性的方法并不适用于长期目标或对抗性任务的设计。许多理想的机器人能力都属于上述一种或两种情况,因此突显了IRL生成可直接应用的机器人代理的能力上的一个关键缺陷。 我们提出了SPLASH(从次优层次化演示中进行样本高效偏好评价逆向强化学习以解决长时序和对抗性任务),该方法在从非最优演示中学习除外,在长期目标与对抗性任务设置方面也实现了对现有技术的重大突破。我们在模拟环境中通过海上夺旗任务验证了SPLASH的效果,并通过自主无人水面艇的仿真到现实转换实验展示了其实际应用潜力。我们证明,我们的方法使SPLASH在从非最优演示中学习奖励时显著超越现有的最先进技术。
https://arxiv.org/abs/2507.08707
Learning whole-body control for locomotion and arm motions in a single policy has challenges, as the two tasks have conflicting goals. For instance, efficient locomotion typically favors a horizontal base orientation, while end-effector tracking may benefit from base tilting to extend reachability. Additionally, current Reinforcement Learning (RL) approaches using a pose-based task specification lack the ability to directly control the end-effector velocity, making smoothly executing trajectories very challenging. To address these limitations, we propose an RL-based framework that allows for dynamic, velocity-aware whole-body end-effector control. Our method introduces a multi-critic actor architecture that decouples the reward signals for locomotion and manipulation, simplifying reward tuning and allowing the policy to resolve task conflicts more effectively. Furthermore, we design a twist-based end-effector task formulation that can track both discrete poses and motion trajectories. We validate our approach through a set of simulation and hardware experiments using a quadruped robot equipped with a robotic arm. The resulting controller can simultaneously walk and move its end-effector and shows emergent whole-body behaviors, where the base assists the arm in extending the workspace, despite a lack of explicit formulations.
在单一策略中学习全身控制以协调行走和手臂动作面临挑战,因为这两个任务的目标往往是冲突的。例如,高效的行走通常倾向于保持水平的基础姿态,而末端执行器跟踪可能需要基础倾斜来增加可达性。此外,目前使用基于姿势的任务规范的强化学习(RL)方法无法直接控制末端执行器的速度,这使得平滑地执行轨迹变得非常困难。 为了解决这些限制,我们提出了一种基于强化学习的框架,该框架允许动态、速度感知的整体末端执行器控制。我们的方法引入了一个多评论员演员架构,将行走和操作任务的奖励信号解耦,简化了奖励调整,并使策略能够更有效地解决任务冲突。此外,我们设计了一种基于扭曲的末端执行器任务规范,可以跟踪离散姿势和运动轨迹。 通过使用配备机械臂的四足机器人进行的一系列模拟和硬件实验验证了我们的方法。由此产生的控制器可以在行走的同时移动其末端执行器,并展示了整体出现的行为,在这种行为中基础帮助手臂扩展工作空间,即使没有明确的形式化也是如此。
https://arxiv.org/abs/2507.08656
When inverse kinematics (IK) is adopted to control robotic arms in manipulation tasks, there is often a discrepancy between the end effector (EE) position of the robot model in the simulator and the physical EE in reality. In most robotic scenarios with sim-to-real transfer, we have information about joint positions in both simulation and reality, but the EE position is only available in simulation. We developed a novel method to overcome this difficulty based on haptic feedback calibration, using a touchscreen in front of the robot that provides information on the EE position in the real environment. During the calibration procedure, the robot touches specific points on the screen, and the information is stored. In the next stage, we build a transformation function from the data based on linear transformation and neural networks that is capable of outputting all missing variables from any partial input (simulated/real joint/EE position). Our results demonstrate that a fully nonlinear neural network model performs best, significantly reducing positioning errors.
当使用逆运动学(IK)来控制机器人手臂执行操作任务时,通常会发现仿真环境中机器人的末端执行器(EE)位置与现实中的物理末端执行器位置之间存在差异。在大多数涉及从仿真到现实转移的机器人场景中,我们同时拥有模拟和现实中关节位置的信息,但只有仿真环境提供了末端执行器的位置信息。为此,我们开发了一种基于触觉反馈校准的新方法,在机器人前方放置一个触摸屏以提供真实环境中末端执行器位置的信息。在校准过程中,机器人会接触屏幕上的特定点,并将这些信息记录下来。 接下来的阶段中,我们会根据线性变换和神经网络的数据构建转换函数,从而能够从任何部分输入(仿真/现实中的关节/末端执行器位置)输出所有缺失变量。我们的结果显示,一个完全非线性的神经网络模型表现最佳,显著减少了定位误差。
https://arxiv.org/abs/2507.08572
3D reconstruction, which aims to recover the dense three-dimensional structure of a scene, is a cornerstone technology for numerous applications, including augmented/virtual reality, autonomous driving, and robotics. While traditional pipelines like Structure from Motion (SfM) and Multi-View Stereo (MVS) achieve high precision through iterative optimization, they are limited by complex workflows, high computational cost, and poor robustness in challenging scenarios like texture-less regions. Recently, deep learning has catalyzed a paradigm shift in 3D reconstruction. A new family of models, exemplified by DUSt3R, has pioneered a feed-forward approach. These models employ a unified deep network to jointly infer camera poses and dense geometry directly from an Unconstrained set of images in a single forward pass. This survey provides a systematic review of this emerging domain. We begin by dissecting the technical framework of these feed-forward models, including their Transformer-based correspondence modeling, joint pose and geometry regression mechanisms, and strategies for scaling from two-view to multi-view scenarios. To highlight the disruptive nature of this new paradigm, we contrast it with both traditional pipelines and earlier learning-based methods like MVSNet. Furthermore, we provide an overview of relevant datasets and evaluation metrics. Finally, we discuss the technology's broad application prospects and identify key future challenges and opportunities, such as model accuracy and scalability, and handling dynamic scenes.
三维重建技术旨在恢复场景的密集三维结构,它是包括增强/虚拟现实、自动驾驶和机器人技术在内的众多应用领域的基石。传统的流水线方法如基于运动的结构(SfM)和多视角立体视觉(MVS),通过迭代优化实现高精度的同时,受限于复杂的流程、高昂的计算成本以及在无纹理区域等挑战性场景中的鲁棒性差的问题。最近,深度学习推动了三维重建领域的范式转变。以DUSt3R为代表的新型模型开创了一种前馈方法。这些模型采用统一的深层网络直接从前向传递中从一组非约束图像集中推断相机姿态和密集几何结构。本综述系统地回顾了这一新兴领域。 我们首先剖析了这些前馈模型的技术框架,包括它们基于Transformer的对应关系建模、联合的姿态与几何回归机制以及从两视图到多视图场景扩展的策略。为了突出这种新范式的颠覆性特征,我们将它与传统的流水线方法和早期的学习方法(如MVSNet)进行了对比。此外,我们还概述了相关的数据集和评估指标。 最后,我们讨论了该技术广泛的应用前景,并识别出关键的未来挑战和机遇,包括模型精度和可扩展性以及处理动态场景的能力。
https://arxiv.org/abs/2507.08448
Humans can naturally identify and mentally complete occluded objects in cluttered environments. However, imparting similar cognitive ability to robotics remains challenging even with advanced reconstruction techniques, which models scenes as undifferentiated wholes and fails to recognize complete object from partial observations. In this paper, we propose InstaScene, a new paradigm towards holistic 3D perception of complex scenes with a primary goal: decomposing arbitrary instances while ensuring complete reconstruction. To achieve precise decomposition, we develop a novel spatial contrastive learning by tracing rasterization of each instance across views, significantly enhancing semantic supervision in cluttered scenes. To overcome incompleteness from limited observations, we introduce in-situ generation that harnesses valuable observations and geometric cues, effectively guiding 3D generative models to reconstruct complete instances that seamlessly align with the real world. Experiments on scene decomposition and object completion across complex real-world and synthetic scenes demonstrate that our method achieves superior decomposition accuracy while producing geometrically faithful and visually intact objects.
人类能够在复杂环境中自然地识别并完成被遮挡物体的图像。然而,将这种认知能力赋予机器人仍然是一个挑战,即使使用了先进的重建技术,这些技术也将场景视为未区分的整体,并且无法从部分观察中识别出完整的对象。在本文中,我们提出了InstaScene,这是一种新的范式,旨在实现复杂场景中的整体3D感知,并以主要目标为出发点:分解任意实例并确保完整重建。 为了实现精确的分解,我们开发了一种新颖的空间对比学习方法,通过追踪每个实例跨视角的光栅化过程,显著增强了杂乱场景中的语义监督。为了避免由于观察有限而导致的不完整性,我们引入了原位生成技术,利用有价值的观察和几何线索,有效地指导3D生成模型重建与现实世界无缝对接的完整对象。 在复杂的真实世界和合成场景中进行的场景分解和物体完成实验表明,我们的方法实现了卓越的分解准确性,并产生了几何上真实且视觉完整的对象。
https://arxiv.org/abs/2507.08416
Humanoid robots show significant potential in daily tasks. However, reinforcement learning-based motion policies often suffer from robustness degradation due to the sim-to-real dynamics gap, thereby affecting the agility of real robots. In this work, we propose a novel robust adversarial training paradigm designed to enhance the robustness of humanoid motion policies in real worlds. The paradigm introduces a learnable adversarial attack network that precisely identifies vulnerabilities in motion policies and applies targeted perturbations, forcing the motion policy to enhance its robustness against perturbations through dynamic adversarial training. We conduct experiments on the Unitree G1 humanoid robot for both perceptive locomotion and whole-body control tasks. The results demonstrate that our proposed method significantly enhances the robot's motion robustness in real world environments, enabling successful traversal of challenging terrains and highly agile whole-body trajectory tracking.
人形机器人在日常任务中展现出巨大的潜力。然而,基于强化学习的运动策略往往由于仿真到现实的动力学差距而出现鲁棒性下降的问题,从而影响了实际机器人的敏捷性。为此,我们提出了一种新的稳健对抗训练范式,旨在提高人形机器人在真实世界中的运动策略的鲁棒性。该范式引入了一个可学习的对抗攻击网络,能够精确地识别运动策略中的漏洞并施加针对性的扰动,通过动态对抗训练迫使运动策略增强其对扰动的抵抗能力。 我们在Unitree G1人形机器人的感知行走和全身控制任务上进行了实验。结果表明,我们提出的方法显著提高了机器人在真实环境中的运动鲁棒性,使其能够成功地穿越具有挑战性的地形并实现高度敏捷的整体轨迹跟踪。
https://arxiv.org/abs/2507.08303
Building a robust perception module is crucial for visuomotor policy learning. While recent methods incorporate pre-trained 2D foundation models into robotic perception modules to leverage their strong semantic understanding, they struggle to capture 3D spatial information and generalize across diverse camera viewpoints. These limitations hinder the policy's effectiveness, especially in fine-grained robotic manipulation scenarios. To address these challenges, we propose CL3R, a novel 3D pre-training framework designed to enhance robotic manipulation policies. Our method integrates both spatial awareness and semantic understanding by employing a point cloud Masked Autoencoder to learn rich 3D representations while leveraging pre-trained 2D foundation models through contrastive learning for efficient semantic knowledge transfer. Additionally, we propose a 3D visual representation pre-training framework for robotic tasks. By unifying coordinate systems across datasets and introducing random fusion of multi-view point clouds, we mitigate camera view ambiguity and improve generalization, enabling robust perception from novel viewpoints at test time. Extensive experiments in both simulation and the real world demonstrate the superiority of our method, highlighting its effectiveness in visuomotor policy learning for robotic manipulation.
构建一个强大的感知模块对于视动策略学习至关重要。虽然最近的方法将预训练的2D基础模型集成到机器人感知模块中,以利用其强大的语义理解能力,但它们难以捕捉3D空间信息,并且在不同相机视角下泛化的能力有限。这些限制阻碍了政策的有效性,尤其是在精细的机器人操作场景中。为了解决这些问题,我们提出了CL3R,这是一个新颖的3D预训练框架,旨在增强机器人的操作策略。我们的方法通过使用点云Masked Autoencoder来学习丰富的3D表示,并利用对比学习将预训练2D基础模型中的语义知识高效转移,从而集成了空间意识和语义理解。 此外,我们还提出了一种用于机器人任务的3D视觉表示预训练框架。通过统一不同数据集之间的坐标系统并引入多视角点云的随机融合,我们缓解了相机视图模糊性,并提高了泛化能力,在测试时能够从新的视角实现稳健感知。 在模拟和现实世界的广泛实验中,我们的方法显示出了其优越性,突显了它在机器人操作中的视动策略学习方面的有效性。
https://arxiv.org/abs/2507.08262
Large language models (LLMs) have shown promise in robotic procedural planning, yet their human-centric reasoning often omits the low-level, grounded details needed for robotic execution. Vision-language models (VLMs) offer a path toward more perceptually grounded plans, but current methods either rely on expensive, large-scale models or are constrained to narrow simulation settings. We introduce SelfReVision, a lightweight and scalable self-improvement framework for vision-language procedural planning. SelfReVision enables small VLMs to iteratively critique, revise, and verify their own plans-without external supervision or teacher models-drawing inspiration from chain-of-thought prompting and self-instruct paradigms. Through this self-distillation loop, models generate higher-quality, execution-ready plans that can be used both at inference and for continued fine-tuning. Using models varying from 3B to 72B, our results show that SelfReVision not only boosts performance over weak base VLMs but also outperforms models 100X the size, yielding improved control in downstream embodied tasks.
大型语言模型(LLMs)在机器人程序规划中展现出潜力,但它们以人类为中心的推理往往忽略了机器人执行所需的低层次、具象化的细节。视觉-语言模型(VLMs)为更感知基础的计划提供了一条路径,然而当前的方法要么依赖于昂贵的大规模模型,要么局限于狭窄的模拟环境中。我们引入了SelfReVision,这是一个轻量级且可扩展的自我改进框架,用于视觉-语言程序规划。SelfReVision使小型VLM能够在没有外部监督或教师模型的情况下迭代地批评、修订和验证自己的计划,这借鉴了链式思维提示和自我指令范式的灵感。通过这种自我蒸馏循环,模型能够生成更高质量且可以直接执行的计划,并可用于推理以及持续微调。使用从30亿到720亿参数规模不同的模型进行实验,我们的结果表明SelfReVision不仅增强了弱基础VLMs的表现,而且还优于100倍大小的模型,在下游具身任务中提供了更好的控制能力。
https://arxiv.org/abs/2507.08224
Obstacle avoidance is crucial for mobile robots' navigation in both known and unknown environments. This research designs, trains, and tests two custom Convolutional Neural Networks (CNNs), using color and depth images from a depth camera as inputs. Both networks adopt sensor fusion to produce an output: the mobile robot's angular velocity, which serves as the robot's steering command. A newly obtained visual dataset for navigation was collected in diverse environments with varying lighting conditions and dynamic obstacles. During data collection, a communication link was established over Wi-Fi between a remote server and the robot, using Robot Operating System (ROS) topics. Velocity commands were transmitted from the server to the robot, enabling synchronized recording of visual data and the corresponding steering commands. Various evaluation metrics, such as Mean Squared Error, Variance Score, and Feed-Forward time, provided a clear comparison between the two networks and clarified which one to use for the application.
避障对于移动机器人在已知和未知环境中的导航至关重要。本研究设计、训练并测试了两个自定义的卷积神经网络(CNN),这些网络使用来自深度相机的颜色和深度图像作为输入。这两个网络都采用了传感器融合技术,以生成输出:即移动机器人的角速度,这充当机器人的转向指令。为了这项研究,我们收集了一个全新的用于导航的视觉数据集,在各种光照条件和动态障碍物变化多样的环境中进行了采集。在数据收集过程中,通过Wi-Fi在远程服务器与机器人之间建立了通信链路,并使用了机器人操作系统(ROS)主题进行数据传输。速度命令从服务器发送到机器人,使得视觉数据和相应的转向指令能够同步记录下来。各种评估指标,如均方误差、方差评分以及前向传播时间,为这两个网络之间的比较提供了清晰的依据,并明确了在实际应用中应选择哪个网络使用。
https://arxiv.org/abs/2507.08112
In crowded environments, individuals must navigate around other occupants to reach their destinations. Understanding and controlling traffic flows in these spaces is relevant to coordinating robot swarms and designing infrastructure for dense populations. Here, we combine simulations, theory, and robotic experiments to study how noisy motion can disrupt traffic jams and enable flow as agents travel to individual goals. Above a critical noise level, large jams do not persist. From this observation, we analytically approximate the goal attainment rate as a function of the noise level, then solve for the optimal agent density and noise level that maximize the swarm's goal attainment rate. We perform robotic experiments to corroborate our simulated and theoretical results. Finally, we compare simple, local navigation approaches with a sophisticated but computationally costly central planner. A simple reactive scheme performs well up to moderate densities and is far more computationally efficient than a planner, suggesting lessons for real-world problems.
在拥挤的环境中,个体必须绕过其他占用者以到达目的地。理解并控制这些空间中的交通流对于协调机器人集群和为密集人口设计基础设施具有重要意义。在这里,我们结合模拟、理论和机器人实验来研究噪声运动如何扰乱交通堵塞,并使代理达到个人目标时能够流动。当噪音水平超过一个临界值后,大的堵塞不再持续存在。基于这一观察,我们分析地近似了以噪声水平为函数的目标实现率,然后求解出最大化集群目标实现率的最佳代理密度和噪声水平。我们进行了机器人实验来验证模拟和理论结果。最后,我们将简单的局部导航方法与复杂但计算成本高昂的中央规划器进行比较。在中等密度下,一个简单的反应式方案表现良好,并且比规划者更高效得多,这为解决实际问题提供了启示。
https://arxiv.org/abs/2507.08100
Robots can better interact with humans and unstructured environments through touch sensing. However, most commercial robots are not equipped with tactile skins, making it challenging to achieve even basic touch-sensing functions, such as contact localization. We present UniTac, a data-driven whole-body touch-sensing approach that uses only proprioceptive joint sensors and does not require the installation of additional sensors. Our approach enables a robot equipped solely with joint sensors to localize contacts. Our goal is to democratize touch sensing and provide an off-the-shelf tool for HRI researchers to provide their robots with touch-sensing capabilities. We validate our approach on two platforms: the Franka robot arm and the Spot quadruped. On Franka, we can localize contact to within 8.0 centimeters, and on Spot, we can localize to within 7.2 centimeters at around 2,000 Hz on an RTX 3090 GPU without adding any additional sensors to the robot. Project website: this https URL.
机器人可以通过触觉感应更好地与人类和非结构化环境互动。然而,大多数商用机器人并未配备触觉皮肤,这使得实现诸如接触定位等基本的触感功能变得困难。我们提出了UniTac,这是一种基于数据驱动的整体触觉感知方法,仅使用本体感受关节传感器,并不需要安装额外的传感器。我们的方法使只配备了关节传感器的机器人都能够进行接触定位。我们的目标是让触觉感应更加普及,并为人机交互(HRI)研究者提供现成的工具,以赋予他们的机器人触觉感知能力。我们在两个平台上验证了该方法的有效性:Franka机械臂和Spot四足机器人。在Franka上,我们能够将接触定位到8.0厘米范围内;而在Spot上,在配备RTX 3090 GPU的情况下,无需添加任何额外的传感器即可实现在约2,000 Hz频率下的7.2厘米范围内的接触定位。 项目网站:[此处插入实际链接]
https://arxiv.org/abs/2507.07980
Synthesizing realistic Martian landscape videos is crucial for mission rehearsal and robotic simulation. However, this task poses unique challenges due to the scarcity of high-quality Martian data and the significant domain gap between Martian and terrestrial imagery. To address these challenges, we propose a holistic solution composed of two key components: 1) A data curation pipeline Multimodal Mars Synthesis (M3arsSynth), which reconstructs 3D Martian environments from real stereo navigation images, sourced from NASA's Planetary Data System (PDS), and renders high-fidelity multiview 3D video sequences. 2) A Martian terrain video generator, MarsGen, which synthesizes novel videos visually realistic and geometrically consistent with the 3D structure encoded in the data. Our M3arsSynth engine spans a wide range of Martian terrains and acquisition dates, enabling the generation of physically accurate 3D surface models at metric-scale resolution. MarsGen, fine-tuned on M3arsSynth data, synthesizes videos conditioned on an initial image frame and, optionally, camera trajectories or textual prompts, allowing for video generation in novel environments. Experimental results show that our approach outperforms video synthesis models trained on terrestrial datasets, achieving superior visual fidelity and 3D structural consistency.
合成逼真的火星景观视频对于任务排练和机器人模拟至关重要。然而,由于高质量的火星数据稀缺以及火星图像与地球图像之间的显著领域差异,这一任务面临着独特的挑战。为了解决这些挑战,我们提出了一种全面的解决方案,包括两个关键组成部分: 1. **多模态火星合成(M3arsSynth)**:这是一个数据整理流水线,从NASA行星数据系统(PDS)获取的真实立体导航图像中重建三维火星环境,并渲染高质量的多视角三维视频序列。 2. **火星地形视频生成器(MarsGen)**:该组件能够根据编码在数据中的3D结构来合成视觉上逼真且几何一致的新视频。 我们的M3arsSynth引擎涵盖了广泛的火星地貌和采集日期,从而能够在米级分辨率下生成物理准确的三维表面模型。通过微调M3arsSynth数据集,MarsGen可以根据初始图像帧(以及可选的摄像机轨迹或文本提示)合成视频,使得在新环境中进行视频生成成为可能。 实验结果表明,我们的方法优于基于地球数据集训练的视频合成模型,在视觉保真度和三维结构一致性方面表现出色。
https://arxiv.org/abs/2507.07978
As the robotics systems increasingly integrate into daily life, from smart home assistants to the new-wave of industrial automation systems (Industry 4.0), there's an increasing need to bridge the gap between complex robotic systems and everyday users. The Robot Operating System (ROS) is a flexible framework often utilised in writing robot software, providing tools and libraries for building complex robotic systems. However, ROS's distributed architecture and technical messaging system create barriers for understanding robot status and diagnosing errors. This gap can lead to extended maintenance downtimes, as users with limited ROS knowledge may struggle to quickly diagnose and resolve system issues. Moreover, this deficit in expertise often delays proactive maintenance and troubleshooting, further increasing the frequency and duration of system interruptions. ROS Help Desk provides intuitive error explanations and debugging support, dynamically customized to users of varying expertise levels. It features user-centric debugging tools that simplify error diagnosis, implements proactive error detection capabilities to reduce downtime, and integrates multimodal data processing for comprehensive system state understanding across multi-sensor data (e.g., lidar, RGB). Testing qualitatively and quantitatively with artificially induced errors demonstrates the system's ability to proactively and accurately diagnose problems, ultimately reducing maintenance time and fostering more effective human-robot collaboration.
随着机器人系统越来越多地融入日常生活,从智能家居助手到工业4.0的新一代自动化系统,对于弥合复杂机器人系统与普通用户之间的差距的需求也在不断增加。机器人操作系统(ROS)是一个灵活的框架,常用于编写机器人软件,提供工具和库来构建复杂的机器人系统。然而,ROS的分布式架构和技术消息传递机制使得理解机器人的状态和诊断错误变得困难。这种差距可能导致延长维护停机时间,因为具有有限ROS知识的用户可能会难以快速诊断并解决系统问题。此外,专业知识的缺乏往往推迟了主动维护和故障排除,进一步增加了系统的中断频率和持续时间。 ROS帮助台提供直观的错误解释和调试支持,并根据不同技能水平的用户进行动态定制。它具备以用户为中心的调试工具,简化了错误诊断过程;实现主动错误检测功能,减少停机时间;并集成了多模态数据处理能力,能够通过来自多个传感器(如激光雷达、RGB)的数据来全面理解系统的状态。 通过人工诱发错误进行定性和定量测试表明该系统具备主动且准确地诊断问题的能力,最终减少了维护时间,并促进了更有效的机器人与人类之间的协作。
https://arxiv.org/abs/2507.07846
Autonomous agents, particularly in the field of robotics, rely on sensory information to perceive and navigate their environment. However, these sensory inputs are often imperfect, leading to distortions in the agent's internal representation of the world. This paper investigates the nature of these perceptual distortions and how they influence autonomous representation learning using a minimal robotic system. We utilize a simulated two-wheeled robot equipped with distance sensors and a compass, operating within a simple square environment. Through analysis of the robot's sensor data during random exploration, we demonstrate how a distorted perceptual space emerges. Despite these distortions, we identify emergent structures within the perceptual space that correlate with the physical environment, revealing how the robot autonomously learns a structured representation for navigation without explicit spatial information. This work contributes to the understanding of embodied cognition, minimal agency, and the role of perception in self-generated navigation strategies in artificial life.
自主代理,尤其是在机器人领域中,依赖于感官信息来感知和导航其环境。然而,这些感官输入往往是不完美的,导致了代理人内部世界表示的扭曲。本文通过一个最小化的机器人系统研究了这种知觉偏差的本质及其对自主表征学习的影响。我们使用了一个配备有距离传感器和指南针的双轮模拟机器人,在简单的方形环境中操作。通过对机器人在随机探索过程中的感官数据进行分析,我们展示了如何在这种情况下出现感知空间的扭曲现象。尽管存在这些扭曲,我们在感知空间中识别出了与物理环境相关的新兴结构,揭示了机器人是如何自主学习用于导航的结构化表示方法,而无需显式的空间信息。这项工作对具身认知、最小代理和感知在人工生命自我生成导航策略中的作用有了更深入的理解。
https://arxiv.org/abs/2507.07845
Unknown dynamic load carrying is one important practical application for quadruped robots. Such a problem is non-trivial, posing three major challenges in quadruped locomotion control. First, how to model or represent the dynamics of the load in a generic manner. Second, how to make the robot capture the dynamics without any external sensing. Third, how to enable the robot to interact with load handling the mutual effect and stabilizing the load. In this work, we propose a general load modeling approach called load characteristics modeling to capture the dynamics of the load. We integrate this proposed modeling technique and leverage recent advances in Reinforcement Learning (RL) based locomotion control to enable the robot to infer the dynamics of load movement and interact with the load indirectly to stabilize it and realize the sim-to-real deployment to verify its effectiveness in real scenarios. We conduct extensive comparative simulation experiments to validate the effectiveness and superiority of our proposed method. Results show that our method outperforms other methods in sudden load resistance, load stabilizing and locomotion with heavy load on rough terrain. \href{this https URL}{Project Page}.
未知的动态负载承载是四足机器人的重要实际应用之一。这类问题具有挑战性,主要面临三个重大难题:首先是如何以通用方式建模或表示负载的动力学;其次是如何在没有任何外部传感器的情况下捕捉负载的动力学特性;最后是如何使机器人能够与负载互动,并处理相互作用的影响来稳定负载。在这项工作中,我们提出了一种名为负载特征建模的通用负载建模方法,旨在捕获负载的动力学特性。我们将这一提议的建模技术与基于强化学习(RL)的步态控制最新进展相结合,使机器人能够间接地推断出负载运动的动力学,并与其互动以稳定负载,同时实现仿真到实际部署的应用验证其在真实场景中的有效性。 我们进行了广泛的比较性模拟实验来验证所提出方法的有效性和优越性。结果表明,我们的方法在应对突发负载阻力、稳定负载以及在崎岖地形上携带重载行走方面优于其他方法。[项目页面](https://this%20URL/)提供了更多详细信息和研究成果。
https://arxiv.org/abs/2507.07825
Mandibular Angle Split Osteotomy (MASO) is a significant procedure in oral and maxillofacial surgery. Despite advances in technique and instrumentation, its success still relies heavily on the surgeon's experience. In this work, a human-robot collaborative system is proposed to perform MASO according to a preoperative plan and under guidance of a surgeon. A task decomposition methodology is used to divide the collaborative surgical procedure into three subtasks: (1) positional control and (2) orientation control, both led by the robot for precise alignment; and (3) force-control, managed by surgeon to ensure safety. Additionally, to achieve patient tracking without the need for a skull clamp, an optical tracking system (OTS) is utilized. Movement of the patient mandibular is measured with an optical-based tracker mounted on a dental occlusal splint. A registration method and Robot-OTS calibration method are introduced to achieve reliable navigation within our framework. The experiments of drilling were conducted on the realistic phantom model, which demonstrated that the average error between the planned and actual drilling points is 1.85mm.
下颌角切骨术(MASO)是口腔和面部外科手术中的一个重要程序。尽管技术与器械有了进步,其成功率仍然很大程度上依赖于外科医生的经验。本文提出了一种人机协作系统,旨在根据术前计划并在外科医生的指导下执行MASO。该系统使用任务分解方法将协作手术过程分为三个子任务:(1)位置控制和(2)方向控制均由机器人主导以实现精确对齐;以及(3)力控由外科医生管理以确保安全。此外,为了在没有头骨夹的情况下进行患者跟踪,采用了一种光学追踪系统(OTS)。患者的下颌移动通过安装在牙合垫上的基于光学的追踪器来测量。引入了注册方法和机器人-OTS校准方法,在我们的框架内实现了可靠的导航。钻孔实验是在一个现实模型上进行的,结果显示计划与实际钻孔点之间的平均误差为1.85毫米。
https://arxiv.org/abs/2507.07794
The integration of language and 3D perception is critical for embodied AI and robotic systems to perceive, understand, and interact with the physical world. Spatial reasoning, a key capability for understanding spatial relationships between objects, remains underexplored in current 3D vision-language research. Existing datasets often mix semantic cues (e.g., object name) with spatial context, leading models to rely on superficial shortcuts rather than genuinely interpreting spatial relationships. To address this gap, we introduce S\textsc{urprise}3D, a novel dataset designed to evaluate language-guided spatial reasoning segmentation in complex 3D scenes. S\textsc{urprise}3D consists of more than 200k vision language pairs across 900+ detailed indoor scenes from ScanNet++ v2, including more than 2.8k unique object classes. The dataset contains 89k+ human-annotated spatial queries deliberately crafted without object name, thereby mitigating shortcut biases in spatial understanding. These queries comprehensively cover various spatial reasoning skills, such as relative position, narrative perspective, parametric perspective, and absolute distance reasoning. Initial benchmarks demonstrate significant challenges for current state-of-the-art expert 3D visual grounding methods and 3D-LLMs, underscoring the necessity of our dataset and the accompanying 3D Spatial Reasoning Segmentation (3D-SRS) benchmark suite. S\textsc{urprise}3D and 3D-SRS aim to facilitate advancements in spatially aware AI, paving the way for effective embodied interaction and robotic planning. The code and datasets can be found in this https URL.
语言与三维感知的结合对于具身人工智能和机器人系统的物理世界感知、理解和交互至关重要。空间推理,即理解物体间空间关系的关键能力,在当前的3D视觉-语言研究中仍被忽视。现有的数据集常常将语义线索(如对象名称)与空间背景混合在一起,导致模型依赖于肤浅的捷径而非真正解释空间关系。为解决这一缺口,我们引入了S\textsc{urprise}3D,这是一个新颖的数据集,旨在评估复杂三维场景中指导语言的空间推理分割能力。S\textsc{urprise}3D包含超过20万张视觉-语言对,来自ScanNet++ v2的900多个详细的室内场景,并涵盖了超过2800个独特的对象类别。该数据集中包含了超过8.9万个由人工标注的空间查询,这些查询刻意避免了使用物体名称,从而减少了空间理解中的捷径偏差。这些查询全面覆盖了各种空间推理技能,如相对位置、叙述视角、参数化视角和绝对距离推理。初步基准测试表明,当前最先进的专家3D视觉定位方法及3D-LLMs(大规模语言模型)面临着重大挑战,这突显出我们数据集及其配套的3D空间推理分割(3D-SRS)基准套件的重要性。S\textsc{urprise}3D和3D-SRS旨在促进具有空间意识的人工智能的发展,并为有效的具身交互和机器人规划铺平道路。该代码及数据集可在此URL获取:[此处应插入链接]。
https://arxiv.org/abs/2507.07781
Robust Visual SLAM (vSLAM) is essential for autonomous systems operating in real-world environments, where challenges such as dynamic objects, low texture, and critically, varying illumination conditions often degrade performance. Existing feature-based SLAM systems rely on fixed front-end parameters, making them vulnerable to sudden lighting changes and unstable feature tracking. To address these challenges, we propose ``IRAF-SLAM'', an Illumination-Robust and Adaptive Feature-Culling front-end designed to enhance vSLAM resilience in complex and challenging environments. Our approach introduces: (1) an image enhancement scheme to preprocess and adjust image quality under varying lighting conditions; (2) an adaptive feature extraction mechanism that dynamically adjusts detection sensitivity based on image entropy, pixel intensity, and gradient analysis; and (3) a feature culling strategy that filters out unreliable feature points using density distribution analysis and a lighting impact factor. Comprehensive evaluations on the TUM-VI and European Robotics Challenge (EuRoC) datasets demonstrate that IRAF-SLAM significantly reduces tracking failures and achieves superior trajectory accuracy compared to state-of-the-art vSLAM methods under adverse illumination conditions. These results highlight the effectiveness of adaptive front-end strategies in improving vSLAM robustness without incurring significant computational overhead. The implementation of IRAF-SLAM is publicly available at https://thanhnguyencanh. this http URL.
鲁棒的视觉同步定位与地图构建(vSLAM)对于在真实世界环境中运行的自主系统至关重要。动态物体、低纹理以及最关键的是光照条件的变化,常常会降低其性能。现有的基于特征的SLAM系统依赖于固定的前端参数设置,这使得它们对突然的照明变化和不稳定的特征跟踪变得脆弱。 为了解决这些问题,我们提出了“IRAF-SLAM”,这是一种设计用于提升vSLAM在复杂且挑战性环境中稳健性的光照鲁棒性和自适应特征剔除前端。我们的方法包括: 1. 一种图像增强方案,用于预处理并根据不同的光照条件调整图像质量; 2. 自适应的特征提取机制,该机制可以根据图像熵、像素强度和梯度分析动态地调整检测敏感性; 3. 基于密度分布分析和照明影响因子过滤不可靠特征点的特征剔除策略。 我们在TUM-VI和欧洲机器人挑战赛(EuRoC)数据集上进行了全面评估,结果表明IRAF-SLAM在不良光照条件下显著减少了跟踪失败,并且其轨迹准确性优于最先进的vSLAM方法。这些结果突显了自适应前端策略在不增加显著计算开销的情况下提升vSLAM鲁棒性方面的有效性。 IRAF-SLAM的实现代码公开可用,网址为https://thanhnguyencanh.github.io/iraf-slam/。
https://arxiv.org/abs/2507.07752
Despite their recent introduction to human society, Large Language Models (LLMs) have significantly affected the way we tackle mental challenges in our everyday lives. From optimizing our linguistic communication to assisting us in making important decisions, LLMs, such as ChatGPT, are notably reducing our cognitive load by gradually taking on an increasing share of our mental activities. In the context of Learning by Demonstration (LbD), classifying and segmenting complex motions into primitive actions, such as pushing, pulling, twisting etc, is considered to be a key-step towards encoding a task. In this work, we investigate the capabilities of LLMs to undertake this task, considering a finite set of predefined primitive actions found in fruit picking operations. By utilizing LLMs instead of simple supervised learning or analytic methods, we aim at making the method easily applicable and deployable in a real-life scenario. Three different fine-tuning approaches are investigated, compared on datasets captured kinesthetically, using a UR10e robot, during a fruit-picking scenario.
尽管大型语言模型(LLM)最近才被引入人类社会,它们已经在我们日常生活中解决心理挑战的方式上产生了重大影响。从优化我们的语言交流到帮助我们在重要决策中做出选择,像ChatGPT这样的LLM通过逐渐承担越来越多的心理活动,显著减轻了我们的认知负荷。 在基于示范学习(LbD)的背景下,将复杂的动作分类和分割为基本的动作单元,如推动、拉动、扭转等,被视为编码任务的关键步骤。在这项工作中,我们探讨了LLM执行此类任务的能力,并考虑了一个有限的基本动作集,这些动作是在水果采摘操作中发现的。通过使用LLM而不是简单的监督学习或分析方法,我们的目标是使这种方法在现实生活场景中易于应用和部署。 本文研究了三种不同的微调方法,在由UR10e机器人在水果采摘场景中捕捉到的数据集中进行比较和评估。
https://arxiv.org/abs/2507.07745