Object 6D pose estimation is a critical challenge in robotics, particularly for manipulation tasks. While prior research combining visual and tactile (visuotactile) information has shown promise, these approaches often struggle with generalization due to the limited availability of visuotactile data. In this paper, we introduce ViTa-Zero, a zero-shot visuotactile pose estimation framework. Our key innovation lies in leveraging a visual model as its backbone and performing feasibility checking and test-time optimization based on physical constraints derived from tactile and proprioceptive observations. Specifically, we model the gripper-object interaction as a spring-mass system, where tactile sensors induce attractive forces, and proprioception generates repulsive forces. We validate our framework through experiments on a real-world robot setup, demonstrating its effectiveness across representative visual backbones and manipulation scenarios, including grasping, object picking, and bimanual handover. Compared to the visual models, our approach overcomes some drastic failure modes while tracking the in-hand object pose. In our experiments, our approach shows an average increase of 55% in AUC of ADD-S and 60% in ADD, along with an 80% lower position error compared to FoundationPose.
六维(6D)姿态估计是机器人技术中的一个关键挑战,尤其是在执行抓取和操作任务时。尽管之前的研究将视觉信息与触觉(视听触觉)相结合的方法显示出了一定的潜力,但由于视触数据集有限的问题,这些方法在泛化能力上常常存在不足。本文介绍了ViTa-Zero框架,这是一个零样本学习下的视听触觉姿态估计框架。 我们的核心创新在于利用一个视觉模型作为主干,并基于来自触觉和本体感觉观察所推导出的物理约束来进行可行性检查及测试时优化。具体来说,我们将抓手-物体交互建模为弹簧质量系统,在此系统中,触觉传感器诱导吸引作用力,而本体感受则产生排斥作用力。 我们通过在实际机器人设置上的实验验证了该框架的有效性,展示出其对代表性视觉主干和操作场景(包括抓取、物体拾起及双臂交接)的适用性。与仅依赖视觉模型的方法相比,在跟踪手中物体姿态时,我们的方法克服了一些极端失败模式,并且在平均AUC增益方面,ADD-S提高了55%,ADD提高了60%,同时位置误差降低了80%(相较于FoundationPose)。
https://arxiv.org/abs/2504.13179
Visuomotor policies learned from teleoperated demonstrations face challenges such as lengthy data collection, high costs, and limited data diversity. Existing approaches address these issues by augmenting image observations in RGB space or employing Real-to-Sim-to-Real pipelines based on physical simulators. However, the former is constrained to 2D data augmentation, while the latter suffers from imprecise physical simulation caused by inaccurate geometric reconstruction. This paper introduces RoboSplat, a novel method that generates diverse, visually realistic demonstrations by directly manipulating 3D Gaussians. Specifically, we reconstruct the scene through 3D Gaussian Splatting (3DGS), directly edit the reconstructed scene, and augment data across six types of generalization with five techniques: 3D Gaussian replacement for varying object types, scene appearance, and robot embodiments; equivariant transformations for different object poses; visual attribute editing for various lighting conditions; novel view synthesis for new camera perspectives; and 3D content generation for diverse object types. Comprehensive real-world experiments demonstrate that RoboSplat significantly enhances the generalization of visuomotor policies under diverse disturbances. Notably, while policies trained on hundreds of real-world demonstrations with additional 2D data augmentation achieve an average success rate of 57.2%, RoboSplat attains 87.8% in one-shot settings across six types of generalization in the real world.
从远程操作演示中学到的视动策略面临诸如数据收集时间长、成本高和数据多样性有限等挑战。现有方法通过在RGB空间中增强图像观测或使用基于物理模拟器的Real-to-Sim-to-Real流水线来解决这些问题。然而,前者仅限于2D数据增强,而后者则因几何重建不准确而导致物理仿真不够精确。本文介绍了RoboSplat这一新方法,它能生成多样且视觉逼真的演示,通过直接操作3D高斯分布实现。具体来说,我们通过三维高斯点绘(3DGS)重构场景、直接编辑重构后的场景,并利用五种技术在六类泛化中进行数据增强:使用3D高斯替换改变对象类型、场景外观和机器人形态;使用等变变换以处理不同物体姿态的变化;采用视觉属性编辑来适应不同的光照条件;进行新视角合成以生成新的摄像机视图;以及通过三维内容生成实现多样的物体类型变化。全面的现实世界实验表明,RoboSplat显著提高了在各种扰动下的视动策略泛化能力。值得注意的是,在利用2D数据增强进行额外训练后,基于数百个真实世界演示学习到的策略平均成功率仅为57.2%,而使用RoboSplat在同一项设置下却实现了87.8%的成功率,这一性能跨越了六类泛化的测试环境。
https://arxiv.org/abs/2504.13175
Dexterous manipulation is a fundamental capability for robotic systems, yet progress has been limited by hardware trade-offs between precision, compactness, strength, and affordability. Existing control methods impose compromises on hand designs and applications. However, learning-based approaches present opportunities to rethink these trade-offs, particularly to address challenges with tendon-driven actuation and low-cost materials. This work presents RUKA, a tendon-driven humanoid hand that is compact, affordable, and capable. Made from 3D-printed parts and off-the-shelf components, RUKA has 5 fingers with 15 underactuated degrees of freedom enabling diverse human-like grasps. Its tendon-driven actuation allows powerful grasping in a compact, human-sized form factor. To address control challenges, we learn joint-to-actuator and fingertip-to-actuator models from motion-capture data collected by the MANUS glove, leveraging the hand's morphological accuracy. Extensive evaluations demonstrate RUKA's superior reachability, durability, and strength compared to other robotic hands. Teleoperation tasks further showcase RUKA's dexterous movements. The open-source design and assembly instructions of RUKA, code, and data are available at this https URL.
灵巧的操作是机器人系统的一项基本能力,但进展受到硬件在精度、紧凑性、力量和成本之间的权衡限制。现有的控制方法对机械手设计及应用提出了妥协要求。然而,基于学习的方法为重新思考这些取舍带来了机会,特别是解决肌腱驱动致动器和低成本材料所面临的挑战。本工作介绍了一种名为RUKA的新型机器人手,这是一种采用肌腱驱动的人形手,具有紧凑、经济实惠且功能强大的特点。RUKA由3D打印部件及现成组件制成,拥有5个手指以及15个欠驱动自由度,能够实现多种类似人类的手部抓握方式。它的肌腱驱动致动器允许在紧凑而人尺寸的外形中进行强力抓取。 为解决控制挑战,我们从MANUS手套采集到的动作捕捉数据中学习关节至执行器和指尖至执行器模型,利用手部形态的准确性。广泛评估表明,RUKA在可达性、耐久性和力量方面优于其他机器人手。远程操作任务进一步展示了RUKA灵巧的操作能力。 RUKA的设计及其组装说明、代码及数据均开源并可在此链接访问:[请参见原文链接]。 该段落总结了名为RUKA的肌腱驱动型机械手的研发成果,强调其在紧凑性、成本效益和功能上的优势,并详细介绍了通过动作捕捉数据学习控制模型的方法。此外还指出了远程操作中表现出的操作灵活性及其开源性质。
https://arxiv.org/abs/2504.13165
This survey explores recent developments in generating digital twins from videos. Such digital twins can be used for robotics application, media content creation, or design and construction works. We analyze various approaches, including 3D Gaussian Splatting, generative in-painting, semantic segmentation, and foundation models highlighting their advantages and limitations. Additionally, we discuss challenges such as occlusions, lighting variations, and scalability, as well as potential future research directions. This survey aims to provide a comprehensive overview of state-of-the-art methodologies and their implications for real-world applications. Awesome list: this https URL
这项调查探讨了从视频生成数字孪生的最新发展。这样的数字孪生可用于机器人应用、媒体内容创作或设计和建设工作。我们分析了各种方法,包括3D高斯点阵、生成式修复、语义分割以及基础模型,并强调它们各自的优点和局限性。此外,我们还讨论了遮挡、光照变化和可扩展性的挑战,以及未来研究的潜在方向。这项调查旨在提供当前最先进的方法及其对实际应用影响的全面概述。 附:详情参见此链接: [https URL] (请将[https URL]替换为实际链接地址)
https://arxiv.org/abs/2504.13159
A robot navigating an outdoor environment with no prior knowledge of the space must rely on its local sensing to perceive its surroundings and plan. This can come in the form of a local metric map or local policy with some fixed horizon. Beyond that, there is a fog of unknown space marked with some fixed cost. A limited planning horizon can often result in myopic decisions leading the robot off course or worse, into very difficult terrain. Ideally, we would like the robot to have full knowledge that can be orders of magnitude larger than a local cost map. In practice, this is intractable due to sparse sensing information and often computationally expensive. In this work, we make a key observation that long-range navigation only necessitates identifying good frontier directions for planning instead of full map knowledge. To this end, we propose Long Range Navigator (LRN), that learns an intermediate affordance representation mapping high-dimensional camera images to `affordable' frontiers for planning, and then optimizing for maximum alignment with the desired goal. LRN notably is trained entirely on unlabeled ego-centric videos making it easy to scale and adapt to new platforms. Through extensive off-road experiments on Spot and a Big Vehicle, we find that augmenting existing navigation stacks with LRN reduces human interventions at test-time and leads to faster decision making indicating the relevance of LRN. this https URL
在没有事先了解的空间中导航的机器人必须依赖于其局部感知来理解周围环境并规划路径。这可以通过局部度量地图或具有固定时间范围的局部策略实现。除此之外,存在一片未知空间的“迷雾”,这片区域被标记为有固定的成本。有限的计划视野常常会导致近视决策,导致机器人偏离预定路线,甚至进入难以通行的地形中。理想情况下,我们希望机器人能够拥有全面的知识,这可能比局部代价地图大几个数量级。然而,在实践中,由于稀疏的感知信息和高昂的计算成本,这是不可行的。 在这项工作中,我们做出一个重要观察:远距离导航只需要确定规划中的“可接近”前沿方向,而无需完整的地图知识。为此,我们提出了长程导航器(LRN),它通过学习一个中间的可操作性表示来实现这一目标——该表示将高维相机图像映射到用于规划的“可接受”的前进步骤,并优化与期望目标的最大一致性。值得注意的是,LRN完全基于未标记的第一人称视频进行训练,这使得它可以很容易地扩展和适应新的平台。 通过在Spot机器人(一种小型四足机器人)以及一辆大型车辆上的大量越野实验中发现,在现有导航系统中加入LRN可以减少测试时的人工干预,并且能更快地做出决策,表明了LRN的相关性和实用性。
https://arxiv.org/abs/2504.13149
Many soft robots struggle to produce dynamic motions with fast, large displacements. We develop a parallel 6 degree-of-freedom (DoF) Stewart-Gough mechanism using Handed Shearing Auxetic (HSA) actuators. By using soft actuators, we are able to use one third as many mechatronic components as a rigid Stewart platform, while retaining a working payload of 2kg and an open-loop bandwidth greater than 16Hx. We show that the platform is capable of both precise tracing and dynamic disturbance rejection when controlling a ball and sliding puck using a Proportional Integral Derivative (PID) controller. We develop a machine-learning-based kinematics model and demonstrate a functional workspace of roughly 10cm in each translation direction and 28 degrees in each orientation. This 6DoF device has many of the characteristics associated with rigid components - power, speed, and total workspace - while capturing the advantages of soft mechanisms.
许多软机器人在产生快速大位移的动态运动方面面临挑战。我们开发了一种使用左手剪切开孔(HSA)执行器的并行六自由度(DoF)Stewart-Gough机构。通过采用软执行器,我们可以将机电部件的数量减少到刚性Stewart平台所需数量的三分之一,同时保留2公斤的工作负载和超过16Hz的开环带宽。我们展示了该平台在使用比例积分微分(PID)控制器控制球体和平移曲棍时能够进行精确跟踪以及动态干扰抑制的能力。我们开发了一种基于机器学习的动力学模型,并证明了该平台在每个平移方向上的功能工作空间约为10厘米,每个定向轴上的工作范围为28度。这种六自由度设备具备刚性组件的许多特性——功率、速度和总体工作空间——同时捕捉到了软机制的优势。
https://arxiv.org/abs/2504.13127
Modeling and control of nonlinear dynamics are critical in robotics, especially in scenarios with unpredictable external influences and complex dynamics. Traditional cascaded modular control pipelines often yield suboptimal performance due to conservative assumptions and tedious parameter tuning. Pure data-driven approaches promise robust performance but suffer from low sample efficiency, sim-to-real gaps, and reliance on extensive datasets. Hybrid methods combining learning-based and traditional model-based control in an end-to-end manner offer a promising alternative. This work presents a self-supervised learning framework combining learning-based inertial odometry (IO) module and differentiable model predictive control (d-MPC) for Unmanned Aerial Vehicle (UAV) attitude control. The IO denoises raw IMU measurements and predicts UAV attitudes, which are then optimized by MPC for control actions in a bi-level optimization (BLO) setup, where the inner MPC optimizes control actions and the upper level minimizes discrepancy between real-world and predicted performance. The framework is thus end-to-end and can be trained in a self-supervised manner. This approach combines the strength of learning-based perception with the interpretable model-based control. Results show the effectiveness even under strong wind. It can simultaneously enhance both the MPC parameter learning and IMU prediction performance.
在机器人技术中,非线性动态模型的建立与控制尤为重要,尤其是在存在不可预测外部影响和复杂动力学的情况下。传统级联模块化控制系统常常由于保守假设及繁琐的参数调整而表现欠佳。纯数据驱动的方法虽然可以提供鲁棒性能,但却面临样本效率低、仿真到现实差距大以及依赖大规模数据集的问题。结合学习型与基于模型的传统控制方法的混合方式为解决上述问题提供了有前景的选择。 本文介绍了一种自监督学习框架,该框架将基于学习的姿态惯性导航系统(IO)模块和可微分预测模型控制(d-MPC)相结合,用于无人飞行器(UAV)姿态控制。该IO模块能够对原始IMU数据进行去噪,并预测UAV的姿态,然后通过MPC优化得到的这些预测值来确定控制动作,在双层优化(BLO)结构中,内层的MPC负责优化控制策略,而外层则致力于缩小真实世界表现与预测性能之间的差距。整个框架具有端到端特性,并且能够以自监督的方式进行训练。这种结合了基于学习的感知技术和可解释模型控制系统优点的方法,在强风等恶劣条件下也表现出有效性。此外,它还可以同时提高MPC参数学习和IMU姿态预测的效果。
https://arxiv.org/abs/2504.13088
In the rapidly advancing field of robotics, dual-arm coordination and complex object manipulation are essential capabilities for developing advanced autonomous systems. However, the scarcity of diverse, high-quality demonstration data and real-world-aligned evaluation benchmarks severely limits such development. To address this, we introduce RoboTwin, a generative digital twin framework that uses 3D generative foundation models and large language models to produce diverse expert datasets and provide a real-world-aligned evaluation platform for dual-arm robotic tasks. Specifically, RoboTwin creates varied digital twins of objects from single 2D images, generating realistic and interactive scenarios. It also introduces a spatial relation-aware code generation framework that combines object annotations with large language models to break down tasks, determine spatial constraints, and generate precise robotic movement code. Our framework offers a comprehensive benchmark with both simulated and real-world data, enabling standardized evaluation and better alignment between simulated training and real-world performance. We validated our approach using the open-source COBOT Magic Robot platform. Policies pre-trained on RoboTwin-generated data and fine-tuned with limited real-world samples demonstrate significant potential for enhancing dual-arm robotic manipulation systems by improving success rates by over 70% for single-arm tasks and over 40% for dual-arm tasks compared to models trained solely on real-world data.
在快速发展的机器人技术领域,双臂协调和复杂物体操作是开发高级自主系统的关键能力。然而,多样化、高质量的演示数据稀缺以及与现实世界相匹配的评估基准不足,严重限制了这一领域的进步。为此,我们引入了RoboTwin,这是一个使用3D生成基础模型和大型语言模型来产生多样化的专家数据集,并提供针对双臂机器人任务的真实世界对齐评估平台的生成式数字孪生框架。 具体而言,RoboTwin可以从单张2D图像创建多样化且逼真的物体数字化副本,生成现实而互动的情景。它还引入了一个具有空间关系感知的代码生成框架,该框架结合了对象注释和大型语言模型来分解任务、确定空间约束,并生成精确的机器人运动代码。 我们的框架提供了一个包含模拟数据和真实世界数据的全面基准测试平台,从而能够进行标准化评估并更好地将模拟训练与现实世界性能对齐。我们使用开源COBOT Magic Robot平台验证了这一方法的有效性。在RoboTwin生成的数据上预先训练策略,并通过少量的真实世界样本进一步微调,可以显著提高单臂任务的成功率超过70%,双臂任务的成功率超过40%(相较于仅基于真实数据训练的模型)。这表明该框架对于增强双臂机器人操作系统的性能具有巨大潜力。
https://arxiv.org/abs/2504.13059
This paper presents a new task-space Non-singular Terminal Super-Twisting Sliding Mode (NT-STSM) controller with adaptive gains for robust trajectory tracking of a 7-DOF robotic manipulator. The proposed approach addresses the challenges of chattering, unknown disturbances, and rotational motion tracking, making it suited for high-DOF manipulators in dexterous manipulation tasks. A rigorous boundedness proof is provided, offering gain selection guidelines for practical implementation. Simulations and hardware experiments with external disturbances demonstrate the proposed controller's robust, accurate tracking with reduced control effort under unknown disturbances compared to other NT-STSM and conventional controllers. The results demonstrated that the proposed NT-STSM controller mitigates chattering and instability in complex motions, making it a viable solution for dexterous robotic manipulations and various industrial applications.
本文提出了一种新的任务空间非奇异终端超级扭转滑模(NT-STSM)控制器,该控制器具有自适应增益,适用于7自由度机械臂的鲁棒轨迹跟踪。所提出的方法解决了抖振、未知扰动和旋转运动跟踪等挑战,非常适合于灵巧操作任务中的高自由度机械臂。本文提供了严格的有界性证明,并为实际应用中增益的选择提供了指导原则。通过外部干扰情况下的仿真和硬件实验表明,与其它NT-STSM控制器及传统控制器相比,所提出的控制器在未知扰动条件下能够实现鲁棒、精确的跟踪并减少控制努力。实验结果表明,所提出的NT-STSM控制器有效地减轻了复杂运动中的抖振和不稳定问题,使其成为灵巧机器人操作以及各种工业应用中的一种可行解决方案。
https://arxiv.org/abs/2504.13056
This paper presents the Krysalis Hand, a five-finger robotic end-effector that combines a lightweight design, high payload capacity, and a high number of degrees of freedom (DoF) to enable dexterous manipulation in both industrial and research settings. This design integrates the actuators within the hand while maintaining an anthropomorphic form. Each finger joint features a self-locking mechanism that allows the hand to sustain large external forces without active motor engagement. This approach shifts the payload limitation from the motor strength to the mechanical strength of the hand, allowing the use of smaller, more cost-effective motors. With 18 DoF and weighing only 790 grams, the Krysalis Hand delivers an active squeezing force of 10 N per finger and supports a passive payload capacity exceeding 10 lbs. These characteristics make Krysalis Hand one of the lightest, strongest, and most dexterous robotic end-effectors of its kind. Experimental evaluations validate its ability to perform intricate manipulation tasks and handle heavy payloads, underscoring its potential for industrial applications as well as academic research. All code related to the Krysalis Hand, including control and teleoperation, is available on the project GitHub repository: this https URL
本文介绍了Krysalis手,这是一种五指机器人末端执行器,它结合了轻量化设计、高负载能力和大量自由度(DoF),能够在工业和研究环境中实现灵巧操作。该设计将驱动器集成到手中,同时保持人形外观。每个手指关节都配备了一个自锁机制,使手能够在外力作用下保持稳定而不需要主动电机参与。这种策略将承载能力的限制从电机强度转移到了手部机械结构上,从而可以使用更小、成本更低的电机。Krysalis手拥有18个自由度,重量仅为790克,每个手指能够提供10牛顿的主动捏合力,并且支持超过4.5公斤(约10磅)的被动负载能力。这些特点使Krysalis手成为同类中最轻、最强和最灵巧的机器人末端执行器之一。实验评估验证了其完成复杂操作任务及处理重载荷的能力,突显了它在工业应用以及学术研究中的潜力。所有与Krysalis手相关的代码(包括控制和遥操作系统)可在项目GitHub仓库中获取:[此链接](this https URL)
https://arxiv.org/abs/2504.12967
Tactile sensing is crucial for achieving human-level robotic capabilities in manipulation tasks. VBTSs have emerged as a promising solution, offering high spatial resolution and cost-effectiveness by sensing contact through camera-captured deformation patterns of elastic gel pads. However, these sensors' complex physical characteristics and visual signal processing requirements present unique challenges for robotic applications. The lack of efficient and accurate simulation tools for VBTS has significantly limited the scale and scope of tactile robotics research. Here we present Taccel, a high-performance simulation platform that integrates IPC and ABD to model robots, tactile sensors, and objects with both accuracy and unprecedented speed, achieving an 18-fold acceleration over real-time across thousands of parallel environments. Unlike previous simulators that operate at sub-real-time speeds with limited parallelization, Taccel provides precise physics simulation and realistic tactile signals while supporting flexible robot-sensor configurations through user-friendly APIs. Through extensive validation in object recognition, robotic grasping, and articulated object manipulation, we demonstrate precise simulation and successful sim-to-real transfer. These capabilities position Taccel as a powerful tool for scaling up tactile robotics research and development. By enabling large-scale simulation and experimentation with tactile sensing, Taccel accelerates the development of more capable robotic systems, potentially transforming how robots interact with and understand their physical environment.
触觉感知对于实现人类水平的机器人操作能力至关重要。VBTS(视觉捕获弹性凝胶垫变形模式的传感器)作为一种有前景的解决方案,因其通过摄像机捕捉弹性凝胶垫接触时的变形图案而提供了高空间分辨率和成本效益。然而,这些传感器复杂的物理特性和对视觉信号处理的要求为机器人应用带来了独特的挑战。缺乏高效的仿真工具来模拟VBTS显著限制了触觉机器人研究的发展规模和范围。在这里,我们介绍了Taccel,这是一个高性能仿真实验平台,它整合了IPC(图像处理组件)和ABD(高级行为驱动),能够以高精度和前所未有的速度对机器人、触觉传感器以及物体进行建模,在数千个并行环境中实现了比实时快18倍的加速。不同于之前在亚实时光速下运行且并行化有限的仿真器,Taccel提供了精确的物理模拟,并生成了逼真的触觉信号,同时通过用户友好的API支持灵活的机器人-传感器配置。通过对物体识别、机械手抓取和关节对象操作进行广泛的验证,我们展示了精准的模拟效果以及成功的从模拟到实际环境的应用转换能力。这些功能使Taccel成为扩大触觉机器人研究与开发规模的强大工具。通过启用大规模的模拟实验来探索触觉感知,Taccel加速了更高级别机器人系统的研发,有可能改变机器如何与其物理环境互动和理解的方式。
https://arxiv.org/abs/2504.12908
We present a novel approach to training specialized instruction-based image-editing diffusion models, addressing key challenges in structural preservation with input images and semantic alignment with user prompts. We introduce an online reinforcement learning framework that aligns the diffusion model with human preferences without relying on extensive human annotations or curating a large dataset. Our method significantly improves the realism and alignment with instructions in two ways. First, the proposed models achieve precise and structurally coherent modifications in complex scenes while maintaining high fidelity in instruction-irrelevant areas. Second, they capture fine nuances in the desired edit by leveraging a visual prompt, enabling detailed control over visual edits without lengthy textual prompts. This approach simplifies users' efforts to achieve highly specific edits, requiring only 5 reference images depicting a certain concept for training. Experimental results demonstrate that our models can perform intricate edits in complex scenes, after just 10 training steps. Finally, we showcase the versatility of our method by applying it to robotics, where enhancing the visual realism of simulated environments through targeted sim-to-real image edits improves their utility as proxies for real-world settings.
我们提出了一种新颖的方法,用于训练基于指令的图像编辑扩散模型,解决了输入图像结构保存和用户提示语义对齐的关键挑战。我们引入了一个在线强化学习框架,该框架使扩散模型能够与人类偏好保持一致,而无需依赖大量的手动标注或整理大规模数据集。我们的方法通过以下两种方式显著提高了现实性和与指令的一致性: 首先,所提出的模型在复杂场景中实现了精确且结构上连贯的修改,同时在与指令无关的区域保持了高度保真度。 其次,它们利用视觉提示捕捉所需的编辑中的细微差别,从而能够在没有冗长文本提示的情况下实现对视觉编辑的精细控制。这种方法简化了用户进行特定编辑的努力,只需使用5个描绘某个概念的参考图像进行训练即可。 实验结果表明,在仅经过10次训练步骤后,我们的模型就能够执行复杂场景中的精细编辑。最后,我们通过在机器人技术领域应用该方法来展示其多功能性,即通过对模拟环境中的目标化仿真到真实(sim-to-real)图像编辑增强视觉逼真度,从而提高这些环境作为现实世界设置代理的实用性。
https://arxiv.org/abs/2504.12833
Reconstructing transparent surfaces is essential for tasks such as robotic manipulation in labs, yet it poses a significant challenge for 3D reconstruction techniques like 3D Gaussian Splatting (3DGS). These methods often encounter a transparency-depth dilemma, where the pursuit of photorealistic rendering through standard $\alpha$-blending undermines geometric precision, resulting in considerable depth estimation errors for transparent materials. To address this issue, we introduce Transparent Surface Gaussian Splatting (TSGS), a new framework that separates geometry learning from appearance refinement. In the geometry learning stage, TSGS focuses on geometry by using specular-suppressed inputs to accurately represent surfaces. In the second stage, TSGS improves visual fidelity through anisotropic specular modeling, crucially maintaining the established opacity to ensure geometric accuracy. To enhance depth inference, TSGS employs a first-surface depth extraction method. This technique uses a sliding window over $\alpha$-blending weights to pinpoint the most likely surface location and calculates a robust weighted average depth. To evaluate the transparent surface reconstruction task under realistic conditions, we collect a TransLab dataset that includes complex transparent laboratory glassware. Extensive experiments on TransLab show that TSGS achieves accurate geometric reconstruction and realistic rendering of transparent objects simultaneously within the efficient 3DGS framework. Specifically, TSGS significantly surpasses current leading methods, achieving a 37.3% reduction in chamfer distance and an 8.0% improvement in F1 score compared to the top baseline. The code and dataset will be released at this https URL.
重建透明表面对于实验室中的机器人操作等任务至关重要,但对像3D高斯点阵(3DGS)这样的三维重构技术来说却是一个重大挑战。这些方法常常遇到透明度与深度之间的两难困境:通过标准的α混合实现逼真的渲染会损害几何精度,导致透明材料的深度估计误差较大。为解决这一问题,我们引入了透明表面高斯点阵(TSGS),这是一个新的框架,它将几何学习和外观精炼分开处理。在几何学习阶段,TSGS通过使用镜面抑制输入来精确表示表面,专注于几何形状的学习。在第二阶段,TSGS利用各向异性镜面建模提高视觉保真度,并关键地保持已建立的不透明度以确保几何精度。为了增强深度推理能力,TSGS采用了一种首次表层深度提取方法,该技术使用α混合权重上的滑动窗口来确定最有可能的表面位置,并计算一个稳健的加权平均深度。 为了在现实条件下评估透明表面重建任务,我们收集了一个名为TransLab的数据集,其中包括复杂的实验室玻璃器皿。在TransLab数据集中进行广泛的实验表明,TSGS能够在高效的3DGS框架内同时实现准确的几何重构和逼真的透明物体渲染。具体而言,与顶级基线相比,TSGS显著超越了当前领先的方法,在查姆费距离(chamfer distance)上减少了37.3%,在F1评分上提高了8.0%。 代码及数据集将在以下网址发布:[此链接] (请将方括号中的文本替换为实际的URL)。
https://arxiv.org/abs/2504.12799
Adapting robot trajectories based on human instructions as per new situations is essential for achieving more intuitive and scalable human-robot interactions. This work proposes a flexible language-based framework to adapt generic robotic trajectories produced by off-the-shelf motion planners like RRT, A-star, etc, or learned from human demonstrations. We utilize pre-trained LLMs to adapt trajectory waypoints by generating code as a policy for dense robot manipulation, enabling more complex and flexible instructions than current methods. This approach allows us to incorporate a broader range of commands, including numerical inputs. Compared to state-of-the-art feature-based sequence-to-sequence models which require training, our method does not require task-specific training and offers greater interpretability and more effective feedback mechanisms. We validate our approach through simulation experiments on the robotic manipulator, aerial vehicle, and ground robot in the Pybullet and Gazebo simulation environments, demonstrating that LLMs can successfully adapt trajectories to complex human instructions.
根据新情况调整机器人轨迹以适应人类指令对于实现更直观和可扩展的人机交互至关重要。本文提出了一种基于灵活语言的框架,用于调整由现成的运动规划器(如RRT、A*等)生成或从人类演示中学习到的一般性机器人路径。我们利用预训练的大规模语言模型来通过生成代码作为策略来适应轨迹的关键点,从而支持更复杂和灵活的指令,这些指令超出了当前方法的能力范围。这种方法允许我们整合包括数值输入在内的更广泛的命令类型。 与目前需要特定任务训练的状态-of-the-art特征序列到序列模型相比,我们的方法不需要特定任务的训练,并且提供了更高的可解释性和更有效的反馈机制。通过在Pybullet和Gazebo模拟环境中对机器人操纵器、飞行器和地面机器人的仿真实验验证了我们的方法的有效性,证明了大规模语言模型能够成功地将轨迹适应复杂的用户指令。
https://arxiv.org/abs/2504.12755
B* is a novel optimization framework that addresses a critical challenge in fixed-base manipulator robotics: optimal base placement. Current methods rely on pre-computed kinematics databases generated through sampling to search for solutions. However, they face an inherent trade-off between solution optimality and computational efficiency when determining sampling resolution. To address these limitations, B* unifies multiple objectives without database dependence. The framework employs a two-layer hierarchical approach. The outer layer systematically manages terminal constraints through progressive tightening, particularly for base mobility, enabling feasible initialization and broad solution exploration. The inner layer addresses non-convexities in each outer-layer subproblem through sequential local linearization, converting the original problem into tractable sequential linear programming (SLP). Testing across multiple robot platforms demonstrates B*'s effectiveness. The framework achieves solution optimality five orders of magnitude better than sampling-based approaches while maintaining perfect success rates and reduced computational overhead. Operating directly in configuration space, B* enables simultaneous path planning with customizable optimization criteria. B* serves as a crucial initialization tool that bridges the gap between theoretical motion planning and practical deployment, where feasible trajectory existence is fundamental.
B* 是一种新颖的优化框架,旨在解决固定基座机器人操作器中的关键挑战:最优基座位置。当前的方法依赖于通过采样生成的预计算运动学数据库来寻找解决方案。然而,在确定采样分辨率时,这些方法在解的最优性和计算效率之间存在内在权衡。为了克服这些限制,B* 在不依赖数据库的情况下统一了多个目标。该框架采用两层分层方法:外层系统地管理终端约束,并通过逐步收紧特别是对于基座移动性的处理,实现可行的初始化和广泛的解决方案探索;内层则通过对每个外层子问题进行连续局部线性化来解决非凸性问题,从而将原始问题转化为可解的序列线性规划(SLP)。 在多个机器人平台上的测试展示了B*的有效性。该框架实现了比采样方法高出五倍数量级的解决方案最优性,并保持了完美的成功率和减少的计算开销。直接在配置空间中操作,B* 使得同时进行路径规划并自定义优化标准成为可能。作为重要的初始化工具,B* 桥接了理论运动规划与实际部署之间的差距,在可行轨迹的存在方面至关重要。
https://arxiv.org/abs/2504.12719
The development of artificial intelligence towards real-time interaction with the environment is a key aspect of embodied intelligence and robotics. Inverse dynamics is a fundamental robotics problem, which maps from joint space to torque space of robotic systems. Traditional methods for solving it rely on direct physical modeling of robots which is difficult or even impossible due to nonlinearity and external disturbance. Recently, data-based model-learning algorithms are adopted to address this issue. However, they often require manual parameter tuning and high computational costs. Neuromorphic computing is inherently suitable to process spatiotemporal features in robot motion control at extremely low costs. However, current research is still in its infancy: existing works control only low-degree-of-freedom systems and lack performance quantification and comparison. In this paper, we propose a neuromorphic control framework to control 7 degree-of-freedom robotic manipulators. We use Spiking Neural Network to leverage the spatiotemporal continuity of the motion data to improve control accuracy, and eliminate manual parameters tuning. We validated the algorithm on two robotic platforms, which reduces torque prediction error by at least 60% and performs a target position tracking task successfully. This work advances embodied neuromorphic control by one step forward from proof of concept to applications in complex real-world tasks.
人工智能向环境实时互动的发展是具身智能和机器人技术的关键方面。逆动力学问题是机器人技术中的一个基本问题,它从关节空间映射到机器人的扭矩空间。传统的方法依赖于对机器人的直接物理建模来解决这个问题,但由于非线性和外部干扰的存在,这种建模往往很难甚至不可能实现。近年来,基于数据的模型学习算法被用来应对这一挑战。然而,这些方法通常需要手动参数调整,并且计算成本高昂。神经形态计算天然适合在机器人运动控制中以极低的成本处理时空特征。但是,目前的研究仍处于初级阶段:现有的工作仅能控制自由度较低的系统,并且缺乏性能量化和比较。 本文提出了一种用于控制具有7个自由度机械臂的神经形态控制框架。我们使用脉冲神经网络来利用运动数据中的时空连续性以提高控制精度,并消除手动参数调整的需求。我们在两个机器人平台上验证了该算法,减少了至少60%的扭矩预测误差,并成功完成了一个目标位置跟踪任务。这项工作使具身神经形态控制从概念证明迈向复杂现实世界应用迈出了重要的一步。
https://arxiv.org/abs/2504.12702
Robotic manipulation faces critical challenges in understanding spatial affordances--the "where" and "how" of object interactions--essential for complex manipulation tasks like wiping a board or stacking objects. Existing methods, including modular-based and end-to-end approaches, often lack robust spatial reasoning capabilities. Unlike recent point-based and flow-based affordance methods that focus on dense spatial representations or trajectory modeling, we propose A0, a hierarchical affordance-aware diffusion model that decomposes manipulation tasks into high-level spatial affordance understanding and low-level action execution. A0 leverages the Embodiment-Agnostic Affordance Representation, which captures object-centric spatial affordances by predicting contact points and post-contact trajectories. A0 is pre-trained on 1 million contact points data and fine-tuned on annotated trajectories, enabling generalization across platforms. Key components include Position Offset Attention for motion-aware feature extraction and a Spatial Information Aggregation Layer for precise coordinate mapping. The model's output is executed by the action execution module. Experiments on multiple robotic systems (Franka, Kinova, Realman, and Dobot) demonstrate A0's superior performance in complex tasks, showcasing its efficiency, flexibility, and real-world applicability.
机器人操作在理解空间可及性(即物体交互的“何地”和“如何”)方面面临重大挑战,这对于像擦拭板子或堆叠物品这样的复杂任务至关重要。现有的方法,包括模块化方法和端到端方法,通常缺乏强大的空间推理能力。与最近基于点的方法和基于流的方法不同,这些方法专注于密集的空间表示或轨迹建模,我们提出了一种层次化的、具有感知性的扩散模型A0,它将操作任务分解为高层次的空间可及性理解和低层次的动作执行。A0利用了无实体依赖的可及性表示,通过预测接触点和接触后的轨迹来捕捉以物体为中心的空间可及性。该模型在100万个接触点的数据上进行了预训练,并在标注过的轨迹数据上进行微调,从而可以在不同的平台上实现泛化。其关键组件包括位置偏移注意机制(用于运动感知特征提取)以及空间信息聚合层(用于精确坐标映射)。A0的输出由动作执行模块负责执行。 实验结果显示,在多个机器人系统(Franka、Kinova、Realman和Dobot)上,A0在复杂任务中的表现优于现有方法,展示了其效率、灵活性和现实世界应用的能力。
https://arxiv.org/abs/2504.12636
Teaching robots dexterous manipulation skills often requires collecting hundreds of demonstrations using wearables or teleoperation, a process that is challenging to scale. Videos of human-object interactions are easier to collect and scale, but leveraging them directly for robot learning is difficult due to the lack of explicit action labels from videos and morphological differences between robot and human hands. We propose Human2Sim2Robot, a novel real-to-sim-to-real framework for training dexterous manipulation policies using only one RGB-D video of a human demonstrating a task. Our method utilizes reinforcement learning (RL) in simulation to cross the human-robot embodiment gap without relying on wearables, teleoperation, or large-scale data collection typically necessary for imitation learning methods. From the demonstration, we extract two task-specific components: (1) the object pose trajectory to define an object-centric, embodiment-agnostic reward function, and (2) the pre-manipulation hand pose to initialize and guide exploration during RL training. We found that these two components are highly effective for learning the desired task, eliminating the need for task-specific reward shaping and tuning. We demonstrate that Human2Sim2Robot outperforms object-aware open-loop trajectory replay by 55% and imitation learning with data augmentation by 68% across grasping, non-prehensile manipulation, and multi-step tasks. Project Site: this https URL
训练机器人灵巧操作技能通常需要收集数百次使用穿戴设备或遥控行动的演示,这是一个难以规模化的过程。人类与物体互动的视频更容易获取和扩展,但由于缺乏来自视频的明确行动标签以及机器人手和人手之间的形态差异,直接利用这些视频进行机器人学习变得困难。 我们提出了Human2Sim2Robot,这是一种新颖的真实到仿真再到真实(real-to-sim-to-real)框架,用于仅通过一个包含人类执行任务演示的RGB-D视频来训练灵巧操作策略。我们的方法使用模拟中的强化学习(RL)来跨越人与机器人的身体差异,而无需依赖穿戴设备、遥控行动或通常用于模仿学习方法的大规模数据收集。 从该演示中提取了两个特定于任务的组件: 1. 物体姿态轨迹:用于定义一个与物体相关的、不依赖于具体形态(如机器人手和人手)的奖励函数。 2. 预操作手势姿势:用于在RL训练期间初始化并指导探索。 我们发现这两个组成部分对于学习所需任务非常有效,消除了对特定任务奖励塑形及调整的需求。通过抓取、非抓握操作以及多步骤任务上的测试表明,Human2Sim2Robot的表现分别优于对象感知开环轨迹回放和数据增强的模仿学习方法55%和68%。 项目网站:[请参阅原文提供的链接] 这种方法突破了传统的数据收集方式,提供了一种利用现有视频资料进行机器人灵巧操作训练的新途径。
https://arxiv.org/abs/2504.12609
Mobile manipulation robots are continuously advancing, with their grasping capabilities rapidly progressing. However, there are still significant gaps preventing state-of-the-art mobile manipulators from widespread real-world deployments, including their ability to reliably grasp items in unstructured environments. To help bridge this gap, we developed SHOPPER, a mobile manipulation robot platform designed to push the boundaries of reliable and generalizable grasp strategies. We develop these grasp strategies and deploy them in a real-world grocery store -- an exceptionally challenging setting chosen for its vast diversity of manipulable items, fixtures, and layouts. In this work, we present our detailed approach to designing general grasp strategies towards picking any item in a real grocery store. Additionally, we provide an in-depth analysis of our latest real-world field test, discussing key findings related to fundamental failure modes over hundreds of distinct pick attempts. Through our detailed analysis, we aim to offer valuable practical insights and identify key grasping challenges, which can guide the robotics community towards pressing open problems in the field.
移动操作机器人正在不断进步,其抓取能力也在迅速发展。然而,最先进的移动操作器在现实世界中的广泛应用仍然面临重大障碍,其中包括它们在非结构化环境中可靠抓取物品的能力。为了缩小这一差距,我们开发了SHOPPER平台,这是一个旨在推动可靠且通用抓取策略边界的移动操作机器人平台。我们在一个真正的杂货店中实施这些抓取策略——选择这样一个极具挑战性的环境是因为它拥有大量的可操控项目、固定装置和布局多样性。 在这项工作中,我们将详细介绍我们的方法来设计能够在实际的杂货店里拾起任何物品的一般性抓取策略。此外,我们还提供了对我们最近进行的真实世界现场测试的深入分析,讨论了在数百次不同尝试中与基本故障模式相关的关键发现。通过详细的分析,我们旨在提供宝贵的实用见解,并识别出关键的抓取挑战,这些可以指导机器人技术社区解决这一领域的紧迫开放问题。
https://arxiv.org/abs/2504.12512
We propose a framework enabling mobile manipulators to reliably complete pick-and-place tasks for assembling structures from construction blocks. The picking uses an eye-in-hand visual servoing controller for object tracking with Control Barrier Functions (CBFs) to ensure fiducial markers in the blocks remain visible. An additional robot with an eye-to-hand setup ensures precise placement, critical for structural stability. We integrate human-in-the-loop capabilities for flexibility and fault correction and analyze robustness to camera pose errors, proposing adapted barrier functions to handle them. Lastly, experiments validate the framework on 6-DoF mobile arms.
我们提出了一种框架,使移动操作臂能够可靠地完成从建筑积木组装结构的取放任务。该框架采用眼手视觉伺服控制器进行物体跟踪,并利用控制屏障函数(CBFs)确保积木上的标记始终可见,从而实现准确的拾取动作。另一个配备眼到手设置的机器人则负责精确放置,这对于保证结构稳定性至关重要。我们还集成了人机协作功能以增强灵活性和故障纠正能力,并分析了摄像机姿态误差对系统的影响,提出了适应性屏障函数来处理这些问题。最后,实验验证了该框架在6自由度移动臂上的有效性。
https://arxiv.org/abs/2504.12506