Despite a widespread success in various applications, large language models (LLMs) often stumble when tackling basic physical reasoning or executing robotics tasks, due to a lack of direct experience with the physical nuances of the real world. To address these issues, we propose a Grounding Large language model with Imperfect world MOdel (GLIMO), which utilizes proxy world models such as simulators to collect and synthesize trining data. GLIMO incorporates an LLM agent-based data generator to automatically create high-quality and diverse instruction datasets. The generator includes an iterative self-refining module for temporally consistent experience sampling, a diverse set of question-answering instruction seeds, and a retrieval-augmented generation module for reflecting on prior experiences. Comprehensive experiments show that our approach improve the performance of strong open-source LLMs like LLaMA-3 with a performance boost of 2.04 $\times$, 1.54 $\times$, and 1.82 $\times$ across three different benchmarks, respectively. The performance is able to compete with or surpass their larger counterparts such as GPT-4.
尽管在各种应用中取得了广泛的成功,大型语言模型(LLMs)在处理基本的物理推理或执行机器人任务时常常会陷入困境,因为它们缺乏与现实世界物理细微差别的第一手经验。为解决这些问题,我们提出了一个基于代理世界模型的接地大型语言模型(GLIMO),该模型利用模拟器等代理世界模型收集和合成训练数据。GLIMO包括一个基于LLM的代理程序数据生成器,用于自动创建高质量和多样化的指令数据集。生成器包括一个迭代自校正的时序一致性经验采样模块、一个多样的问题回答指令种子集和一个反映先验经验的检索增强生成模块。 全面的实验证明,我们的方法在三个不同的基准测试中分别将LLM-3的性能提高了2.04倍、1.54倍和1.82倍。性能能够与或超过其较大 counterparts(如 GPT-4)竞争,甚至有些超过它们。
https://arxiv.org/abs/2410.02742
Recent advances in robotics are pushing real-world autonomy, enabling robots to perform long-term and large-scale missions. A crucial component for successful missions is the incorporation of loop closures through place recognition, which effectively mitigates accumulated pose estimation drift. Despite computational advancements, optimizing performance for real-time deployment remains challenging, especially in resource-constrained mobile robots and multi-robot systems since, conventional keyframe sampling practices in place recognition often result in retaining redundant information or overlooking relevant data, as they rely on fixed sampling intervals or work directly in the 3D space instead of the feature space. To address these concerns, we introduce the concept of sample space in place recognition and demonstrate how different sampling techniques affect the query process and overall performance. We then present a novel keyframe sampling approach for LiDAR-based place recognition, which focuses on redundancy minimization and information preservation in the hyper-dimensional descriptor space. This approach is applicable to both learning-based and handcrafted descriptors, and through the experimental validation across multiple datasets and descriptor frameworks, we demonstrate the effectiveness of our proposed method, showing it can jointly minimize redundancy and preserve essential information in real-time. The proposed approach maintains robust performance across various datasets without requiring parameter tuning, contributing to more efficient and reliable place recognition for a wide range of robotic applications.
近年来,机器人技术的进步推动了现实世界的自主,使得机器人能够执行长期和大规模任务。成功执行任务的关键组件是引入通过空间识别进行闭环闭合,有效减轻了累积姿态估计漂移。尽管计算取得了进步,为实时部署优化性能仍然具有挑战性,尤其是在资源受限的移动机器人和多机器人系统上,因为传统的基于关键帧的采样实践通常会导致保留冗余信息或忽视相关信息,因为他们依赖于固定的采样间隔或直接在三维空间而不是特征空间工作。为了应对这些担忧,我们引入了空间识别中的样本空间概念,并展示了不同采样技术如何影响查询过程和整体性能。然后,我们提出了一个基于LiDAR的紧凑表示空间中进行闭环闭合的新关键帧采样方法,重点关注降维和信息保留。这种方法适用于基于学习和手工制作的描述符,并通过在多个数据集和描述框架上的实验验证,证明了我们所提出方法的有效性,表明它可以同时最小化冗余并保留关键信息。这种方法在各种数据集上保持稳健的性能,无需进行参数调整,为各种机器人应用提供更高效、可靠的姿态识别。
https://arxiv.org/abs/2410.02643
Swarm robotics, or very large-scale robotics (VLSR), has many meaningful applications for complicated tasks. However, the complexity of motion control and energy costs stack up quickly as the number of robots increases. In addressing this problem, our previous studies have formulated various methods employing macroscopic and microscopic approaches. These methods enable microscopic robots to adhere to a reference Gaussian mixture model (GMM) distribution observed at the macroscopic scale. As a result, optimizing the macroscopic level will result in an optimal overall result. However, all these methods require systematic and global generation of Gaussian components (GCs) within obstacle-free areas to construct the GMM trajectories. This work utilizes centroidal Voronoi tessellation to generate GCs methodically. Consequently, it demonstrates performance improvement while also ensuring consistency and reliability.
群机器人学(Swarm robotics,VLSR)在复杂任务中具有许多有意义的应用。然而,随着机器人数量的增加,运动控制和能源成本的复杂性会迅速累积。为了解决这个问题,我们的前期研究提出了各种使用宏观和微观方法的方法。这些方法使微观机器人能够遵循在宏观尺度上观察到的参考高斯混合模型的分布。因此,优化宏观水平将导致最佳结果。然而,所有这些方法都需要在障碍物无关区域系统地生成高斯成分(GCs)来构建GMM轨迹。这项工作利用质心Voronoi分割法生成GCs。因此,它不仅展示了性能的提高,还确保了一致性和可靠性。
https://arxiv.org/abs/2410.02510
Dexterous hands exhibit significant potential for complex real-world grasping tasks. While recent studies have primarily focused on learning policies for specific robotic hands, the development of a universal policy that controls diverse dexterous hands remains largely unexplored. In this work, we study the learning of cross-embodiment dexterous grasping policies using reinforcement learning (RL). Inspired by the capability of human hands to control various dexterous hands through teleoperation, we propose a universal action space based on the human hand's eigengrasps. The policy outputs eigengrasp actions that are then converted into specific joint actions for each robot hand through a retargeting mapping. We simplify the robot hand's proprioception to include only the positions of fingertips and the palm, offering a unified observation space across different robot hands. Our approach demonstrates an 80% success rate in grasping objects from the YCB dataset across four distinct embodiments using a single vision-based policy. Additionally, our policy exhibits zero-shot generalization to two previously unseen embodiments and significant improvement in efficient finetuning. For further details and videos, visit our project page this https URL.
灵活的手展示了在复杂现实世界的抓取任务中具有显著的潜力。 虽然最近的研究主要集中在为特定机器人手的学习策略,但开发一个控制各种灵巧手 universally policies 仍然是一个未被探索的问题。在这项工作中,我们使用强化学习 (RL) 研究跨身体抓取策略的学习。受到人类手通过遥控控制各种灵巧手的能力的启发,我们基于人类手的 eigengrasps 提出了一个通用动作空间。 策略输出 eigengrasp 动作,然后通过重新定位映射将其转换为每个机器人手的特定关节动作。 我们简化机器人的本体感知,仅包括手指的位置和手掌,为不同机器人手提供了一个统一的数据空间。我们的方法在用单个视觉基础策略从 YCB 数据集中抓取物体时取得了80%的成功率。此外,我们的策略展示了零样本泛化到之前未见过的两个灵巧手,并且在有效微调方面取得了显著的改进。更多细节和视频,请访问我们的项目页面,这个链接:
https://arxiv.org/abs/2410.02479
Bimanual dexterous manipulation is a critical yet underexplored area in robotics. Its high-dimensional action space and inherent task complexity present significant challenges for policy learning, and the limited task diversity in existing benchmarks hinders general-purpose skill development. Existing approaches largely depend on reinforcement learning, often constrained by intricately designed reward functions tailored to a narrow set of tasks. In this work, we present a novel approach for efficiently learning diverse bimanual dexterous skills from abundant human demonstrations. Specifically, we introduce BiDexHD, a framework that unifies task construction from existing bimanual datasets and employs teacher-student policy learning to address all tasks. The teacher learns state-based policies using a general two-stage reward function across tasks with shared behaviors, while the student distills the learned multi-task policies into a vision-based policy. With BiDexHD, scalable learning of numerous bimanual dexterous skills from auto-constructed tasks becomes feasible, offering promising advances toward universal bimanual dexterous manipulation. Our empirical evaluation on the TACO dataset, spanning 141 tasks across six categories, demonstrates a task fulfillment rate of 74.59% on trained tasks and 51.07% on unseen tasks, showcasing the effectiveness and competitive zero-shot generalization capabilities of BiDexHD. For videos and more information, visit our project page this https URL.
熟练操作双手的机器人技术是一个关键但尚未得到充分探索的领域。其高维动作空间和固有任务复杂性对策略学习造成了重大挑战,而现有基准测试中任务的多样性有限,阻碍了通用技能的发展。现有的方法很大程度上依赖于强化学习,通常受到针对狭窄任务的精心设计奖励函数的限制。在这项工作中,我们提出了一个学习丰富多手熟练技能的新方法,可以从丰富的人体演示中有效地学习。具体来说,我们引入了BiDexHD,一个统一任务构建现有多手数据集和采用师生策略学习解决所有任务的框架。教师使用共享行为任务的一般两阶段奖励函数学习状态基于策略,而学生将学到的多任务策略精炼为基于视觉的策略。使用BiDexHD,从自构建任务中学习成千上万个多手熟练技能变得可行,为通用多手熟练操作提供了有益的进展。我们对TACO数据集的实证评估,涵盖了六个类别的141个任务,展示了BiDexHD在训练任务上的任务完成率为74.59%,在未见任务上的任务完成率为51.07%,展示了BiDexHD的有效性和竞争零散分布能力。对于视频和其他更多信息,请访问我们的项目页面,该页面链接为https://www. this URL。
https://arxiv.org/abs/2410.02477
Universal dexterous grasping across diverse objects presents a fundamental yet formidable challenge in robot learning. Existing approaches using reinforcement learning (RL) to develop policies on extensive object datasets face critical limitations, including complex curriculum design for multi-task learning and limited generalization to unseen objects. To overcome these challenges, we introduce ResDex, a novel approach that integrates residual policy learning with a mixture-of-experts (MoE) framework. ResDex is distinguished by its use of geometry-unaware base policies that are efficiently acquired on individual objects and capable of generalizing across a wide range of unseen objects. Our MoE framework incorporates several base policies to facilitate diverse grasping styles suitable for various objects. By learning residual actions alongside weights that combine these base policies, ResDex enables efficient multi-task RL for universal dexterous grasping. ResDex achieves state-of-the-art performance on the DexGraspNet dataset comprising 3,200 objects with an 88.8% success rate. It exhibits no generalization gap with unseen objects and demonstrates superior training efficiency, mastering all tasks within only 12 hours on a single GPU.
通用且灵巧的抓取跨越多样物体,对机器人学习是一个基本但困难的挑战。使用强化学习(RL)开发策略来处理广泛物体数据集现有方法面临着关键限制,包括多任务学习复杂的课程设计和对未见过的物体的泛化能力有限。为了克服这些挑战,我们引入了ResDex,一种将残差策略学习与专家混合(MoE)框架相结合的新颖方法。ResDex的特点在于其使用几何感知的基础策略,在单个物体上以高效的方式获得,并能够跨越广泛的未见过的物体。我们的MoE框架包括几个基础策略,以促进各种抓取风格,适应该些物体。通过与这些基础策略一起学习残差动作,ResDex实现了通用灵巧抓取的 efficient multi-task RL。ResDex在由3,200个物体组成的DexGraspNet数据集上取得了最先进的性能,成功率为88.8%。它与未见过的物体没有泛化差距,并展示了在单块GPU上训练的高效率,精通所有任务,仅用12个小时。
https://arxiv.org/abs/2410.02475
Safe and successful deployment of robots requires not only the ability to generate complex plans but also the capacity to frequently replan and correct execution errors. This paper addresses the challenge of long-horizon trajectory planning under temporally extended objectives in a receding horizon manner. To this end, we propose DOPPLER, a data-driven hierarchical framework that generates and updates plans based on instruction specified by linear temporal logic (LTL). Our method decomposes temporal tasks into chain of options with hierarchical reinforcement learning from offline non-expert datasets. It leverages diffusion models to generate options with low-level actions. We devise a determinantal-guided posterior sampling technique during batch generation, which improves the speed and diversity of diffusion generated options, leading to more efficient querying. Experiments on robot navigation and manipulation tasks demonstrate that DOPPLER can generate sequences of trajectories that progressively satisfy the specified formulae for obstacle avoidance and sequential visitation. Demonstration videos are available online at: this https URL.
安全和成功的机器人部署需要不仅具备生成复杂计划的能力,还需要频繁重新规划和纠正执行错误的能力。本文解决了在撤退视野方法下,长视野轨迹规划中的挑战。为此,我们提出了DOPPLER,一种基于指令由线性时间逻辑(LTL)指定的数据驱动分层框架。我们的方法将时间任务分解为具有分层强化学习从离线非专家数据集生成的选项链。它利用扩散模型生成具有低级动作的选项。我们在批生成过程中设计了一种确定性指导的后验采样技术,从而改善了扩散生成的选项的速度和多样性,导致更有效的查询。在机器人导航和操作任务上的实验证明,DOPPLER可以生成满足避障和顺序访问指定形式的轨迹序列。演示视频可在该网址上观看:https://this URL。
https://arxiv.org/abs/2410.02389
The Coastal underwater evidence search system with surface-underwater collaboration is designed to revolutionize the search for artificial objects in coastal underwater environments, overcoming limitations associated with traditional methods such as divers and tethered remotely operated vehicles. Our innovative multi-robot collaborative system consists of three parts, an autonomous surface vehicle as a mission control center, a towed underwater vehicle for wide-area search, and a biomimetic underwater robot inspired by marine organisms for detailed inspections of identified areas. We conduct extensive simulations and real-world experiments in pond environments and coastal fields to demonstrate the system potential to surpass the limitations of conventional underwater search methods, offering a robust and efficient solution for law enforcement and recovery operations in marine settings.
海洋水下证据搜索系统与水面下合作搜索是一个设计,旨在彻底颠覆沿海水下环境中寻找人造物体的传统方法,克服了与传统方法相关的限制,如潜水员和附着式远程操控车辆。我们创新的多机器人协同系统由三个部分组成,分别是自主水面车辆作为任务控制中心、拖行的水下车辆进行区域搜索和以海洋生物为灵感的水下机器人,用于对确定的区域进行详细检查。我们在池塘环境和沿海水域进行广泛的仿真和实地试验,以展示该系统在超越传统水下搜索方法的局限性方面具有潜力,为警察和救援人员在海洋环境中的执法和恢复操作提供了一个健壮和高效解决方案。
https://arxiv.org/abs/2410.02345
Urban environments face significant challenges due to climate change, including extreme heat, drought, and water scarcity, which impact public health, community well-being, and local economies. Effective management of these issues is crucial, particularly in areas like Sydney Olympic Park, which relies on one of Australia's largest irrigation systems. The Smart Irrigation Management for Parks and Cool Towns (SIMPaCT) project, initiated in 2021, leverages advanced technologies and machine learning models to optimize irrigation and induce physical cooling. This paper introduces two novel methods to enhance the efficiency of the SIMPaCT system's extensive sensor network and applied machine learning models. The first method employs clustering of sensor time series data using K-shape and K-means algorithms to estimate readings from missing sensors, ensuring continuous and reliable data. This approach can detect anomalies, correct data sources, and identify and remove redundant sensors to reduce maintenance costs. The second method involves sequential data collection from different sensor locations using robotic systems, significantly reducing the need for high numbers of stationary sensors. Together, these methods aim to maintain accurate soil moisture predictions while optimizing sensor deployment and reducing maintenance costs, thereby enhancing the efficiency and effectiveness of the smart irrigation system. Our evaluations demonstrate significant improvements in the efficiency and cost-effectiveness of soil moisture monitoring networks. The cluster-based replacement of missing sensors provides up to 5.4% decrease in average error. The sequential sensor data collection as a robotic emulation shows 17.2% and 2.1% decrease in average error for circular and linear paths respectively.
由于气候变化,城市环境面临重大挑战,包括极端高温、干旱和水资源短缺,这些都影响了公共卫生、社区福祉和当地经济。有效管理这些问题至关重要,尤其是在像悉尼奥林匹克公园这样的地区,该地区依赖澳大利亚最大的灌溉系统。2021年启动的智能公园和 cool 城镇项目(SIMPaCT)利用先进的技术和机器学习模型优化灌溉和诱导物理降温。本文介绍了两种新的方法,增强 SIMPaCT 系统的广泛传感器网络的效率,并应用机器学习模型。第一种方法采用 K-形状和 K-means 算法对传感器时间序列数据进行聚类,估计缺失传感器的读数,确保连续和可靠的数据。这种方法可以检测异常,纠正数据来源,并识别和删除冗余传感器,从而降低维护成本。第二种方法涉及使用机器人系统从不同传感器位置进行顺序数据收集,从而大大减少了需要的高数量静态传感器的需要。 Together,这些方法旨在在优化传感器部署的同时降低维护成本,从而提高智能灌溉系统的效率和效果。我们的评估结果表明,土壤水分监测网络的效率和成本效益都有显著提高。基于聚类的缺失传感器替换平均误差降低了至多 5.4%。作为机器人仿真的顺序传感器数据收集,环形和线性路径的平均误差分别降低了 17.2% 和 2.1%。
https://arxiv.org/abs/2410.02335
Recent advances in AI have led to significant results in robotic learning, but skills like grasping remain partially solved. Many recent works exploit synthetic grasping datasets to learn to grasp unknown objects. However, those datasets were generated using simple grasp sampling methods using priors. Recently, Quality-Diversity (QD) algorithms have been proven to make grasp sampling significantly more efficient. In this work, we extend QDG-6DoF, a QD framework for generating object-centric grasps, to scale up the production of synthetic grasping datasets. We propose a data augmentation method that combines the transformation of object meshes with transfer learning from previous grasping repertoires. The conducted experiments show that this approach reduces the number of required evaluations per discovered robust grasp by up to 20%. We used this approach to generate QDGset, a dataset of 6DoF grasp poses that contains about 3.5 and 4.5 times more grasps and objects, respectively, than the previous state-of-the-art. Our method allows anyone to easily generate data, eventually contributing to a large-scale collaborative dataset of synthetic grasps.
近年来在人工智能方面的进步在机器人学习方面取得了显著的成果,但抓取技能仍然是部分解决。许多最近的工作利用合成抓取数据集来学习抓取未知的物体。然而,这些数据集是使用简单的抓取抽样方法生成的,利用先验知识。最近,Quality-Diversity(QD)算法已经被证明使得抓取抽样显著更有效。在这项工作中,我们将QDG-6DoF扩展到用于生成物体中心抓取的合成抓取数据集的生产。我们提出了一种数据增强方法,将物体拓扑结构的变换与从以前抓取序列中进行迁移学习相结合。实验结果表明,这种方法将所需评估的数量降低至发现稳健抓取的数量的20%左右。我们使用这种方法生成了QDGset,一个包含比以前最先进的约3.5和4.5倍更多抓取和对象的6DoF抓取数据集。我们的方法允许任何人轻松生成数据,最终为大型协同抓取数据集做出贡献。
https://arxiv.org/abs/2410.02319
Detecting 3D keypoints with semantic consistency is widely used in many scenarios such as pose estimation, shape registration and robotics. Currently, most unsupervised 3D keypoint detection methods focus on the rigid-body objects. However, when faced with deformable objects, the keypoints they identify do not preserve semantic consistency well. In this paper, we introduce an innovative unsupervised keypoint detector Key-Grid for both the rigid-body and deformable objects, which is an autoencoder framework. The encoder predicts keypoints and the decoder utilizes the generated keypoints to reconstruct the objects. Unlike previous work, we leverage the identified keypoint in formation to form a 3D grid feature heatmap called grid heatmap, which is used in the decoder section. Grid heatmap is a novel concept that represents the latent variables for grid points sampled uniformly in the 3D cubic space, where these variables are the shortest distance between the grid points and the skeleton connected by keypoint pairs. Meanwhile, we incorporate the information from each layer of the encoder into the decoder section. We conduct an extensive evaluation of Key-Grid on a list of benchmark datasets. Key-Grid achieves the state-of-the-art performance on the semantic consistency and position accuracy of keypoints. Moreover, we demonstrate the robustness of Key-Grid to noise and downsampling. In addition, we achieve SE-(3) invariance of keypoints though generalizing Key-Grid to a SE(3)-invariant backbone.
检测3D关键点与语义一致性是许多应用场景(如姿态估计、形状配准和机器人技术)中广泛使用的。目前,大多数无监督3D关键点检测方法都关注于刚体物体。然而,面对变形物体,它们确定的关键点在语义上并不保持一致。在本文中,我们提出了一种创新的无监督关键点检测器Key-Grid,适用于刚体和变形物体,是一种自动编码器框架。编码器预测关键点,解码器利用生成的关键点重构物体。与之前的工作不同,我们利用已识别的关键点形成一个3D立方空间中采样均匀的网格点特征热图,即网格热图,用于解码器部分。网格热图是一种新颖的概念,它表示在3D立方空间中,网格点与通过关键点对齐的骨架之间的最短距离。同时,我们将编码器每一层的有关信息融入解码器部分。我们在一系列基准数据集上对Key-Grid进行广泛评估。Key-Grid在关键点的语义一致性和位置精度上实现了最先进的性能。此外,我们还证明了Key-Grid对噪声和下采样具有鲁棒性。此外,通过将Key-Grid扩展到SE(3)-不变的骨干网络,我们实现了关键点的SE(3)不变性。
https://arxiv.org/abs/2410.02237
Accurate real-time tracking of dexterous hand movements and interactions has numerous applications in human-computer interaction, metaverse, robotics, and tele-health. Capturing realistic hand movements is challenging because of the large number of articulations and degrees of freedom. Here, we report accurate and dynamic tracking of articulated hand and finger movements using stretchable, washable smart gloves with embedded helical sensor yarns and inertial measurement units. The sensor yarns have a high dynamic range, responding to low 0.005 % to high 155 % strains, and show stability during extensive use and washing cycles. We use multi-stage machine learning to report average joint angle estimation root mean square errors of 1.21 and 1.45 degrees for intra- and inter-subjects cross-validation, respectively, matching accuracy of costly motion capture cameras without occlusion or field of view limitations. We report a data augmentation technique that enhances robustness to noise and variations of sensors. We demonstrate accurate tracking of dexterous hand movements during object interactions, opening new avenues of applications including accurate typing on a mock paper keyboard, recognition of complex dynamic and static gestures adapted from American Sign Language and object identification.
准确实时追踪灵活的手部运动和交互在人类-计算机交互、元宇宙、机器人和远程医疗等领域具有许多应用价值。由于手部动作的数量和自由度较大,捕捉真实的握持动作具有挑战性。在这里,我们报道了一种使用具有可伸缩、可清洗的智能手套以及嵌入的螺旋传感器纤维的准确而动态追踪具有关节和手指的运动。传感器纤维具有高动态范围,响应于低0.005%至高155%的应变,并在大范围使用和清洗周期中表现出稳定性。我们使用多级机器学习来报告跨个体交叉验证的平均关节角度估计根方差分别为1.21度和1.45度,分别与没有遮挡或视野限制的昂贵运动捕捉摄像机相匹配,具有与昂贵相机相媲美的准确性。我们还报道了一种增强传感器噪声和灵敏度变化的数据增强技术。我们展示了在物体交互过程中准确追踪灵活的手部动作,开辟了包括在模拟纸键盘上准确打字、从美国手语识别复杂动态和静态手势以及物体识别等新应用领域的道路。
https://arxiv.org/abs/2410.02221
Vision-Language Models (VLM) can generate plausible high-level plans when prompted with a goal, the context, an image of the scene, and any planning constraints. However, there is no guarantee that the predicted actions are geometrically and kinematically feasible for a particular robot embodiment. As a result, many prerequisite steps such as opening drawers to access objects are often omitted in their plans. Robot task and motion planners can generate motion trajectories that respect the geometric feasibility of actions and insert physically necessary actions, but do not scale to everyday problems that require common-sense knowledge and involve large state spaces comprised of many variables. We propose VLM-TAMP, a hierarchical planning algorithm that leverages a VLM to generate goth semantically-meaningful and horizon-reducing intermediate subgoals that guide a task and motion planner. When a subgoal or action cannot be refined, the VLM is queried again for replanning. We evaluate VLM- TAMP on kitchen tasks where a robot must accomplish cooking goals that require performing 30-50 actions in sequence and interacting with up to 21 objects. VLM-TAMP substantially outperforms baselines that rigidly and independently execute VLM-generated action sequences, both in terms of success rates (50 to 100% versus 0%) and average task completion percentage (72 to 100% versus 15 to 45%). See project site this https URL for more information.
视觉语言模型(VLM)在接收到目标、场景图像以及任何规划限制时,可以生成合理的高层次计划。然而,对于特定机器人构型,预测的动作是否具有几何学和动力学可行性并没有保证。因此,在他们的计划中,许多先决步骤(如打开抽屉以获取物品)往往被省略。机器人任务和运动规划器可以生成遵守几何可行性的动作轨迹,并插入必要的物理操作,但它们不会扩展到日常生活中需要常识知识和涉及大量状态空间(包含许多变量)的问题。我们提出了VLM-TAMP,一种分层规划算法,它利用VLM生成具有良好语义含义和可扩展性的中间目标,以指导任务和运动规划器。当无法细化工件或动作时,VLM再次进行重新规划。我们在一个需要机器人完成30到50个动作并与其互动多达21个物体的厨房任务中评估了VLM-TAMP。与独立且刚性执行VLM生成动作序列的基线相比,VLM-TAMP在成功率(50到100%与0%)和平均任务完成百分比(72到100%与15到45%)方面显著优于基线。有关更多信息,请访问项目网站:https://this URL。
https://arxiv.org/abs/2410.02193
Recent advancements in humanoid robotics, including the integration of hierarchical reinforcement learning-based control and the utilization of LLM planning, have significantly enhanced the ability of robots to perform complex tasks. In contrast to the highly developed humanoid robots, the human factors involved remain relatively unexplored. Directly controlling humanoid robots with the brain has already appeared in many science fiction novels, such as Pacific Rim and Gundam. In this work, we present E2H (EEG-to-Humanoid), an innovative framework that pioneers the control of humanoid robots using high-frequency non-invasive neural signals. As the none-invasive signal quality remains low in decoding precise spatial trajectory, we decompose the E2H framework in an innovative two-stage formation: 1) decoding neural signals (EEG) into semantic motion keywords, 2) utilizing LLM facilitated motion generation with a precise motion imitation control policy to realize humanoid robotics control. The method of directly driving robots with brainwave commands offers a novel approach to human-machine collaboration, especially in situations where verbal commands are impractical, such as in cases of speech impairments, space exploration, or underwater exploration, unlocking significant potential. E2H offers an exciting glimpse into the future, holding immense potential for human-computer interaction.
近年来,在人形机器人技术的发展中,包括集成等级强化学习控制和利用LLM规划,显著增强了机器人执行复杂任务的能力。相比之下,高度发展的人形机器人,涉及的人机因素仍然相对较少被探索。已经出现了直接用脑控制人形机器人的科幻小说,如《太平洋 Rim》和《机巧》。在这篇论文中,我们提出了E2H(EEG-to-Humanoid),一种创新的人形机器人控制框架,利用高频率的非侵入性神经信号控制人形机器人。由于非侵入性信号质量在解码精确运动轨迹时仍然较低,我们将其分为创新的两阶段形成:1)将EEG中的神经信号解码为语义运动关键词,2)利用LLM辅助运动生成策略实现人形机器人控制。这种直接用脑控制机器人的方法为人机协作提供了一种新的途径,尤其是在无法通过语言命令实现情况下的情况,例如言语障碍、空间探索或水下探索,具有很大的潜在能量。E2H为人类与计算机交互开辟了一个激动人心的未来,具有巨大的潜在能量。
https://arxiv.org/abs/2410.02141
This paper presents an approach for navigation and control in unmapped environments under input and state constraints using a composite control barrier function (CBF). We consider the scenario where real-time perception feedback (e.g., LiDAR) is used online to construct a local CBF that models local state constraints (e.g., local safety constraints such as obstacles) in the a priori unmapped environment. The approach employs a soft-maximum function to synthesize a single time-varying CBF from the N most recently obtained local CBFs. Next, the input constraints are transformed into controller-state constraints through the use of control dynamics. Then, we use a soft-minimum function to compose the input constraints with the time-varying CBF that models the a priori unmapped environment. This composition yields a single relaxed CBF, which is used in a constrained optimization to obtain an optimal control that satisfies the state and input constraints. The approach is validated through simulations of a nonholonomic ground robot that is equipped with LiDAR and navigates an unmapped environment. The robot successfully navigates the environment while avoiding the a priori unmapped obstacles and satisfying both speed and input constraints.
本文提出了一种在受输入和状态约束的未映射环境中进行导航和控制的策略,使用复合控制障碍函数(CBF)。我们考虑使用实时感知反馈(例如激光雷达)在线构建局部CBF来建模先验未映射环境中的局部状态约束(例如,如障碍物的局部安全约束)的情况。 策略采用软最大函数合成单个时间变化的CBF。接下来,通过控制动态将输入约束转换为控制器状态约束。然后,我们使用软最小函数将输入约束与建模先验未映射环境的时刻变化的CBF组合。这个组合产生了一个单个的放松CBF,用于在约束优化中实现满足状态和输入约束的最佳控制。 该策略通过非齐次性地面机器人的仿真来验证。机器人成功地在未映射环境中避开先前的障碍物,并满足速度和输入约束。
https://arxiv.org/abs/2410.02106
Monocular Depth and Surface Normals Estimation (MDSNE) is crucial for tasks such as 3D reconstruction, autonomous navigation, and underwater exploration. Current methods rely either on discriminative models, which struggle with transparent or reflective surfaces, or generative models, which, while accurate, are computationally expensive. This paper presents a novel deep learning model for MDSNE, specifically tailored for underwater environments, using a hybrid architecture that integrates Convolutional Neural Networks (CNNs) with Transformers, leveraging the strengths of both approaches. Training effective MDSNE models is often hampered by noisy real-world datasets and the limited generalization of synthetic datasets. To address this, we generate pseudo-labeled real data using multiple pre-trained MDSNE models. To ensure the quality of this data, we propose the Depth Normal Evaluation and Selection Algorithm (DNESA), which evaluates and selects the most reliable pseudo-labeled samples using domain-specific metrics. A lightweight student model is then trained on this curated dataset. Our model reduces parameters by 90% and training costs by 80%, allowing real-time 3D perception on resource-constrained devices. Key contributions include: a novel and efficient MDSNE model, the DNESA algorithm, a domain-specific data pipeline, and a focus on real-time performance and scalability. Designed for real-world underwater applications, our model facilitates low-cost deployments in underwater robots and autonomous vehicles, bridging the gap between research and practical implementation.
单目深度和表面法线估计(MDSNE)对于诸如3D建模、自主导航和水下探索等任务至关重要。目前的方法依赖于分类模型,这些模型在透明的或反射性表面上有困难;或者依赖于生成模型,虽然准确,但计算成本较高。本文提出了一种新的用于MDSNE的深度学习模型,特别针对水下环境,采用结合卷积神经网络(CNNs)和Transformer的混合架构,利用两种方法的优点。训练有效的MDSNE模型通常受到噪声 real-world 数据集和合成数据集的有限泛化能力的困扰。为解决这个问题,我们使用多个预训练的 MDSNE 模型生成伪标签。为了确保数据的质量,我们提出了深度法线评估和选择算法(DNESA),它通过领域特定指标评估和选择最可靠的伪标签样本。然后,在经过筛选的数据集上训练一个轻量级的学生模型。我们的模型将参数减少90%,训练成本减少80%,允许在资源受限的设备上实现实时 3D 感知。关键贡献包括:一种新颖且有效的 MDSNE 模型、DNESA 算法、一个领域特定的数据管道,以及关注实时性能和可扩展性。为了实现现实世界的水下应用,我们的模型促使成本较低的部署在 underwater 机器人或自动驾驶车辆上,将研究和技术实现之间的差距缩小。
https://arxiv.org/abs/2410.02072
In this paper, we tackle the problem of estimating 3D contact forces using vision-based tactile sensors. In particular, our goal is to estimate contact forces over a large range (up to 15 N) on any objects while generalizing across different vision-based tactile sensors. Thus, we collected a dataset of over 200K indentations using a robotic arm that pressed various indenters onto a GelSight Mini sensor mounted on a force sensor and then used the data to train a multi-head transformer for force regression. Strong generalization is achieved via accurate data collection and multi-objective optimization that leverages depth contact images. Despite being trained only on primitive shapes and textures, the regressor achieves a mean absolute error of 4\% on a dataset of unseen real-world objects. We further evaluate our approach's generalization capability to other GelSight mini and DIGIT sensors, and propose a reproducible calibration procedure for adapting the pre-trained model to other vision-based sensors. Furthermore, the method was evaluated on real-world tasks, including weighing objects and controlling the deformation of delicate objects, which relies on accurate force feedback. Project webpage: this http URL
在本文中,我们解决了使用基于视觉触觉传感器估计三维接触力的问题。特别是,我们的目标是估计任意物体上的接触力(最大达到15 N),并在不同的视觉触觉传感器上进行泛化。因此,我们使用机器人手臂收集了超过20万次压印的数据,并将数据用于训练一个多头变换器进行力回归。通过准确的数据收集和多目标优化,实现了强大的泛化。尽管仅在基本形状和纹理上进行训练,但回归器在一个未见过的实物数据集上的平均绝对误差为4%。我们进一步评估了我们的方法的泛化能力,以及对其他GelSight mini和DIGIT传感器的适应能力。此外,该方法在包括称重物体和控制脆弱物体变形等真实世界任务上进行了评估,这些任务依赖于精确的力反馈。项目网页:http://this http URL
https://arxiv.org/abs/2410.02048
Vision-language-action (VLA) models trained on large-scale internet data and robot demonstrations have the potential to serve as generalist robot policies. However, despite their large-scale training, VLAs are often brittle to task-irrelevant visual details such as distractor objects or background colors. We introduce Bring Your Own VLA (BYOVLA): a run-time intervention scheme that (1) dynamically identifies regions of the input image that the model is sensitive to, and (2) minimally alters task-irrelevant regions to reduce the model's sensitivity using automated image editing tools. Our approach is compatible with any off the shelf VLA without model fine-tuning or access to the model's weights. Hardware experiments on language-instructed manipulation tasks demonstrate that BYOVLA enables state-of-the-art VLA models to nearly retain their nominal performance in the presence of distractor objects and backgrounds, which otherwise degrade task success rates by up to 40%. Website with additional information, videos, and code: this https URL .
翻译: 基于大型互联网数据和机器人演示的大视图语言动作(VLA)模型具有作为通用机器人策略的潜力。然而,尽管它们在大型数据集上的训练很大,但VLAs通常对任务无关的视觉细节(如干扰物体或背景颜色)非常脆弱。我们引入了“自定义VLA”(BYOVLA)运行干预方案:通过(1)动态地确定模型敏感的输入图像区域,以及(2)最小地改变任务无关的图像区域,使用自动图像编辑工具来降低模型的敏感性。我们的方法与不需要模型微调或访问模型权重的标准VLA兼容。语言指令操作任务的硬件实验表明,BYOVLA使最先进的VLA模型在存在干扰物体和背景的情况下几乎保留其名义性能,否则会降低任务成功率至多40%。网站上有更多信息、视频和代码:此链接为https://。
https://arxiv.org/abs/2410.01971
Imitation learning from human motion capture (MoCap) data provides a promising way to train humanoid robots. However, due to differences in morphology, such as varying degrees of joint freedom and force limits, exact replication of human behaviors may not be feasible for humanoid robots. Consequently, incorporating physically infeasible MoCap data in training datasets can adversely affect the performance of the robot policy. To address this issue, we propose a bi-level optimization-based imitation learning framework that alternates between optimizing both the robot policy and the target MoCap data. Specifically, we first develop a generative latent dynamics model using a novel self-consistent auto-encoder, which learns sparse and structured motion representations while capturing desired motion patterns in the dataset. The dynamics model is then utilized to generate reference motions while the latent representation regularizes the bi-level motion imitation process. Simulations conducted with a realistic model of a humanoid robot demonstrate that our method enhances the robot policy by modifying reference motions to be physically consistent.
从人类运动捕捉(MoCap)数据中进行模仿学习是一种训练人形机器人的有益方法。然而,由于诸如关节自由度和平衡限制等因素的不同,完全复制人类行为可能并不现实。因此,在训练数据中包含物理上不可行的MoCap数据可能会对机器人的策略性能产生不利影响。为了应对这个问题,我们提出了一个基于双层优化的人形机器人模仿学习框架,交替优化机器人策略和目标MoCap数据。具体来说,我们首先使用一种新颖的自组织自动编码器开发了生成性潜在动态模型,该模型在捕捉数据中的稀疏和结构化运动表示的同时,捕获数据中的所需运动模式。然后,动态模型用于生成参考运动,而潜在表示通过正则化双层运动模仿过程来约束。用一个 realistic 的人形机器人模型进行仿真证明,我们的方法通过修改参考运动使其具有物理一致性,从而提高了机器人的策略性能。
https://arxiv.org/abs/2410.01968
Detecting human actions is a crucial task for autonomous robots and vehicles, often requiring the integration of various data modalities for improved accuracy. In this study, we introduce a novel approach to Human Action Recognition (HAR) based on skeleton and visual cues. Our method leverages a language model to guide the feature extraction process in the skeleton encoder. Specifically, we employ learnable prompts for the language model conditioned on the skeleton modality to optimize feature representation. Furthermore, we propose a fusion mechanism that combines dual-modality features using a salient fusion module, incorporating attention and transformer mechanisms to address the modalities' high dimensionality. This fusion process prioritizes informative video frames and body joints, enhancing the recognition accuracy of human actions. Additionally, we introduce a new dataset tailored for real-world robotic applications in construction sites, featuring visual, skeleton, and depth data modalities, named VolvoConstAct. This dataset serves to facilitate the training and evaluation of machine learning models to instruct autonomous construction machines for performing necessary tasks in the real world construction zones. To evaluate our approach, we conduct experiments on our dataset as well as three widely used public datasets, NTU-RGB+D, NTU-RGB+D120 and NW-UCLA. Results reveal that our proposed method achieves promising performance across all datasets, demonstrating its robustness and potential for various applications. The codes and dataset are available at: this https URL
检测人类行为是自动驾驶机器人车辆的关键任务,通常需要将各种数据模式进行集成以提高准确性。在这项研究中,我们提出了一种基于骨架和视觉线索的新人机行为识别(HAR)方法。我们的方法利用语言模型引导骨架编码器的特征提取过程。具体来说,我们使用条件于骨架模态的learnable prompts来优化特征表示。此外,我们提出了一种融合机制,使用显著性融合模块结合注意力和Transformer机制来处理模态的高维度。这个融合过程优先考虑视频帧和身体关节的有用信息,提高了人类行为的识别准确性。此外,我们还引入了一个新的数据集,名为VolvoConstAct,专门针对实境建筑工地进行设计,包括视觉、骨架和深度数据模式。这个数据集有助于指导自主建筑机器人在现实世界的建筑区执行必要任务。为了评估我们的方法,我们在我们的数据集以及三个广泛使用的主流公共数据集(NTU-RGB+D,NTU-RGB+D120和NW-UCLA)上进行了实验。结果表明,我们提出的方法在所有数据集上都取得了良好的性能,证明了其稳健性和各种应用的前景。代码和数据集可在此处访问:https://this URL
https://arxiv.org/abs/2410.01962