This work investigates the potential of Reinforcement Learning (RL) to tackle robot motion planning challenges in the dynamic RoboCup Small Size League (SSL). Using a heuristic control approach, we evaluate RL's effectiveness in obstacle-free and single-obstacle path-planning environments. Ablation studies reveal significant performance improvements. Our method achieved a 60% time gain in obstacle-free environments compared to baseline algorithms. Additionally, our findings demonstrated dynamic obstacle avoidance capabilities, adeptly navigating around moving blocks. These findings highlight the potential of RL to enhance robot motion planning in the challenging and unpredictable SSL environment.
这项工作研究了强化学习(RL)解决机器人运动规划挑战在动态机器人杯小型联赛(SSL)中的潜在能力。我们使用一种启发式控制方法来评估RL在无障碍和单障碍路径规划环境中的效果。消融研究揭示了显著的性能提升。与基线算法相比,我们的方法在无障碍环境中取得了60%的性能提升。此外,我们的研究结果表明,RL具有动态避障能力,能够熟练地围绕移动障碍物进行导航。这些发现突出了RL在具有挑战性和不可预测性的SSL环境中增强机器人运动规划的潜力。
https://arxiv.org/abs/2404.15410
The majority of multi-agent path finding (MAPF) methods compute collision-free space-time paths which require agents to be at a specific location at a specific discretized timestep. However, executing these space-time paths directly on robotic systems is infeasible due to real-time execution differences (e.g. delays) which can lead to collisions. To combat this, current methods translate the space-time paths into a temporal plan graph (TPG) that only requires that agents observe the order in which they navigate through locations where their paths cross. However, planning space-time paths and then post-processing them into a TPG does not reduce the required agent-to-agent coordination, which is fixed once the space-time paths are computed. To that end, we propose a novel algorithm Space-Order CBS that can directly plan a TPG and explicitly minimize coordination. Our main theoretical insight is our novel perspective on viewing a TPG as a set of space-visitation order paths where agents visit locations in relative orders (e.g. 1st vs 2nd) as opposed to specific timesteps. We redefine unique conflicts and constraints for adapting CBS for space-order planning. We experimentally validate how Space-Order CBS can return TPGs which significantly reduce coordination, thus subsequently reducing the amount of agent-agent communication and leading to more robustness to delays during execution.
大多数多代理器路径规划(MAPF)方法计算碰撞免的空间-时间路径,这要求代理在特定的时间步长上位于特定的位置。然而,直接在机器人系统上执行这些空间-时间路径是不可行的,因为实时执行差异(例如延迟)可能导致碰撞。为了应对这个问题,现有方法将空间-时间路径转换为时间计划图(TPG),只需要要求代理观察他们通过位置相交的路径的顺序。然而,规划和处理空间-时间路径并将其转换为TPG并不能减少所需的代理与代理之间的协调,一旦空间-时间路径计算完成,这种协调就是固定的。因此,我们提出了一种新颖的算法——空间顺序CBS(Space-Order CBS),可以直接规划TPG,并明确最小化协调。我们主要的理论洞察是我们将TPG看作是一个相对位置访问顺序路径的集合,而不是特定的时间步。我们重新定义了为适应空间顺序计划而重新定义独特的冲突和约束。我们通过实验验证了Space-Order CBS如何返回具有显著降低协调的TPG,从而在后续减少代理与代理之间的通信,并导致在执行过程中延迟的减少。
https://arxiv.org/abs/2404.15137
Tactile and textile skin technologies have become increasingly important for enhancing human-robot interaction and allowing robots to adapt to different environments. Despite notable advancements, there are ongoing challenges in skin signal processing, particularly in achieving both accuracy and speed in dynamic touch sensing. This paper introduces a new framework that poses the touch sensing problem as an estimation problem of resistive sensory arrays. Utilizing a Regularized Least Squares objective function which estimates the resistance distribution of the skin. We enhance the touch sensing accuracy and mitigate the ghosting effects, where false or misleading touches may be registered. Furthermore, our study presents a streamlined skin design that simplifies manufacturing processes without sacrificing performance. Experimental outcomes substantiate the effectiveness of our method, showing 26.9% improvement in multi-touch force-sensing accuracy for the tactile skin.
触觉和纺织皮肤技术在增强人机交互和让机器人适应不同环境方面变得越来越重要。尽管已经取得了一定的进展,但在动态触摸感知的皮肤信号处理方面仍然存在 ongoing 的挑战,特别是在实现准确性和速度方面。本文提出了一种新的框架,将触摸感知的問題视为阻抗性傳感器阵列的估計問題。利用 Regularized Least Squares 目標函數,估計皮膚的阻抗分佈。我們通過增強触摸感知的準確性,減輕了幽靈效應,其中可以記錄到假設或誤導的触摸。此外, our 研究還呈現出一種簡化製造過程的皮膚設計,在保持性能的同时簡化了製造過程。實驗結果證實了我們的方法的有效性,在觸覺皮膚上 Multi-touch 力量感測的準確性提高了 26.9%。
https://arxiv.org/abs/2404.15131
Replicating the remarkable athleticism seen in animals has long been a challenge in robotics control. Although Reinforcement Learning (RL) has demonstrated significant progress in dynamic legged locomotion control, the substantial sim-to-real gap often hinders the real-world demonstration of truly dynamic movements. We propose a new framework to mitigate this gap through frequency-domain analysis-based impedance matching between simulated and real robots. Our framework offers a structured guideline for parameter selection and the range for dynamics randomization in simulation, thus facilitating a safe sim-to-real transfer. The learned policy using our framework enabled jumps across distances of 55 cm and heights of 38 cm. The results are, to the best of our knowledge, one of the highest and longest running jumps demonstrated by an RL-based control policy in a real quadruped robot. Note that the achieved jumping height is approximately 85% of that obtained from a state-of-the-art trajectory optimization method, which can be seen as the physical limit for the given robot hardware. In addition, our control policy accomplished stable walking at speeds up to 2 m/s in the forward and backward directions, and 1 m/s in the sideway direction.
复制动物在运动中的惊人 athletic 性一直是一个挑战,尤其是在机器人控制领域。虽然强化学习 (RL) 在动态腿履带运动控制方面取得了显著的进步,但巨大的模拟与现实之间的差距通常会阻碍在现实世界中真正动态运动的演示。我们提出了一种新的框架,通过基于频域分析的模拟与现实机器人之间的阻尼匹配来缓解这个差距。我们的框架为参数选择和动态随机化在模拟中的范围提供了结构化的指导,从而促进了安全的模拟到实体的转移。使用我们框架学习到的策略,跳跃距离达到了55厘米,高度达到了38厘米。据我们所知,这是基于 RL 的控制策略在实心四足机器人中实现的最高和最长的跳跃。需要注意的是,所达到的跳跃高度大约是先进轨迹优化方法得到的结果的85%,可以看出这是给定机器人硬件的物理极限。此外,我们的控制策略在前进和后退方向上实现了稳定的步行,速度达到2米/秒,而在侧面方向上实现了1米/秒的步行。
https://arxiv.org/abs/2404.15096
Teleoperation is a popular solution to remotely support highly automated vehicles through a human remote operator whenever a disengagement of the automated driving system is present. The remote operator wirelessly connects to the vehicle and solves the disengagement through support or substitution of automated driving functions and therefore enables the vehicle to resume automation. There are different approaches to support automated driving functions on various levels, commonly known as teleoperation concepts. A variety of teleoperation concepts is described in the literature, yet there has been no comprehensive and structured comparison of these concepts, and it is not clear what subset of teleoperation concepts is suitable to enable safe and efficient remote support of highly automated vehicles in a broad spectrum of disengagements. The following work establishes a basis for comparing teleoperation concepts through a literature overview on automated vehicle disengagements and on already conducted studies on the comparison of teleoperation concepts and metrics used to evaluate teleoperation performance. An evaluation of the teleoperation concepts is carried out in an expert workshop, comparing different teleoperation concepts using a selection of automated vehicle disengagement scenarios and metrics. Based on the workshop results, a set of teleoperation concepts is derived that can be used to address a wide variety of automated vehicle disengagements in a safe and efficient way.
遥控操作是通过一个远程操作员来支持高度自动化车辆的常见解决方案,在任何自动驾驶系统断开的情况下,都可以通过支持或替代自动驾驶功能来解决断开问题,从而使车辆重新进入自动化。在支持自动驾驶功能的不同级别上有不同的方法,通常称为遥控概念。文献中描述了各种遥控概念,然而,还没有对这些概念进行全面的结构比较,而且不清楚哪些遥控概念适合在广泛的断开范围内安全有效地支持高度自动化车辆。以下工作为比较遥控概念提供了一个基础,通过对自动驾驶断开和已经进行的研究进行文献回顾,对遥控概念和评价遥控性能的指标进行比较。在专家研讨会中,通过选择不同的自动驾驶断开场景和指标,对遥控概念进行评估。根据研讨会结果,得出了一组适用于各种自动驾驶断开的安全高效的遥控概念。
https://arxiv.org/abs/2404.15030
We propose a novel pipeline for unknown object grasping in shared robotic autonomy scenarios. State-of-the-art methods for fully autonomous scenarios are typically learning-based approaches optimised for a specific end-effector, that generate grasp poses directly from sensor input. In the domain of assistive robotics, we seek instead to utilise the user's cognitive abilities for enhanced satisfaction, grasping performance, and alignment with their high level task-specific goals. Given a pair of stereo images, we perform unknown object instance segmentation and generate a 3D reconstruction of the object of interest. In shared control, the user then guides the robot end-effector across a virtual hemisphere centered around the object to their desired approach direction. A physics-based grasp planner finds the most stable local grasp on the reconstruction, and finally the user is guided by shared control to this grasp. In experiments on the DLR EDAN platform, we report a grasp success rate of 87% for 10 unknown objects, and demonstrate the method's capability to grasp objects in structured clutter and from shelves.
我们提出了一个在共享机器人自主场景中解决未知物体抓取的新流程。在先进的全自动驾驶场景中,通常采用基于学习的优化方法,针对特定的末端设备生成直接的抓取姿态。在辅助机器人领域,我们寻求利用用户的认知能力来提高满足感、抓取性能以及与高层次任务目标的对齐。 给定一对立体图像,我们进行未知物体实例分割并生成物体感兴趣的3D复原。在共享控制下,用户 then 导引机器人末端Effector 穿越围绕物体的虚拟半球,以到达期望的接近方向。基于物理的抓取规划器找到重构中最具稳定性的局部抓取,最后用户通过共享控制找到这个抓取。 在德国 Frauncese 实验室的 EDAN 平台实验中,我们报告了10个未知物体的抓取成功率为87%,并展示了该方法在结构混乱和货架上的物体抓取能力。
https://arxiv.org/abs/2404.15001
The emergence of Large Vision Models (LVMs) is following in the footsteps of the recent prosperity of Large Language Models (LLMs) in following years. However, there's a noticeable gap in structured research applying LVMs to Human-Robot Interaction (HRI), despite extensive evidence supporting the efficacy of vision models in enhancing interactions between humans and robots. Recognizing the vast and anticipated potential, we introduce an initial design space that incorporates domain-specific LVMs, chosen for their superior performance over normal models. We delve into three primary dimensions: HRI contexts, vision-based tasks, and specific domains. The empirical validation was implemented among 15 experts across six evaluated metrics, showcasing the primary efficacy in relevant decision-making scenarios. We explore the process of ideation and potential application scenarios, envisioning this design space as a foundational guideline for future HRI system design, emphasizing accurate domain alignment and model selection.
大视图模型的出现是在大型语言模型(LLMs)在接下来的几年里繁荣昌盛的基础上。然而,在将LVMs应用于人机交互(HRI)领域方面, Structured research之间存在显著的空白,尽管在增强人类与机器人之间的互动方面,视觉模型的有效性已经得到了充分的证据支持。认识到LVMs的广泛和预期的潜在可能性,我们引入了一个初始设计空间,其中包含特定领域的LVMs,这些模型在正常模型中具有卓越的性能。我们深入研究三个主要方面:人机交互环境、基于视觉的任务和具体领域。在六个评估指标的15位专家的实证验证过程中进行了实际验证,展示了在相关决策场景中的主要有效性。我们探讨了创意过程和潜在应用场景,将此设计空间视为未来HRI系统设计的基石指南,强调准确领域对齐和模型选择。
https://arxiv.org/abs/2404.14965
A common prerequisite for evaluating a visual(-inertial) odometry (VO/VIO) algorithm is to align the timestamps and the reference frame of its estimated trajectory with a reference ground-truth derived from a system of superior precision, such as a motion capture system. The trajectory-based alignment, typically modeled as a classic hand-eye calibration, significantly influences the accuracy of evaluation metrics. However, traditional calibration methods are susceptible to the quality of the input poses. Few studies have taken this into account when evaluating VO/VIO trajectories that usually suffer from noise and drift. To fill this gap, we propose a novel spatiotemporal hand-eye calibration algorithm that fully leverages multiple constraints from screw theory for enhanced accuracy and robustness. Experimental results show that our algorithm has better performance and is less noise-prone than state-of-the-art methods.
评估视觉惯性导航算法(VO/VIO)的常见先决条件是将其估计轨迹的时标和参考帧与从高级精度系统(如运动捕捉系统)生成的参考地面参考系对齐。基于轨迹的对齐通常建模为经典的手眼校准,显著影响了评估指标的准确性。然而,传统的校准方法容易受到输入姿态质量的影响。在评估通常存在噪声和漂移的VO/VIO轨迹时,很少有研究考虑这一点。为了填补这一空白,我们提出了一个新颖的spatiotemporal hand-eye校准算法,它完全利用螺纹理论的多个约束以提高准确性和稳健性。实验结果表明,我们的算法具有更好的性能,并且比最先进的 methods噪声更小。
https://arxiv.org/abs/2404.14894
Dynamic obstacle avoidance is a popular research topic for autonomous systems, such as micro aerial vehicles and service robots. Accurately evaluating the performance of dynamic obstacle avoidance methods necessitates the establishment of a metric to quantify the environment's difficulty, a crucial aspect that remains unexplored. In this paper, we propose four metrics to measure the difficulty of dynamic environments. These metrics aim to comprehensively capture the influence of obstacles' number, size, velocity, and other factors on the difficulty. We compare the proposed metrics with existing static environment difficulty metrics and validate them through over 1.5 million trials in a customized simulator. This simulator excludes the effects of perception and control errors and supports different motion and gaze planners for obstacle avoidance. The results indicate that the survivability metric outperforms and establishes a monotonic relationship between the success rate, with a Spearman's Rank Correlation Coefficient (SRCC) of over 0.9. Specifically, for every planner, lower survivability leads to a higher success rate. This metric not only facilitates fair and comprehensive benchmarking but also provides insights for refining collision avoidance methods, thereby furthering the evolution of autonomous systems in dynamic environments.
动态避障是自动驾驶系统和服务机器人的一个热门研究课题。准确评估动态避障方法的性能需要建立一个指标来量化环境的难度,这是的一个重要方面,但尚未被探索。在本文中,我们提出了四个指标来衡量动态环境的难度。这些指标旨在全面捕捉障碍物数量、大小、速度和其他因素对难度的影响。我们将所提出的指标与现有的静态环境难度指标进行比较,并通过在定制仿真器上进行超过150000次试验来验证它们。这个仿真器排除了感知和控制误差的影响,支持不同避障规划的运动和视觉计划。结果表明,生存能力指标超过了传统的避障方法,并建立了成功率与幸存能力之间的单调关系,相关系数(SRCC)超过0.9。具体来说,对于每个规划器,较低的生存能力会导致更高的成功率。这个指标不仅促进了公平和全面的基准测试,还为改进避障方法提供了洞察,从而进一步推动自动驾驶系统在动态环境中的发展。
https://arxiv.org/abs/2404.14848
This research addresses the challenge of estimating bathymetry from imaging sonars where the state-of-the-art works have primarily relied on either supervised learning with ground-truth labels or surface rendering based on the Lambertian assumption. In this letter, we propose a novel, self-supervised framework based on volume rendering for reconstructing bathymetry using forward-looking sonar (FLS) data collected during standard surveys. We represent the seafloor as a neural heightmap encapsulated with a parametric multi-resolution hash encoding scheme and model the sonar measurements with a differentiable renderer using sonar volumetric rendering employed with hierarchical sampling techniques. Additionally, we model the horizontal and vertical beam patterns and estimate them jointly with the bathymetry. We evaluate the proposed method quantitatively on simulation and field data collected by remotely operated vehicles (ROVs) during low-altitude surveys. Results show that the proposed method outperforms the current state-of-the-art approaches that use imaging sonars for seabed mapping. We also demonstrate that the proposed approach can potentially be used to increase the resolution of a low-resolution prior map with FLS data from low-altitude surveys.
这项研究解决了从成像声纳中估计海底地形这一挑战,因为最先进的工作主要依赖于监督学习或基于Lambertian假设的表面渲染。在本文中,我们提出了一个新颖的、自监督的框架,基于体积渲染,用于通过标准调查期间收集的前向声纳数据(FLS)重构海底地形。我们将海底被视为一个参数多分辨率哈希编码方案捕获的神经高度图,并使用采用分层采样技术展开的声纳体积渲染模型来建模声纳测量。此外,我们还建模水平和垂直束模式,并与其共同估计海底地形。我们对使用遥控操作车辆(ROVs)在低空调查期间收集的模拟和现场数据进行定量评估。结果表明,与使用成像声纳进行海底映射的现有最佳方法相比,所提出的方法表现优异。我们还证明了这种方法有可能用于从低空调查中增加低分辨率先验图的分辨率。
https://arxiv.org/abs/2404.14819
Teaching robots novel skills with demonstrations via human-in-the-loop data collection techniques like kinesthetic teaching or teleoperation puts a heavy burden on human supervisors. In contrast to this paradigm, it is often significantly easier to provide raw, action-free visual data of tasks being performed. Moreover, this data can even be mined from video datasets or the web. Ideally, this data can serve to guide robot learning for new tasks in novel environments, informing both "what" to do and "how" to do it. A powerful way to encode both the "what" and the "how" is to infer a well-shaped reward function for reinforcement learning. The challenge is determining how to ground visual demonstration inputs into a well-shaped and informative reward function. We propose a technique Rank2Reward for learning behaviors from videos of tasks being performed without access to any low-level states and actions. We do so by leveraging the videos to learn a reward function that measures incremental "progress" through a task by learning how to temporally rank the video frames in a demonstration. By inferring an appropriate ranking, the reward function is able to guide reinforcement learning by indicating when task progress is being made. This ranking function can be integrated into an adversarial imitation learning scheme resulting in an algorithm that can learn behaviors without exploiting the learned reward function. We demonstrate the effectiveness of Rank2Reward at learning behaviors from raw video on a number of tabletop manipulation tasks in both simulations and on a real-world robotic arm. We also demonstrate how Rank2Reward can be easily extended to be applicable to web-scale video datasets.
通过使用人机交互数据收集技术(如本体感知教学或遥控操作)进行示例教学,让机器人学习 novel skills 会为人类监督者带来沉重的负担。相比之下,提供原始、无动作的任务执行数据要容易得多。此外,这种数据甚至可以从视频数据集中或互联网上进行挖掘。在理想情况下,这些数据可以为机器人提供关于新任务和新环境下的机器学习指导,告知做什么以及如何做。通过推断一个形状良好且信息丰富的奖励函数来编码 both the "what" 和 the "how" 是一种强大的方法。挑战在于将视觉演示输入 ground 到一个形状良好且具有指导性的奖励函数中。我们提出了 Rank2Reward 技术,用于从执行任务的视频序列中学习行为,而无需访问任何低级状态和动作。我们通过学习如何按时间排序演示视频帧来推断适当的分级,从而使奖励函数能够指导强化学习,表明任务进展。这个排名函数可以集成到 adversarial imitation learning 方案中,从而学习行为而无需利用所学习的奖励函数。我们证明了 Rank2Reward 在从模拟和现实世界机器人手臂的许多表单操作任务中学习行为方面的有效性。我们还展示了 Rank2Reward 如何很容易地扩展到适用于网页规模视频数据集。
https://arxiv.org/abs/2404.14735
The execution of flight missions by unmanned aerial vehicles (UAV) primarily relies on navigation. In particular, the navigation pipeline has traditionally been divided into positioning and control, operating in a sequential loop. However, the existing navigation pipeline, where the positioning and control are decoupled, struggles to adapt to ubiquitous uncertainties arising from measurement noise, abrupt disturbances, and nonlinear dynamics. As a result, the navigation reliability of the UAV is significantly challenged in complex dynamic areas. For example, the ubiquitous global navigation satellite system (GNSS) positioning can be degraded by the signal reflections from surrounding high-rising buildings in complex urban areas, leading to significantly increased positioning uncertainty. An additional challenge is introduced to the control algorithm due to the complex wind disturbances in urban canyons. Given the fact that the system positioning and control are highly correlated with each other, this research proposes a **tightly joined positioning and control model (JPCM) based on factor graph optimization (FGO)**. In particular, the proposed JPCM combines sensor measurements from positioning and control constraints into a unified probabilistic factor graph. Specifically, the positioning measurements are formulated as the factors in the factor graph. In addition, the model predictive control (MPC) is also formulated as the additional factors in the factor graph. By solving the factor graph contributed by both the positioning-related factors and the MPC-based factors, the complementariness of positioning and control can be deeply exploited. Finally, we validate the effectiveness and resilience of the proposed method using a simulated quadrotor system which shows significantly improved trajectory following performance.
无人机(UAV)执行任务的主要依赖是导航。特别是,传统的导航管道被分为定位和控制,在顺序循环中运行。然而,由于测量噪声、突然干扰和非线性动力学等原因,现有的导航管道在复杂动态区域中面临着严重的导航可靠性挑战。例如,复杂城市区域周围高耸建筑的信号反射可能会降低全球导航卫星系统(GNSS)的定位精度,导致定位不确定性大幅增加。此外,城市峡谷中的复杂风干扰给控制算法带来了额外的挑战。鉴于系统定位和控制高度相关,这项研究基于因子图优化(FGO)提出了一个**紧密连接的定位和控制模型(JPCM)**。 具体来说,与定位和控制相关的传感器测量被统一到一个概率因子图上。具体而言,定位测量被表示为因子图中的因子。此外,模型预测控制(MPC)也被表示为因子图中的其他因子。通过解决定位相关因素和基于MPC的因子图中的因素,可以深入挖掘定位和控制的互补性。最后,我们通过模拟四旋翼系统来验证所提出方法的有效性和韧性,该系统显示出明显改善的轨迹跟随性能。
https://arxiv.org/abs/2404.14724
Testing and evaluating the safety performance of autonomous vehicles (AVs) is essential before the large-scale deployment. Practically, the number of testing scenarios permissible for a specific AV is severely limited by tight constraints on testing budgets and time. With the restrictions imposed by strictly restricted numbers of tests, existing testing methods often lead to significant uncertainty or difficulty to quantifying evaluation results. In this paper, we formulate this problem for the first time the "few-shot testing" (FST) problem and propose a systematic framework to address this challenge. To alleviate the considerable uncertainty inherent in a small testing scenario set, we frame the FST problem as an optimization problem and search for the testing scenario set based on neighborhood coverage and similarity. Specifically, under the guidance of better generalization ability of the testing scenario set on AVs, we dynamically adjust this set and the contribution of each testing scenario to the evaluation result based on coverage, leveraging the prior information of surrogate models (SMs). With certain hypotheses on SMs, a theoretical upper bound of evaluation error is established to verify the sufficiency of evaluation accuracy within the given limited number of tests. The experiment results on cut-in scenarios demonstrate a notable reduction in evaluation error and variance of our method compared to conventional testing methods, especially for situations with a strict limit on the number of scenarios.
在自动驾驶车辆(AVs)的大规模部署之前,对AV的安全性能进行测试和评估是至关重要的。实际上,特定AV允许的测试场景数量受到严格预算和时间限制的严重限制。由于限制了测试预算和时间,现有测试方法通常导致对评估结果的不确定性和量化评估结果的困难。在本文中,我们首次将这个问题定义为“少样本测试”(FST)问题,并提出了一个系统框架来解决这个挑战。为了减轻小测试场景集中存在的相当大的不确定性,我们将FST问题定义为优化问题,并基于邻域覆盖和相似性搜索测试场景集。具体来说,在AV测试场景集的更好泛化能力指导下,我们动态调整该集,并根据覆盖率基于每个测试场景对评估结果的贡献进行调整,利用代理模型(SMs)的先验信息。在某些假设关于SMs的情况下,建立了评估误差的理论上限,以验证在给定的有限测试数量内评估准确性的充分性。对切分场景的实验结果表明,与传统测试方法相比,我们的方法在评估误差和方差方面具有显著的减少,尤其是在有限场景数量的情况下。
https://arxiv.org/abs/2402.01795
In multi-robot systems, achieving coordinated missions remains a significant challenge due to the coupled nature of coordination behaviors and the lack of global information for individual robots. To mitigate these challenges, this paper introduces a novel approach, Bi-level Coordination Learning (Bi-CL), that leverages a bi-level optimization structure within a centralized training and decentralized execution paradigm. Our bi-level reformulation decomposes the original problem into a reinforcement learning level with reduced action space, and an imitation learning level that gains demonstrations from a global optimizer. Both levels contribute to improved learning efficiency and scalability. We note that robots' incomplete information leads to mismatches between the two levels of learning models. To address this, Bi-CL further integrates an alignment penalty mechanism, aiming to minimize the discrepancy between the two levels without degrading their training efficiency. We introduce a running example to conceptualize the problem formulation and apply Bi-CL to two variations of this example: route-based and graph-based scenarios. Simulation results demonstrate that Bi-CL can learn more efficiently and achieve comparable performance with traditional multi-agent reinforcement learning baselines for multi-robot coordination.
在多机器人系统中,实现协调任务仍然是一个重要的挑战,因为协调行为是相互耦合的,并且每个机器人的全局信息缺乏。为了减轻这些挑战,本文引入了一种新颖的方法——双层协调学习(Bi-CL),该方法利用了集中训练和分布式执行范式中的中央化优化结构。我们的双层归约将原始问题分解为强化学习级别具有减小动作空间的级别和基于全局最优器的模仿学习级别。这两个级别都促进了学习效率和可扩展性的提高。我们注意到,机器人的不完全信息导致了学习模型的两个级别之间的差异。为了应对这个问题,Bi-CL进一步引入了平滑惩罚机制,旨在最小化两个级别之间的差异,同时不降低它们的训练效率。我们引入了一个示例来阐述问题求解方法和应用Bi-CL到两种变体:基于路线和基于图的场景。仿真结果表明,Bi-CL可以学习更有效地,与传统的多机器人协同强化学习基线具有可比较的性能。
https://arxiv.org/abs/2404.14649
Human behavior modeling is important for the design and implementation of human-automation interactive control systems. In this context, human behavior refers to a human's control input to systems. We propose a novel method for human behavior modeling that uses human demonstrations for a given task to infer the unknown task objective and the variability. The task objective represents the human's intent or desire. It can be inferred by the inverse optimal control and improve the understanding of human behavior by providing an explainable objective function behind the given human behavior. Meanwhile, the variability denotes the intrinsic uncertainty in human behavior. It can be described by a Gaussian mixture model and capture the uncertainty in human behavior which cannot be encoded by the task objective. The proposed method can improve the prediction accuracy of human behavior by leveraging both task objective and variability. The proposed method is demonstrated through human-subject experiments using an illustrative quadrotor remote control example.
人类行为建模对于设计和实施人类自动化交互控制系统非常重要。在這種背景下,人类行为是指一个人对系统的控制输入。我们提出了一个人类行为建模新方法,该方法使用人类演示来推断未知任务目标和变异性。任务目标代表人类的意愿或愿望。通过逆最优控制可以推断出任务目标,从而提高对人类行为的理解,并为给定的人类行为提供解释性的目标函数。同时,变异性表示人类行为的固有不确定性。它可以通过高斯混合模型来描述,并捕捉到不能由任务目标编码的人类行为的不确定性。通过利用任务目标和变异性,该方法可以提高人类行为预测的准确性。该方法通过人机交互实验使用示例 quadrotor 遥控器进行演示。
https://arxiv.org/abs/2404.14647
Finding controllers that perform well across multiple morphologies is an important milestone for large-scale robotics, in line with recent advances via foundation models in other areas of machine learning. However, the challenges of learning a single controller to control multiple morphologies make the `one robot one task' paradigm dominant in the field. To alleviate these challenges, we present a pipeline that: (1) leverages Quality Diversity algorithms like MAP-Elites to create a dataset of many single-task/single-morphology teacher controllers, then (2) distills those diverse controllers into a single multi-morphology controller that performs well across many different body plans by mimicking the sensory-action patterns of the teacher controllers via supervised learning. The distilled controller scales well with the number of teachers/morphologies and shows emergent properties. It generalizes to unseen morphologies in a zero-shot manner, providing robustness to morphological perturbations and instant damage recovery. Lastly, the distilled controller is also independent of the teacher controllers -- we can distill the teacher's knowledge into any controller model, making our approach synergistic with architectural improvements and existing training algorithms for teacher controllers.
找到在多个形态学上表现良好的控制器是大型机器人领域的一个重要里程碑,这得益于其他领域机器学习中的基础模型最近的进步。然而,学习控制多个形态需要一个单一控制器的挑战使得“一个机器人一个任务”范式在领域中占主导地位。为了缓解这些挑战,我们提出了一个流水线,该流水线利用像MAP-Elites这样的质量多样性算法创建了一个包含许多单一任务/单一形态的教学控制器数据集,然后通过监督学习将它们多样化的控制器提炼为一个在许多不同身体形态上表现良好的多形态控制器。提炼出的控制器具有良好的扩展性,其表现与教师/形态的数量成正比,并表现出 emergent 属性。它可以在零散的形态上表现出对形态扰动和即时损伤恢复的鲁棒性。最后,提炼出的控制器与教师控制器无关,我们可以将教师的知识提炼为任何控制器模型,使我们的方法与改进的建筑和现有的训练算法相辅相成。
https://arxiv.org/abs/2404.14625
Controller tuning and parameter optimization are crucial in system design to improve closed-loop system performance. Bayesian optimization has been established as an efficient model-free controller tuning and adaptation method. However, Bayesian optimization methods are computationally expensive and therefore difficult to use in real-time critical scenarios. In this work, we propose a real-time purely data-driven, model-free approach for adaptive control, by online tuning low-level controller parameters. We base our algorithm on GoOSE, an algorithm for safe and sample-efficient Bayesian optimization, for handling performance and stability criteria. We introduce multiple computational and algorithmic modifications for computational efficiency and parallelization of optimization steps. We further evaluate the algorithm's performance on a real precision-motion system utilized in semiconductor industry applications by modifying the payload and reference stepsize and comparing it to an interpolated constrained optimization-based baseline approach.
控制器调整和参数优化在系统设计中至关重要,以提高闭环系统的性能。贝叶斯优化被证明是一种有效的模型无关控制器调整和适应方法。然而,贝叶斯优化方法在实时关键场景中计算代价较高,因此难以使用。在这项工作中,我们提出了一种实时的数据驱动、模型无关的自适应控制方法,通过在线调整低级控制器参数来实现。我们基于GoOSE算法,该算法用于安全且具有采样效率的贝叶斯优化,处理性能和稳定性标准。我们引入了多个计算和算法修改以提高计算效率和优化步骤的并行化。我们进一步通过修改负载和参考步长并将其与拟合约束优化基于基准方法进行比较,评估算法的性能在用于半导体工业应用的实时精度运动系统上。
https://arxiv.org/abs/2404.14602
Autonomous robots navigating in changing environments demand adaptive navigation strategies for safe long-term operation. While many modern control paradigms offer theoretical guarantees, they often assume known extrinsic safety constraints, overlooking challenges when deployed in real-world environments where objects can appear, disappear, and shift over time. In this paper, we present a closed-loop perception-action pipeline that bridges this gap. Our system encodes an online-constructed dense map, along with object-level semantic and consistency estimates into a control barrier function (CBF) to regulate safe regions in the scene. A model predictive controller (MPC) leverages the CBF-based safety constraints to adapt its navigation behaviour, which is particularly crucial when potential scene changes occur. We test the system in simulations and real-world experiments to demonstrate the impact of semantic information and scene change handling on robot behavior, validating the practicality of our approach.
自主机器人导航在变化环境中需要适应性导航策略来实现安全长期操作。虽然许多现代控制范式提供了理论保证,但它们通常假定已知的外部安全约束,忽视了在现实环境中物体可能出现、消失和移动的事实挑战。在本文中,我们提出了一个端到端的感知-动作管道,弥合了这一空白。我们的系统编码了一个在线构建的密集地图以及物体级别的语义和一致性估计,作为一个控制障碍函数(CBF)来调节场景中的安全区域。一个模型预测控制器(MPC)利用基于CBF的安全约束来适应其导航行为,尤其是在可能发生场景变化时更是至关重要。我们在仿真和现实实验中测试了系统,以证明语义信息和场景变化处理对机器人行为的影响,验证了我们对方法的实用性。
https://arxiv.org/abs/2404.14546
Stable locomotion in precipitous environments is an essential capability of quadruped robots, demanding the ability to resist various external disturbances. However, recent learning-based policies only use basic domain randomization to improve the robustness of learned policies, which cannot guarantee that the robot has adequate disturbance resistance capabilities. In this paper, we propose to model the learning process as an adversarial interaction between the actor and a newly introduced disturber and ensure their optimization with $H_{\infty}$ constraint. In contrast to the actor that maximizes the discounted overall reward, the disturber is responsible for generating effective external forces and is optimized by maximizing the error between the task reward and its oracle, i.e., "cost" in each iteration. To keep joint optimization between the actor and the disturber stable, our $H_{\infty}$ constraint mandates the bound of ratio between the cost to the intensity of the external forces. Through reciprocal interaction throughout the training phase, the actor can acquire the capability to navigate increasingly complex physical disturbances. We verify the robustness of our approach on quadrupedal locomotion tasks with Unitree Aliengo robot, and also a more challenging task with Unitree A1 robot, where the quadruped is expected to perform locomotion merely on its hind legs as if it is a bipedal robot. The simulated quantitative results show improvement against baselines, demonstrating the effectiveness of the method and each design choice. On the other hand, real-robot experiments qualitatively exhibit how robust the policy is when interfering with various disturbances on various terrains, including stairs, high platforms, slopes, and slippery terrains. All code, checkpoints, and real-world deployment guidance will be made public.
在险峻环境中实现稳定的运动是四足机器人的关键能力,要求具有抵抗各种外部干扰的能力。然而,最近基于学习的策略仅使用基本的领域随机化来提高学到的策略的鲁棒性,这不能保证机器人具有足够的干扰抵抗能力。在本文中,我们将建模学习过程为演员与一个新引入的干扰器之间的对抗交互,并通过$H_{\infty}$约束确保它们的优化。与最大化累计奖励的演员不同,干扰器负责生成有效的外部力,并通过最大化任务奖励与其预言之间的误差来优化,即“成本”在每个迭代中。为了保持演员和干扰器之间的联合优化稳定,我们的$H_{\infty}$约束要求外力成本与强度之间的比值的边界。在训练过程中通过相互交互,演员可以获得在 increasingly复杂的物理干扰中航行的能力。我们在 Unitree Aliengo 机器人上验证了我们的方法的有效性,还使用 Unitree A1 机器人进行了一个更具有挑战性的任务,其中假设四足机器人仅在腿上进行运动,就像它是一个双足机器人一样。模拟的定量结果表明,相对于基线,我们的方法取得了改善,证明了这种方法和每个设计选择的有效性。另一方面,实机实验通过干扰各种地形对策略的鲁棒性进行了定性评估。所有代码、检查点和实机部署指南都将公开发布。
https://arxiv.org/abs/2404.14405
We present PLUTO, a powerful framework that pushes the limit of imitation learning-based planning for autonomous driving. Our improvements stem from three pivotal aspects: a longitudinal-lateral aware model architecture that enables flexible and diverse driving behaviors; An innovative auxiliary loss computation method that is broadly applicable and efficient for batch-wise calculation; A novel training framework that leverages contrastive learning, augmented by a suite of new data augmentations to regulate driving behaviors and facilitate the understanding of underlying interactions. We assessed our framework using the large-scale real-world nuPlan dataset and its associated standardized planning benchmark. Impressively, PLUTO achieves state-of-the-art closed-loop performance, beating other competing learning-based methods and surpassing the current top-performed rule-based planner for the first time. Results and code are available at this https URL.
我们提出了PLUTO,一个强大的框架,可以将自动驾驶中基于模仿学习的规划极限推向更高。我们的改进源于三个关键方面:一个纵向-横向感知模型架构,实现灵活多样且和谐的驾驶行为;一种适用于批量计算的创新辅助损失计算方法;一种利用对比学习的新颖训练框架,通过一系列新的数据增强方法调节驾驶行为,并促进底层交互的理解。我们对PLUTO框架进行了评估,使用了大规模现实世界nuPlan数据集及其相关的标准化规划基准。令人印象深刻的是,PLUTO实现了最先进的闭环性能,超越了其他竞争性的基于学习的方法和当前最高表现的基于规则的规划器,这是第一次实现的。结果和代码可在此链接中查看:https://url.org/
https://arxiv.org/abs/2404.14327