Imitation learning has proven to be a powerful tool for training complex visuomotor policies. However, current methods often require hundreds to thousands of expert demonstrations to handle high-dimensional visual observations. A key reason for this poor data efficiency is that visual representations are predominantly either pretrained on out-of-domain data or trained directly through a behavior cloning objective. In this work, we present DynaMo, a new in-domain, self-supervised method for learning visual representations. Given a set of expert demonstrations, we jointly learn a latent inverse dynamics model and a forward dynamics model over a sequence of image embeddings, predicting the next frame in latent space, without augmentations, contrastive sampling, or access to ground truth actions. Importantly, DynaMo does not require any out-of-domain data such as Internet datasets or cross-embodied datasets. On a suite of six simulated and real environments, we show that representations learned with DynaMo significantly improve downstream imitation learning performance over prior self-supervised learning objectives, and pretrained representations. Gains from using DynaMo hold across policy classes such as Behavior Transformer, Diffusion Policy, MLP, and nearest neighbors. Finally, we ablate over key components of DynaMo and measure its impact on downstream policy performance. Robot videos are best viewed at this https URL
模仿学习已被证明是一种强大的训练复杂视觉运动策略的有力工具。然而,当前的方法通常需要几百到数千个专家演示才能处理高维视觉观察数据。导致这种低数据效率的一个关键原因是,视觉表示主要是通过行为克隆目标直接训练,或者在非域数据上预训练。在这项工作中,我们提出了DynaMo,一种新的在域自监督学习方法,用于学习视觉表示。给定一组专家演示,我们通过一系列图像嵌入共同学习一个潜在的逆动态模型和一个前动态模型,预测下一个帧在潜在空间中,无需增强、对比采样或访问地面真实动作。重要的是,DynaMo不需要任何非域数据,如互联网数据或跨嵌入数据。在六个模拟和现实环境上,我们证明了使用DynaMo学习到的表示能够显著提高下游模仿学习性能,以及预训练表示。使用DynaMo的优势在行为Transformer、扩散策略、MLP和最近邻策略的类别中保持不变。最后,我们抽象了DynaMo的关键组件,并测量了其对下游策略性能的影响。机器人视频最好在以下链接查看:
https://arxiv.org/abs/2409.12192
Bundle adjustment (BA) is a critical technique in various robotic applications, such as simultaneous localization and mapping (SLAM), augmented reality (AR), and photogrammetry. BA optimizes parameters such as camera poses and 3D landmarks to align them with observations. With the growing importance of deep learning in perception systems, there is an increasing need to integrate BA with deep learning frameworks for enhanced reliability and performance. However, widely-used C++-based BA frameworks, such as GTSAM, g$^2$o, and Ceres, lack native integration with modern deep learning libraries like PyTorch. This limitation affects their flexibility, adaptability, ease of debugging, and overall implementation efficiency. To address this gap, we introduce an eager-mode BA framework seamlessly integrated with PyPose, providing PyTorch-compatible interfaces with high efficiency. Our approach includes GPU-accelerated, differentiable, and sparse operations designed for 2nd-order optimization, Lie group and Lie algebra operations, and linear solvers. Our eager-mode BA on GPU demonstrates substantial runtime efficiency, achieving an average speedup of 18.5$\times$, 22$\times$, and 23$\times$ compared to GTSAM, g$^2$o, and Ceres, respectively.
捆绑调整(BA)是一种关键的技术,在各种机器人应用中都有广泛的应用,如同时定位与映射(SLAM)、增强现实(AR)和摄影测量。BA优化参数,如相机姿态和3D地标,以与观测值对齐。随着深度学习在感知系统中的重要性不断增加,越来越多的需要将BA与深度学习框架集成以提高可靠性和性能。然而,广泛使用的基于C++的BA框架,如GTSAM、g$^2$o和Ceres,与现代深度学习库(如PyTorch)的本地集成缺乏。这一限制影响了它们的灵活性、适应性、调试难度和整体实现效率。为了填补这一空白,我们引入了一个与PyPose无缝集成的 eager-mode BA框架,提供与PyTorch兼容的接口,具有高效率。我们的方法包括为2阶优化、Lie组和Lie代数运算以及线性求解设计的GPU加速、可导和稀疏操作。我们的 eager-mode BA在GPU 上具有实质性的运行效率,与 GTSAM、g$^2$o 和 Ceres 分别相比,平均速度提升为 18.5$\times$、22$\times$ 和 23$\times$。
https://arxiv.org/abs/2409.12190
There is a large population of wheelchair users. Most of the wheelchair users need help with daily tasks. However, according to recent reports, their needs are not properly satisfied due to the lack of caregivers. Therefore, in this project, we develop WeHelp, a shared autonomy system aimed for wheelchair users. A robot with a WeHelp system has three modes, following mode, remote control mode and tele-operation mode. In the following mode, the robot follows the wheelchair user automatically via visual tracking. The wheelchair user can ask the robot to follow them from behind, by the left or by the right. When the wheelchair user asks for help, the robot will recognize the command via speech recognition, and then switch to the teleoperation mode or remote control mode. In the teleoperation mode, the wheelchair user takes over the robot with a joy stick and controls the robot to complete some complex tasks for their needs, such as opening doors, moving obstacles on the way, reaching objects on a high shelf or on the low ground, etc. In the remote control mode, a remote assistant takes over the robot and helps the wheelchair user complete some complex tasks for their needs. Our evaluation shows that the pipeline is useful and practical for wheelchair users. Source code and demo of the paper are available at \url{this https URL}.
有很大的轮椅用户人口。大多数轮椅用户需要日常生活中的帮助。然而,据最近报道,由于缺乏护理人员,他们的需求没有得到适当的满足。因此,在这个项目中,我们开发了一个名为WeHelp的共享自治系统,专为轮椅用户设计。具有WeHelp系统的机器人有三种模式:跟随模式,遥控模式和远程操作模式。在跟随模式下,机器人通过视觉跟踪来跟随轮椅用户。轮椅用户可以通过左或右向机器人发出请求。当轮椅用户寻求帮助时,机器人将通过语音识别接收到命令,然后切换到遥控模式或远程操作模式。在遥控模式下,轮椅用户通过摇杆掌控机器人,并帮助机器人完成一些复杂的任务,如打开门,在路上移动障碍物,到达高货架或低地面等。在远程操作模式下,一个远程助手接管机器人,帮助轮椅用户完成一些复杂的任务。我们的评估显示,该流程对于轮椅用户来说是有用且实用的。论文的源代码和演示文稿可在此处访问:\url{this <https://this URL>}.
https://arxiv.org/abs/2409.12159
Robots can influence people to accomplish their tasks more efficiently: autonomous cars can inch forward at an intersection to pass through, and tabletop manipulators can go for an object on the table first. However, a robot's ability to influence can also compromise the safety of nearby people if naively executed. In this work, we pose and solve a novel robust reach-avoid dynamic game which enables robots to be maximally influential, but only when a safety backup control exists. On the human side, we model the human's behavior as goal-driven but conditioned on the robot's plan, enabling us to capture influence. On the robot side, we solve the dynamic game in the joint physical and belief space, enabling the robot to reason about how its uncertainty in human behavior will evolve over time. We instantiate our method, called SLIDE (Safely Leveraging Influence in Dynamic Environments), in a high-dimensional (39-D) simulated human-robot collaborative manipulation task solved via offline game-theoretic reinforcement learning. We compare our approach to a robust baseline that treats the human as a worst-case adversary, a safety controller that does not explicitly reason about influence, and an energy-function-based safety shield. We find that SLIDE consistently enables the robot to leverage the influence it has on the human when it is safe to do so, ultimately allowing the robot to be less conservative while still ensuring a high safety rate during task execution.
机器人可以更有效地影响人们以完成任务:自动驾驶汽车可以在路口逐步前进以通过,而桌面操作器可以首先尝试桌子上的物体。然而,机器人影响力的能力也可能危及附近人的安全,如果盲目执行。在这项工作中,我们提出并解决了新颖的鲁棒到达避免动态游戏,使机器人只有在存在安全备份控制时才能发挥最大影响力。在人类方面,我们将人类的行为建模为以目标为导向,但受到机器人计划的条件限制,使我们能够捕捉影响力。在机器人方面,我们在物理和信念空间中解决了动态游戏,使机器人能够关于其对人类行为的不确定性如何随时间演变进行思考。我们通过高维(39-D)的模拟人类-机器人协同操作任务,使用离线游戏理论强化学习来解决该任务,实例化我们的方法,称为SLIDE(在动态环境中安全利用影响力)。我们比较了我们的方法与将人类视为最坏情况 adversary 的稳健基线、不明确考虑影响力的安全控制器以及基于能量函数的安全护盾之间的差异。我们发现,SLIDE 能够使机器人安全地利用其对人类的影响力,从而在执行任务时允许机器人更加保守,但在任务执行过程中仍能确保高安全率。
https://arxiv.org/abs/2409.12153
Teams of mobile [aerial, ground, or aquatic] robots have applications in resource delivery, patrolling, information-gathering, agriculture, forest fire fighting, chemical plume source localization and mapping, and search-and-rescue. Robot teams traversing hazardous environments -- with e.g. rough terrain or seas, strong winds, or adversaries capable of attacking or capturing robots -- should plan and coordinate their trails in consideration of risks of disablement, destruction, or capture. Specifically, the robots should take the safest trails, coordinate their trails to cooperatively achieve the team-level objective with robustness to robot failures, and balance the reward from visiting locations against risks of robot losses. Herein, we consider bi-objective trail-planning for a mobile team of robots orienteering in a hazardous environment. The hazardous environment is abstracted as a directed graph whose arcs, when traversed by a robot, present known probabilities of survival. Each node of the graph offers a reward to the team if visited by a robot (which e.g. delivers a good to or images the node). We wish to search for the Pareto-optimal robot-team trail plans that maximize two [conflicting] team objectives: the expected (i) team reward and (ii) number of robots that survive the mission. A human decision-maker can then select trail plans that balance, according to their values, reward and robot survival. We implement ant colony optimization, guided by heuristics, to search for the Pareto-optimal set of robot team trail plans. As a case study, we illustrate with an information-gathering mission in an art museum.
移动机器人团队具有在资源交付、巡逻、信息收集、农业、森林防火、化学浓烟源定位和绘制以及搜救中的应用。在具有例如崎岖不平的地形、强风或能够攻击或捕捉机器人的敌人等危险环境的机器人团队中,应该规划并协调它们的路线,考虑残疾、破坏或被俘虏的风险。具体来说,机器人应选择最安全的路线,将路线协调为在机器人故障的情况下实现团队目标,并平衡访问地点的奖励与机器人损失的风险。本文我们考虑在危险环境中为移动机器人团队进行双目标规划。危险环境被抽象为一个有向图,当机器人穿过时,路径上的每个节点已知生存概率。每个节点为团队提供奖励(例如,交付货物或图像节点)。我们试图寻找具有两个(相互冲突)团队目标的Pareto最优机器人团队路线计划:预期团队奖励和预期幸存机器人数量。然后,一个人类决策者可以根据他们的价值观选择路线计划,平衡奖励和机器人生存。我们采用蚁群优化,受到启发,搜索具有最优机器人团队路线计划的Pareto最优集合。 例如,我们以在艺术博物馆进行信息收集任务为例进行说明。
https://arxiv.org/abs/2409.12114
Efficiently and completely capturing the three-dimensional data of an object is a fundamental problem in industrial and robotic applications. The task of next-best-view (NBV) planning is to infer the pose of the next viewpoint based on the current data, and gradually realize the complete three-dimensional reconstruction. Many existing algorithms, however, suffer a large computational burden due to the use of ray-casting. To address this, this paper proposes a projection-based NBV planning framework. It can select the next best view at an extremely fast speed while ensuring the complete scanning of the object. Specifically, this framework refits different types of voxel clusters into ellipsoids based on the voxel structure.Then, the next best view is selected from the candidate views using a projection-based viewpoint quality evaluation function in conjunction with a global partitioning strategy. This process replaces the ray-casting in voxel structures, significantly improving the computational efficiency. Comparative experiments with other algorithms in a simulation environment show that the framework proposed in this paper can achieve 10 times efficiency improvement on the basis of capturing roughly the same coverage. The real-world experimental results also prove the efficiency and feasibility of the framework.
有效地和完全捕捉物体三维数据是一个在工业和机器人应用中基本的难题。下一最好视角(NBV)规划的任务是根据当前数据推断出下一个视角,并逐渐实现完整的三维重建。然而,许多现有算法由于使用透射而产生巨大的计算负载。为了解决这个问题,本文提出了一种基于投影的NBV规划框架。它可以在极快的时间内选择下一个最佳视角,同时确保对物体进行完整的扫描。具体来说,这个框架根据体素结构将不同类型的体素聚类为椭圆。然后,使用基于投影的主观 View Quality 评估函数和全局分割策略从候选视角中选择下一个最佳视角。这个过程用透射替换了体素结构中的 ray-casting,显著提高了计算效率。在仿真环境中的其他算法比较实验表明,本文提出的框架基于捕捉大致相同的覆盖面积可以实现10倍的效率提升。真实世界的实验结果也证明了该框架的有效性和可行性。
https://arxiv.org/abs/2409.12096
Sonar-based indoor mapping systems have been widely employed in robotics for several decades. While such systems are still the mainstream in underwater and pipe inspection settings, the vulnerability to noise reduced, over time, their general widespread usage in favour of other modalities(\textit{e.g.}, cameras, lidars), whose technologies were encountering, instead, extraordinary advancements. Nevertheless, mapping physical environments using acoustic signals and echolocation can bring significant benefits to robot navigation in adverse scenarios, thanks to their complementary characteristics compared to other sensors. Cameras and lidars, indeed, struggle in harsh weather conditions, when dealing with lack of illumination, or with non-reflective walls. Yet, for acoustic sensors to be able to generate accurate maps, noise has to be properly and effectively handled. Traditional signal processing techniques are not always a solution in those cases. In this paper, we propose a framework where machine learning is exploited to aid more traditional signal processing methods to cope with background noise, by removing outliers and artefacts from the generated maps using acoustic sensors. Our goal is to demonstrate that the performance of traditional echolocation mapping techniques can be greatly enhanced, even in particularly noisy conditions, facilitating the employment of acoustic sensors in state-of-the-art multi-modal robot navigation systems. Our simulated evaluation demonstrates that the system can reliably operate at an SNR of $-10$dB. Moreover, we also show that the proposed method is capable of operating in different reverberate environments. In this paper, we also use the proposed method to map the outline of a simulated room using a robotic platform.
基于声波的室内建模系统在过去几十年里在机器人领域得到了广泛应用。虽然这些系统在水下和管道检查环境中仍然是主流,但是随着时间的推移,它们的声学兼容性降低,对噪声的敏感性增加,导致它们在更喜欢其他技术(如摄像头或激光雷达)的环境中的应用受到限制。然而,利用声波进行空间建模可以在恶劣情况下提高机器人在不利场景中的导航性能,因为它们的互补特性相比其他传感器更为突出。事实上,当处理缺乏照明或反射墙等恶劣环境时,相机和激光雷达会受到严重影响。然而,为了使生成的地图准确,必须正确有效地处理噪声。传统信号处理技术并不总是解决问题的方法。在本文中,我们提出了一个利用机器学习帮助传统信号处理方法应对背景噪声的方法,通过使用声传感器消除生成的地图中的异常和噪声。我们的目标是证明,即使在非常嘈杂的环境中,传统回声定位映射技术的性能也可以大大提高,从而促进在现代多模态机器人导航系统中使用声传感器。我们的模拟评估证实,该系统在信噪比(SNR)为-10dB时可以可靠地运行。此外,我们还证明了所提出的方法可以操作在不同回声环境中。在这篇论文中,我们还使用所提出的方法在一个机器人平台上对模拟房间的轮廓进行建模。
https://arxiv.org/abs/2409.12094
Robotic assistive feeding holds significant promise for improving the quality of life for individuals with eating disabilities. However, acquiring diverse food items under varying conditions and generalizing to unseen food presents unique challenges. Existing methods that rely on surface-level geometric information (e.g., bounding box and pose) derived from visual cues (e.g., color, shape, and texture) often lacks adaptability and robustness, especially when foods share similar physical properties but differ in visual appearance. We employ imitation learning (IL) to learn a policy for food acquisition. Existing methods employ IL or Reinforcement Learning (RL) to learn a policy based on off-the-shelf image encoders such as ResNet-50. However, such representations are not robust and struggle to generalize across diverse acquisition scenarios. To address these limitations, we propose a novel approach, IMRL (Integrated Multi-Dimensional Representation Learning), which integrates visual, physical, temporal, and geometric representations to enhance the robustness and generalizability of IL for food acquisition. Our approach captures food types and physical properties (e.g., solid, semi-solid, granular, liquid, and mixture), models temporal dynamics of acquisition actions, and introduces geometric information to determine optimal scooping points and assess bowl fullness. IMRL enables IL to adaptively adjust scooping strategies based on context, improving the robot's capability to handle diverse food acquisition scenarios. Experiments on a real robot demonstrate our approach's robustness and adaptability across various foods and bowl configurations, including zero-shot generalization to unseen settings. Our approach achieves improvement up to $35\%$ in success rate compared with the best-performing baseline.
机器人辅助喂食对于改善具有进食障碍的个人生活质量具有巨大的潜力。然而,在不同的条件和情况下获取多样食物项目并将其推广到未见过的食物具有独特的挑战。现有的方法往往依赖于视觉提示(例如颜色、形状和纹理)表面的几何信息,往往缺乏适应性和稳健性,尤其是在食物具有类似的物理属性但视觉外观不同的情况下。我们使用模仿学习(IL)来学习食物获取策略。现有的方法使用IL或强化学习(RL)基于标准的图像编码器(例如ResNet-50)学习策略。然而,这些表示不具有鲁棒性,在多样获取场景中表现不佳。为了克服这些限制,我们提出了名为IMRL(集成多维表示学习)的新方法,该方法将视觉、物理、时间和几何表示相结合,提高了IL在食品获取方面的鲁棒性和可扩展性。我们的方法捕捉食品类型和物理属性(例如固体、半固体、粗粒、液体和混合物),建模获取动作的时变动态,并引入几何信息来确定最优的勺子抓取点和评估碗的满度。IMRL使得IL能够根据上下文自适应地调整抓取策略,从而提高机器人处理多样食品获取场景的能力。在真实机器人上的实验证实了我们在不同食品和碗配置下的鲁棒性和适应性,包括对未见设置的零 shot泛化。与最佳基线相比,我们的方法实现了35%的改善。
https://arxiv.org/abs/2409.12092
This paper presents a general refractive camera model and online co-estimation of odometry and the refractive index of unknown media. This enables operation in diverse and varying refractive fluids, given only the camera calibration in air. The refractive index is estimated online as a state variable of a monocular visual-inertial odometry framework in an iterative formulation using the proposed camera model. The method was verified on data collected using an underwater robot traversing inside a pool. The evaluations demonstrate convergence to the ideal refractive index for water despite significant perturbations in the initialization. Simultaneously, the approach enables on-par visual-inertial odometry performance in refractive media without prior knowledge of the refractive index or requirement of medium-specific camera calibration.
本文提出了一种通用的反射相机模型以及基于在线协同估计未知介质中的相位差和折射率的方法。这使得仅基于空气中的相机校准可以在各种不同的折射流体中操作。通过使用所提出的相机模型,将相位差估计为单目视觉惯性导航框架中的状态变量,实现基于在线迭代公式。该方法在用潜水机器人穿越池内的数据上进行了验证。评估结果表明,尽管在初始化过程中存在很大的扰动,但该方法在理想折射指数上收敛。同时,该方法可以在无需事先知道折射率或对介质特定相机校准的情况下实现等视觉惯性导航性能。
https://arxiv.org/abs/2409.12074
Imitation based robot learning has recently gained significant attention in the robotics field due to its theoretical potential for transferability and generalizability. However, it remains notoriously costly, both in terms of hardware and data collection, and deploying it in real-world environments demands meticulous setup of robots and precise experimental conditions. In this paper, we present a low-cost robot learning framework that is both easily reproducible and transferable to various robots and environments. We demonstrate that deployable imitation learning can be successfully applied even to industrial-grade robots, not just expensive collaborative robotic arms. Furthermore, our results show that multi-task robot learning is achievable with simple network architectures and fewer demonstrations than previously thought necessary. As the current evaluating method is almost subjective when it comes to real-world manipulation tasks, we propose Voting Positive Rate (VPR) - a novel evaluation strategy that provides a more objective assessment of performance. We conduct an extensive comparison of success rates across various self-designed tasks to validate our approach. To foster collaboration and support the robot learning community, we have open-sourced all relevant datasets and model checkpoints, available at this http URL.
基于模仿的学习在机器人领域最近因其在可迁移性和泛化性方面的理论潜力而受到了广泛关注。然而,它依然昂贵,不仅在硬件上,而且在数据收集方面也是如此。将这种方法应用于现实世界环境需要对机器人进行精心设置,并精确控制实验条件。在本文中,我们提出了一个低成本的机器人学习框架,既容易复制,又容易迁移到各种机器人和环境。我们证明了可穿戴式模仿学习甚至可以应用于工业级机器人,而不仅仅是昂贵的合作机器人手臂。此外,我们的结果表明,使用简单的网络架构和比之前想象的更少的演示,多任务机器人学习是可行的。由于当前的评价方法在现实世界的操作任务上几乎具有主观性,我们提出了Voting Positive Rate(VPR) -一种新的评估策略,为性能提供了一个更客观的评估。我们进行了对各种自定义任务成功率的广泛比较,以验证我们的方法。为了促进合作和支持机器人学习社区,我们已经公开发布了所有相关的数据和模型检查点, available at this http URL.
https://arxiv.org/abs/2409.12061
We propose visual-inertial simultaneous localization and mapping that tightly couples sparse reprojection errors, inertial measurement unit pre-integrals, and relative pose factors with dense volumetric occupancy mapping. Hereby depth predictions from a deep neural network are fused in a fully probabilistic manner. Specifically, our method is rigorously uncertainty-aware: first, we use depth and uncertainty predictions from a deep network not only from the robot's stereo rig, but we further probabilistically fuse motion stereo that provides depth information across a range of baselines, therefore drastically increasing mapping accuracy. Next, predicted and fused depth uncertainty propagates not only into occupancy probabilities but also into alignment factors between generated dense submaps that enter the probabilistic nonlinear least squares estimator. This submap representation offers globally consistent geometry at scale. Our method is thoroughly evaluated in two benchmark datasets, resulting in localization and mapping accuracy that exceeds the state of the art, while simultaneously offering volumetric occupancy directly usable for downstream robotic planning and control in real-time.
我们提出了一个将视觉和惯性同时进行局部定位和映射的方法,该方法紧密耦合稀疏投影误差、惯性测量单元预积分和相对姿态因子与密集体积占用映射。在本文中,来自深度神经网络的深度预测在完全概率方式下进行融合。具体来说,我们的方法是严谨的不确定性感知方法:首先,我们从机器人的双目立体系统中不仅使用深度预测和不确定性预测,而且我们还使用运动立体学提供深度信息,从而极大地提高了映射精度。接下来,预测和融合的深度不确定性不仅传播到占有概率,而且传播到生成密集子图之间的对齐因子。这种子图表示在尺度上具有全局一致的几何。我们的方法在两个基准数据集上进行了彻底评估,从而实现了超过现有技术的局部定位和映射精度,同时为实时下游机器人规划和控制提供了直接可用的体积占有率。
https://arxiv.org/abs/2409.12051
Safety is one of the key issues preventing the deployment of reinforcement learning techniques in real-world robots. While most approaches in the Safe Reinforcement Learning area do not require prior knowledge of constraints and robot kinematics and rely solely on data, it is often difficult to deploy them in complex real-world settings. Instead, model-based approaches that incorporate prior knowledge of the constraints and dynamics into the learning framework have proven capable of deploying the learning algorithm directly on the real robot. Unfortunately, while an approximated model of the robot dynamics is often available, the safety constraints are task-specific and hard to obtain: they may be too complicated to encode analytically, too expensive to compute, or it may be difficult to envision a priori the long-term safety requirements. In this paper, we bridge this gap by extending the safe exploration method, ATACOM, with learnable constraints, with a particular focus on ensuring long-term safety and handling of uncertainty. Our approach is competitive or superior to state-of-the-art methods in final performance while maintaining safer behavior during training.
安全是阻止将强化学习技术应用于现实机器人中的关键问题之一。虽然安全强化学习领域的大多数方法都没有要求先验知识约束和机器人运动学,并且仅依赖数据,但通常很难将它们应用于复杂的现实环境。相反,基于模型的方法已经证明,将约束和动态知识引入学习框架可以使学习算法直接部署到现实机器人上。然而,不幸的是,虽然通常可以获得机器人动态的近似模型,但安全约束是任务特定的,很难获得:它们可能过于复杂,无法通过分析来编码,或者难以在训练过程中明确地想象远期安全性需求。在本文中,我们通过扩展安全探索方法ATACOM,引入可学习约束,特别关注确保长期安全和处理不确定性。我们的方法在最终表现上与最先进的Methods相当或更优,同时保持训练过程中的安全性行为。
https://arxiv.org/abs/2409.12045
Forecasting the semantics and 3D structure of scenes is essential for robots to navigate and plan actions safely. Recent methods have explored semantic and panoptic scene forecasting; however, they do not consider the geometry of the scene. In this work, we propose the panoptic-depth forecasting task for jointly predicting the panoptic segmentation and depth maps of unobserved future frames, from monocular camera images. To facilitate this work, we extend the popular KITTI-360 and Cityscapes benchmarks by computing depth maps from LiDAR point clouds and leveraging sequential labeled data. We also introduce a suitable evaluation metric that quantifies both the panoptic quality and depth estimation accuracy of forecasts in a coherent manner. Furthermore, we present two baselines and propose the novel PDcast architecture that learns rich spatio-temporal representations by incorporating a transformer-based encoder, a forecasting module, and task-specific decoders to predict future panoptic-depth outputs. Extensive evaluations demonstrate the effectiveness of PDcast across two datasets and three forecasting tasks, consistently addressing the primary challenges. We make the code publicly available at this https URL.
预测场景的语义和3D结构对机器人安全导航和规划任务至关重要。最近的方法探索了语义和凝视场景预测;然而,它们没有考虑场景的几何结构。在这项工作中,我们提出了一个凝视深度预测任务,共同预测未观察到的未来帧的凝视分割和深度图。为了便利这项工作,我们扩展了KITTI-360和Cityscapes基准,通过计算从激光雷达点云中提取深度图,并利用序列标注数据。我们还引入了一个合适的评估指标,以合理地量化预测的凝视质量和深度估计准确性。此外,我们提出了PDcast架构,通过将自注意力机制编码器、预测模块和任务特定解码器相结合,从预测未来的凝视深度输出。广泛的评估表明,PDcast在两个数据集和三个预测任务上都具有有效性,始终针对主要挑战。我们将代码公开发布在此链接:https://github.com/PDcast-AI/PDcast
https://arxiv.org/abs/2409.12008
Online planning of collision-free trajectories is a fundamental task for robotics and self-driving car applications. This paper revisits collision avoidance between ellipsoidal objects using differentiable constraints. Two ellipsoids do not overlap if and only if the endpoint of the vector between the center points of the ellipsoids does not lie in the interior of the Minkowski sum of the ellipsoids. This condition is formulated using a parametric over-approximation of the Minkowski sum, which can be made tight in any given direction. The resulting collision avoidance constraint is included in an optimal control problem (OCP) and evaluated in comparison to the separating-hyperplane approach. Not only do we observe that the Minkowski-sum formulation is computationally more efficient in our experiments, but also that using pre-determined over-approximation parameters based on warm-start trajectories leads to a very limited increase in suboptimality. This gives rise to a novel real-time scheme for collision-free motion planning with model predictive control (MPC). Both the real-time feasibility and the effectiveness of the constraint formulation are demonstrated in challenging real-world experiments.
在线规划无碰撞轨迹是一个基本的机器人学和自动驾驶汽车应用任务。本文回顾了使用可导约束求解椭圆物体之间的碰撞避免。两个椭圆物体重叠的条件是,椭圆物体中心的点之间的向量的终点不在椭圆物体Minkowski和的内部。这个条件用Minkowski和的参数化估计来表示,且可以在任意方向上变得紧。得到的碰撞避免约束包含在最优控制问题(OCP)中,并与分离超平面方法进行比较评估。不仅我们观察到,在我们的实验中,Minkowski和和的表示在计算上是更加高效的,而且,使用基于预设的超参数(基于预热的轨迹)导致离散优化程度的非常有限增加。这导致了一种新型的实时规划无碰撞运动规划模型预测控制(MPC)方案。在具有挑战性的真实世界实验中,我们证明了约束表示的实时可行性和有效性。
https://arxiv.org/abs/2409.12007
Object manipulation capabilities are essential skills that set apart embodied agents engaging with the world, especially in the realm of robotics. The ability to predict outcomes of interactions with objects is paramount in this setting. While model-based control methods have started to be employed for tackling manipulation tasks, they have faced challenges in accurately manipulating objects. As we analyze the causes of this limitation, we identify the cause of underperformance in the way current world models represent crucial positional information, especially about the target's goal specification for object positioning tasks. We introduce a general approach that empowers world model-based agents to effectively solve object-positioning tasks. We propose two declinations of this approach for generative world models: position-conditioned (PCP) and latent-conditioned (LCP) policy learning. In particular, LCP employs object-centric latent representations that explicitly capture object positional information for goal specification. This naturally leads to the emergence of multimodal capabilities, enabling the specification of goals through spatial coordinates or a visual goal. Our methods are rigorously evaluated across several manipulation environments, showing favorable performance compared to current model-based control approaches.
对象操作能力是使与物体交互的实体代理人在机器人领域具有关键技能。在这样一个设置中,预测交互结果的能力至关重要。虽然基于模型的控制方法已经开始用于解决操作任务,但它们在准确操作物体方面遇到了挑战。在我们分析这一限制的原因时,我们得出了在当前世界模型中代表目标位置信息不足的问题,尤其是在目标定位任务中。我们引入了一种通用的方法,使基于世界模型的代理有效地解决物体定位任务。我们提出了两种基于世界模型学习策略的变体:位置条件(PCP)和隐式条件(LCP)策略学习。特别地,LCP采用对象中心化的潜在表示,明确地捕捉目标定位信息。这自然导致了多模态能力的出现,通过空间坐标或视觉目标来指定目标。我们对多个操作环境进行了严格的评估,我们的方法与当前基于模型的控制方法相比显示出良好的性能。
https://arxiv.org/abs/2409.12005
Re-identification (ReID) is a critical challenge in computer vision, predominantly studied in the context of pedestrians and vehicles. However, robust object-instance ReID, which has significant implications for tasks such as autonomous exploration, long-term perception, and scene understanding, remains underexplored. In this work, we address this gap by proposing a novel dual-path object-instance re-identification transformer architecture that integrates multimodal RGB and depth information. By leveraging depth data, we demonstrate improvements in ReID across scenes that are cluttered or have varying illumination conditions. Additionally, we develop a ReID-based localization framework that enables accurate camera localization and pose identification across different viewpoints. We validate our methods using two custom-built RGB-D datasets, as well as multiple sequences from the open-source TUM RGB-D datasets. Our approach demonstrates significant improvements in both object instance ReID (mAP of 75.18) and localization accuracy (success rate of 83% on TUM-RGBD), highlighting the essential role of object ReID in advancing robotic perception. Our models, frameworks, and datasets have been made publicly available.
重新识别(Re-ID)是计算机视觉领域的一个关键挑战,主要研究在行人或车辆的背景下。然而,稳健的物体实例Re-ID,其在自动驾驶探索、长期感知和场景理解等任务中具有重要的影响,仍然没有被深入研究。在这项工作中,我们通过提出一种新颖的双路径物体实例Re-ID转换器架构来解决这一空白。该架构整合了多模态的RGB和深度信息。通过利用深度数据,我们证明了在复杂场景或具有不同照明条件的情况下,Re-ID的改善。此外,我们还开发了一个基于Re-ID的局部定位框架,可以准确地定位和识别不同视角下的相机。我们使用两个自定义的RGB-D数据集以及来自开源TUM RGB-D数据集的多序列进行验证。我们的方法在物体实例Re-ID(mAP为75.18)和局部定位精度(在TUM-RGBD上的成功率为83%)方面都取得了显著的改进,突显了物体Re-ID在推动机器人感知方面的重要性。我们的模型、框架和数据集已经公开发布。
https://arxiv.org/abs/2409.12002
Representing the 3D environment with instance-aware semantic and geometric information is crucial for interaction-aware robots in dynamic environments. Nonetheless, creating such a representation poses challenges due to sensor noise, instance segmentation and tracking errors, and the objects' dynamic motion. This paper introduces a novel particle-based instance-aware semantic occupancy map to tackle these challenges. Particles with an augmented instance state are used to estimate the Probability Hypothesis Density (PHD) of the objects and implicitly model the environment. Utilizing a State-augmented Sequential Monte Carlo PHD (S$^2$MC-PHD) filter, these particles are updated to jointly estimate occupancy status, semantic, and instance IDs, mitigating noise. Additionally, a memory module is adopted to enhance the map's responsiveness to previously observed objects. Experimental results on the Virtual KITTI 2 dataset demonstrate that the proposed approach surpasses state-of-the-art methods across multiple metrics under different noise conditions. Subsequent tests using real-world data further validate the effectiveness of the proposed approach.
用实例感知的语义和几何信息来表示3D环境对于动态环境中的交互式机器人至关重要。然而,创建这种表示由于传感器噪音、实例分割和跟踪误差以及对象的动态运动等问题而带来了挑战。本文提出了一种新型的基于粒子的实例感知的语义占有图来解决这些挑战。使用具有增强实例状态的粒子来估计对象的概率假设密度(PHD),并隐含地建模环境。利用状态增强的随序蒙特卡洛 PHD(S$^2$MC-PHD)滤波器,这些粒子被更新以共同估计占有状态、语义和实例ID,减轻噪声。此外,还采用了一个记忆模块来增强地图对之前观察过的物体的响应。在Virtual KITTI 2数据集上的实验结果表明,与不同噪声条件下的最先进方法相比,所提出的方法在多个指标上超越了最先进方法。后续使用真实世界数据进行的测试进一步验证了所提出方法的有效性。
https://arxiv.org/abs/2409.11975
Reactive collision avoidance is essential for agile robots navigating complex and dynamic environments, enabling real-time obstacle response. However, this task is inherently challenging because it requires a tight integration of perception, planning, and control, which traditional methods often handle separately, resulting in compounded errors and delays. This paper introduces a novel approach that unifies these tasks into a single reactive framework using solely onboard sensing and computing. Our method combines nonlinear model predictive control with adaptive control barrier functions, directly linking perception-driven constraints to real-time planning and control. Constraints are determined by using a neural network to refine noisy RGB-D data, enhancing depth accuracy, and selecting points with the minimum time-to-collision to prioritize the most immediate threats. To maintain a balance between safety and agility, a heuristic dynamically adjusts the optimization process, preventing overconstraints in real time. Extensive experiments with an agile quadrotor demonstrate effective collision avoidance across diverse indoor and outdoor environments, without requiring environment-specific tuning or explicit mapping.
主动避障对于敏捷机器人穿越复杂和动态环境至关重要,实现实时障碍物反应。然而,这项任务固有挑战性,因为它需要集成感知、规划和控制,而传统方法通常分别处理这些问题,导致累积误差和延迟。本文提出了一种将这三项任务统一到一个反应性框架中的新方法,仅利用车载感知和计算。我们的方法将非线性模型预测控制与自适应控制障碍函数相结合,直接将感知驱动的约束与实时规划和控制直接联系起来。约束是通过使用神经网络来优化嘈杂的RGB-D数据,提高深度准确性,并选择距离碰撞时间最短点的点来优先考虑最紧迫的威胁。为了保持安全和敏捷之间的平衡, Heuristic动态调整优化过程,防止在实时超约束。在敏捷四旋翼的广泛实验中,证明了在多样室内和室外环境中有效避开碰撞,而无需对环境进行特定调整或显式映射。
https://arxiv.org/abs/2409.11962
Recent advances in machine learning have paved the way for the development of musical and entertainment robots. However, human-robot cooperative instrument playing remains a challenge, particularly due to the intricate motor coordination and temporal synchronization. In this paper, we propose a theoretical framework for human-robot cooperative piano playing based on non-verbal cues. First, we present a music improvisation model that employs a recurrent neural network (RNN) to predict appropriate chord progressions based on the human's melodic input. Second, we propose a behavior-adaptive controller to facilitate seamless temporal synchronization, allowing the cobot to generate harmonious acoustics. The collaboration takes into account the bidirectional information flow between the human and robot. We have developed an entropy-based system to assess the quality of cooperation by analyzing the impact of different communication modalities during human-robot collaboration. Experiments demonstrate that our RNN-based improvisation can achieve a 93\% accuracy rate. Meanwhile, with the MPC adaptive controller, the robot could respond to the human teammate in homophony performances with real-time accompaniment. Our designed framework has been validated to be effective in allowing humans and robots to work collaboratively in the artistic piano-playing task.
近年来机器学习的进步为音乐和娱乐机器人的发展铺平了道路。然而,人机合作钢琴演奏仍然具有挑战性,特别是由于复杂的运动协调和时间同步。在本文中,我们提出了一个基于非语言提示的人机合作钢琴演奏的理论框架。首先,我们提出了一种基于循环神经网络(RNN)的音乐即兴演奏模型,该模型根据人类的旋律输入预测适当的和弦进行。其次,我们提出了一种行为自适应控制器,以促进无缝的时间同步,使得机器人可以生成和谐的音色。合作考虑了人与机器之间双向信息流。我们还开发了一个基于熵的系统,通过分析人在人机合作期间的不同通信模式对合作质量的影响。实验证明,基于RNN的即兴演奏可以达到93%的准确率。与此同时,使用MPC自适应控制器,机器人可以在同声表演中响应人类队友。我们设计的框架已被证明在让人类和机器人共同协作在艺术钢琴演奏任务中有效。
https://arxiv.org/abs/2409.11952
When your robot grasps an object using dexterous hands or grippers, it should understand the Task-Oriented Affordances of the Object(TOAO), as different tasks often require attention to specific parts of the object. To address this challenge, we propose GauTOAO, a Gaussian-based framework for Task-Oriented Affordance of Objects, which leverages vision-language models in a zero-shot manner to predict affordance-relevant regions of an object, given a natural language query. Our approach introduces a new paradigm: "static camera, moving object," allowing the robot to better observe and understand the object in hand during manipulation. GauTOAO addresses the limitations of existing methods, which often lack effective spatial grouping, by extracting a comprehensive 3D object mask using DINO features. This mask is then used to conditionally query gaussians, producing a refined semantic distribution over the object for the specified task. This approach results in more accurate TOAO extraction, enhancing the robot's understanding of the object and improving task performance. We validate the effectiveness of GauTOAO through real-world experiments, demonstrating its capability to generalize across various tasks.
当您的机器人使用灵巧的手或夹具抓住物体时,它应该能够理解物体的任务导向势能(TOAO),因为不同的任务通常需要关注物体特定部分的注意力。为解决这个挑战,我们提出了GauTOAO,一个基于高斯的对象任务导向势能框架,它利用了视觉语言模型在零击中方式预测物体上的势能相关区域,给出自然语言查询。我们的方法引入了一个新的范例:“静态相机,移动物体”,使机器人能够在操作过程中更好地观察和理解手中的物体。通过利用DINO特征提取全面的3D物体掩码,GauTOAO解决了现有方法的局限性,即往往缺乏有效的空间分组。这个掩码然后用于有条件地查询高斯分布,在指定的任务上产生物体上的语义分布。这种方法提高了更准确的TOAO提取,提高了机器人对物体的理解,并提高了任务性能。我们通过现实世界的实验验证了GauTOAO的有效性,证明了它在各种任务上的泛化能力。
https://arxiv.org/abs/2409.11941