Autonomous locomotion for mobile ground robots in unstructured environments such as waypoint navigation or flipper control requires a sufficiently accurate prediction of the robot-terrain interaction. Heuristics like occupancy grids or traversability maps are widely used but limit actions available to robots with active flippers as joint positions are not taken into account. We present a novel iterative geometric method to predict the 3D pose of mobile ground robots with active flippers on uneven ground with high accuracy and online planning capabilities. This is achieved by utilizing the ability of signed distance fields to represent surfaces with sub-voxel accuracy. The effectiveness of the presented approach is demonstrated on two different tracked robots in simulation and on a real platform. Compared to a tracking system as ground truth, our method predicts the robot position and orientation with an average accuracy of 3.11 cm and 3.91°, outperforming a recent heightmap-based approach. The implementation is made available as an open-source ROS package.
自治移动地面机器人在非结构化环境中(如路径规划或翻转控制)实现自主移动需要对机器人与地面之间的相互作用进行足够准确的预测。类似于占用网格或可穿越性地图等启发式方法被广泛使用,但它们限制了具有活动翻板的机器人的可用动作,因为它们没有考虑到关节位置。我们提出了一种新颖的迭代几何方法,可以预测带有活动翻板的移动地面机器人在不平滑地面上的3D姿态,具有高精度和在线规划能力。这是通过利用签名距离场表示具有子像素准确度的表面来实现的。所提出的方法的有效性在模拟中和真实平台上进行了演示。与跟踪系统作为地面真实情况相比,我们的方法预测机器人的位置和方向具有平均准确度为3.11cm和3.91°,超过了最近基于高图的方法的性能。该实现可作为开源ROS包提供。
https://arxiv.org/abs/2405.02121
Monitoring large scale environments is a crucial task for managing remote alpine environments, especially for hazardous events such as avalanches. One key information for avalanche risk forecast is imagery of released avalanches. As these happen in remote and potentially dangerous locations this data is difficult to obtain. Fixed-wing vehicles, due to their long range and travel speeds are a promising platform to gather aerial imagery to map avalanche activities. However, operating such vehicles in mountainous terrain remains a challenge due to the complex topography, regulations, and uncertain environment. In this work, we present a system that is capable of safely navigating and mapping an avalanche using a fixed-wing aerial system and discuss the challenges arising when executing such a mission. We show in our field experiments that we can effectively navigate in steep terrain environments while maximizing the map quality. We expect our work to enable more autonomous operations of fixed-wing vehicles in alpine environments to maximize the quality of the data gathered.
监控大型环境对于管理远程 Alpine 环境至关重要,尤其是在可能引发山洪等危险事件的环境中。预测雪灾的一种关键信息是释放雪崩的影像。由于这些事件发生在远程且可能危险的位置,因此很难获得这些数据。固定翼车辆由于其长航程和高速旅行,是一个有前途的平台,用于收集高空影像以绘制雪崩活动。然而,在山区操作这些车辆仍然具有挑战性,由于复杂的地质、法规和不确定的环境。在这项工作中,我们提出了一个系统,使用固定翼空中系统安全地导航和绘制雪崩活动。并讨论了在执行此任务时出现的挑战。我们通过现场实验证明,在陡峭的地形环境中,我们既能保证最高地图质量,又能安全地导航。我们预计,我们的工作将使固定翼车辆在 Alpine 环境中实现更自主的操作,从而提高收集到的数据的质量。
https://arxiv.org/abs/2405.02011
Constructing high-definition (HD) maps is a crucial requirement for enabling autonomous driving. In recent years, several map segmentation algorithms have been developed to address this need, leveraging advancements in Bird's-Eye View (BEV) perception. However, existing models still encounter challenges in producing realistic and consistent semantic map layouts. One prominent issue is the limited utilization of structured priors inherent in map segmentation masks. In light of this, we propose DiffMap, a novel approach specifically designed to model the structured priors of map segmentation masks using latent diffusion model. By incorporating this technique, the performance of existing semantic segmentation methods can be significantly enhanced and certain structural errors present in the segmentation outputs can be effectively rectified. Notably, the proposed module can be seamlessly integrated into any map segmentation model, thereby augmenting its capability to accurately delineate semantic information. Furthermore, through extensive visualization analysis, our model demonstrates superior proficiency in generating results that more accurately reflect real-world map layouts, further validating its efficacy in improving the quality of the generated maps.
构建高清晰度(HD)地图是实现自动驾驶的关键要求。近年来,为了解决这个问题,已经开发了几种地图分割算法,利用了Bird's-Eye View(BEV)感知技术的发展。然而,现有的模型在生成逼真的且一致的语义图布局方面仍然遇到困难。一个突出的问题是,地图分割掩码中固有结构的优先利用程度有限。鉴于这一点,我们提出了DiffMap,一种专门利用潜在扩散模型来建模地图分割掩码结构优先级的全新方法。通过引入这项技术,现有语义分割方法的性能可以显著提高,而且分割输出中存在的某些结构错误可以通过有效矫正来消除。值得注意的是,所提出的模块可以轻松地集成到任何地图分割模型中,从而增强其准确描绘语义信息的能力。此外,通过广泛的可视化分析,我们的模型证明了其在生成更准确反映真实世界地图布局的结果方面具有卓越的性能,进一步验证了其在提高生成的地图质量方面的有效性。
https://arxiv.org/abs/2405.02008
This paper presents a novel self-supervised two-frame multi-camera metric depth estimation network, termed M${^2}$Depth, which is designed to predict reliable scale-aware surrounding depth in autonomous driving. Unlike the previous works that use multi-view images from a single time-step or multiple time-step images from a single camera, M${^2}$Depth takes temporally adjacent two-frame images from multiple cameras as inputs and produces high-quality surrounding depth. We first construct cost volumes in spatial and temporal domains individually and propose a spatial-temporal fusion module that integrates the spatial-temporal information to yield a strong volume presentation. We additionally combine the neural prior from SAM features with internal features to reduce the ambiguity between foreground and background and strengthen the depth edges. Extensive experimental results on nuScenes and DDAD benchmarks show M${^2}$Depth achieves state-of-the-art performance. More results can be found in this https URL .
本文提出了一种新颖的自监督两帧多相机metric深度估计网络,称为M2Depth,旨在在自动驾驶中预测可靠的尺度感知周围深度。与之前使用单个时间步或单个相机的多视角图像相比,M2Depth将来自多个摄像头的空间相邻的两帧图像作为输入,并产生高质量的周围深度。我们首先在空间和时间域分别构建成本体积,并提出了一个空间-时间融合模块,将空间-时间信息集成为一个强大的体积展示。此外,将SAM特征的神经先验与内部特征结合以减少前景和背景之间的歧义并加强深度边缘。在nuScenes和DDAD基准上进行的大量实验结果表明,M2Depth实现了与最先进技术相当的表现。更多结果可以在该https://url.org/ URL上找到。
https://arxiv.org/abs/2405.02004
The increasing demand for underwater vehicles highlights the necessity for robust localization solutions in inspection missions. In this work, we present a novel real-time sonar-based underwater global positioning algorithm for AUVs (Autonomous Underwater Vehicles) designed for environments with a sparse distribution of human-made assets. Our approach exploits two synergistic data interpretation frontends applied to the same stream of sonar data acquired by a multibeam Forward-Looking Sonar (FSD). These observations are fused within a Particle Filter (PF) either to weigh more particles that belong to high-likelihood regions or to solve symmetric ambiguities. Preliminary experiments carried out on a simulated environment resembling a real underwater plant provided promising results. This work represents a starting point towards future developments of the method and consequent exhaustive evaluations also in real-world scenarios.
增加的水下车辆的需求突出了在检查任务中实现稳健本地化解决方案的必要性。在这项工作中,我们提出了一个适用于环境中有稀疏分布的人类资产的水下全局定位算法,用于自主水下车辆(AUVs)。我们的方法利用了在同一多束前进式声呐(FSD)获得的声呐数据流中应用的两个协同数据解释前景。这些观察结果可以融合到一个粒子滤波器(PF)中,以便更重地考虑属于高可能性区域的分子的权重,或者解决对称模糊性。在模拟水下环境中进行初步实验,类似于真实水下植物,产生了积极的结果。这项工作代表了该方法未来发展和真实世界场景中进行详细评估的开端。
https://arxiv.org/abs/2405.01971
The recent embrace of machine learning (ML) in the development of autonomous weapons systems (AWS) creates serious risks to geopolitical stability and the free exchange of ideas in AI research. This topic has received comparatively little attention of late compared to risks stemming from superintelligent artificial general intelligence (AGI), but requires fewer assumptions about the course of technological development and is thus a nearer-future issue. ML is already enabling the substitution of AWS for human soldiers in many battlefield roles, reducing the upfront human cost, and thus political cost, of waging offensive war. In the case of peer adversaries, this increases the likelihood of "low intensity" conflicts which risk escalation to broader warfare. In the case of non-peer adversaries, it reduces the domestic blowback to wars of aggression. This effect can occur regardless of other ethical issues around the use of military AI such as the risk of civilian casualties, and does not require any superhuman AI capabilities. Further, the military value of AWS raises the specter of an AI-powered arms race and the misguided imposition of national security restrictions on AI research. Our goal in this paper is to raise awareness among the public and ML researchers on the near-future risks posed by full or near-full autonomy in military technology, and we provide regulatory suggestions to mitigate these risks. We call upon AI policy experts and the defense AI community in particular to embrace transparency and caution in their development and deployment of AWS to avoid the negative effects on global stability and AI research that we highlight here.
近年来,机器学习(ML)在自主武器系统(AWS)的发展中得到了广泛应用,这给地缘政治稳定和人工智能研究的自由交流带来了严重风险。与超级智能人工智能(AGI)带来的风险相比,这个话题最近受到了相对较少的关注,但它离我们更近,是一个更接近未来的问题。ML 已经在许多战场角色中用 AWS 替换了人类士兵,降低了战争开端的 human cost,从而降低了政治成本。在平等对手的情况下,这增加了“低强度”冲突升级到更广泛战争的概率。在对等对手的情况下,它减少了国内反弹,降低了侵略战争造成的国内影响。这种效果可以在不影响其他涉及军事人工智能使用的伦理问题的前提下发生,也不需要超人类 AI 能力。此外,AWS 在军事上的价值加剧了 AI 驱动的军备竞赛和错误地限制国家安全研究的的国家安全限制。我们在论文中的目标是提醒公众和 ML 研究人员,在军事技术上实现完全或近完全自主可能带来的近未来风险,并向减轻这些风险提供监管建议。我们呼吁 AI 政策专家和国防 AI 社区在开发和部署 AWS 时保持透明和谨慎,以避免我们在论文中强调的对其全球稳定和 AI 研究产生的负面影响。
https://arxiv.org/abs/2405.01859
Image segmentation is one of the major computer vision tasks, which is applicable in a variety of domains, such as autonomous navigation of an unmanned aerial vehicle. However, image segmentation cannot easily materialize on tiny embedded systems because image segmentation models generally have high peak memory usage due to their architectural characteristics. This work finds that image segmentation models unnecessarily require large memory space with an existing tiny machine learning framework. That is, the existing framework cannot effectively manage the memory space for the image segmentation models. This work proposes TinySeg, a new model optimizing framework that enables memory-efficient image segmentation for tiny embedded systems. TinySeg analyzes the lifetimes of tensors in the target model and identifies long-living tensors. Then, TinySeg optimizes the memory usage of the target model mainly with two methods: (i) tensor spilling into local or remote storage and (ii) fused fetching of spilled tensors. This work implements TinySeg on top of the existing tiny machine learning framework and demonstrates that TinySeg can reduce the peak memory usage of an image segmentation model by 39.3% for tiny embedded systems.
图像分割是计算机视觉中的一个重要任务,适用于各种领域,如无人驾驶无人机的自主导航。然而,由于图像分割模型的架构特点,它们通常具有较高的峰值内存使用率,这使得在小型嵌入系统上实现图像分割变得困难。这项工作发现,与现有的微型机器学习框架相比,图像分割模型不必要的需要大型的内存空间。也就是说,现有的框架无法有效地管理图像分割模型的内存空间。这项工作提出了一种名为TinySeg的新模型优化框架,可以实现对小型嵌入系统的内存高效图像分割。TinySeg分析了目标模型中张量的生命周期,并识别出长期存在的张量。然后,TinySeg通过(i)张量溢出到局部或远地存储和(ii)张量融合获取溢出的张量来优化目标模型的内存使用。这项工作在现有微型机器学习框架上实现了TinySeg,并证明了TinySeg可以在小型嵌入系统上降低图像分割模型的峰值内存使用率39.3%。
https://arxiv.org/abs/2405.01857
Autonomous wheeled-legged robots have the potential to transform logistics systems, improving operational efficiency and adaptability in urban environments. Navigating urban environments, however, poses unique challenges for robots, necessitating innovative solutions for locomotion and navigation. These challenges include the need for adaptive locomotion across varied terrains and the ability to navigate efficiently around complex dynamic obstacles. This work introduces a fully integrated system comprising adaptive locomotion control, mobility-aware local navigation planning, and large-scale path planning within the city. Using model-free reinforcement learning (RL) techniques and privileged learning, we develop a versatile locomotion controller. This controller achieves efficient and robust locomotion over various rough terrains, facilitated by smooth transitions between walking and driving modes. It is tightly integrated with a learned navigation controller through a hierarchical RL framework, enabling effective navigation through challenging terrain and various obstacles at high speed. Our controllers are integrated into a large-scale urban navigation system and validated by autonomous, kilometer-scale navigation missions conducted in Zurich, Switzerland, and Seville, Spain. These missions demonstrate the system's robustness and adaptability, underscoring the importance of integrated control systems in achieving seamless navigation in complex environments. Our findings support the feasibility of wheeled-legged robots and hierarchical RL for autonomous navigation, with implications for last-mile delivery and beyond.
自动驾驶轮式机器人具有潜力彻底改变物流系统,提高操作效率和适应城市环境的灵活性。然而,在导航城市环境中还存在独特的挑战,对机器人的运动和导航提出了创新解决方案。这些挑战包括在各种地形上进行自适应运动以及高效地围绕复杂动态障碍物进行导航。本文介绍了一种集成系统,包括自适应运动控制、面向移动性的局部路径规划和城市规模路径规划。我们使用基于模型无关强化学习(RL)技术和优先学习方法开发了一个多功能的运动控制器。该控制器在各种崎岖不平的地面上实现高效的稳健运动,得益于平滑的步行和驾驶模式之间的转换。它与通过分层的RL框架集成的学习导航控制器紧密集成,使机器人能够有效通过具有挑战性的地形和各种障碍物的高速导航。我们的控制器被集成到大型城市导航系统中,并通过瑞士苏黎世和西班牙塞维利亚等地进行的自主、公里级导航任务进行了验证。这些任务突显了系统的稳健性和适应性,进一步强调了集成控制系统在复杂环境中实现无缝导航的重要性。我们的研究结果支持轮式机器人的可行性和层次式RL在自主导航方面的应用,这对末端交付和更广阔的应用领域都有重要的意义。
https://arxiv.org/abs/2405.01792
Unmanned Aerial Vehicles (UAVs) have emerged as a transformative technology across diverse sectors, offering adaptable solutions to complex challenges in both military and civilian domains. Their expanding capabilities present a platform for further advancement by integrating cutting-edge computational tools like Artificial Intelligence (AI) and Machine Learning (ML) algorithms. These advancements have significantly impacted various facets of human life, fostering an era of unparalleled efficiency and convenience. Large Language Models (LLMs), a key component of AI, exhibit remarkable learning and adaptation capabilities within deployed environments, demonstrating an evolving form of intelligence with the potential to approach human-level proficiency. This work explores the significant potential of integrating UAVs and LLMs to propel the development of autonomous systems. We comprehensively review LLM architectures, evaluating their suitability for UAV integration. Additionally, we summarize the state-of-the-art LLM-based UAV architectures and identify novel opportunities for LLM embedding within UAV frameworks. Notably, we focus on leveraging LLMs to refine data analysis and decision-making processes, specifically for enhanced spectral sensing and sharing in UAV applications. Furthermore, we investigate how LLM integration expands the scope of existing UAV applications, enabling autonomous data processing, improved decision-making, and faster response times in emergency scenarios like disaster response and network restoration. Finally, we highlight crucial areas for future research that are critical for facilitating the effective integration of LLMs and UAVs.
无人机(UAVs)作为一种变革性的技术,已经出现在各种领域,为军事和民用领域提供了适应性的解决方案。它们不断扩大的能力为通过整合尖端的计算工具如人工智能(AI)和机器学习(ML)算法,进一步推动进步提供了平台。这些进步对人类生活产生了重大影响,推动了无与伦比的高效和便利的时期。大型语言模型(LLMs),是AI的关键组成部分,在部署环境中表现出惊人的学习和适应能力,表明了一种不断发展的智能形式,具有接近人类水平的能力。 本文探讨了将无人机(UAVs)和LLMs集成以推动自主系统开发的巨大潜力。我们全面回顾了LLM架构,评估其是否适合无人机集成。此外,我们总结了基于LLM的无人机架构的最新进展,并探讨了LLM在无人机框架中嵌入的新机会。值得注意的是,我们重点关注利用LLMs优化数据分析和决策过程,特别是增强无人机应用中的光谱感知和数据共享。 此外,我们研究了LLM集成如何扩大现有无人机应用的范围,实现自主数据处理、改进决策以及在紧急场景如灾难应对和网络恢复中的更快的响应时间。最后,我们强调了未来研究的关键领域,这些领域对于促进LLMs和UAV的有效整合至关重要。
https://arxiv.org/abs/2405.01745
To perform effective causal inference in high-dimensional datasets, initiating the process with causal discovery is imperative, wherein a causal graph is generated based on observational data. However, obtaining a complete and accurate causal graph poses a formidable challenge, recognized as an NP-hard problem. Recently, the advent of Large Language Models (LLMs) has ushered in a new era, indicating their emergent capabilities and widespread applicability in facilitating causal reasoning across diverse domains, such as medicine, finance, and science. The expansive knowledge base of LLMs holds the potential to elevate the field of causal reasoning by offering interpretability, making inferences, generalizability, and uncovering novel causal structures. In this paper, we introduce a new framework, named Autonomous LLM-Augmented Causal Discovery Framework (ALCM), to synergize data-driven causal discovery algorithms and LLMs, automating the generation of a more resilient, accurate, and explicable causal graph. The ALCM consists of three integral components: causal structure learning, causal wrapper, and LLM-driven causal refiner. These components autonomously collaborate within a dynamic environment to address causal discovery questions and deliver plausible causal graphs. We evaluate the ALCM framework by implementing two demonstrations on seven well-known datasets. Experimental results demonstrate that ALCM outperforms existing LLM methods and conventional data-driven causal reasoning mechanisms. This study not only shows the effectiveness of the ALCM but also underscores new research directions in leveraging the causal reasoning capabilities of LLMs.
在高维数据集上进行有效的因果推断,从因果发现开始是至关重要的,其中基于观测数据的因果图被生成。然而,获得完整和准确的因果图是一个具有挑战性的任务,被认为是NP难问题。最近,大型语言模型的出现引领了一个新时代,表明了它们新兴的潜力和在多个领域促进因果推理的广泛应用,如医学、金融和科学。LLM的广泛知识库具有提高因果推理领域的方法,提供可解释性、推理、一般性和发现新颖因果结构的可能性。在本文中,我们引入了一个新的框架,名为自动LLM增强因果发现框架(ALCM),以实现数据驱动的因果发现算法和LLM的协同作用,自动生成更健壮、准确和可解释的因果图。ALCM由三个基本组件组成:因果结构学习、因果外壳和LLM驱动因果细化。这些组件在动态环境中自治地合作来解决因果发现问题并生成合理的因果图。我们对ALCM框架进行了两个演示,在七个著名的数据集上进行了实验。实验结果表明,ALCM超越了现有的LLM方法和传统数据驱动因果推理机制。本研究不仅展示了ALCM的有效性,还强调了利用LLM的因果推理能力的新研究方向。
https://arxiv.org/abs/2405.01744
Out-of-distribution (OOD) detection is essential in autonomous driving, to determine when learning-based components encounter unexpected inputs. Traditional detectors typically use encoder models with fixed settings, thus lacking effective human interaction capabilities. With the rise of large foundation models, multimodal inputs offer the possibility of taking human language as a latent representation, thus enabling language-defined OOD detection. In this paper, we use the cosine similarity of image and text representations encoded by the multimodal model CLIP as a new representation to improve the transparency and controllability of latent encodings used for visual anomaly detection. We compare our approach with existing pre-trained encoders that can only produce latent representations that are meaningless from the user's standpoint. Our experiments on realistic driving data show that the language-based latent representation performs better than the traditional representation of the vision encoder and helps improve the detection performance when combined with standard representations.
离散(OOD)检测在自动驾驶中至关重要,以确定学习基础组件何时遇到意外输入。传统的检测方法通常使用固定设置的编码器模型,因此缺乏有效的人机交互能力。随着大型基础模型的发展,多模态输入提供了将人类语言作为潜在表示的机会,从而实现了语言定义的OOD检测。在本文中,我们将多模态模型CLIP中编码器模型生成的图像和文本表示的余弦相似性作为新的表示,以提高用于视觉异常检测的潜在表示的可视化和可控制性。我们将我们的方法与只能从用户角度产生无意义潜在表示的现有预训练编码器进行比较。我们对现实驾驶数据的实验结果表明,基于语言的潜在表示 perform better than the traditional representation of the vision encoder and helps improve detection performance when combined with standard representations.
https://arxiv.org/abs/2405.01691
The ability to determine the pose of a rover in an inertial frame autonomously is a crucial capability necessary for the next generation of surface rover missions on other planetary bodies. Currently, most on-going rover missions utilize ground-in-the-loop interventions to manually correct for drift in the pose estimate and this human supervision bottlenecks the distance over which rovers can operate autonomously and carry out scientific measurements. In this paper, we present ShadowNav, an autonomous approach for global localization on the Moon with an emphasis on driving in darkness and at nighttime. Our approach uses the leading edge of Lunar craters as landmarks and a particle filtering approach is used to associate detected craters with known ones on an offboard map. We discuss the key design decisions in developing the ShadowNav framework for use with a Lunar rover concept equipped with a stereo camera and an external illumination source. Finally, we demonstrate the efficacy of our proposed approach in both a Lunar simulation environment and on data collected during a field test at Cinder Lakes, Arizona.
自主确定机器人在惯性框架中的姿态是一个关键的能力,对于下一代的太空机器人任务在其他国家行星上至关重要。目前,大多数正在进行中的机器人任务使用地面反馈来手动校正姿态估计中的漂移,这种人监控瓶颈会限制机器人能够在自主和进行科学测量时可以操作的范围。在本文中,我们提出了ShadowNav,一种专注于在月球上进行全局定位的方法,重点关注在夜间和夜间行驶。我们的方法使用月球坑的领先边缘作为地标,并采用粒子滤波方法将检测到的坑与已知位置的坑在离岸地图上相关联。我们讨论了开发ShadowNav框架与配备立体相机和外部照明系统的月球机器人概念相关的设计决策。最后,我们在月球仿真环境和亚利桑那州Cinder Lakes的现场测试中展示了我们提出方法的效力。
https://arxiv.org/abs/2405.01673
A unified and versatile LiDAR segmentation model with strong robustness and generalizability is desirable for safe autonomous driving perception. This work presents M3Net, a one-of-a-kind framework for fulfilling multi-task, multi-dataset, multi-modality LiDAR segmentation in a universal manner using just a single set of parameters. To better exploit data volume and diversity, we first combine large-scale driving datasets acquired by different types of sensors from diverse scenes and then conduct alignments in three spaces, namely data, feature, and label spaces, during the training. As a result, M3Net is capable of taming heterogeneous data for training state-of-the-art LiDAR segmentation models. Extensive experiments on twelve LiDAR segmentation datasets verify our effectiveness. Notably, using a shared set of parameters, M3Net achieves 75.1%, 83.1%, and 72.4% mIoU scores, respectively, on the official benchmarks of SemanticKITTI, nuScenes, and Waymo Open.
一个统一且具有强大稳健性和泛化能力的LiDAR分割模型对于安全自动驾驶感知是可取的。本文提出M3Net,一个独一无二的框架,用于通过仅使用一组参数实现多任务、多数据集和多模态LiDAR分割。为了更好地利用数据量和多样性,我们首先将不同场景下获取的大规模驾驶数据集合并,然后进行数据、特征和标签空间的归一化训练。结果,M3Net能够驯服训练中最具竞争力的LiDAR分割模型的异质数据。在十二个LiDAR分割数据集上的广泛实验证实了我们的有效性。值得注意的是,使用共享参数,M3Net在SemanticKITTI、nuScenes和Waymo Open的官方基准测试中都取得了75.1%、83.1%和72.4%的mIoU分数。
https://arxiv.org/abs/2405.01538
The advances in multimodal large language models (MLLMs) have led to growing interests in LLM-based autonomous driving agents to leverage their strong reasoning capabilities. However, capitalizing on MLLMs' strong reasoning capabilities for improved planning behavior is challenging since planning requires full 3D situational awareness beyond 2D reasoning. To address this challenge, our work proposes a holistic framework for strong alignment between agent models and 3D driving tasks. Our framework starts with a novel 3D MLLM architecture that uses sparse queries to lift and compress visual representations into 3D before feeding them into an LLM. This query-based representation allows us to jointly encode dynamic objects and static map elements (e.g., traffic lanes), providing a condensed world model for perception-action alignment in 3D. We further propose OmniDrive-nuScenes, a new visual question-answering dataset challenging the true 3D situational awareness of a model with comprehensive visual question-answering (VQA) tasks, including scene description, traffic regulation, 3D grounding, counterfactual reasoning, decision making and planning. Extensive studies show the effectiveness of the proposed architecture as well as the importance of the VQA tasks for reasoning and planning in complex 3D scenes.
多模态大型语言模型(MLLMs)的进步导致了对基于LLM的自动驾驶代理的浓厚兴趣,以利用其强大的推理能力。然而,利用MLLMs的强大的推理能力进行改进的规划行为具有挑战性,因为规划需要超过2D推理的全面3D情景意识。为解决这个问题,我们的工作提出了一个整体框架,实现代理模型与3D驾驶任务的强一致性。我们的框架从采用稀疏查询的全新3D MLLM架构开始,该架构在将视觉表示压缩成3D后输入LLM之前利用稀疏查询。这种基于查询的表示允许我们共同编码动态物体和静态地图元素(例如,交通车道),为3D感知-动作对齐提供了一个压缩的世界模型。我们还提出了OmniDrive-nuScenes,一个新的视觉问题回答数据集,挑战了具有全面视觉问题回答(VQA)任务的模型的真正3D情景意识,包括场景描述、交通规则、3D建模、反事实推理、决策和规划。大量研究证明了所建议的架构的有效性以及VQA任务对复杂3D场景中的推理和规划的重要性。
https://arxiv.org/abs/2405.01533
Adaptive Cruise Control ACC can change the speed of the ego vehicle to maintain a safe distance from the following vehicle automatically. The primary purpose of this research is to use cutting-edge computing approaches to locate and track vehicles in real time under various conditions to achieve a safe ACC. The paper examines the extension of ACC employing depth cameras and radar sensors within Autonomous Vehicles AVs to respond in real time by changing weather conditions using the Car Learning to Act CARLA simulation platform at noon. The ego vehicle controller's decision to accelerate or decelerate depends on the speed of the leading ahead vehicle and the safe distance from that vehicle. Simulation results show that a Proportional Integral Derivative PID control of autonomous vehicles using a depth camera and radar sensors reduces the speed of the leading vehicle and the ego vehicle when it rains. In addition, longer travel time was observed for both vehicles in rainy conditions than in dry conditions. Also, PID control prevents the leading vehicle from rear collisions
自适应巡航控制(ACC)可以根据预设的速度自动改变车辆的自适应速度,以保持与后车安全距离。本研究的主要目的是利用尖端计算方法在各种情况下实时定位和跟踪车辆,以实现安全ACC。论文检查了在自动驾驶车辆(AV)中使用深度相机和雷达传感器扩展ACC,通过使用Car Learning to Act CARLA仿真平台在中午实时响应天气条件。自车控制器决定加速或减速取决于前车的速度和与该车辆的安全距离。仿真结果表明,使用深度相机和雷达传感器的自动驾驶车辆在下雨时,自车和前车的速度都会降低。此外,在雨天观察到的车辆行驶时间比干燥条件下更长。此外,PID控制还可以防止前车发生碰撞。
https://arxiv.org/abs/2405.01504
Imitation learning is a promising paradigm for training robot control policies, but these policies can suffer from distribution shift, where the conditions at evaluation time differ from those in the training data. A popular approach for increasing policy robustness to distribution shift is interactive imitation learning (i.e., DAgger and variants), where a human operator provides corrective interventions during policy rollouts. However, collecting a sufficient amount of interventions to cover the distribution of policy mistakes can be burdensome for human operators. We propose IntervenGen (I-Gen), a novel data generation system that can autonomously produce a large set of corrective interventions with rich coverage of the state space from a small number of human interventions. We apply I-Gen to 4 simulated environments and 1 physical environment with object pose estimation error and show that it can increase policy robustness by up to 39x with only 10 human interventions. Videos and more results are available at this https URL.
模仿学习是一种训练机器人控制策略的有前途的范式,但这些策略可能由于评估时与训练数据中的条件不同而受到分布漂移的影响。提高策略对分布漂移的鲁棒性的一个受欢迎的方法是交互式模仿学习(即DAgger及其变体),其中人类操作员在策略部署过程中提供纠正干预。然而,收集到足够的干预以覆盖策略错误的分布可能对人类操作员来说具有负担。我们提出IntervenGen(I-Gen),一种新数据生成系统,可以自主生成大量具有丰富覆盖状态空间小数个人类干预的纠正干预。我们将I-Gen应用于4个模拟环境和1个物理环境,对象姿态估计误差,并表明它可以通过仅使用10个人类干预来增加策略鲁棒性高达39倍。视频和其他结果可在此链接查看。
https://arxiv.org/abs/2405.01472
Understanding user enjoyment is crucial in human-robot interaction (HRI), as it can impact interaction quality and influence user acceptance and long-term engagement with robots, particularly in the context of conversations with social robots. However, current assessment methods rely solely on self-reported questionnaires, failing to capture interaction dynamics. This work introduces the Human-Robot Interaction Conversational User Enjoyment Scale (HRI CUES), a novel scale for assessing user enjoyment from an external perspective during conversations with a robot. Developed through rigorous evaluations and discussions of three annotators with relevant expertise, the scale provides a structured framework for assessing enjoyment in each conversation exchange (turn) alongside overall interaction levels. It aims to complement self-reported enjoyment from users and holds the potential for autonomously identifying user enjoyment in real-time HRI. The scale was validated on 25 older adults' open-domain dialogue with a companion robot that was powered by a large language model for conversations, corresponding to 174 minutes of data, showing moderate to good alignment. Additionally, the study offers insights into understanding the nuances and challenges of assessing user enjoyment in robot interactions, and provides guidelines on applying the scale to other domains.
理解用户的喜爱在人机交互(HRI)中至关重要,因为它可能会影响交互质量和影响用户对机器的接受程度以及与机器的长期参与,特别是在与社交机器人的对话中。然而,目前的评估方法仅依赖自我报告问卷,无法捕捉交互动态。这项工作介绍了一个名为人机交互聊天机器人用户喜爱量表(HRI CUES)的新量表,用于从外部角度评估用户在机器人对话中的喜爱。通过与具有相关专业知识的三位注释者的深入讨论和严格的评估,该量表构建了一个结构化的框架,用于评估每个对话交流(回合)的喜爱程度以及整个交互水平。该量表旨在补充来自用户的自我报告喜爱,并具有在实时HRI中自动识别用户喜好的潜力。 该量表在25名年龄较大的成年人与一台由大型语言模型驱动的伴侣机器人进行开放领域的对话上进行了验证,对话持续了174分钟,显示出中等至良好的相关性。此外,这项研究揭示了评估用户喜爱在机器人交互中的细微问题和挑战,并为其他领域提供了应用该量表的指导。
https://arxiv.org/abs/2405.01354
Simulation is a fundamental tool in developing autonomous vehicles, enabling rigorous testing without the logistical and safety challenges associated with real-world trials. As autonomous vehicle technologies evolve and public safety demands increase, advanced, realistic simulation frameworks are critical. Current testing paradigms employ a mix of general-purpose and specialized simulators, such as CARLA and IVRESS, to achieve high-fidelity results. However, these tools often struggle with compatibility due to differing platform, hardware, and software requirements, severely hampering their combined effectiveness. This paper introduces BlueICE, an advanced framework for ultra-realistic simulation and digital twinning, to address these challenges. BlueICE's innovative architecture allows for the decoupling of computing platforms, hardware, and software dependencies while offering researchers customizable testing environments to meet diverse fidelity needs. Key features include containerization to ensure compatibility across different systems, a unified communication bridge for seamless integration of various simulation tools, and synchronized orchestration of input and output across simulators. This framework facilitates the development of sophisticated digital twins for autonomous vehicle testing and sets a new standard in simulation accuracy and flexibility. The paper further explores the application of BlueICE in two distinct case studies: the ICAT indoor testbed and the STAR campus outdoor testbed at the University of Delaware. These case studies demonstrate BlueICE's capability to create sophisticated digital twins for autonomous vehicle testing and underline its potential as a standardized testbed for future autonomous driving technologies.
模拟是在发展自动驾驶车辆中的一种基本工具,它允许在不需要与现实世界试验相关的物流和安全性挑战的情况下进行严格的测试。随着自动驾驶技术的发展和公共安全需求的增长,先进的、逼真的模拟框架至关重要。当前的测试范式采用通用和专用模拟器,如CARLA和IVRESS,以实现高保真度的结果。然而,由于不同平台、硬件和软件需求的不同,这些工具往往难以兼容,严重地阻碍了它们的综合效果。本文介绍了一种名为BlueICE的高级框架,以解决这些挑战。BlueICE创新的设计允许在计算平台、硬件和软件依赖之间进行解耦,并为研究人员提供可定制的测试环境,以满足不同的保真度需求。关键特点包括容器化以确保不同系统之间的兼容性,统一的通信桥实现各种模拟工具的无缝集成,以及模拟器之间同步操作输入和输出。这个框架促进了自动驾驶车辆测试中复杂数字孪生的开发,为模拟准确性和灵活性设定了新的标准。本文进一步探讨了BlueICE在两个不同案例研究中的应用:美国马里兰大学ICAT室内测试区和大学 of Delaware的STAR校园户外测试区。这些案例研究展示了BlueICE在创建自动驾驶车辆测试中的复杂数字孪生方面的能力,并强调了其在未来自动驾驶技术标准化测试床上的潜力。
https://arxiv.org/abs/2405.01328
This paper introduces a trajectory prediction model tailored for autonomous driving, focusing on capturing complex interactions in dynamic traffic scenarios without reliance on high-definition maps. The model, termed MFTraj, harnesses historical trajectory data combined with a novel dynamic geometric graph-based behavior-aware module. At its core, an adaptive structure-aware interactive graph convolutional network captures both positional and behavioral features of road users, preserving spatial-temporal intricacies. Enhanced by a linear attention mechanism, the model achieves computational efficiency and reduced parameter overhead. Evaluations on the Argoverse, NGSIM, HighD, and MoCAD datasets underscore MFTraj's robustness and adaptability, outperforming numerous benchmarks even in data-challenged scenarios without the need for additional information such as HD maps or vectorized maps. Importantly, it maintains competitive performance even in scenarios with substantial missing data, on par with most existing state-of-the-art models. The results and methodology suggest a significant advancement in autonomous driving trajectory prediction, paving the way for safer and more efficient autonomous systems.
本文提出了一种专为自动驾驶设计的轨迹预测模型,重点关注在动态交通场景中捕捉复杂交互,而不依赖于高清晰度地图。该模型被称为MFTraj,利用了历史轨迹数据与新颖动态几何图形的结合。在其核心,自适应结构感知交互图卷积网络捕捉道路用户的定位和行为特征,保留空间-时间复杂性。通过线性注意机制的增强,该模型实现计算效率和参数开销降低。在Argoverse、NGSIM、HighD和MoCAD数据集上的评估表明,MFTraj的稳健性和适应性得到了充分证明,即使在数据稀疏的场景中,也能在大多数基准模型之上表现优异。重要的是,在具有大量缺失数据的情况下,它仍具有竞争力的性能,与大多数现有自动驾驶系统相当。结果和 methodology 表明,在自动驾驶轨迹预测方面取得了显著的进展,为更安全和高效的自动驾驶系统铺平了道路。
https://arxiv.org/abs/2405.01266
The ability to autonomously assemble structures is crucial for the development of future space infrastructure. However, the unpredictable conditions of space pose significant challenges for robotic systems, necessitating the development of advanced learning techniques to enable autonomous assembly. In this study, we present a novel approach for learning autonomous peg-in-hole assembly in the context of space robotics. Our focus is on enhancing the generalization and adaptability of autonomous systems through deep reinforcement learning. By integrating procedural generation and domain randomization, we train agents in a highly parallelized simulation environment across a spectrum of diverse scenarios with the aim of acquiring a robust policy. The proposed approach is evaluated using three distinct reinforcement learning algorithms to investigate the trade-offs among various paradigms. We demonstrate the adaptability of our agents to novel scenarios and assembly sequences while emphasizing the potential of leveraging advanced simulation techniques for robot learning in space. Our findings set the stage for future advancements in intelligent robotic systems capable of supporting ambitious space missions and infrastructure development beyond Earth.
自主组装结构的能力对于未来空间基础设施的发展至关重要。然而,空间的不可预测条件对机器人系统提出了重大挑战,需要开发高级学习技术来实现自主组装。在这项研究中,我们提出了一个在空间机器人领域学习自主穿孔桩组装的新方法。我们的重点是通过深度强化学习增强自主系统的泛化能力和适应性。通过将程序生成和领域随机化集成起来,我们在一系列多样场景中训练具有高度并行化的仿真环境中的智能体,旨在获得稳健的策略。所提出的方法通过评估三种不同的强化学习算法,研究了各种范式之间的权衡。我们展示了我们的智能体在应对新颖场景和组装序列时具有的可扩展性,同时强调利用先进的仿真技术进行机器人学习的潜力,为支持太空探索和基础设施发展做好准备。我们的研究为未来智能机器人系统在支持太空探索和基础设施发展方面取得进一步进展奠定了基础。
https://arxiv.org/abs/2405.01134