Q-learning methods are widely used in robot path planning but often face challenges of inefficient search and slow convergence. We propose an Improved Q-learning (IQL) framework that enhances standard Q-learning in two significant ways. First, we introduce the Path Adaptive Collaborative Optimization (PACO) algorithm to optimize Q-table initialization, providing better initial estimates and accelerating learning. Second, we incorporate a Utility-Controlled Heuristic (UCH) mechanism with dynamically tuned parameters to optimize the reward function, enhancing the algorithm's accuracy and effectiveness in path-planning tasks. Extensive experiments in three different raster grid environments validate the superior performance of our IQL framework. The results demonstrate that our IQL algorithm outperforms existing methods, including FIQL, PP-QL-based CPP, DFQL, and QMABC algorithms, in terms of path-planning capabilities.
Q-learning方法在机器人路径规划中被广泛使用,但常常面临搜索效率低和收敛速度慢的挑战。我们提出了一种改进型Q学习(IQL)框架,在两个重要方面提升了标准Q学习的方法。首先,我们引入了路径自适应协同优化(PACO)算法来优化Q表的初始化,提供更好的初始估计值并加速学习过程。其次,我们整合了一个动态调整参数的效用控制启发式(UCH)机制以优化奖励函数,从而提高算法在路径规划任务中的准确性和有效性。 我们在三种不同的栅格环境中进行了广泛的实验,验证了我们的IQL框架的优越性能。结果表明,在路径规划能力方面,我们的IQL算法优于现有方法,包括FIQL、基于PP-QL的CPP、DFQL和QMABC算法。
https://arxiv.org/abs/2501.05411
Every maneuver of a vehicle redistributes risks between road users. While human drivers do this intuitively, autonomous vehicles allow and require deliberative algorithmic risk management. But how should traffic risks be distributed among road users? In a global experimental study in eight countries with different cultural backgrounds and almost 11,000 participants, we compared risk distribution preferences. It turns out that risk preferences in road traffic are strikingly similar between the cultural zones. The vast majority of participants in all countries deviates from a guiding principle of minimizing accident probabilities in favor of weighing up the probability and severity of accidents. At the national level, the consideration of accident probability and severity hardly differs between countries. The social dilemma of autonomous vehicles detected in deterministic crash scenarios disappears in risk assessments of everyday traffic situations in all countries. In no country do cyclists receive a risk bonus that goes beyond their higher vulnerability. In sum, our results suggest that a global consensus on the risk ethics of autonomous driving is easier to establish than on the ethics of crashing.
每一次车辆的操作都会重新分配道路上各使用者的风险。尽管人类驾驶员会凭直觉进行风险再分配,自动驾驶汽车则允许并需要通过算法来进行有意识的风险管理。但交通风险应该如何在道路使用者之间分布呢?在一个涵盖八个国家、具有不同文化背景的全球实验研究中,我们对近1.1万名参与者进行了风险分配偏好的对比分析。 结果显示,在道路交通中的风险偏好在各个文化区域间非常相似。所有国家的大多数参与者都偏离了最小化事故概率的原则,而是倾向于权衡事故的概率和严重性。从国家层面来看,各国在考虑事故发生概率和严重性的方法上几乎没有差异。在确定性碰撞场景中发现的自动驾驶车辆的社会困境,在所有国家的日常交通情况风险评估中都不复存在。 没有一个国家会给予骑自行车者超出其更高脆弱性以外的风险补偿。总的来说,我们的研究结果表明,建立全球统一的自动驾驶风险伦理共识比制定碰撞事件中的道德规范要容易得多。
https://arxiv.org/abs/2501.05391
Corrigibility of autonomous agents is an under explored part of system design, with previous work focusing on single agent systems. It has been suggested that uncertainty over the human preferences acts to keep the agents corrigible, even in the face of human irrationality. We present a general framework for modelling corrigibility in a multi-agent setting as a 2 player game in which the agents always have a move in which they can ask the human for supervision. This is formulated as a Bayesian game for the purpose of introducing uncertainty over the human beliefs. We further analyse two specific cases. First, a two player corrigibility game, in which we want corrigibility displayed in both agents for both common payoff (monotone) games and harmonic games. Then we investigate an adversary setting, in which one agent is considered to be a `defending' agent and the other an `adversary'. A general result is provided for what belief over the games and human rationality the defending agent is required to have to induce corrigibility.
自主代理的可纠正性是系统设计中一个尚未充分探索的部分,之前的研究主要集中在单个代理系统上。有人提出,在面对人类非理性的情况下,对人类偏好的不确定性可以保持代理的可纠正性。我们在此提出了一个多代理环境下的可纠正性的通用框架,并将其建模为一个两人博弈游戏,在该游戏中,代理始终有一个动作可以让它们请求人类进行监督。为了引入关于人类信念的不确定性,我们将这一问题形式化为贝叶斯博弈。 进一步地,我们分析了两个特定案例: 1. **双人可纠正性博弈**:在这种情境下,我们希望在双方都表现出共同收益(单调)游戏和和谐游戏中均具备可纠正性的代理。这意味着,无论是在合作还是竞争环境中,代理都能根据人类的指导调整其行为。 2. **对抗设置**:在此场景中,一个代理被视作“防御方”,另一个作为“对手”。我们提供了一个关于哪些信念需要由防守方代理持有以诱导出可纠正性的一般结果。具体来说,这个一般结论涉及游戏类型和人类理性之间的关系,确定了为了保证代理能够响应并服从于人类的监督而必须持有的信念。 通过这些分析,我们可以更好地理解如何在多代理系统中设计具有自我修正能力的人工智能,以确保它们即使面临复杂的社会和技术环境也能正确地服务于人类的最佳利益。
https://arxiv.org/abs/2501.05360
Semantic segmentation for autonomous driving is an even more challenging task when faced with adverse driving conditions. Standard models trained on data recorded under ideal conditions show a deteriorated performance in unfavorable weather or illumination conditions. Fine-tuning on the new task or condition would lead to overwriting the previously learned information resulting in catastrophic forgetting. Adapting to the new conditions through traditional domain adaption methods improves the performance on the target domain at the expense of the source domain. Addressing these issues, we propose an architecture-based domain-incremental learning approach called Progressive Semantic Segmentation (PSS). PSS is a task-agnostic, dynamically growing collection of domain-specific segmentation models. The task of inferring the domain and subsequently selecting the appropriate module for segmentation is carried out using a collection of convolutional autoencoders. We extensively evaluate our proposed approach using several datasets at varying levels of granularity in the categorization of adverse driving conditions. Furthermore, we demonstrate the generalization of the proposed approach to similar and unseen domains.
针对自动驾驶中的语义分割任务,在遇到不利的驾驶条件时变得更加具有挑战性。标准模型在理想条件下训练后,在恶劣天气或光照条件下性能会下降。对新任务或条件进行微调会导致覆盖之前学习的信息,从而引发灾难性遗忘。通过传统领域适应方法来适应新的条件虽然可以提高目标领域的表现,但也会牺牲源域的性能。 为解决这些问题,我们提出了一种基于架构的增量领域学习方法,称为渐进语义分割(Progressive Semantic Segmentation, PSS)。PSS是一个任务无关、动态增长的领域特定分割模型集合。通过一组卷积自编码器来完成推断领域并选择适当的模块进行分割的任务。 我们在多个数据集上对所提出的这种方法进行了广泛的评估,这些数据集中包含了不同程度的不利驾驶条件分类细节。此外,我们还展示了该方法在类似和未见过领域的泛化能力。
https://arxiv.org/abs/2501.05246
Autonomous vehicles rely on camera-based perception systems to comprehend their driving environment and make crucial decisions, thereby ensuring vehicles to steer safely. However, a significant threat known as Electromagnetic Signal Injection Attacks (ESIA) can distort the images captured by these cameras, leading to incorrect AI decisions and potentially compromising the safety of autonomous vehicles. Despite the serious implications of ESIA, there is limited understanding of its impacts on the robustness of AI models across various and complex driving scenarios. To address this gap, our research analyzes the performance of different models under ESIA, revealing their vulnerabilities to the attacks. Moreover, due to the challenges in obtaining real-world attack data, we develop a novel ESIA simulation method and generate a simulated attack dataset for different driving scenarios. Our research provides a comprehensive simulation and evaluation framework, aiming to enhance the development of more robust AI models and secure intelligent systems, ultimately contributing to the advancement of safer and more reliable technology across various fields.
自主驾驶汽车依赖于基于摄像头的感知系统来理解其行驶环境并做出关键决策,从而确保车辆的安全操控。然而,一种被称为电磁信号注入攻击(ESIA)的重大威胁可以扭曲这些摄像机捕捉到的画面,导致人工智能作出错误判断,并可能危及自动驾驶汽车的安全性。尽管ESIA的影响十分严重,但对于它对各种复杂驾驶场景中AI模型的鲁棒性的冲击却了解有限。为解决这一空白,我们的研究分析了不同模型在面对ESIA时的表现,揭示了它们在此类攻击面前的脆弱性。此外,由于难以获取实际世界中的攻击数据,我们开发了一种新颖的ESIA模拟方法,并生成了适用于各种驾驶场景的虚拟攻击数据集。 本研究提供了一个全面的仿真与评估框架,旨在增强更强大AI模型和安全智能系统的研发工作,最终为各个领域的技术发展做出贡献,提升其安全性及可靠性。
https://arxiv.org/abs/2501.05239
Real-time object detection takes an essential part in the decision-making process of numerous real-world applications, including collision avoidance and path planning in autonomous driving systems. This paper presents a novel real-time streaming perception method named CorrDiff, designed to tackle the challenge of delays in real-time detection systems. The main contribution of CorrDiff lies in its adaptive delay-aware detector, which is able to utilize runtime-estimated temporal cues to predict objects' locations for multiple future frames, and selectively produce predictions that matches real-world time, effectively compensating for any communication and computational delays. The proposed model outperforms current state-of-the-art methods by leveraging motion estimation and feature enhancement, both for 1) single-frame detection for the current frame or the next frame, in terms of the metric mAP, and 2) the prediction for (multiple) future frame(s), in terms of the metric sAP (The sAP metric is to evaluate object detection algorithms in streaming scenarios, factoring in both latency and accuracy). It demonstrates robust performance across a range of devices, from powerful Tesla V100 to modest RTX 2080Ti, achieving the highest level of perceptual accuracy on all platforms. Unlike most state-of-the-art methods that struggle to complete computation within a single frame on less powerful devices, CorrDiff meets the stringent real-time processing requirements on all kinds of devices. The experimental results emphasize the system's adaptability and its potential to significantly improve the safety and reliability for many real-world systems, such as autonomous driving. Our code is completely open-sourced and is available at this https URL.
实时目标检测在许多现实世界应用的决策过程中扮演着重要角色,包括自主驾驶系统中的碰撞避免和路径规划。本文提出了一种名为CorrDiff的新颖实时流式感知方法,旨在解决实时检测系统中延迟问题的挑战。CorrDiff的主要贡献在于其自适应时延感知探测器,能够利用运行时估计的时间线索来预测多个未来帧中的对象位置,并选择性地生成与现实世界时间匹配的预测,有效补偿了通信和计算延迟。 所提出的模型通过运动估计和特征增强超越了当前最先进方法,在以下方面表现出色:1)在mAP(平均精度均值)指标上,对于当前帧或下一帧进行单帧检测;2)在sAP(流场景下评估目标检测算法的度量标准,同时考虑延迟和准确率)指标上,对一个或多个未来帧的预测。该模型展示了从强大的Tesla V100到较为普通的RTX 2080Ti设备上的稳健性能,在所有平台上均达到了最高的感知精度水平。 与大多数最先进的方法在较弱设备上难以在一帧内完成计算不同,CorrDiff满足了各种类型设备严格的实时处理要求。实验结果强调了该系统的适应性和其对许多现实世界系统(如自主驾驶)的安全性和可靠性显著提升的潜力。我们的代码完全开源,并可在提供的链接中获取。
https://arxiv.org/abs/2501.05132
Recent advancements in reinforcement learning (RL) demonstrate the significant potential in autonomous driving. Despite this promise, challenges such as the manual design of reward functions and low sample efficiency in complex environments continue to impede the development of safe and effective driving policies. To tackle these issues, we introduce LearningFlow, an innovative automated policy learning workflow tailored to urban driving. This framework leverages the collaboration of multiple large language model (LLM) agents throughout the RL training process. LearningFlow includes a curriculum sequence generation process and a reward generation process, which work in tandem to guide the RL policy by generating tailored training curricula and reward functions. Particularly, each process is supported by an analysis agent that evaluates training progress and provides critical insights to the generation agent. Through the collaborative efforts of these LLM agents, LearningFlow automates policy learning across a series of complex driving tasks, and it significantly reduces the reliance on manual reward function design while enhancing sample efficiency. Comprehensive experiments are conducted in the high-fidelity CARLA simulator, along with comparisons with other existing methods, to demonstrate the efficacy of our proposed approach. The results demonstrate that LearningFlow excels in generating rewards and curricula. It also achieves superior performance and robust generalization across various driving tasks, as well as commendable adaptation to different RL algorithms.
最近在强化学习(RL)领域的进展展示了其在自动驾驶中的巨大潜力。尽管前景广阔,但手动设计奖励函数和复杂环境下的低样本效率等问题仍然阻碍了安全有效的驾驶策略的发展。为解决这些问题,我们提出了LearningFlow,这是一种针对城市驾驶的创新自动化政策学习工作流。该框架利用多个大型语言模型(LLM)代理在整个RL训练过程中协作。 LearningFlow包括课程序列生成过程和奖励生成过程,这两个过程协同合作以通过定制培训课程和奖励函数来指导RL策略。特别地,每个过程都有一个分析代理评估培训进度并为生成代理提供关键见解。这些LLM代理的共同努力使LearningFlow能够在一系列复杂的驾驶任务中自动化政策学习,并且大大减少了对手动设计奖励功能的依赖,同时提高了样本效率。 在高保真的CARLA模拟器中进行了全面实验,并与其他现有方法进行了比较,以展示我们提出的方法的有效性。结果表明,LearningFlow在生成奖励和课程方面表现出色。它还在各种驾驶任务上实现了卓越的表现和强大的泛化能力,并且能够适应不同的RL算法。
https://arxiv.org/abs/2501.05057
Autonomous vessels potentially enhance safety and reliability of seaborne trade. To facilitate the development of autonomous vessels, high-fidelity simulations are required to model realistic interactions with other vessels. However, modeling realistic interactive maritime traffic is challenging due to the unstructured environment, coarsely specified traffic rules, and largely varying vessel types. Currently, there is no standard for simulating interactive maritime environments in order to rigorously benchmark autonomous vessel algorithms. In this paper, we introduce the first intelligent sailing model (ISM), which simulates rule-compliant vessels for navigation on the open sea. An ISM vessel reacts to other traffic participants according to maritime traffic rules while at the same time solving a motion planning task characterized by waypoints. In particular, the ISM monitors the applicable rules, generates rule-compliant waypoints accordingly, and utilizes a model predictive control for tracking the waypoints. We evaluate the ISM in two environments: interactive traffic with only ISM vessels and mixed traffic where some vessel trajectories are from recorded real-world maritime traffic data or handcrafted for criticality. Our results show that simulations with many ISM vessels of different vessel types are rule-compliant and scalable. We tested 4,049 critical traffic scenarios. For interactive traffic with ISM vessels, no collisions occurred while goal-reaching rates of about 97 percent were achieved. We believe that our ISM can serve as a standard for challenging and realistic maritime traffic simulation to accelerate autonomous vessel development.
自主船只有可能增强海上贸易的安全性和可靠性。为了促进自主船舶的发展,需要高保真的模拟来建模与其他船舶的真实互动情况。然而,由于海洋环境的无结构化、交通规则规定模糊以及船型差异大等因素,模拟现实中的相互作用性海事交通颇具挑战性。目前尚没有用于严格评估自主船舶算法的标准海事交互式环境仿真方法。 在本文中,我们引入了首个智能航行模型(ISM),该模型可模拟符合海洋交通规则的船只,在开放海域进行导航。一个ISM船只会根据海上交通规则对其他交通参与者作出反应,并同时解决由航路点特征化的运动规划任务。特别地,ISM会监测适用的规定,生成相应的合规航路点,并利用预测性控制模型来跟踪这些航路点。 我们在两个环境中评估了ISM的表现:仅包含ISM船只的互动交通环境以及混合交通环境(其中一些船舶轨迹来自记录的真实世界海事交通数据或为关键情况手工创建)。我们的结果显示,具有多种船型的大量ISM船只模拟结果符合规则并且具备扩展性。我们测试了4,049个关键交通场景。对于仅包含ISM船只的互动交通而言,没有发生碰撞事故,同时目标达成率约为97%。 我们认为,我们的ISM可以作为挑战性和现实性的海事交通仿真标准,以加速自主船舶的发展。
https://arxiv.org/abs/2501.04988
In autonomous driving, traditional Computer Vision (CV) agents often struggle in unfamiliar situations due to biases in the training data. Deep Reinforcement Learning (DRL) agents address this by learning from experience and maximizing rewards, which helps them adapt to dynamic environments. However, ensuring their generalization remains challenging, especially with static training environments. Additionally, DRL models lack transparency, making it difficult to guarantee safety in all scenarios, particularly those not seen during training. To tackle these issues, we propose a method that combines DRL with Curriculum Learning for autonomous driving. Our approach uses a Proximal Policy Optimization (PPO) agent and a Variational Autoencoder (VAE) to learn safe driving in the CARLA simulator. The agent is trained using two-fold curriculum learning, progressively increasing environment difficulty and incorporating a collision penalty in the reward function to promote safety. This method improves the agent's adaptability and reliability in complex environments, and understand the nuances of balancing multiple reward components from different feedback signals in a single scalar reward function. Keywords: Computer Vision, Deep Reinforcement Learning, Variational Autoencoder, Proximal Policy Optimization, Curriculum Learning, Autonomous Driving.
在自动驾驶领域,传统的计算机视觉(CV)代理由于训练数据中的偏差,在处理未知情况时常常遇到困难。深度强化学习(DRL)代理通过从经验中学习并最大化奖励来适应动态环境,从而解决了这一问题。然而,确保其泛化能力仍是一个挑战,尤其是在静态的培训环境中。此外,DRL模型缺乏透明度,这使得在所有场景下保证安全变得困难,尤其是那些训练过程中未见过的情况。 为了解决这些问题,我们提出了一种结合深度强化学习和课程学习的方法来解决自动驾驶问题。我们的方法使用Proximal Policy Optimization(近端策略优化,PPO)代理以及Variational Autoencoder(变分自编码器,VAE),在CARLA仿真环境中学习安全驾驶。通过双重课程训练法逐步提高环境难度,并将碰撞惩罚纳入奖励函数中以促进安全性。这种方法提高了代理在复杂环境中的适应性和可靠性,同时帮助理解如何在一个标量奖励函数中平衡多个来自不同反馈信号的奖励成分。 关键词:计算机视觉、深度强化学习、变分自编码器、近端策略优化、课程学习、自动驾驶。
https://arxiv.org/abs/2501.04982
As opposed to human drivers, current autonomous driving systems still require vast amounts of labeled data to train. Recently, world models have been proposed to simultaneously enhance autonomous driving capabilities by improving the way these systems understand complex real-world environments and reduce their data demands via self-supervised pre-training. In this paper, we present AD-L-JEPA (aka Autonomous Driving with LiDAR data via a Joint Embedding Predictive Architecture), a novel self-supervised pre-training framework for autonomous driving with LiDAR data that, as opposed to existing methods, is neither generative nor contrastive. Our method learns spatial world models with a joint embedding predictive architecture. Instead of explicitly generating masked unknown regions, our self-supervised world models predict Bird's Eye View (BEV) embeddings to represent the diverse nature of autonomous driving scenes. Our approach furthermore eliminates the need to manually create positive and negative pairs, as is the case in contrastive learning. AD-L-JEPA leads to simpler implementation and enhanced learned representations. We qualitatively and quantitatively demonstrate high-quality of embeddings learned with AD-L-JEPA. We furthermore evaluate the accuracy and label efficiency of AD-L-JEPA on popular downstream tasks such as LiDAR 3D object detection and associated transfer learning. Our experimental evaluation demonstrates that AD-L-JEPA is a plausible approach for self-supervised pre-training in autonomous driving applications and is the best available approach outperforming SOTA, including most recently proposed Occupancy-MAE [1] and ALSO [2]. The source code of AD-L-JEPA is available at this https URL.
与人类驾驶员不同,当前的自动驾驶系统仍然需要大量的标注数据来进行训练。最近,世界模型被提出以同时增强这些系统的理解能力,使其更好地处理复杂的现实环境,并通过自我监督的预训练来减少其对数据的需求。在本文中,我们提出了AD-L-JEPA(即基于LiDAR数据并通过联合嵌入预测架构进行自动驾驶),这是一种新颖的针对自动驾驶中的LiDAR数据的自监督预训练框架,与现有方法不同的是,它既不是生成式的也不是对比式的。我们的方法通过联合嵌入预测架构学习空间世界模型。不同于明确地生成遮蔽未知区域的方式,我们的自监督世界模型会预测俯视图(BEV)嵌入以表示自动驾驶场景的多样性。此外,我们所提出的方法还消除了创建正负样本对的需求,这是对比学习中需要手动完成的任务。因此,AD-L-JEPA简化了实现过程,并提升了学到的表示能力。我们在定性和定量上展示了通过AD-L-JEPA学得的嵌入具有高质量的特点。 为了评估AD-L-JEPA在下游任务中的准确性以及标注效率,我们对包括LiDAR 3D物体检测和相关迁移学习在内的流行任务进行了测试。实验结果表明,AD-L-JEPA是自监督预训练应用于自动驾驶领域的一种可行方法,并且优于现有的最佳方法(SOTA),包括最近提出的Occupancy-MAE [1]和ALSO [2]。 AD-L-JEPA的源代码可以在此网址获取:[此URL]。
https://arxiv.org/abs/2501.04969
Deep neural network (DNN) based perception models are indispensable in the development of autonomous vehicles (AVs). However, their reliance on large-scale, high-quality data is broadly recognized as a burdensome necessity due to the substantial cost of data acquisition and labeling. Further, the issue is not a one-time concern, as AVs might need a new dataset if they are to be deployed to another region (real-target domain) that the in-hand dataset within the real-source domain cannot incorporate. To mitigate this burden, we propose leveraging synthetic environments as an auxiliary domain where the characteristics of real domains are reproduced. This approach could enable indirect experience about the real-target domain in a time- and cost-effective manner. As a practical demonstration of our methodology, nuScenes and South Korea are employed to represent real-source and real-target domains, respectively. That means we construct digital twins for several regions of South Korea, and the data-acquisition framework of nuScenes is reproduced. Blending the aforementioned components within a simulator allows us to obtain a synthetic-fusion domain in which we forge our novel driving dataset, MORDA: Mixture Of Real-domain characteristics for synthetic-data-assisted Domain Adaptation. To verify the value of synthetic features that MORDA provides in learning about driving environments of South Korea, 2D/3D detectors are trained solely on a combination of nuScenes and MORDA. Afterward, their performance is evaluated on the unforeseen real-world dataset (AI-Hub) collected in South Korea. Our experiments present that MORDA can significantly improve mean Average Precision (mAP) on AI-Hub dataset while that on nuScenes is retained or slightly enhanced.
基于深度神经网络(DNN)的感知模型在自动驾驶车辆(AVs)的发展中不可或缺。然而,它们依赖于大规模高质量数据的事实被广泛认为是一种负担,因为获取和标注这些数据的成本非常高昂。此外,这个问题不仅是一次性的关注点,因为在不同地区部署自动驾驶汽车时可能需要新的数据集,而现有的数据集中没有涵盖该新地区的环境特征(即真实的目标域)。为了减轻这种负担,我们提出利用合成环境作为辅助领域,在其中重现现实领域的特性。这种方法可以以一种时间成本效益高的方式间接体验到实际目标区域的情况。 作为一种实用的演示方法,我们将nuScenes和韩国分别用作真实源域和真实目标域的代表。这意味着我们在韩国构建了数字孪生,并且复制了nuScenes的数据采集框架。将上述组件融入模拟器中使我们能够获得一个合成融合领域,在其中我们可以创建一个新的驾驶数据集MORDA(混合现实领域特征用于合成数据辅助领域适应)。为了验证MORDA提供的合成特性在学习韩国驾驶环境中的价值,我们在仅使用nuScenes和MORDA的组合训练2D/3D检测器之后,对其性能进行了评估,并将其应用于未见过的真实世界数据集(AI-Hub)中收集的数据。我们的实验表明,在AI-Hub数据集上,MORDA可以显著提高平均精度(mAP),而nuScenes上的表现则被保留或略有提升。 这种方法展示了合成环境如何能够在不增加大量成本的前提下改进自动驾驶系统的感知能力,尤其是在需要适应新地理区域的情况下。
https://arxiv.org/abs/2501.04950
In this paper, we aim to understand how user motivation shapes human-robot interaction (HRI) in the wild. To explore this, we conducted a field study by deploying a fully autonomous conversational robot in a shopping mall over two days. Through sequential video analysis, we identified five patterns of interaction fluency (Smooth, Awkward, Active, Messy, and Quiet), four types of user motivation for interacting with the robot (Function, Experiment, Curiosity, and Education), and user positioning towards the robot. We further analyzed how these motivations and positioning influence interaction fluency. Our findings suggest that incorporating users' motivation types into the design of robot behavior can enhance interaction fluency, engagement, and user satisfaction in real-world HRI scenarios.
在这篇论文中,我们的目标是理解用户动机如何塑造真实环境中的机器人与人类之间的互动(HRI)。为了探索这一问题,我们在一家购物中心部署了一台完全自主的对话机器人,并进行了为期两天的田野研究。通过连续视频分析,我们识别出了五种互动流畅性模式(顺畅、别扭、积极、混乱和安静),四种用户与机器人互动的动力类型(功能、实验、好奇和教育),以及用户对机器人的定位方式。进一步地,我们分析了这些动机和定位如何影响互动的流畅性。我们的研究发现表明,在设计机器人的行为时纳入用户的动力类型可以增强现实世界HRI场景中的互动流畅性、参与度和用户体验。
https://arxiv.org/abs/2501.04929
Exploration of unknown environments is crucial for autonomous robots; it allows them to actively reason and decide on what new data to acquire for tasks such as mapping, object discovery, and environmental assessment. Existing methods, such as frontier-based methods, rely heavily on 3D map operations, which are limited by map quality and often overlook valuable context from visual cues. This work aims at leveraging 2D visual cues for efficient autonomous exploration, addressing the limitations of extracting goal poses from a 3D map. We propose a image-only frontier-based exploration system, with FrontierNet as a core component developed in this work. FrontierNet is a learning-based model that (i) detects frontiers, and (ii) predicts their information gain, from posed RGB images enhanced by monocular depth priors. Our approach provides an alternative to existing 3D-dependent exploration systems, achieving a 16% improvement in early-stage exploration efficiency, as validated through extensive simulations and real-world experiments.
探索未知环境对于自主机器人来说至关重要,这使它们能够主动推理并决定获取哪些新的数据以完成如地图绘制、物体发现和环境评估等任务。现有的方法,例如基于边界的(frontier-based)方法,很大程度上依赖于3D地图操作,这些方法受限于地图的质量,并且往往会忽视来自视觉线索的宝贵信息。本工作旨在利用2D视觉线索进行高效的自主探索,以解决从3D地图中提取目标姿态时存在的限制。我们提出了一种仅基于图像的前沿探索系统,其中FrontierNet是本文开发的核心组件之一。 FrontierNet是一个学习型模型,它可以从由单目深度先验增强的姿态化RGB图像中(i)检测出前沿点和(ii)预测其信息增益。我们的方法为现有的依赖于3D地图的探索系统提供了一种替代方案,在广泛的模拟和真实世界实验中验证了在早期阶段探索效率提高了16%。
https://arxiv.org/abs/2501.04597
Uncertainty estimation is an indispensable capability for AI-enabled, safety-critical applications, e.g. autonomous vehicles or medical diagnosis. Bayesian neural networks (BNNs) use Bayesian statistics to provide both classification predictions and uncertainty estimation, but they suffer from high computational overhead associated with random number generation and repeated sample iterations. Furthermore, BNNs are not immediately amenable to acceleration through compute-in-memory architectures due to the frequent memory writes necessary after each RNG operation. To address these challenges, we present an ASIC that integrates 360 fJ/Sample Gaussian RNG directly into the SRAM memory words. This integration reduces RNG overhead and enables fully-parallel compute-in-memory operations for BNNs. The prototype chip achieves 5.12 GSa/s RNG throughput and 102 GOp/s neural network throughput while occupying 0.45 mm2, bringing AI uncertainty estimation to edge computation.
不确定性估计是AI赋能的安全关键型应用(如自动驾驶汽车或医学诊断)中不可或缺的能力。贝叶斯神经网络(BNNs)使用贝叶斯统计方法来提供分类预测和不确定性评估,但它们会因为随机数生成和重复样本迭代而产生较高的计算开销。此外,由于每次随机数生成操作后都需要频繁的内存写入,因此BNN很难通过计算存储器架构进行加速。为了解决这些问题,我们提出了一种专用集成电路(ASIC),该电路将360 fJ/样本高斯随机数生成器直接集成到SRAM内存单元中。这种集成减少了随机数生成的开销,并使BNN能够在内存内并行执行计算操作。原型芯片实现了5.12 GSa/s的随机数生成吞吐量和102 GOp/s的神经网络处理速度,同时占用面积仅为0.45 mm²,将AI不确定性估计带到了边缘计算领域。
https://arxiv.org/abs/2501.04577
Endovascular navigation is a crucial aspect of minimally invasive procedures, where precise control of curvilinear instruments like guidewires is critical for successful interventions. A key challenge in this task is accurately predicting the evolving shape of the guidewire as it navigates through the vasculature, which presents complex deformations due to interactions with the vessel walls. Traditional segmentation methods often fail to provide accurate real-time shape predictions, limiting their effectiveness in highly dynamic environments. To address this, we propose SplineFormer, a new transformer-based architecture, designed specifically to predict the continuous, smooth shape of the guidewire in an explainable way. By leveraging the transformer's ability, our network effectively captures the intricate bending and twisting of the guidewire, representing it as a spline for greater accuracy and smoothness. We integrate our SplineFormer into an end-to-end robot navigation system by leveraging the condensed information. The experimental results demonstrate that our SplineFormer is able to perform endovascular navigation autonomously and achieves a 50% success rate when cannulating the brachiocephalic artery on the real robot.
血管内导航是微创手术中的关键环节,其中对如导丝这样的曲线器械的精确控制对于成功干预至关重要。在这个任务中面临的主要挑战之一就是准确预测导丝在通过血管时不断变化的形状,由于与血管壁的相互作用,这种情况下导丝会呈现出复杂的变形。传统的分割方法常常无法提供实时的、准确的形状预测,在动态环境中效果受限。 为了解决这一问题,我们提出了一种新的基于Transformer架构的方法——SplineFormer。这种方法专门设计用来以可解释的方式预测连续平滑的导丝形状。通过利用变压器的能力,我们的网络能够有效地捕捉到导丝弯曲和扭曲等复杂的变形情况,并将其表示成样条线,从而提高准确性和平滑度。 我们将SplineFormer整合进了一个端到端的机器人导航系统中,利用其集中后的信息进行操作。实验结果显示,我们的SplineFormer能够在真实的机器人上自主完成血管内导航任务,在尝试进入无名动脉时达到了50%的成功率。
https://arxiv.org/abs/2501.04515
The convergence of drone delivery systems, virtual worlds, and blockchain has transformed logistics and supply chain management, providing a fast, and environmentally friendly alternative to traditional ground transportation methods;Provide users with a real-world experience, virtual service providers need to collect up-to-the-minute delivery information from edge devices. To address this challenge, 1) a reinforcement learning approach is introduced to enable drones with fast training capabilities and the ability to autonomously adapt to new virtual scenarios for effective resource allocation.2) A semantic communication framework for meta-universes is proposed, which utilizes the extraction of semantic information to reduce the communication cost and incentivize the transmission of information for meta-universe services.3) In order to ensure that user information security, a lightweight authentication and key agreement scheme is designed between the drone and the user by introducing blockchain technology. In our experiments, the drone adaptation performance is improved by about 35\%, and the local offloading rate can reach 90\% with the increase of the number of base stations. The semantic communication system proposed in this paper is compared with the Cross Entropy baseline model. Introducing blockchain technology the throughput of the transaction is maintained at a stable value with different number of drones.
无人机交付系统、虚拟世界和区块链技术的融合已经改变了物流和供应链管理,提供了一种快速且环保的传统地面运输方法替代方案;为了向用户提供真实的体验,虚拟服务提供商需要从边缘设备收集实时的递送信息。为了解决这一挑战,提出了以下几项措施: 1. 通过引入强化学习方法来使无人机具备快速训练能力和自主适应新虚拟场景的能力,从而有效分配资源。 2. 提出了一个用于元宇宙的语义通信框架,该框架利用语义信息提取减少通信成本,并激励为元宇宙服务的信息传输。 3. 为了确保用户信息安全,通过引入区块链技术设计了一种轻量级的身份验证和密钥协商方案,以实现无人机与用户之间的安全连接。 在我们的实验中,无人机的适应性能提高了约35%,随着基站数量的增加,本地卸载率可达到90%。本文提出的语义通信系统与交叉熵基线模型进行了比较,并且引入区块链技术后,在不同数量的无人机情况下交易吞吐量保持在一个稳定的数值水平。
https://arxiv.org/abs/2501.04480
Intelligent Transportation Systems (ITS) are crucial for the development and operation of smart cities, addressing key challenges in efficiency, productivity, and environmental sustainability. This paper comprehensively reviews the transformative potential of Large Language Models (LLMs) in optimizing ITS. Initially, we provide an extensive overview of ITS, highlighting its components, operational principles, and overall effectiveness. We then delve into the theoretical background of various LLM techniques, such as GPT, T5, CTRL, and BERT, elucidating their relevance to ITS applications. Following this, we examine the wide-ranging applications of LLMs within ITS, including traffic flow prediction, vehicle detection and classification, autonomous driving, traffic sign recognition, and pedestrian detection. Our analysis reveals how these advanced models can significantly enhance traffic management and safety. Finally, we explore the challenges and limitations LLMs face in ITS, such as data availability, computational constraints, and ethical considerations. We also present several future research directions and potential innovations to address these challenges. This paper aims to guide researchers and practitioners through the complexities and opportunities of integrating LLMs in ITS, offering a roadmap to create more efficient, sustainable, and responsive next-generation transportation systems.
智能交通系统(ITS)对于智慧城市的开发和运营至关重要,能够解决效率、生产力以及环境可持续性等方面的关键挑战。本文全面回顾了大型语言模型(LLM)在优化ITS中的变革潜力。首先,我们提供了对ITS的详尽概述,强调其组成部分、运作原理及其总体有效性。然后深入探讨各种LLM技术如GPT、T5、CTRL和BERT的理论背景,并阐明这些技术对于ITS应用的相关性。接下来,本文研究了LLMs在ITS中的广泛应用场景,包括交通流量预测、车辆检测与分类、自动驾驶、交通标志识别以及行人检测等。我们的分析揭示了这些高级模型如何显著提升交通管理和安全性。最后,我们探讨了LLM在ITS中面临的挑战和限制因素,如数据可用性、计算约束和伦理考量,并提出了应对这些挑战的未来研究方向和潜在创新方法。 本文旨在为研究人员和实践者提供指引,在整合LLMs到ITS时帮助他们了解复杂性和机遇,绘制出创建更高效、可持续且响应迅速的下一代交通系统的路线图。
https://arxiv.org/abs/2501.04437
Multimodal 3D object detection has garnered considerable interest in autonomous driving. However, multimodal detectors suffer from dimension mismatches that derive from fusing 3D points with 2D pixels coarsely, which leads to sub-optimal fusion performance. In this paper, we propose a multimodal framework FGU3R to tackle the issue mentioned above via unified 3D representation and fine-grained fusion, which consists of two important components. First, we propose an efficient feature extractor for raw and pseudo points, termed Pseudo-Raw Convolution (PRConv), which modulates multimodal features synchronously and aggregates the features from different types of points on key points based on multimodal interaction. Second, a Cross-Attention Adaptive Fusion (CAAF) is designed to fuse homogeneous 3D RoI (Region of Interest) features adaptively via a cross-attention variant in a fine-grained manner. Together they make fine-grained fusion on unified 3D representation. The experiments conducted on the KITTI and nuScenes show the effectiveness of our proposed method.
在自动驾驶领域,多模态3D物体检测引起了广泛的关注。然而,现有的多模态检测器在融合三维点和二维像素时存在维度不匹配的问题,这些问题源于粗略的融合方式,导致了次优的融合效果。为此,在本文中我们提出了一种新的多模态框架FGU3R(Fine-Grained Unified 3D Representation),旨在通过统一的3D表示和细粒度融合来解决上述问题。该框架主要包括两个重要组成部分: 1. 我们提出了一个高效的特征提取器,称为伪原始卷积(PRConv),用于处理原始点和伪点数据。PRConv能够同步调节多模态特征,并基于多模态交互在关键点上聚合不同类型的点的特征。 2. 设计了一种跨注意力自适应融合方法(CAAF),通过跨注意力变体以细粒度的方式来适应性地融合同质3D感兴趣区域(RoI)特征。 这两个组成部分共同实现了统一的3D表示上的细粒度融合。在KITTI和nuScenes数据集上进行的实验表明了我们提出的方法的有效性。
https://arxiv.org/abs/2501.04373
With the prevalence of Multimodal Large Language Models(MLLMs), autonomous driving has encountered new opportunities and challenges. In particular, multi-modal video understanding is critical to interactively analyze what will happen in the procedure of autonomous driving. However, videos in such a dynamical scene that often contains complex spatial-temporal movements, which restricts the generalization capacity of the existing MLLMs in this field. To bridge the gap, we propose a novel Hierarchical Mamba Adaptation (H-MBA) framework to fit the complicated motion changes in autonomous driving videos. Specifically, our H-MBA consists of two distinct modules, including Context Mamba (C-Mamba) and Query Mamba (Q-Mamba). First, C-Mamba contains various types of structure state space models, which can effectively capture multi-granularity video context for different temporal resolutions. Second, Q-Mamba flexibly transforms the current frame as the learnable query, and attentively selects multi-granularity video context into query. Consequently, it can adaptively integrate all the video contexts of multi-scale temporal resolutions to enhance video understanding. Via a plug-and-play paradigm in MLLMs, our H-MBA shows the remarkable performance on multi-modal video tasks in autonomous driving, e.g., for risk object detection, it outperforms the previous SOTA method with 5.5% mIoU improvement.
随着多模态大型语言模型(MLLMs)的普及,自动驾驶领域迎来了新的机遇与挑战。特别是多模态视频理解对于交互式分析自动驾驶过程中可能发生的情况至关重要。然而,在这种动态场景中的视频往往包含复杂的时空运动模式,这限制了现有MLLMs在此领域的泛化能力。为解决这一问题,我们提出了一种新颖的分层蟒蛇自适应(Hierarchical Mamba Adaptation, H-MBA)框架,旨在适应自动驾驶视频中复杂的运动变化。 具体来说,H-MBA由两个不同的模块组成:上下文蟒蛇(Context Mamba, C-Mamba)和查询蟒蛇(Query Mamba, Q-Mamba)。首先,C-Mamba包含了各种类型的结构状态空间模型,能够有效地捕捉不同时间分辨率下的多粒度视频背景信息。其次,Q-Mamba可以灵活地将当前帧转换为可学习的查询,并且有选择性地从视频上下文中挑选出多个粒度的信息融入查询中。因此,它可以自适应地整合所有多尺度时间分辨率下的视频背景信息,从而增强视频理解能力。 通过在MLLMs中的即插即用范式,我们的H-MBA框架在自动驾驶的多模态视频任务上展示了出色的表现,例如,在风险物体检测方面,相较于之前的最先进方法(SOTA),实现了5.5% mIoU的性能提升。
https://arxiv.org/abs/2501.04302
Historically, scientific discovery has been a lengthy and costly process, demanding substantial time and resources from initial conception to final results. To accelerate scientific discovery, reduce research costs, and improve research quality, we introduce Agent Laboratory, an autonomous LLM-based framework capable of completing the entire research process. This framework accepts a human-provided research idea and progresses through three stages--literature review, experimentation, and report writing to produce comprehensive research outputs, including a code repository and a research report, while enabling users to provide feedback and guidance at each stage. We deploy Agent Laboratory with various state-of-the-art LLMs and invite multiple researchers to assess its quality by participating in a survey, providing human feedback to guide the research process, and then evaluate the final paper. We found that: (1) Agent Laboratory driven by o1-preview generates the best research outcomes; (2) The generated machine learning code is able to achieve state-of-the-art performance compared to existing methods; (3) Human involvement, providing feedback at each stage, significantly improves the overall quality of research; (4) Agent Laboratory significantly reduces research expenses, achieving an 84% decrease compared to previous autonomous research methods. We hope Agent Laboratory enables researchers to allocate more effort toward creative ideation rather than low-level coding and writing, ultimately accelerating scientific discovery.
历史上,科学研究是一个漫长且耗费成本的过程,从最初的构想到最终结果的产生都需要大量的时间和资源。为了加速科学发现、减少研究成本并提高研究质量,我们引入了Agent Laboratory,这是一个基于大型语言模型(LLM)的自主框架,能够完成整个研究过程。该框架接受人类提供的研究理念,并通过文献综述、实验和报告撰写三个阶段来生成全面的研究成果,包括代码仓库和研究报告,并允许用户在每个阶段提供反馈和指导。 我们使用多种最先进的LLM部署了Agent Laboratory,并邀请多位研究人员参与评估其质量:他们通过填写调查问卷、为研究过程提供人类反馈以及最终评审论文来进行评价。我们的发现如下: 1. 由o1-preview驱动的Agent Laboratory产生最佳的研究成果; 2. 自动生成的机器学习代码能够达到与现有方法相比的最优性能; 3. 在每个阶段都进行的人类参与和反馈显著提升了整体研究质量; 4. Agent Laboratory大幅减少了研究费用,相比于之前的自主研究方法,成本降低了84%。 我们希望Agent Laboratory能够让研究人员将更多的精力投入到富有创意的想法构思上,而不是低级编码和写作工作中,从而加速科学发现。
https://arxiv.org/abs/2501.04227