Prior parameter distributions provide an elegant way to represent prior expert and world knowledge for informed learning. Previous work has shown that using such informative priors to regularize probabilistic deep learning (DL) models increases their performance and data-efficiency. However, commonly used sampling-based approximations for probabilistic DL models can be computationally expensive, requiring multiple inference passes and longer training times. Promising alternatives are compute-efficient last layer kernel approximations like spectral normalized Gaussian processes (SNGPs). We propose a novel regularization-based continual learning method for SNGPs, which enables the use of informative priors that represent prior knowledge learned from previous tasks. Our proposal builds upon well-established methods and requires no rehearsal memory or parameter expansion. We apply our informed SNGP model to the trajectory prediction problem in autonomous driving by integrating prior drivability knowledge. On two public datasets, we investigate its performance under diminishing training data and across locations, and thereby demonstrate an increase in data-efficiency and robustness to location-transfers over non-informed and informed baselines.
先验参数分布提供了一种优雅的方式来表示先验专家和世界知识,以进行知情的学习。之前的工作已经表明,使用这样的有信息性的先验来对概率深度学习(DL)模型进行正则化可以提高它们的性能和数据效率。然而,通常使用的基于采样的概率DL模型的近似方法在计算上可能是昂贵的,需要多次推理迭代和更长的训练时间。有前途的替代方法是计算效率的最后一层核近似,如斯皮尔曼正态分布 Gaussian processes (SNGPs)。我们提出了一种新的基于先验的连续学习方法 for SNGPs,该方法允许使用从以前任务中获得的 informative先验。我们的提议基于成熟的方法,不需要练习记忆或参数扩展。我们将我们的 informed SNGP 模型应用于自动驾驶轨迹预测问题,通过将以前的知识集成到模型中来实现。在两个公开的数据集上,我们研究了在训练数据减少的情况下其性能,以及在不同位置上的表现。从而,我们证明了数据效率和对抗非知觉和知觉先验基线的增强。
https://arxiv.org/abs/2403.11966
This paper presents a novel approach to address the challenging problem of autonomous on-ramp merging, where a self-driving vehicle needs to seamlessly integrate into a flow of vehicles on a multi-lane highway. We introduce the Lane-keeping, Lane-changing with Latent-state Inference and Safety Controller (L3IS) agent, designed to perform the on-ramp merging task safely without comprehensive knowledge about surrounding vehicles' intents or driving styles. We also present an augmentation of this agent called AL3IS that accounts for observation delays, allowing the agent to make more robust decisions in real-world environments with vehicle-to-vehicle (V2V) communication delays. By modeling the unobservable aspects of the environment through latent states, such as other drivers' intents, our approach enhances the agent's ability to adapt to dynamic traffic conditions, optimize merging maneuvers, and ensure safe interactions with other vehicles. We demonstrate the effectiveness of our method through extensive simulations generated from real traffic data and compare its performance with existing approaches. L3IS shows a 99.90\% success rate in a challenging on-ramp merging case generated from the real US Highway 101 data. We further perform a sensitivity analysis on AL3IS to evaluate its robustness against varying observation delays, which demonstrates an acceptable performance of 93.84\% success rate in 1-second V2V communication delay.
本文提出了一种解决自动驾驶车辆在多车道高速公路上无缝集成的问题的新方法,该方法可以让自动驾驶车辆在不具备周围车辆意图或驾驶风格全面知识的情况下安全地进行高速公路上的自动驾驶车道的并入。我们引入了Lane-keeping,Lane-changing with Latent-state Inference and Safety Controller (L3IS)代理,旨在在没有全面了解周围车辆意图或驾驶风格的情况下安全地进行高速公路上的自动驾驶车道并入。我们还介绍了一种名为AL3IS的增强代理,用于考虑观察延迟,使得代理在现实环境中能够做出更稳健的决策,同时考虑车辆与车辆(V2V)通信延迟。通过通过潜在状态建模环境的不透明方面,如其他驾驶员的意图,我们的方法增强了代理适应动态交通状况、优化并入技巧以及确保与其他车辆的安全互动的能力。我们通过从实际交通数据中生成广泛的仿真来验证我们的方法的有效性,并将其性能与现有方法进行比较。L3IS在从实际美国101高速公路数据生成的具有挑战性的并入案例中表现出了99.90%的成功率。我们进一步对AL3IS进行了敏感性分析,以评估其对不同观察延迟的鲁棒性,其性能在1秒的V2V通信延迟下达到了93.84%的成功率。
https://arxiv.org/abs/2403.11852
Autonomous driving systems are a rapidly evolving technology that enables driverless car production. Trajectory prediction is a critical component of autonomous driving systems, enabling cars to anticipate the movements of surrounding objects for safe navigation. Trajectory prediction using Lidar point-cloud data performs better than 2D images due to providing 3D information. However, processing point-cloud data is more complicated and time-consuming than 2D images. Hence, state-of-the-art 3D trajectory predictions using point-cloud data suffer from slow and erroneous predictions. This paper introduces TrajectoryNAS, a pioneering method that focuses on utilizing point cloud data for trajectory prediction. By leveraging Neural Architecture Search (NAS), TrajectoryNAS automates the design of trajectory prediction models, encompassing object detection, tracking, and forecasting in a cohesive manner. This approach not only addresses the complex interdependencies among these tasks but also emphasizes the importance of accuracy and efficiency in trajectory modeling. Through empirical studies, TrajectoryNAS demonstrates its effectiveness in enhancing the performance of autonomous driving systems, marking a significant advancement in the field.Experimental results reveal that TrajcetoryNAS yield a minimum of 4.8 higger accuracy and 1.1* lower latency over competing methods on the NuScenes dataset.
自动驾驶系统是一种快速发展的技术,可以实现无人驾驶汽车的生产。轨迹预测是自动驾驶系统的一个关键组件,可以让汽车预测周围物体的运动,实现安全导航。使用激光点云数据进行轨迹预测的轨迹预测比二维图像更好,因为提供了3D信息。然而,处理点云数据的过程更加复杂和耗时。因此,使用点云数据进行轨迹预测的先进方法存在慢且错误的预测。本文介绍了TrajectoryNAS,一种领先的方法,专注于利用点云数据进行轨迹预测。通过利用神经架构搜索(NAS),TrajectoryNAS自动设计轨迹预测模型,将物体检测、跟踪和预测整合在一起。这种方法不仅解决了这些任务之间的复杂依赖关系,还强调了轨迹建模中准确性和效率的重要性。通过实验研究,TrajectoryNAS在NuScenes数据集上的性能至少比竞争方法提高了4.8个精度,延迟降低了1.1倍。
https://arxiv.org/abs/2403.11695
The ability to predict the future trajectories of traffic participants is crucial for the safe and efficient operation of autonomous vehicles. In this paper, a diffusion-based generative model for multi-agent trajectory prediction is proposed. The model is capable of capturing the complex interactions between traffic participants and the environment, accurately learning the multimodal nature of the data. The effectiveness of the approach is assessed on large-scale datasets of real-world traffic scenarios, showing that our model outperforms several well-established methods in terms of prediction accuracy. By the incorporation of differential motion constraints on the model output, we illustrate that our model is capable of generating a diverse set of realistic future trajectories. Through the use of an interaction-aware guidance signal, we further demonstrate that the model can be adapted to predict the behavior of less cooperative agents, emphasizing its practical applicability under uncertain traffic conditions.
预测交通参与者的未来轨迹对于自动驾驶车辆的安全和高效运行至关重要。在本文中,我们提出了一个基于扩散的生成模型来预测多代理器轨迹。该模型能够捕捉交通参与者与环境之间的复杂相互作用,准确地学习数据的多模态性质。该方法的有效性在现实世界的交通场景的大型数据集上进行了评估,结果表明,我们的模型在预测准确性方面超过了几个成熟的方法。通过在模型输出上引入差动运动约束,我们证明了我们的模型能够生成一系列真实的未来轨迹。通过使用交互式指导信号,我们进一步证明了该模型可以适应预测不太合作代理者的行为,突出了在不确定交通条件下其实用性的特点。
https://arxiv.org/abs/2403.11643
Monocular depth estimation (MDE) has advanced significantly, primarily through the integration of convolutional neural networks (CNNs) and more recently, Transformers. However, concerns about their susceptibility to adversarial attacks have emerged, especially in safety-critical domains like autonomous driving and robotic navigation. Existing approaches for assessing CNN-based depth prediction methods have fallen short in inducing comprehensive disruptions to the vision system, often limited to specific local areas. In this paper, we introduce SSAP (Shape-Sensitive Adversarial Patch), a novel approach designed to comprehensively disrupt monocular depth estimation (MDE) in autonomous navigation applications. Our patch is crafted to selectively undermine MDE in two distinct ways: by distorting estimated distances or by creating the illusion of an object disappearing from the system's perspective. Notably, our patch is shape-sensitive, meaning it considers the specific shape and scale of the target object, thereby extending its influence beyond immediate proximity. Furthermore, our patch is trained to effectively address different scales and distances from the camera. Experimental results demonstrate that our approach induces a mean depth estimation error surpassing 0.5, impacting up to 99% of the targeted region for CNN-based MDE models. Additionally, we investigate the vulnerability of Transformer-based MDE models to patch-based attacks, revealing that SSAP yields a significant error of 0.59 and exerts substantial influence over 99% of the target region on these models.
单目深度估计(MDE)已经取得了显著的进步,主要通过集成卷积神经网络(CNN)和更最近地,通过Transformer来实现。然而,对它们易受对抗攻击的担忧已经出现,特别是在自动驾驶和机器人导航等关键领域。现有的基于CNN的深度预测方法在引起对视觉系统全面的干扰方面表现不足,往往局限于特定的局部区域。在本文中,我们引入了SSAP(形状敏感对抗补丁),一种专为在自动驾驶应用中全面破坏单目深度估计(MDE)而设计的全新方法。我们的补丁通过两种方式选择性地破坏MDE:通过扭曲估计距离或通过从系统的角度创建物体消失的错觉。值得注意的是,我们的补丁是形状相关的,这意味着它考虑了目标物体的具体形状和比例,从而在直接接近范围内扩大其影响。此外,我们的补丁还训练得有效地处理来自摄像头的不同规模和距离。实验结果表明,我们的方法引起的平均深度估计误差超过0.5,影响到基于CNN的MDE模型的99%目标区域。此外,我们研究了Transformer-based MDE模型的补丁攻击的脆弱性,揭示了SSAP在这些模型上产生0.59的显著误差,并对99%的目标区域产生相当大的影响。
https://arxiv.org/abs/2403.11515
Perception plays a crucial role in various robot applications. However, existing well-annotated datasets are biased towards autonomous driving scenarios, while unlabelled SLAM datasets are quickly over-fitted, and often lack environment and domain variations. To expand the frontier of these fields, we introduce a comprehensive dataset named MCD (Multi-Campus Dataset), featuring a wide range of sensing modalities, high-accuracy ground truth, and diverse challenging environments across three Eurasian university campuses. MCD comprises both CCS (Classical Cylindrical Spinning) and NRE (Non-Repetitive Epicyclic) lidars, high-quality IMUs (Inertial Measurement Units), cameras, and UWB (Ultra-WideBand) sensors. Furthermore, in a pioneering effort, we introduce semantic annotations of 29 classes over 59k sparse NRE lidar scans across three domains, thus providing a novel challenge to existing semantic segmentation research upon this largely unexplored lidar modality. Finally, we propose, for the first time to the best of our knowledge, continuous-time ground truth based on optimization-based registration of lidar-inertial data on large survey-grade prior maps, which are also publicly released, each several times the size of existing ones. We conduct a rigorous evaluation of numerous state-of-the-art algorithms on MCD, report their performance, and highlight the challenges awaiting solutions from the research community.
感知在各种机器人应用中扮演着关键角色。然而,现有的经过良好注释的数据集偏向于自动驾驶场景,而未标注的SLAM数据集会很快过拟合,并且通常缺乏环境和领域变化。为了扩展这些领域的前沿,我们引入了一个名为MCD(多校园数据集)的全面数据集,其中包括广泛的感测模式、高精度的目标跟踪数据和来自欧洲三个大学校园的多样挑战环境。MCD包括CSC(经典圆柱形旋转)和NRE(非重复周期环形)激光雷达、高质量的惯性测量单元(IMU)、相机和 UWB(超宽波长)传感器。此外,我们在一个具有创新性的努力中引入了跨越三个领域的29类语义注释,对3D NRE激光雷达扫描进行语义标注,这为现有语义分割研究带来了新的挑战。最后,我们提出了一种新颖的基于优化基于重采样定位的连续时间目标跟踪方法,该方法基于大型调查级先验图对激光雷达-惯性数据进行优化,这些数据也是公开发布的,每个的大小是现有方法的许多倍。我们在MCD上对多个最先进的算法进行了严格的评估,报告了它们的性能,并强调了研究社区需要解决的挑战。
https://arxiv.org/abs/2403.11496
Predicting the future motion of surrounding agents is essential for autonomous vehicles (AVs) to operate safely in dynamic, human-robot-mixed environments. Context information, such as road maps and surrounding agents' states, provides crucial geometric and semantic information for motion behavior prediction. To this end, recent works explore two-stage prediction frameworks where coarse trajectories are first proposed, and then used to select critical context information for trajectory refinement. However, they either incur a large amount of computation or bring limited improvement, if not both. In this paper, we introduce a novel scenario-adaptive refinement strategy, named SmartRefine, to refine prediction with minimal additional computation. Specifically, SmartRefine can comprehensively adapt refinement configurations based on each scenario's properties, and smartly chooses the number of refinement iterations by introducing a quality score to measure the prediction quality and remaining refinement potential of each scenario. SmartRefine is designed as a generic and flexible approach that can be seamlessly integrated into most state-of-the-art motion prediction models. Experiments on Argoverse (1 & 2) show that our method consistently improves the prediction accuracy of multiple state-of-the-art prediction models. Specifically, by adding SmartRefine to QCNet, we outperform all published ensemble-free works on the Argoverse 2 leaderboard (single agent track) at submission. Comprehensive studies are also conducted to ablate design choices and explore the mechanism behind multi-iteration refinement. Codes are available at this https URL
预测周围代理物的未来运动对于自动驾驶车辆(AVs)在动态、人机混合环境中安全操作至关重要。上下文信息(如道路地图和周围代理物的状态)为运动行为预测提供了关键的几何和语义信息。为此,最近的工作探索了两阶段预测框架,首先提出粗轨迹,然后用于轨迹精炼。然而,它们要么导致大量计算,要么带来有限改进,除非两者兼备。在本文中,我们引入了一种名为SmartRefine的新场景自适应精炼策略,以最小化附加计算的情况下精炼预测。具体来说,SmartRefine可以根据每个场景的特性全面适应精炼配置,并智能地通过引入质量评分来衡量每个场景的预测质量和剩余精炼潜力选择精炼迭代数量。SmartRefine被设计为一种通用且灵活的方法,可以轻松地集成到大多数先进的运动预测模型中。在ArgoVerse(1和2)上的实验表明,我们的方法显著提高了多个先进预测模型的预测准确性。具体来说,将SmartRefine与QCNet结合,我们在ArgoVerse 2领导者板(单代理器跟踪)上超过了所有发表的集成无元学习工作的成绩。此外,还进行了全面的研究,以消解设计选择并探讨多迭代精炼的机制。代码可在此处下载:https://www.reurl.com/
https://arxiv.org/abs/2403.11492
With the advent of universal function approximators in the domain of reinforcement learning, the number of practical applications leveraging deep reinforcement learning (DRL) has exploded. Decision-making in automated driving tasks has emerged as a chief application among them, taking the sensor data or the higher-order kinematic variables as the input and providing a discrete choice or continuous control output. However, the black-box nature of the models presents an overwhelming limitation that restricts the real-world deployment of DRL in autonomous vehicles (AVs). Therefore, in this research work, we focus on the interpretability of an attention-based DRL framework. We use a continuous proximal policy optimization-based DRL algorithm as the baseline model and add a multi-head attention framework in an open-source AV simulation environment. We provide some analytical techniques for discussing the interpretability of the trained models in terms of explainability and causality for spatial and temporal correlations. We show that the weights in the first head encode the positions of the neighboring vehicles while the second head focuses on the leader vehicle exclusively. Also, the ego vehicle's action is causally dependent on the vehicles in the target lane spatially and temporally. Through these findings, we reliably show that these techniques can help practitioners decipher the results of the DRL algorithms.
在强化学习领域引入了通用的功能近似器后,利用深度强化学习(DRL)的实际应用数量急剧增加。自动驾驶任务中的决策已成为其中最主要的应用之一,以传感器数据或高级动态变量作为输入,提供离散选择或连续控制输出。然而,这些模型的黑盒性质提出了一个令人信服的限制,这限制了在自动驾驶车辆(AVs)上实现DRL的实时部署。因此,在这项研究工作中,我们关注基于注意力的DRL框架的可解释性。我们使用基于连续近邻策略优化的一种DRL算法作为基准模型,并在开源AV仿真环境中添加了一个多头注意机制。我们提供了一些分析技术,用于讨论训练后模型的可解释性以及空间和时间相关性。我们发现,第一个头部的权重编码了相邻车辆的位置,而第二个头则专注于领导者车辆。此外, ego车辆的行动在空间和时间上与目标车道的车辆有关。通过这些发现,我们可靠地表明这些技术可以帮助实践者理解DRL算法的结果。
https://arxiv.org/abs/2403.11432
As the field of AI continues to evolve, a significant dimension of this progression is the development of Large Language Models and their potential to enhance multi-agent artificial intelligence systems. This paper explores the cooperative capabilities of Large Language Model-augmented Autonomous Agents (LAAs) using the well-known Meltin Pot environments along with reference models such as GPT4 and GPT3.5. Preliminary results suggest that while these agents demonstrate a propensity for cooperation, they still struggle with effective collaboration in given environments, emphasizing the need for more robust architectures. The study's contributions include an abstraction layer to adapt Melting Pot game scenarios for LLMs, the implementation of a reusable architecture for LLM-mediated agent development - which includes short and long-term memories and different cognitive modules, and the evaluation of cooperation capabilities using a set of metrics tied to the Melting Pot's "Commons Harvest" game. The paper closes, by discussing the limitations of the current architectural framework and the potential of a new set of modules that fosters better cooperation among LAAs.
随着人工智能领域的不断进步,AI的发展的一个重要方面是大型语言模型的开发及其对多代理人工智能系统提高的潜力。本文通过使用著名的Meltin Pot环境以及诸如GPT4和GPT3.5等参考模型,探讨了大型语言模型增强的自治代理(LAAs)的协作能力。初步结果表明,尽管这些代理表现出合作的倾向,但在给定的环境中,它们仍然难以有效合作,强调了需要更健壮的架构。本研究的贡献包括为LLMs适应Meltin Pot游戏场景的抽象层、为LLM介导代理开发的可重用架构的实现,以及使用Meltin Pot的"Commons Harvest"游戏中的一组指标来评估合作能力。最后,本文讨论了当前架构的局限性以及促进LAA之间更好合作的可能性。
https://arxiv.org/abs/2403.11381
Recently, LLM-powered driver agents have demonstrated considerable potential in the field of autonomous driving, showcasing human-like reasoning and decision-making abilities.However, current research on aligning driver agent behaviors with human driving styles remains limited, partly due to the scarcity of high-quality natural language data from human driving this http URL address this research gap, we propose a multi-alignment framework designed to align driver agents with human driving styles through demonstrations and feedback. Notably, we construct a natural language dataset of human driver behaviors through naturalistic driving experiments and post-driving interviews, offering high-quality human demonstrations for LLM alignment. The framework's effectiveness is validated through simulation experiments in the CARLA urban traffic simulator and further corroborated by human evaluations. Our research offers valuable insights into designing driving agents with diverse driving styles.The implementation of the framework and details of the dataset can be found at the link.
近年来,LLM驱动的驾驶员代理在自动驾驶领域表现出相当大的潜力,展示了人类般的推理和决策能力。然而,目前关于将驾驶员代理的行为与人类驾驶风格对齐的研究仍然有限,部分原因是因为高质量的自然语言数据来自人类驾驶的稀缺性。为了填补这一研究空白,我们提出了一个多对齐框架,通过演示和反馈来将驾驶员代理与人类驾驶风格对齐。值得注意的是,我们通过自然驾驶实验和驾驶后访谈来构建人类驾驶员行为的自然语言数据集,为LLM对齐提供了高质量的人类演示。框架的有效性通过在CARLA城市交通模拟器中的仿真实验得到验证,并进一步得到了人类评价的证实。我们的研究为设计具有不同驾驶风格的驱动程序提供了宝贵的洞见。框架的实现和数据集的详细信息可以在链接中找到。
https://arxiv.org/abs/2403.11368
Motion prediction is among the most fundamental tasks in autonomous driving. Traditional methods of motion forecasting primarily encode vector information of maps and historical trajectory data of traffic participants, lacking a comprehensive understanding of overall traffic semantics, which in turn affects the performance of prediction tasks. In this paper, we utilized Large Language Models (LLMs) to enhance the global traffic context understanding for motion prediction tasks. We first conducted systematic prompt engineering, visualizing complex traffic environments and historical trajectory information of traffic participants into image prompts -- Transportation Context Map (TC-Map), accompanied by corresponding text prompts. Through this approach, we obtained rich traffic context information from the LLM. By integrating this information into the motion prediction model, we demonstrate that such context can enhance the accuracy of motion predictions. Furthermore, considering the cost associated with LLMs, we propose a cost-effective deployment strategy: enhancing the accuracy of motion prediction tasks at scale with 0.7\% LLM-augmented datasets. Our research offers valuable insights into enhancing the understanding of traffic scenes of LLMs and the motion prediction performance of autonomous driving.
运动预测是自动驾驶中最基本的任务之一。传统的运动预测方法主要编码地图和交通参与者历史轨迹数据的向量信息,但缺乏对整体交通语义的理解,从而影响到预测任务的性能。在本文中,我们利用大型语言模型(LLMs)来增强运动预测任务的全球交通语义理解。我们首先进行了系统性的提示工程,将复杂的交通环境和交通参与者的历史轨迹信息转化为图像提示——交通场景地图(TC-Map),并附带相应的文本提示。通过这种方法,我们从LLM中获得了丰富的交通语义信息。将此信息整合到运动预测模型中,我们证明了这种语境可以提高运动预测的准确性。此外,考虑到LLM的部署成本,我们提出了一个成本效益的部署策略:通过增加0.7%的LLM增强数据集来提高运动预测任务的准确性。我们的研究为提高LLM对交通场景的理解和自动驾驶运动的预测性能提供了宝贵的洞见。
https://arxiv.org/abs/2403.11057
Artificial General Intelligence falls short when communicating role specific nuances to other systems. This is more pronounced when building autonomous LLM agents capable and designed to communicate with each other for real world problem solving. Humans can communicate context and domain specific nuances along with knowledge, and that has led to refinement of skills. In this work we propose and evaluate a novel method that leads to knowledge distillation among LLM agents leading to realtime human role play preserving unique contexts without relying on any stored data or pretraining. We also evaluate how our system performs better in simulated real world tasks compared to state of the art.
人工通用智能在传达角色特定细微差别给其他系统时存在不足。这尤其表现在构建能够彼此沟通交流并解决现实问题的人工智能语言模型(LLM)代理上。人类可以传达上下文和领域特定细微差别,以及知识,这导致了技能的提升。在这项工作中,我们提出了并评估了一种导致LLM代理之间知识蒸馏的新方法,从而实现保留独特上下文且不依赖任何存储数据或预训练的实时人机角色扮演。我们还评估了我们的系统在模拟现实世界任务上的表现与最先进技术的比较。
https://arxiv.org/abs/2403.10824
Autonomous vehicles are gradually entering city roads today, with the help of high-definition maps (HDMaps). However, the reliance on HDMaps prevents autonomous vehicles from stepping into regions without this expensive digital infrastructure. This fact drives many researchers to study online HDMap generation algorithms, but the performance of these algorithms at far regions is still unsatisfying. We present P-MapNet, in which the letter P highlights the fact that we focus on incorporating map priors to improve model performance. Specifically, we exploit priors in both SDMap and HDMap. On one hand, we extract weakly aligned SDMap from OpenStreetMap, and encode it as an additional conditioning branch. Despite the misalignment challenge, our attention-based architecture adaptively attends to relevant SDMap skeletons and significantly improves performance. On the other hand, we exploit a masked autoencoder to capture the prior distribution of HDMap, which can serve as a refinement module to mitigate occlusions and artifacts. We benchmark on the nuScenes and Argoverse2 datasets. Through comprehensive experiments, we show that: (1) our SDMap prior can improve online map generation performance, using both rasterized (by up to $+18.73$ $\rm mIoU$) and vectorized (by up to $+8.50$ $\rm mAP$) output representations. (2) our HDMap prior can improve map perceptual metrics by up to $6.34\%$. (3) P-MapNet can be switched into different inference modes that covers different regions of the accuracy-efficiency trade-off landscape. (4) P-MapNet is a far-seeing solution that brings larger improvements on longer ranges. Codes and models are publicly available at this https URL.
自动驾驶汽车今天正在逐步进入城市道路,得益于高清晰度地图(HDMaps)的帮助。然而,对HDMaps的依赖使得自动驾驶汽车无法进入没有这种昂贵数字基础设施的区域。这一点导致许多研究人员研究在线HDMap生成算法,但這些算法的性能在远距离地区仍然不令人满意。我们提出了P-MapNet,其中字母P突出了我们专注于结合地图先验以提高模型性能的事实。具体来说,我们利用SDMap和HDMap的预分布。一方面,我们提取了OpenStreetMap中弱对齐的SDMap,并将其编码为额外的条件分支。尽管存在对齐挑战,但我们的自适应架构会适应性地关注相关的SDMap骨架,显著提高性能。另一方面,我们利用掩码自动编码器来捕捉HDMap的先验分布,这可以作为减少遮挡和伪影的优化模块。我们在nuScenes和Argoverse2数据集上进行基准测试。通过全面的实验,我们证明了:(1)我们的SDMap先验可以提高在线地图生成性能,无论是通过平面化(最高+18.73 $\rm mIoU$)还是向量化(最高+8.50 $\rm mAP$)输出表示。(2)我们的HDMap先验可以提高地图感知指标,最多+6.34%。(3)P-MapNet可以切换到不同的推理模式,涵盖准确性与效率权衡曲线的不同区域。(4)P-MapNet是一个具有远见性的解决方案,在更远的距离上带来更大的改进。代码和模型可以从该https URL公开获取。
https://arxiv.org/abs/2403.10521
Integrating Large Language Models (VLMs) and Vision-Language Models (VLMs) with robotic systems enables robots to process and understand complex natural language instructions and visual information. However, a fundamental challenge remains: for robots to fully capitalize on these advancements, they must have a deep understanding of their physical embodiment. The gap between AI models cognitive capabilities and the understanding of physical embodiment leads to the following question: Can a robot autonomously understand and adapt to its physical form and functionalities through interaction with its environment? This question underscores the transition towards developing self-modeling robots without reliance on external sensory or pre-programmed knowledge about their structure. Here, we propose a meta self modeling that can deduce robot morphology through proprioception (the internal sense of position and movement). Our study introduces a 12 DoF reconfigurable legged robot, accompanied by a diverse dataset of 200k unique configurations, to systematically investigate the relationship between robotic motion and robot morphology. Utilizing a deep neural network model comprising a robot signature encoder and a configuration decoder, we demonstrate the capability of our system to accurately predict robot configurations from proprioceptive signals. This research contributes to the field of robotic self-modeling, aiming to enhance understanding of their physical embodiment and adaptability in real world scenarios.
将大型语言模型(VLMs)和视觉语言模型(VLMs)与机器人系统集成,使机器人能够处理和理解复杂的自然语言指令和视觉信息。然而,一个基本挑战 remains:为了充分利用这些进步,机器人必须对其物理 embodiment 具有深入的理解。AI 模型认知能力和物理 embodiment 的理解之间的差距导致了以下问题:机器人是否可以通过与环境的交互自主理解并适应其物理形态和功能?这个问题突出了开发无需依赖外部感官或预编程知识其结构的自我建模机器人的过渡。 在这里,我们提出了一种元自我建模,可以通过本体感觉(内部感觉位置和运动)来推断机器人的外形。我们的研究引入了一台12个自由度可重构的机器人,并随着一个包含200k个独特配置的多样数据集,系统性地研究了机器运动和机器人外形之间的关系。利用包括机器人签名编码器和一个配置解码器的深度神经网络模型,我们证明了我们的系统能够准确预测从本体感觉信号预测机器人配置。这项研究为机器人自建模领域做出了贡献,旨在增强其在现实场景中理解和适应能力。
https://arxiv.org/abs/2403.10496
The surge in black-box AI models has prompted the need to explain the internal mechanism and justify their reliability, especially in high-stakes applications, such as healthcare and autonomous driving. Due to the lack of a rigorous definition of explainable AI (XAI), a plethora of research related to explainability, interpretability, and transparency has been developed to explain and analyze the model from various perspectives. Consequently, with an exhaustive list of papers, it becomes challenging to have a comprehensive overview of XAI research from all aspects. Considering the popularity of neural networks in AI research, we narrow our focus to a specific area of XAI research: gradient based explanations, which can be directly adopted for neural network models. In this review, we systematically explore gradient based explanation methods to date and introduce a novel taxonomy to categorize them into four distinct classes. Then, we present the essence of technique details in chronological order and underscore the evolution of algorithms. Next, we introduce both human and quantitative evaluations to measure algorithm performance. More importantly, we demonstrate the general challenges in XAI and specific challenges in gradient based explanations. We hope that this survey can help researchers understand state-of-the-art progress and their corresponding disadvantages, which could spark their interest in addressing these issues in future work.
黑盒AI模型的激增引发了解释内部机制并证明其可靠性的需要,特别是在高风险应用中,如医疗和自动驾驶等领域。由于可解释AI(XAI)的严谨定义缺失,为了解释和分析模型从各种角度进行大量的研究。因此,随着一系列论文的详细列出,全面了解XAI研究方面变得具有挑战性。考虑到神经网络在人工智能研究中的流行,我们将重点缩小为XAI研究的一个具体领域:基于梯度的解释,可以直接应用于神经网络模型。 在本文回顾中,我们系统地探讨了迄今为止的基于梯度的解释方法,并引入了一个新的分类体系将它们分为四个不同的类别。然后,我们按时间顺序呈现了技术细节,强调了解算法的演变过程。接下来,我们引入了人类和定量评估来衡量算法的性能。更重要的是,我们展示了XAI的一般挑战和基于梯度的解释的特殊挑战。我们希望这次调查可以帮助研究人员了解最先进的进展,以及他们相应的不足之处,激发他们在未来的工作中关注这些问题。
https://arxiv.org/abs/2403.10415
Image segmentation is one of the most fundamental problems in computer vision and has drawn a lot of attentions due to its vast applications in image understanding and autonomous driving. However, designing effective and efficient segmentation neural architectures is a labor-intensive process that may require lots of trials by human experts. In this paper, we address the challenge of integrating multi-head self-attention into high resolution representation CNNs efficiently, by leveraging architecture search. Manually replacing convolution layers with multi-head self-attention is non-trivial due to the costly overhead in memory to maintain high resolution. By contrast, we develop a multi-target multi-branch supernet method, which not only fully utilizes the advantages of high-resolution features, but also finds the proper location for placing multi-head self-attention module. Our search algorithm is optimized towards multiple objective s (e.g., latency and mIoU) and capable of finding architectures on Pareto frontier with arbitrary number of branches in a single search. We further present a series of model via Hybrid Convolutional-Transformer Architecture Search (HyCTAS) method that searched for the best hybrid combination of light-weight convolution layers and memory-efficient self-attention layers between branches from different resolutions and fuse to high resolution for both efficiency and effectiveness. Extensive experiments demonstrate that HyCTAS outperforms previous methods on semantic segmentation task. Code and models are available at \url{this https URL}.
图像分割是计算机视觉中最基本的问题之一,由于其在图像理解和自动驾驶中的广泛应用,因此受到了很多关注。然而,设计有效且高效的分割神经架构是一个劳动密集的过程,可能需要许多专家的人工作尝试。在本文中,我们通过利用架构搜索解决了将多头自注意力集成到高分辨率表示CNNs中的问题,通过构建多目标多分支超级网络。通过手动替换卷积层为多头自注意力,由于需要高昂的内存开销来维持高分辨率,因此这是不可能的。相反,我们开发了一种多目标多分支超级网络方法,不仅充分利用了高分辨率特征的优势,而且发现了放置多头自注意力的适当位置。我们的搜索算法针对多个目标(如延迟和mIoU)进行优化,可以在一个搜索中找到架构在帕累托前沿的任意数量分支上的最优架构。我们还通过HyCTAS方法展示了一系列模型,该方法在寻找不同分辨率分支的最佳轻量级卷积层和内存高效的自注意力层之间进行了搜索,将高分辨率与效率进行了平衡。大量实验证明,HyCTAS在语义分割任务上优于以前的算法。代码和模型可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2403.10413
Accurate positioning of underwater robots in confined environments is crucial for inspection and mapping tasks and is also a prerequisite for autonomous operations. Presently, there are no positioning systems available that are suited for real-world use in confined underwater environments, unconstrained by environmental lighting and water turbidity levels and have sufficient accuracy for reliable and repeatable navigation. This shortage presents a significant barrier to enhancing the capabilities of ROVs in such scenarios. This paper introduces an innovative positioning system for ROVs operating in confined, cluttered underwater settings, achieved through the collaboration of an omnidirectional surface vehicle and an ROV. A formulation is proposed and evaluated in the simulation against ground truth. The experimental results from the simulation form a proof of principle of the proposed system and also demonstrate its deployability. Unlike many previous approaches, the system does not rely on fixed infrastructure or tracking of features in the environment and can cover large enclosed areas without additional equipment.
在水下机器人准确定位在受限环境中的重要性不亚于在海洋中进行探测和绘图任务,也是自主操作的先决条件。目前,尚无适用于水下现实环境中的定位系统,这些系统不受环境光照和水浊度的影响,具有足够的准确性和可重复性导航。这种不足严重地阻碍了在受限场景中提高ROV能力。本文提出了一种创新的水下机器人定位系统,该系统由全向型水面车辆和ROV共同开发。针对仿真进行了公式提出并进行了评估。仿真结果证明了所提出的系统的原则,同时也证明了其可部署性。与许多先前的方法不同,该系统不依赖于固定的基础设施或环境特征的跟踪,可以在没有额外设备的情况下覆盖较大的封闭区域。
https://arxiv.org/abs/2403.10397
The field of autonomous driving has attracted considerable interest in approaches that directly infer 3D objects in the Bird's Eye View (BEV) from multiple cameras. Some attempts have also explored utilizing 2D detectors from single images to enhance the performance of 3D detection. However, these approaches rely on a two-stage process with separate detectors, where the 2D detection results are utilized only once for token selection or query initialization. In this paper, we present a single model termed SimPB, which simultaneously detects 2D objects in the perspective view and 3D objects in the BEV space from multiple cameras. To achieve this, we introduce a hybrid decoder consisting of several multi-view 2D decoder layers and several 3D decoder layers, specifically designed for their respective detection tasks. A Dynamic Query Allocation module and an Adaptive Query Aggregation module are proposed to continuously update and refine the interaction between 2D and 3D results, in a cyclic 3D-2D-3D manner. Additionally, Query-group Attention is utilized to strengthen the interaction among 2D queries within each camera group. In the experiments, we evaluate our method on the nuScenes dataset and demonstrate promising results for both 2D and 3D detection tasks. Our code is available at: this https URL.
自动驾驶领域吸引了相当大的关注,尤其是在从多个相机直接推断鸟视图(BEV)中的3D对象的方法。一些尝试还探索了利用单个图像中的2D检测器来提高3D检测的性能。然而,这些方法依赖于两个阶段的处理过程,其中仅在关键词选择或查询初始化时利用2D检测结果。在本文中,我们提出了一个名为SimPB的单一模型,该模型同时从多个相机检测2D物体和3D物体。为实现这一目标,我们引入了一个由多个鸟视图2D检测层和几个3D检测层组成的混合编码器。为了不断更新和优化2D和3D结果之间的交互,我们提出了动态查询分配模块和自适应查询聚合模块。此外,我们还使用了查询组注意来加强每个相机组内2D查询之间的互动。在实验中,我们在 nuScenes 数据集上评估了我们的方法,并展示了对于2D和3D检测任务的积极结果。我们的代码可在此处访问:https://this URL。
https://arxiv.org/abs/2403.10353
This paper explores the problem of continual learning (CL) of vision-language models (VLMs) in open domains, where the models need to perform continual updating and inference on a streaming of datasets from diverse seen and unseen domains with novel classes. Such a capability is crucial for various applications in open environments, e.g., AI assistants, autonomous driving systems, and robotics. Current CL studies mostly focus on closed-set scenarios in a single domain with known classes. Large pre-trained VLMs like CLIP have demonstrated superior zero-shot recognition ability, and a number of recent studies leverage this ability to mitigate catastrophic forgetting in CL, but they focus on closed-set CL in a single domain dataset. Open-domain CL of large VLMs is significantly more challenging due to 1) large class correlations and domain gaps across the datasets and 2) the forgetting of zero-shot knowledge in the pre-trained VLMs in addition to the knowledge learned from the newly adapted datasets. In this work we introduce a novel approach, termed CoLeCLIP, that learns an open-domain CL model based on CLIP. It addresses these challenges by a joint learning of a set of task prompts and a cross-domain class vocabulary. Extensive experiments on 11 domain datasets show that CoLeCLIP outperforms state-of-the-art methods for open-domain CL under both task- and class-incremental learning settings.
本文探讨了在开放领域中,视觉语言模型(VLMs)的持续学习(CL)问题,这些模型需要对来自不同可见和不可见领域的数据流进行持续更新和推理,并且需要学习新的类。这种能力对于各种开放环境中的应用程序(例如AI助手、自动驾驶系统和机器人)至关重要。目前,大多数CL研究都集中在单个域内的已知类闭包场景。像CLIP这样的大预训练VLM已经证明了卓越的零散拍摄识别能力,并且一些最近的研究利用这种能力来减轻CL中的灾难性遗忘,但他们集中在单个域数据集中的闭包CL。 大VLMs在开放域中的CL挑战非常大,因为数据集之间存在大的类别相关性和领域差异,以及预训练VLM在从新适应的数据集中学习新知识时会遗忘零散拍摄知识。在本文中,我们引入了一种新方法,称为CoLeCLIP,它基于CLIP学习开放域CL模型。它通过联合学习一组任务提示和跨领域类词汇表来解决这些挑战。在11个领域数据集上的广泛实验表明,CoLeCLIP在任务和类增益学习设置下都优于最先进的开放域CL方法。
https://arxiv.org/abs/2403.10245
In recent advancements within the domain of Large Language Models (LLMs), there has been a notable emergence of agents capable of addressing Robotic Process Automation (RPA) challenges through enhanced cognitive capabilities and sophisticated reasoning. This development heralds a new era of scalability and human-like adaptability in goal attainment. In this context, we introduce AUTONODE (Autonomous User-interface Transformation through Online Neuro-graphic Operations and Deep Exploration). AUTONODE employs advanced neuro-graphical techniques to facilitate autonomous navigation and task execution on web interfaces, thereby obviating the necessity for predefined scripts or manual intervention. Our engine empowers agents to comprehend and implement complex workflows, adapting to dynamic web environments with unparalleled efficiency. Our methodology synergizes cognitive functionalities with robotic automation, endowing AUTONODE with the ability to learn from experience. We have integrated an exploratory module, DoRA (Discovery and mapping Operation for graph Retrieval Agent), which is instrumental in constructing a knowledge graph that the engine utilizes to optimize its actions and achieve objectives with minimal supervision. The versatility and efficacy of AUTONODE are demonstrated through a series of experiments, highlighting its proficiency in managing a diverse array of web-based tasks, ranging from data extraction to transaction processing.
在大型语言模型(LLMs)领域最近的研究进展中,出现了一些能够通过增强认知能力和复杂的推理能力来应对机器人流程自动化(RPA)挑战的智能体。这一发展预示着在实现目标的过程中将进入一个可扩展和具有人类相似适应性的新时代。在这个背景下,我们介绍了一个名为AUTONODE(通过在线神经图网络操作和深度探索实现自主用户界面转换)的系统。AUTONODE采用先进的神经图网络技术来促进自主导航和任务执行在网页界面上,从而消除了需要预定义脚本或手动干预的必要性。我们的引擎使智能体能够理解并实施复杂的任务流程,适应于动态的网页环境,效率无与伦比。我们的方法论将认知功能与机器人自动化相结合,使AUTONODE具有从经验中学习的能力。我们引入了一个探索模块,DoRA(用于构建知识图的发现和映射操作),该模块对于构建引擎使用的知识图至关重要。通过一系列实验,我们展示了AUTONODE的多样性和有效性,涵盖了从数据提取到交易处理的各类网页任务。
https://arxiv.org/abs/2403.10171