Deep learning-based Autonomous Driving (AD) models often exhibit poor generalization due to data heterogeneity in an ever domain-shifting environment. While Federated Learning (FL) could improve the generalization of an AD model (known as FedAD system), conventional models often struggle with under-fitting as the amount of accumulated training data progressively increases. To address this issue, instead of conventional small models, employing Large Vision Models (LVMs) in FedAD is a viable option for better learning of representations from a vast volume of data. However, implementing LVMs in FedAD introduces three challenges: (I) the extremely high communication overheads associated with transmitting LVMs between participating vehicles and a central server; (II) lack of computing resource to deploy LVMs on each vehicle; (III) the performance drop due to LVM focusing on shared features but overlooking local vehicle characteristics. To overcome these challenges, we propose pFedLVM, a LVM-Driven, Latent Feature-Based Personalized Federated Learning framework. In this approach, the LVM is deployed only on central server, which effectively alleviates the computational burden on individual vehicles. Furthermore, the exchange between central server and vehicles are the learned features rather than the LVM parameters, which significantly reduces communication overhead. In addition, we utilize both shared features from all participating vehicles and individual characteristics from each vehicle to establish a personalized learning mechanism. This enables each vehicle's model to learn features from others while preserving its personalized characteristics, thereby outperforming globally shared models trained in general FL. Extensive experiments demonstrate that pFedLVM outperforms the existing state-of-the-art approaches.
基于深度学习的自动驾驶(AD)模型通常由于领域转移环境中数据异质性导致泛化性能较差。虽然联邦学习(FL)可以提高AD模型的泛化性能(也称为FedAD系统),但传统模型往往在训练数据逐渐增加时遇到过拟合问题。为了解决这个问题, instead of使用传统的小模型,在FedAD中使用大型视觉模型(LVMs)是一种从大量数据中更好学习表示的可行方法。然而,在FedAD中实现LVMs存在三个挑战: (I)参与车辆与中央服务器之间传输LVMs的极高通信开销; (II)在每辆车辆上部署LVMs所需的计算资源不足; (III)LVM过于关注共享特征而忽视了本地车辆特性,导致性能下降。 为了克服这些挑战,我们提出了pFedLVM,一种基于LVM的、基于潜在特征的个性化FedAD学习框架。在这种方法中,LVM仅部署在中央服务器上,有效地减轻了每辆车辆的计算负担。此外,中央服务器与车辆之间的交换是学习到的特征而不是LVM参数,从而显著减少了通信开销。此外,我们还利用参与车辆中的共享特征和每辆车辆的个性特征来建立个性化学习机制。这使得每辆车辆的模型可以从其他车辆学习特征,同时保留其个性特征,从而在全局共享模型训练中具有优异的性能。大量实验证明,pFedLVM超越了现有技术的水平。
https://arxiv.org/abs/2405.04146
Emergent-scene safety is the key milestone for fully autonomous driving, and reliable on-time prediction is essential to maintain safety in emergency scenarios. However, these emergency scenarios are long-tailed and hard to collect, which restricts the system from getting reliable predictions. In this paper, we build a new dataset, which aims at the long-term prediction with the inconspicuous state variation in history for the emergency event, named the Extro-Spective Prediction (ESP) problem. Based on the proposed dataset, a flexible feature encoder for ESP is introduced to various prediction methods as a seamless plug-in, and its consistent performance improvement underscores its efficacy. Furthermore, a new metric named clamped temporal error (CTE) is proposed to give a more comprehensive evaluation of prediction performance, especially in time-sensitive emergency events of subseconds. Interestingly, as our ESP features can be described in human-readable language naturally, the application of integrating into ChatGPT also shows huge potential. The ESP-dataset and all benchmarks are released at this https URL.
涌现式场景安全是实现完全自动驾驶的关键里程碑,而可靠的时间预测对于在紧急情况下维持安全至关重要。然而,这些紧急情况是具有长期尾性和难以收集的,这限制了系统从获得可靠预测。在本文中,我们构建了一个名为Extra-Spective Prediction(ESP)的新数据集,旨在通过历史紧急情况中的不可见状态变化提供长期预测,名为ESP问题。基于所提出的数据集,为ESP问题引入了一个灵活的特征编码器,作为各种预测方法的无缝插件,并证明了其一致的性能提升。此外,还提出了一个名为clamped temporal error(CTE)的新指标,用于更全面地评估预测性能,特别是在几毫秒的紧迫事件中。有趣的是,由于我们的ESP特征可以用自然语言描述,将ESP集成到ChatGPT中也有着巨大的应用潜力。ESP数据集和所有基准数据集都已发布在https://www.esprite.readthedocs.io/en/latest/index.html。
https://arxiv.org/abs/2405.04100
Autonomous driving perception models are typically composed of multiple functional modules that interact through complex relationships to accomplish environment understanding. However, perception models are predominantly optimized as a black box through end-to-end training, lacking independent evaluation of functional modules, which poses difficulties for interpretability and optimization. Pioneering in the issue, we propose an evaluation method based on feature map analysis to gauge the convergence of model, thereby assessing functional modules' training maturity. We construct a quantitative metric named as the Feature Map Convergence Score (FMCS) and develop Feature Map Convergence Evaluation Network (FMCE-Net) to measure and predict the convergence degree of models respectively. FMCE-Net achieves remarkable predictive accuracy for FMCS across multiple image classification experiments, validating the efficacy and robustness of the introduced approach. To the best of our knowledge, this is the first independent evaluation method for functional modules, offering a new paradigm for the training assessment towards perception models.
自动驾驶感知模型通常由多个功能模块组成,通过复杂的关系来理解环境。然而,通过端到端的训练优化感知模型,缺乏对功能模块的独立评估,这使得模型的可解释性和优化存在困难。在这个问题上,我们提出了基于特征图分析的评估方法来衡量模型的收敛,从而评估功能模块的训练成熟度。我们构建了一个名为特征图收敛分数(FMCS)的定量度量,并开发了特征图收敛评估网络(FMCE-Net)来分别测量和预测模型的收敛程度。在多个图像分类实验中,FMCE-Net在FMCS上的预测准确性非常显著,验证了所引入方法的有效性和鲁棒性。据我们所知,这是第一个关于功能模块的独立评估方法,为感知模型的训练评估提供了一个新的范式。
https://arxiv.org/abs/2405.04041
Dormant pruning, or the removal of unproductive portions of a tree while a tree is not actively growing, is an important orchard task to help maintain yield, requiring years to build expertise. Because of long training periods and an increasing labor shortage in agricultural jobs, pruning could benefit from robotic automation. However, to program robots to prune branches, we first need to understand how pruning decisions are made, and what variables in the environment (e.g., branch size and thickness) we need to capture. Working directly with three pruning stakeholders -- horticulturists, growers, and pruners -- we find that each group of human experts approaches pruning decision-making differently. To capture this knowledge, we present three studies and two extracted pruning protocols from field work conducted in Prosser, Washington in January 2022 and 2023. We interviewed six stakeholders (two in each group) and observed pruning across three cultivars -- Bing Cherries, Envy Apples, and Jazz Apples -- and two tree architectures -- Upright Fruiting Offshoot and V-Trellis. Leveraging participant interviews and video data, this analysis uses grounded coding to extract pruning terminology, discover horticultural contexts that influence pruning decisions, and find implementable pruning heuristics for autonomous systems. The results include a validated terminology set, which we offer for use by both pruning stakeholders and roboticists, to communicate general pruning concepts and heuristics. The results also highlight seven pruning heuristics utilizing this terminology set that would be relevant for use by future autonomous robot pruning systems, and characterize three discovered horticultural contexts (i.e., environmental management, crop-load management, and replacement wood) across all three cultivars.
休眠修剪,即在树木非积极生长期间,修剪掉无产量的树段,是保持产量的园艺重要任务,需要多年时间来培养专业知识。由于长时间的培训期和农业就业岗位劳动力的不断减少,修剪可能会从机器人自动化中受益。然而,要编程机器人进行修剪,我们首先需要了解修剪决定的制定过程,以及我们需要捕捉的环境变量的内容。与三个修剪 stakeholders(即园艺师、种植者和修剪者)直接合作,我们发现每个团队的人类专家在修剪决策上有所不同。为了捕捉这一知识,我们在2022年1月和2023年的华盛顿州普雷西园艺中心现场工作中,展示了三个研究和从2022年1月和2023年的普雷西园艺中心现场工作中提取的两个修剪协议。我们对六个利益相关者(每个组别两人)进行了采访,并观察了来自三种栽培品种—— Bing 樱桃、Envy 苹果和Jazz 苹果的修剪情况,以及两种树架构——立式果树出芽和V形平顶树架构。利用参与者的访谈和视频数据,本分析使用 grounded coding 提取了修剪术语,发现了影响修剪决策的园艺背景,并找到了可应用于自治系统的实现性修剪技巧。结果包括验证的术语集,我们将其提供给修剪利益相关者和机器人使用,以传达一般修剪概念和技巧。结果还突出了七个利用此术语集的修剪技巧,这些技巧对于未来自主机器人修剪系统具有重要意义,并描述了所有三个栽培品种中发现的三个园艺背景(即环境管理、负荷管理和替换木)。
https://arxiv.org/abs/2405.04030
Object detection plays a critical role in autonomous driving, where accurately and efficiently detecting objects in fast-moving scenes is crucial. Traditional frame-based cameras face challenges in balancing latency and bandwidth, necessitating the need for innovative solutions. Event cameras have emerged as promising sensors for autonomous driving due to their low latency, high dynamic range, and low power consumption. However, effectively utilizing the asynchronous and sparse event data presents challenges, particularly in maintaining low latency and lightweight architectures for object detection. This paper provides an overview of object detection using event data in autonomous driving, showcasing the competitive benefits of event cameras.
物体检测在自动驾驶中扮演着关键角色,因为准确且高效地在快速移动的场景中检测物体至关重要。传统的基于帧的相机在平衡延迟和带宽方面面临挑战,需要采取创新解决方案。事件相机因低延迟、高动态范围和低功耗而成为自动驾驶的有前景的传感器。然而,有效地利用异步和稀疏事件数据存在挑战,特别是在维持低延迟和轻量架构进行物体检测方面。本文对使用事件数据进行物体检测在自动驾驶中的概述,展示了事件相机的竞争优势。
https://arxiv.org/abs/2405.03995
V2X cooperation, through the integration of sensor data from both vehicles and infrastructure, is considered a pivotal approach to advancing autonomous driving technology. Current research primarily focuses on enhancing perception accuracy, often overlooking the systematic improvement of accident prediction accuracy through end-to-end learning, leading to insufficient attention to the safety issues of autonomous driving. To address this challenge, this paper introduces the UniE2EV2X framework, a V2X-integrated end-to-end autonomous driving system that consolidates key driving modules within a unified network. The framework employs a deformable attention-based data fusion strategy, effectively facilitating cooperation between vehicles and infrastructure. The main advantages include: 1) significantly enhancing agents' perception and motion prediction capabilities, thereby improving the accuracy of accident predictions; 2) ensuring high reliability in the data fusion process; 3) superior end-to-end perception compared to modular approaches. Furthermore, We implement the UniE2EV2X framework on the challenging DeepAccident, a simulation dataset designed for V2X cooperative driving.
V2X合作,通过整合来自车辆和基础设施的传感器数据,被认为是推动自动驾驶技术发展的重要方法。当前的研究主要集中在提高感知精度和事故预测精度,往往忽视了端到端学习对事故预测精度系统性改进,导致对自动驾驶技术的安全问题关注不足。为了解决这个挑战,本文引入了UniE2EV2X框架,一种V2X集成式的端到端自动驾驶系统,将关键驾驶模块整合在一个统一的网络中。该框架采用了一种可变的注意力基于数据融合策略,有效促进了车辆和基础设施之间的合作。主要优点包括:1)显著增强了代理的感知和运动预测能力,从而提高了事故预测的准确性;2)确保了数据融合过程的高可靠性;3)与模块化方法相比具有卓越的端到端感知能力。此外,我们在具有挑战性的DeepAccident模拟数据集上实现了UniE2EV2X框架。
https://arxiv.org/abs/2405.03971
Recently, we are witnessing the remarkable progress and widespread adoption of sensing technologies in autonomous driving, robotics, and metaverse. Considering the rapid advancement of computer vision (CV) technology to analyze the sensing information, we anticipate a proliferation of wireless applications exploiting the sensing and CV technologies in 6G. In this article, we provide a holistic overview of the sensing and CV-aided wireless communications (SVWC) framework for 6G. By analyzing the high-resolution sensing information through the powerful CV techniques, SVWC can quickly and accurately understand the wireless environments and then perform the wireless tasks. To demonstrate the efficacy of SVWC, we design the whole process of SVWC including the sensing dataset collection, DL model training, and execution of realistic wireless tasks. From the numerical evaluations on 6G communication scenarios, we show that SVWC achieves considerable performance gains over the conventional 5G systems in terms of positioning accuracy, data rate, and access latency.
近年来,我们在自动驾驶、机器人技术和元宇宙中见证了感测技术的显著进步和广泛应用。考虑到计算机视觉(CV)技术在分析感测信息方面的快速进步,我们预计在6G中会涌现出大量利用感测和CV技术的有线应用程序。在本文中,我们全面概述了6G中基于感测和CV技术的无线通信(SVWC)框架。通过利用强大的CV技术分析高分辨率感测信息,SVWC可以快速且准确地了解无线环境,然后执行无线任务。为了证明SVWC的有效性,我们设计了一个包括感测数据集收集、DL模型训练和执行真实无线任务的SVWC过程。从6G通信场景的数值评估中,我们证明了SVWC在定位精度、数据速率和接入延迟方面显著优于传统的5G系统。
https://arxiv.org/abs/2405.03945
In the wake of rapid advancements in artificial intelligence (AI), we stand on the brink of a transformative leap in data systems. The imminent fusion of AI and DB (AIxDB) promises a new generation of data systems, which will relieve the burden on end-users across all industry sectors by featuring AI-enhanced functionalities, such as personalized and automated in-database AI-powered analytics, self-driving capabilities for improved system performance, etc. In this paper, we explore the evolution of data systems with a focus on deepening the fusion of AI and DB. We present NeurDB, our next-generation data system designed to fully embrace AI design in each major system component and provide in-database AI-powered analytics. We outline the conceptual and architectural overview of NeurDB, discuss its design choices and key components, and report its current development and future plan.
在人工智能 (AI) 迅速发展的背景下,我们站在数据系统变革的边缘。AI 和数据库 (AIxDB) 的即将融合为新一代数据系统带来了无限可能,这将减轻所有行业部门用户负担,并具备诸如个性化和自动化的数据库内 AI 增强功能等 AI 加强功能。在本文中,我们深入探讨了数据系统的演变,重点关注了 AI 和数据库的融合。我们介绍了 NeurDB,我们专门为每个主要系统组件进行了 AI 设计的下一代数据系统,并提供了数据库内 AI 增强分析功能。我们概述了 NeurDB 的概念和架构概述,讨论了其设计选择和关键组件,并报告了其当前开发和未来计划。
https://arxiv.org/abs/2405.03924
3D object detection plays an important role in autonomous driving; however, its vulnerability to backdoor attacks has become evident. By injecting ''triggers'' to poison the training dataset, backdoor attacks manipulate the detector's prediction for inputs containing these triggers. Existing backdoor attacks against 3D object detection primarily poison 3D LiDAR signals, where large-sized 3D triggers are injected to ensure their visibility within the sparse 3D space, rendering them easy to detect and impractical in real-world scenarios. In this paper, we delve into the robustness of 3D object detection, exploring a new backdoor attack surface through 2D cameras. Given the prevalent adoption of camera and LiDAR signal fusion for high-fidelity 3D perception, we investigate the latent potential of camera signals to disrupt the process. Although the dense nature of camera signals enables the use of nearly imperceptible small-sized triggers to mislead 2D object detection, realizing 2D-oriented backdoor attacks against 3D object detection is non-trivial. The primary challenge emerges from the fusion process that transforms camera signals into a 3D space, compromising the association with the 2D trigger to the target output. To tackle this issue, we propose an innovative 2D-oriented backdoor attack against LiDAR-camera fusion methods for 3D object detection, named BadFusion, for preserving trigger effectiveness throughout the entire fusion process. The evaluation demonstrates the effectiveness of BadFusion, achieving a significantly higher attack success rate compared to existing 2D-oriented attacks.
3D物体检测在自动驾驶中扮演着重要的角色;然而,它对后门攻击的易受性已经变得显而易见。通过向训练数据注入“触发器”,后门攻击会操纵检测器对包含这些触发器的输入的预测。针对3D物体检测的现有后门攻击主要污染3D激光雷达信号,大型的3D触发器被注入以确保它们在稀疏的3D空间中的可见性,从而使它们在现实场景中容易检测且不可行。在本文中,我们深入研究了3D物体检测的稳健性,通过2D摄像头探究新的后门攻击表面。 考虑到摄像头和激光雷达信号融合在高保真度3D感知中普遍采用,我们研究了摄像机信号的潜在破坏力,以打断该过程。尽管相机信号的密集性使得几乎无法察觉的小型触发器可以误导2D物体检测,但实现针对3D物体检测的2D方向后门攻击并非易事。主要挑战来自于将相机信号融合成一个3D空间的过程,这使得2D触发器与目标输出之间建立联系的过程受到威胁。为了解决这个问题,我们提出了名为BadFusion的 innovative 2D方向后门攻击,旨在在整个融合过程中保持触发器的有效性。评估结果表明,BadFusion的效果非常显著,其攻击成功率比现有2D方向攻击要高得多。
https://arxiv.org/abs/2405.03884
Accurate trajectory prediction is crucial for ensuring safe and efficient autonomous driving. However, most existing methods overlook complex interactions between traffic participants that often govern their future trajectories. In this paper, we propose SocialFormer, an agent interaction-aware trajectory prediction method that leverages the semantic relationship between the target vehicle and surrounding vehicles by making use of the road topology. We also introduce an edge-enhanced heterogeneous graph transformer (EHGT) as the aggregator in a graph neural network (GNN) to encode the semantic and spatial agent interaction information. Additionally, we introduce a temporal encoder based on gated recurrent units (GRU) to model the temporal social behavior of agent movements. Finally, we present an information fusion framework that integrates agent encoding, lane encoding, and agent interaction encoding for a holistic representation of the traffic scene. We evaluate SocialFormer for the trajectory prediction task on the popular nuScenes benchmark and achieve state-of-the-art performance.
准确的轨迹预测对于确保安全和高效的自动驾驶至关重要。然而,大多数现有方法忽视了交通参与者之间的复杂交互,这些交互通常决定了他们的未来轨迹。在本文中,我们提出了SocialFormer,一种关注代理与周围车辆之间语义关系的轨迹预测方法,通过利用道路拓扑结构来利用目标车辆与周围车辆之间的语义关系。我们还引入了一个增强的异质图变换器(EHGT)作为图神经网络(GNN)的聚合器,以编码代理的语义和空间交互信息。此外,我们还引入了一个基于门控循环单元(GRU)的时间编码器来建模代理运动的时间社交行为。最后,我们提出了一个整合代理编码、路况编码和代理交互编码的信息融合框架,以对交通场景进行全面的表示。我们在流行的nuScenes基准上评估SocialFormer的轨迹预测任务,并取得了最先进的性能。
https://arxiv.org/abs/2405.03809
This paper introduces UniGen, a novel approach to generating new traffic scenarios for evaluating and improving autonomous driving software through simulation. Our approach models all driving scenario elements in a unified model: the position of new agents, their initial state, and their future motion trajectories. By predicting the distributions of all these variables from a shared global scenario embedding, we ensure that the final generated scenario is fully conditioned on all available context in the existing scene. Our unified modeling approach, combined with autoregressive agent injection, conditions the placement and motion trajectory of every new agent on all existing agents and their trajectories, leading to realistic scenarios with low collision rates. Our experimental results show that UniGen outperforms prior state of the art on the Waymo Open Motion Dataset.
本文介绍了UniGen,一种通过仿真生成新的交通场景以评估和改进自动驾驶软件的新方法。我们的方法将所有驾驶场景元素建模为一个统一的模型:新代理的位置、它们的初始状态和未来运动轨迹。通过从共享全局场景嵌入中预测这些变量的分布,我们确保生成的场景完全依赖于现有场景中所有可用上下文的条件。我们统一的建模方法与自回归代理器的注入相结合,条件所有新代理的位置和运动轨迹,从而实现具有低碰撞率的现实场景。我们的实验结果表明,UniGen在Waymo Open Motion Dataset上超越了前人的水平。
https://arxiv.org/abs/2405.03807
Recent advancements in Large Language Models (LLMs) have led to the development of Video Large Multi-modal Models (Video-LMMs) that can handle a wide range of video understanding tasks. These models have the potential to be deployed in real-world applications such as robotics, AI assistants, medical imaging, and autonomous vehicles. The widespread adoption of Video-LMMs in our daily lives underscores the importance of ensuring and evaluating their robust performance in mirroring human-like reasoning and interaction capabilities in complex, real-world contexts. However, existing benchmarks for Video-LMMs primarily focus on general video comprehension abilities and neglect assessing their reasoning capabilities over complex videos in the real-world context, and robustness of these models through the lens of user prompts as text queries. In this paper, we present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES), a novel benchmark that comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions. We evaluate 9 recent models, including both open-source and closed-source variants, and find that most of the Video-LMMs, {especially open-source ones,} struggle with robustness and reasoning when dealing with complex videos. Based on our analysis, we develop a training-free Dual-Step Contextual Prompting (DSCP) technique to enhance the performance of existing Video-LMMs. Our findings provide valuable insights for building the next generation of human-centric AI systems with advanced robustness and reasoning capabilities. Our dataset and code are publicly available at: this https URL.
近年来,在大型语言模型(LLMs)方面取得了显著进展,导致开发了视频大型多模态模型(Video-LMMs),这些模型可以处理广泛的视频理解任务。这些模型在现实世界的应用场景,如机器人学、人工智能助手、医学成像和自动驾驶车辆等方面具有潜在部署价值。在我们日常生活中广泛部署Video-LMMs,突显了在复杂、现实世界语境中确保和评估其稳健性能的重要性。然而,现有的Video-LMM基准主要关注于通用视频理解能力,而忽略了评估其在复杂视频中的推理能力,以及通过用户提示的视角评估模型的稳健性。在本文中,我们提出了CVRR-ES(复杂视频推理和稳健性评估套装),一种新颖的基准,全面评估了11种不同现实世界视频维度中Video-LMM的性能。我们对9个最近的后开源和闭源模型进行了评估,发现大多数视频模型(尤其是开源模型)在处理复杂视频时,表现出了稳健性和推理能力不足。根据我们的分析,我们提出了一种无需训练的自助双步上下文提示(DSCP)技术,以提高现有视频模型的性能。我们的研究结果为构建具有高级稳健性和推理能力的下一代人类中心化人工智能系统提供了宝贵的洞见。我们的数据和代码 publicly available at:this <https:// this URL.
https://arxiv.org/abs/2405.03690
This paper introduces RoboCar, an open-source research platform for autonomous driving developed at the University of Luxembourg. RoboCar provides a modular, cost-effective framework for the development of experimental Autonomous Driving Systems (ADS), utilizing the 2018 KIA Soul EV. The platform integrates a robust hardware and software architecture that aligns with the vehicle's existing systems, minimizing the need for extensive modifications. It supports various autonomous driving functions and has undergone real-world testing on public roads in Luxembourg City. This paper outlines the platform's architecture, integration challenges, and initial test results, offering insights into its application in advancing autonomous driving research. RoboCar is available to anyone at this https URL and is released under an open-source MIT license.
本文介绍了一个名为RoboCar的自驾研究平台,该平台是由 Luxembourg 大学开发的开放式源代码项目。RoboCar 为开发实验性的自动驾驶系统(ADS)提供了一个模块化、成本效益高的框架,利用 2018 年的 KIA Soul EV。该平台采用了一体化、可靠的硬件和软件架构,与车辆现有系统保持一致,最小化了广泛修改的需求。它支持各种自动驾驶功能,已在 Luxembourg 市内的公共道路上进行了实际测试。本文概述了平台的架构、集成挑战以及初始测试结果,为其在推动自动驾驶研究方面应用提供了洞察。RoboCar 可以在此链接处访问,并采用开源的 MIT 许可证发布。
https://arxiv.org/abs/2405.03572
In this study, we introduce "SARDiM," a modular semi-autonomous platform enhanced with mixed reality for industrial disassembly tasks. Through a case study focused on EV battery disassembly, SARDiM integrates Mixed Reality, object segmentation, teleoperation, force feedback, and variable autonomy. Utilising the ROS, Unity, and MATLAB platforms, alongside a joint impedance controller, SARDiM facilitates teleoperated disassembly. The approach combines FastSAM for real-time object segmentation, generating data which is subsequently processed through a cluster analysis algorithm to determine the centroid and orientation of the components, categorizing them by size and disassembly priority. This data guides the MoveIt platform in trajectory planning for the Franka Robot arm. SARDiM provides the capability to switch between two teleoperation modes: manual and semi-autonomous with variable autonomy. Each was evaluated using four different Interface Methods (IM): direct view, monitor feed, mixed reality with monitor feed, and point cloud mixed reality. Evaluations across the eight IMs demonstrated a 40.61% decrease in joint limit violations using Mode 2. Moreover, Mode 2-IM4 outperformed Mode 1-IM1 by achieving a 2.33%-time reduction while considerably increasing safety, making it optimal for operating in hazardous environments at a safe distance, with the same ease of use as teleoperation with a direct view of the environment.
在这项研究中,我们引入了“SARDiM”,一种模块化半自主平台,通过增强混合现实技术,用于工业拆卸任务。通过一个关注电动汽车电池拆卸的案例研究,SARDiM 整合了混合现实、目标分割、遥控、力反馈和可变自主。利用ROS、Unity和MATLAB平台,结合联合阻尼控制器,SARDiM 促进了遥控拆卸。该方法结合了 FastSAM 进行实时物体分割,生成数据,并通过聚类分析算法对其进行处理,以确定组件的质心和高低优先级,将它们按大小和拆卸优先级进行分类。这些数据指导了 Franka Robot 手臂在轨迹规划中的移动。SARDiM 提供了在两种遥控模式之间切换的能力:手动和具有可变自主权的半自主模式。每种模式都通过四种不同的接口方法(IM)进行了评估:直接观察、监视反馈、混合现实与监视反馈、点云混合现实。在八种不同接口方法的评估中,使用 Mode 2。此外,Mode 2-IM4 比 Mode 1-IM1 实现了 2.33% 的时间减少,大大提高了安全性,使其成为在安全距离内操作的理想选择,具有与直接观察环境遥控相同的易用性。
https://arxiv.org/abs/2405.03530
General world models represent a crucial pathway toward achieving Artificial General Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual environments to decision-making systems. Recently, the emergence of the Sora model has attained significant attention due to its remarkable simulation capabilities, which exhibits an incipient comprehension of physical laws. In this survey, we embark on a comprehensive exploration of the latest advancements in world models. Our analysis navigates through the forefront of generative methodologies in video generation, where world models stand as pivotal constructs facilitating the synthesis of highly realistic visual content. Additionally, we scrutinize the burgeoning field of autonomous-driving world models, meticulously delineating their indispensable role in reshaping transportation and urban mobility. Furthermore, we delve into the intricacies inherent in world models deployed within autonomous agents, shedding light on their profound significance in enabling intelligent interactions within dynamic environmental contexts. At last, we examine challenges and limitations of world models, and discuss their potential future directions. We hope this survey can serve as a foundational reference for the research community and inspire continued innovation. This survey will be regularly updated at: this https URL.
通用世界模型代表了一种关键途径,有助于实现人工通用智能(AGI),并为各种应用提供基石,从虚拟环境到决策系统。最近,Sora模型的出现因其出色的模拟能力而受到广泛关注,它表现出对物理定律的初步理解。在本次调查中,我们全面探讨了世界模型的最新进展。我们的分析涵盖了视频生成领域的前沿生成方法,其中世界模型作为关键构建模块促进高度现实主义视觉内容的合成。此外,我们详细研究了自动驾驶世界模型的蓬勃发展,精心描绘了它们在重塑交通和城市出行方式中不可或缺的作用。最后,我们深入研究了部署在自主代理中的世界模型的复杂性,揭示了它们在 enabling intelligent interactions within dynamic environmental contexts中的深刻意义。总之,我们调查了世界模型的挑战和局限性,并讨论了它们未来可能的发展方向。我们希望这次调查能为研究社区提供一种基础性的参考,并激发持续创新。本次调查将定期更新于:https:// this URL。
https://arxiv.org/abs/2405.03520
Effective execution of long-horizon tasks with dexterous robotic hands remains a significant challenge in real-world problems. While learning from human demonstrations have shown encouraging results, they require extensive data collection for training. Hence, decomposing long-horizon tasks into reusable primitive skills is a more efficient approach. To achieve so, we developed DexSkills, a novel supervised learning framework that addresses long-horizon dexterous manipulation tasks using primitive skills. DexSkills is trained to recognize and replicate a select set of skills using human demonstration data, which can then segment a demonstrated long-horizon dexterous manipulation task into a sequence of primitive skills to achieve one-shot execution by the robot directly. Significantly, DexSkills operates solely on proprioceptive and tactile data, i.e., haptic data. Our real-world robotic experiments show that DexSkills can accurately segment skills, thereby enabling autonomous robot execution of a diverse range of tasks.
在现实问题中,使用灵巧的机器人手臂有效执行长期任务仍然是一个重要的挑战。虽然从人类演示中学习已经取得了鼓舞人心的结果,但它们需要大量的数据收集来进行训练。因此,将长期任务分解为可重复使用的原始技能是一种更有效的方法。为了实现这一点,我们开发了DexSkills,一种新颖的监督学习框架,它使用原始技能解决长期灵巧操作任务。DexSkills通过使用人类演示数据来识别和复制一系列技能,然后将演示的长期灵巧操作任务分割为一系列原始技能,使机器人可以直接实现一次性的任务。显著的是,DexSkills仅操作自适应和触觉数据,即触觉数据。我们在现实世界的机器人实验中证明,DexSkills可以准确地分割技能,从而使机器人能够自主执行各种任务。
https://arxiv.org/abs/2405.03476
In complex traffic environments, autonomous vehicles face multi-modal uncertainty about other agents' future behavior. To address this, recent advancements in learningbased motion predictors output multi-modal predictions. We present our novel framework that leverages Branch Model Predictive Control(BMPC) to account for these predictions. The framework includes an online scenario-selection process guided by topology and collision risk criteria. This efficiently selects a minimal set of predictions, rendering the BMPC realtime capable. Additionally, we introduce an adaptive decision postponing strategy that delays the planner's commitment to a single scenario until the uncertainty is resolved. Our comprehensive evaluations in traffic intersection and random highway merging scenarios demonstrate enhanced comfort and safety through our method.
在复杂的交通环境中,自动驾驶车辆面临着其他车辆未来行为的 multi-modal 不确定性。为了解决这个问题, recent 的学习基础运动预测器取得了多模态预测。 我们提出了一个利用分支模型预测控制(BMPC)来处理这些预测的新框架。框架包括一个由拓扑和碰撞风险标准引导的在线场景选择过程。 这有效地选择了一个最小预测集,使得 BMPC 实时有能力。此外,我们引入了一种自适应的决策推迟策略,即在确定性解决之前,将规划器的承诺推迟到单个场景。 我们在交通交叉口和随机高速公路合并场景的全面评估证明了通过我们的方法可以提高舒适度和安全性。
https://arxiv.org/abs/2405.03470
In this study, we present an implementation strategy for a robot that performs peg transfer tasks in Fundamentals of Laparoscopic Surgery (FLS) via imitation learning, aimed at the development of an autonomous robot for laparoscopic surgery. Robotic laparoscopic surgery presents two main challenges: (1) the need to manipulate forceps using ports established on the body surface as fulcrums, and (2) difficulty in perceiving depth information when working with a monocular camera that displays its images on a monitor. Especially, regarding issue (2), most prior research has assumed the availability of depth images or models of a target to be operated on. Therefore, in this study, we achieve more accurate imitation learning with only monocular images by extracting motion constraints from one exemplary motion of skilled operators, collecting data based on these constraints, and conducting imitation learning based on the collected data. We implemented an overall system using two Franka Emika Panda Robot Arms and validated its effectiveness.
在这项研究中,我们提出了一个通过模仿学习在基础腔镜手术(FLS)中执行吊环转移任务的机器人实现策略,旨在开发用于腹腔镜手术的自走机器人。机器人腹腔镜手术面临着两个主要挑战:(1)需要使用身体表面建立的孔道作为支点来操纵镊子,以及(2)在操作显示图像为一目镜的单目相机时,很难感知深度信息。特别地,关于问题(2),之前的研究大部分都假定目标可以操作的深度图像或模型是可用的。因此,在本研究中,我们通过从熟练操作员的一个示例运动中提取运动约束,根据这些约束收集数据,并基于收集到的数据进行模仿学习,实现了更准确的学习。我们使用两个弗兰克aEmika熊猫机器人手臂实现了总体系统,并验证了其有效性。
https://arxiv.org/abs/2405.03440
Building accurate maps is a key building block to enable reliable localization, planning, and navigation of autonomous vehicles. We propose a novel approach for building accurate maps of dynamic environments utilizing a sequence of LiDAR scans. To this end, we propose encoding the 4D scene into a novel spatio-temporal implicit neural map representation by fitting a time-dependent truncated signed distance function to each point. Using our representation, we extract the static map by filtering the dynamic parts. Our neural representation is based on sparse feature grids, a globally shared decoder, and time-dependent basis functions, which we jointly optimize in an unsupervised fashion. To learn this representation from a sequence of LiDAR scans, we design a simple yet efficient loss function to supervise the map optimization in a piecewise way. We evaluate our approach on various scenes containing moving objects in terms of the reconstruction quality of static maps and the segmentation of dynamic point clouds. The experimental results demonstrate that our method is capable of removing the dynamic part of the input point clouds while reconstructing accurate and complete 3D maps, outperforming several state-of-the-art methods. Codes are available at: this https URL
建立准确的地图是实现自动驾驶车辆的可靠定位、规划和导航的关键模块。我们提出了一种利用连续激光雷达扫描序列建立动态环境中准确地图的新颖方法。为此,我们通过将4D场景编码为一个新颖的空间-时间隐式神经网络表示来达成目标。利用我们的表示,我们通过滤波动态部分来提取静态地图。我们的神经表示基于稀疏特征网格、全局共享的解码器以及时间相关的基函数,我们以自适应的方式共同优化。为了从一系列激光雷达扫描中学习这种表示,我们设计了一个简单而有效的损失函数来以片段方式监督地图优化。我们在各种包含运动物体的场景中评估我们的方法,根据静态地图的重建质量和动态点云的分割程度。实验结果表明,我们的方法能够删除输入点云中的动态部分,同时还原准确和完整的3D地图,优于其他最先进的方法。代码可在此处访问:https:// this URL
https://arxiv.org/abs/2405.03388
Purpose: Autonomous navigation of devices in endovascular interventions can decrease operation times, improve decision-making during surgery, and reduce operator radiation exposure while increasing access to treatment. This systematic review explores recent literature to assess the impact, challenges, and opportunities artificial intelligence (AI) has for the autonomous endovascular intervention navigation. Methods: PubMed and IEEEXplore databases were queried. Eligibility criteria included studies investigating the use of AI in enabling the autonomous navigation of catheters/guidewires in endovascular interventions. Following PRISMA, articles were assessed using QUADAS-2. PROSPERO: CRD42023392259. Results: Among 462 studies, fourteen met inclusion criteria. Reinforcement learning (9/14, 64%) and learning from demonstration (7/14, 50%) were used as data-driven models for autonomous navigation. Studies predominantly utilised physical phantoms (10/14, 71%) and in silico (4/14, 29%) models. Experiments within or around the blood vessels of the heart were reported by the majority of studies (10/14, 71%), while simple non-anatomical vessel platforms were used in three studies (3/14, 21%), and the porcine liver venous system in one study. We observed that risk of bias and poor generalisability were present across studies. No procedures were performed on patients in any of the studies reviewed. Studies lacked patient selection criteria, reference standards, and reproducibility, resulting in low clinical evidence levels. Conclusions: AI's potential in autonomous endovascular navigation is promising, but in an experimental proof-of-concept stage, with a technology readiness level of 3. We highlight that reference standards with well-identified performance metrics are crucial to allow for comparisons of data-driven algorithms proposed in the years to come.
目的:在介入治疗中,自主导航设备的操作时间可以缩短,手术过程中的决策可以得到改善,同时辐射剂量可以降低,同时提高治疗的可获取性。本系统综述评估了近年来关于人工智能(AI)在自主导航穿刺器/引导线在介入治疗中的影响、挑战和机会的文献,以评估AI在自主导航穿刺器/引导线在介入治疗中的潜在影响。方法:PubMed和IEEE Explore数据库进行查询。符合资格标准的研究包括研究使用AI促进自主导航穿刺器/引导线在介入治疗中的应用。然后使用PRISMA和QUADAS-2对文章进行评估。PROSPERO: CRD42023392259。结果:在462篇论文中,有14篇符合资格标准。强化学习(9/14,64%)和学习演示(7/14,50%)被用作数据驱动模型进行自主导航。研究主要使用物理幻象(10/14,71%)和仿真(4/14,29%)模型。大多数研究(10/14,71%)报道了心脏血管内实验,而三篇研究(3/14,21%)使用了简单非解剖性血管平台,一篇研究(1/14,7)使用了猪肝静脉系统。我们观察到,研究中的偏见和普遍性存在。在所有审查的研究中,没有对患者进行任何操作。研究缺乏患者选择标准、参考标准和可重复性,导致临床证据水平较低。结论:AI在自主导航介入治疗中的潜在影响是积极的,但目前仍处于实验验证阶段,技术成熟度为3。我们强调,具有明确定义的性能指标的参考标准对于允许未来数据驱动算法的比较至关重要。
https://arxiv.org/abs/2405.03305