3D scene generation seeks to synthesize spatially structured, semantically meaningful, and photorealistic environments for applications such as immersive media, robotics, autonomous driving, and embodied AI. Early methods based on procedural rules offered scalability but limited diversity. Recent advances in deep generative models (e.g., GANs, diffusion models) and 3D representations (e.g., NeRF, 3D Gaussians) have enabled the learning of real-world scene distributions, improving fidelity, diversity, and view consistency. Recent advances like diffusion models bridge 3D scene synthesis and photorealism by reframing generation as image or video synthesis problems. This survey provides a systematic overview of state-of-the-art approaches, organizing them into four paradigms: procedural generation, neural 3D-based generation, image-based generation, and video-based generation. We analyze their technical foundations, trade-offs, and representative results, and review commonly used datasets, evaluation protocols, and downstream applications. We conclude by discussing key challenges in generation capacity, 3D representation, data and annotations, and evaluation, and outline promising directions including higher fidelity, physics-aware and interactive generation, and unified perception-generation models. This review organizes recent advances in 3D scene generation and highlights promising directions at the intersection of generative AI, 3D vision, and embodied intelligence. To track ongoing developments, we maintain an up-to-date project page: this https URL.
三维场景生成旨在为沉浸式媒体、机器人技术、自动驾驶和具身人工智能等应用合成具有空间结构化、语义意义且逼真的环境。早期基于程序规则的方法虽然具备可扩展性,但多样性有限。近年来,深度生成模型(如GANs、扩散模型)以及3D表示方法(如NeRF、3D高斯分布)的进步使得能够学习真实场景的分布,从而提高了逼真度、多样性和视角一致性。最近的技术进步,比如扩散模型通过将生成问题重新定义为图像或视频合成问题的方式,成功地连接了三维场景生成与照片级真实性。本综述系统性地概述了当前最先进的方法,并将其分类为四大范式:程序化生成、基于神经3D的生成、基于图像的生成和基于视频的生成。我们分析了它们的技术基础、权衡以及代表性结果,还回顾了一些常用的数据库、评估协议及下游应用。最后,讨论了在生成能力、三维表示、数据与注释以及评估方面的关键挑战,并概述了一系列有前景的方向,包括更高的保真度、物理感知和互动生成,以及统一的感知-生成模型。这篇综述整理了最近在三维场景生成领域的进展,并突出了生成AI、三维视觉和具身智能交叉领域中的潜在发展方向。为了跟踪正在进行的发展,我们维护了一个最新的项目页面:[此URL](this https URL)。
https://arxiv.org/abs/2505.05474
We present DSDrive, a streamlined end-to-end paradigm tailored for integrating the reasoning and planning of autonomous vehicles into a unified framework. DSDrive leverages a compact LLM that employs a distillation method to preserve the enhanced reasoning capabilities of a larger-sized vision language model (VLM). To effectively align the reasoning and planning tasks, a waypoint-driven dual-head coordination module is further developed, which synchronizes dataset structures, optimization objectives, and the learning process. By integrating these tasks into a unified framework, DSDrive anchors on the planning results while incorporating detailed reasoning insights, thereby enhancing the interpretability and reliability of the end-to-end pipeline. DSDrive has been thoroughly tested in closed-loop simulations, where it performs on par with benchmark models and even outperforms in many key metrics, all while being more compact in size. Additionally, the computational efficiency of DSDrive (as reflected in its time and memory requirements during inference) has been significantly enhanced. Evidently thus, this work brings promising aspects and underscores the potential of lightweight systems in delivering interpretable and efficient solutions for AD.
我们介绍了DSDrive,这是一种简化的一站式解决方案,专为将自主车辆的推理和规划整合到统一框架中而设计。DSDrive采用了一个紧凑型的大语言模型(LLM),通过提炼方法保留了更大规模视觉语言模型(VLM)增强的推理能力。为了有效地对齐推理和规划任务,进一步开发了一种以航点驱动的双头协调模块,该模块同步数据集结构、优化目标以及学习过程。通过将这些任务整合到一个统一框架中,DSDrive基于其规划结果的同时融入详细的推理见解,从而增强了端到端管道的可解释性和可靠性。在封闭回路模拟测试中,DSDrive的表现与基准模型相当甚至在许多关键指标上超越了它们,并且体积更小。此外,DSDrive的计算效率(通过推断期间的时间和内存需求反映)也得到了显著提升。因此,这项工作带来了令人鼓舞的方面,并强调了轻量级系统在为AD提供可解释和高效解决方案方面的潜力。
https://arxiv.org/abs/2505.05360
In this study, an autonomous visual-guided robotic cotton-picking system, built on a Clearpath's Husky robot platform and the Cotton-Eye perception system, was developed in the Gazebo robotic simulator. Furthermore, a virtual cotton farm was designed and developed as a Robot Operating System (ROS 1) package to deploy the robotic cotton picker in the Gazebo environment for simulating autonomous field navigation. The navigation was assisted by the map coordinates and an RGB-depth camera, while the ROS navigation algorithm utilized a trained YOLOv8n-seg model for instance segmentation. The model achieved a desired mean Average Precision (mAP) of 85.2%, a recall of 88.9%, and a precision of 93.0% for scene segmentation. The developed ROS navigation packages enabled our robotic cotton-picking system to autonomously navigate through the cotton field using map-based and GPS-based approaches, visually aided by a deep learning-based perception system. The GPS-based navigation approach achieved a 100% completion rate (CR) with a threshold of 5 x 10^-6 degrees, while the map-based navigation approach attained a 96.7% CR with a threshold of 0.25 m. This study establishes a fundamental baseline of simulation for future agricultural robotics and autonomous vehicles in cotton farming and beyond. CottonSim code and data are released to the research community via GitHub: this https URL
在这项研究中,我们开发了一个自主视觉引导的棉花采摘机器人系统,该系统基于Clearpath公司的Husky机器人平台和Cotton-Eye感知系统,并在Gazebo机器人模拟器中进行构建。此外,还设计并开发了一种虚拟棉花农场作为Robot Operating System (ROS 1) 包的一部分,以便将自主棉花采摘机器人部署到Gazebo环境中以模拟田间导航。 该导航系统利用地图坐标和RGB深度相机提供支持,并且ROS导航算法使用经过训练的YOLOv8n-seg模型进行实例分割。该模型在场景分割中实现了期望的平均精度(mAP)为85.2%,召回率为88.9%,准确率为93.0%。 开发出的ROS导航包使我们的自主棉花采摘机器人系统能够通过基于地图和GPS的方法在棉花田中自主导航,并借助深度学习感知系统进行视觉辅助。GPS引导导航方法在阈值设置为5 x 10^-6度时实现了100%的任务完成率(CR),而基于地图的导航方法则在阈值设定为0.25米的情况下达到了96.7%的CR。 本研究建立了一个基础模拟框架,对未来农业机器人和自主车辆在棉花种植及其他领域的应用具有重要意义。CottonSim代码和数据已通过GitHub平台发布给学术界:[此链接](https://github.com/your-research-group/CottonSim)(请将"this https URL"替换为实际的GitHub地址)。
https://arxiv.org/abs/2505.05317
In order to mitigate economical, ecological, and societal challenges in electric scooter (e-scooter) sharing systems, we develop an autonomous e-scooter prototype. Our vision is to design a fully autonomous prototype that can find its way to the next parking spot, high-demand area, or charging station. In this work, we propose a path following solution to enable localization and navigation in an urban environment with a provided path to follow. We design a closed-loop architecture that solves the localization and path following problem while allowing the e-scooter to maintain its balance with a previously developed reaction wheel mechanism. Our approach facilitates state and input constraints, e.g., adhering to the path width, while remaining executable on a Raspberry Pi 5. We demonstrate the efficacy of our approach in a real-world experiment on our prototype.
为了缓解电动滑板车(e-scooter)共享系统中的经济、生态和社会挑战,我们开发了一个自主驾驶的电动滑板车型号。我们的愿景是设计一个完全自主的原型,它可以找到下一个停车点、高需求区域或充电站。在这项工作中,我们提出了一种路径跟踪解决方案,以实现城市环境中沿着给定路径进行定位和导航的功能。我们设计了一个闭环架构,解决了定位和路径跟随的问题,并使电动滑板车能够通过之前开发的反应轮机制保持平衡。我们的方法在满足状态和输入约束(例如,遵守路径宽度)的同时,能够在树莓派5上执行。我们在原型上的真实世界实验中展示了我们方法的有效性。
https://arxiv.org/abs/2505.05314
In this paper, we propose PADriver, a novel closed-loop framework for personalized autonomous driving (PAD). Built upon Multi-modal Large Language Model (MLLM), PADriver takes streaming frames and personalized textual prompts as inputs. It autoaggressively performs scene understanding, danger level estimation and action decision. The predicted danger level reflects the risk of the potential action and provides an explicit reference for the final action, which corresponds to the preset personalized prompt. Moreover, we construct a closed-loop benchmark named PAD-Highway based on Highway-Env simulator to comprehensively evaluate the decision performance under traffic rules. The dataset contains 250 hours videos with high-quality annotation to facilitate the development of PAD behavior analysis. Experimental results on the constructed benchmark show that PADriver outperforms state-of-the-art approaches on different evaluation metrics, and enables various driving modes.
在这篇论文中,我们提出了PADriver,这是一种新颖的闭环框架,用于个性化自动驾驶(Personalized Autonomous Driving,PAD)。基于多模态大型语言模型(Multi-modal Large Language Model,MLLM),PADriver 以流式帧和个性化的文本提示作为输入。它自动执行场景理解、危险等级评估和动作决策。预测出的危险级别反映了潜在行动的风险,并为最终行动提供了明确的参考,这与预设的个性化提示相对应。此外,我们基于Highway-Env模拟器构建了一个闭环基准测试PAD-Highway,以全面评估在交通规则下的决策性能。该数据集包含250小时的视频和高质量注释,有助于推动PAD行为分析的发展。在构建的基准测试上进行的实验结果表明,与现有最佳方法相比,PADriver 在不同的评估指标上表现出色,并支持多种驾驶模式。
https://arxiv.org/abs/2505.05240
Human drivers exhibit individual preferences regarding driving style. Adapting autonomous vehicles to these preferences is essential for user trust and satisfaction. However, existing end-to-end driving approaches often rely on predefined driving styles or require continuous user feedback for adaptation, limiting their ability to support dynamic, context-dependent preferences. We propose a novel approach using multi-objective reinforcement learning (MORL) with preference-driven optimization for end-to-end autonomous driving that enables runtime adaptation to driving style preferences. Preferences are encoded as continuous weight vectors to modulate behavior along interpretable style objectives$\unicode{x2013}$including efficiency, comfort, speed, and aggressiveness$\unicode{x2013}$without requiring policy retraining. Our single-policy agent integrates vision-based perception in complex mixed-traffic scenarios and is evaluated in diverse urban environments using the CARLA simulator. Experimental results demonstrate that the agent dynamically adapts its driving behavior according to changing preferences while maintaining performance in terms of collision avoidance and route completion.
人类驾驶员在驾驶风格上表现出各自的偏好。为了让用户信任并满意,将自动驾驶汽车适应这些个人偏好是至关重要的。然而,现有的端到端驾驶方法通常依赖于预定义的驾驶风格或需要持续的人工反馈来进行调整,这限制了它们支持动态、情境相关偏好的能力。我们提出了一种新颖的方法,使用基于偏奷新型多目标强化学习(MORL)进行端到端自动驾驶优化,这种方法能够在运行时根据驾驶员偏好调整驾驶行为。 通过将这些偏好编码为连续的权重向量,我们的方法可以在不重新训练策略的情况下调节车辆的行为以适应不同的风格目标——包括效率、舒适性、速度和激进程度。我们设计了一个单一策略的代理,该代理结合了基于视觉感知的能力,在复杂的混合交通环境中运行,并在多样化的城市环境中使用CARLA模拟器进行评估。 实验结果显示,我们的代理能够根据偏好变化动态调整其驾驶行为,同时保持碰撞避免能力和路线完成方面的性能不变。
https://arxiv.org/abs/2505.05223
Efficient view planning is a fundamental challenge in computer vision and robotic perception, critical for tasks ranging from search and rescue operations to autonomous navigation. While classical approaches, including sampling-based and deterministic methods, have shown promise in planning camera viewpoints for scene exploration, they often struggle with computational scalability and solution optimality in complex settings. This study introduces HQC-NBV, a hybrid quantum-classical framework for view planning that leverages quantum properties to efficiently explore the parameter space while maintaining robustness and scalability. We propose a specific Hamiltonian formulation with multi-component cost terms and a parameter-centric variational ansatz with bidirectional alternating entanglement patterns that capture the hierarchical dependencies between viewpoint parameters. Comprehensive experiments demonstrate that quantum-specific components provide measurable performance advantages. Compared to the classical methods, our approach achieves up to 49.2% higher exploration efficiency across diverse environments. Our analysis of entanglement architecture and coherence-preserving terms provides insights into the mechanisms of quantum advantage in robotic exploration tasks. This work represents a significant advancement in integrating quantum computing into robotic perception systems, offering a paradigm-shifting solution for various robot vision tasks.
高效视图规划是计算机视觉和机器人感知中的一个基本挑战,对于从搜索救援行动到自主导航等任务至关重要。虽然传统的基于采样的方法和确定性方法在为场景探索计划相机视角方面展现出潜力,但它们通常难以处理复杂环境下的计算可扩展性和解的最优性问题。本文介绍了一种名为HQCNBV(Hybrid Quantum-Classical Novelty-Based Viewpoint)的混合量子-经典框架,用于视图规划,该框架利用量子特性在保持鲁棒性和可扩展性的前提下高效地探索参数空间。我们提出了一种特定的哈密顿量形式化方法,其中包括多组件成本项,并采用一种基于参数的变分结构,其中双向交替纠缠模式捕捉了视点参数之间的层次依赖关系。 全面的实验表明,量子特有组件提供了可测量的性能优势。与传统方法相比,在各种环境中,我们的方法实现了高达49.2%更高的探索效率。对纠缠架构和保持相干性的术语进行分析,为机器人探索任务中量子优势机制提供了见解。这项工作代表了将量子计算整合到机器人感知系统中的重要进展,并为多种机器人视觉任务提供了一种变革性解决方案。 这种创新的混合方法不仅提高了视图规划的效率,还展示了如何利用量子计算的优势来解决传统算法难以克服的问题,从而在复杂环境中展现出显著改进的表现。这预示着未来在机器人技术领域中融合更多量子技术的可能性与潜力。
https://arxiv.org/abs/2505.05212
The safety of autonomous cars has come under scrutiny in recent years, especially after 16 documented incidents involving Teslas (with autopilot engaged) crashing into parked emergency vehicles (police cars, ambulances, and firetrucks). While previous studies have revealed that strong light sources often introduce flare artifacts in the captured image, which degrade the image quality, the impact of flare on object detection performance remains unclear. In this research, we unveil PaniCar, a digital phenomenon that causes an object detector's confidence score to fluctuate below detection thresholds when exposed to activated emergency vehicle lighting. This vulnerability poses a significant safety risk, and can cause autonomous vehicles to fail to detect objects near emergency vehicles. In addition, this vulnerability could be exploited by adversaries to compromise the security of advanced driving assistance systems (ADASs). We assess seven commercial ADASs (Tesla Model 3, "manufacturer C", HP, Pelsee, AZDOME, Imagebon, Rexing), four object detectors (YOLO, SSD, RetinaNet, Faster R-CNN), and 14 patterns of emergency vehicle lighting to understand the influence of various technical and environmental factors. We also evaluate four SOTA flare removal methods and show that their performance and latency are insufficient for real-time driving constraints. To mitigate this risk, we propose Caracetamol, a robust framework designed to enhance the resilience of object detectors against the effects of activated emergency vehicle lighting. Our evaluation shows that on YOLOv3 and Faster RCNN, Caracetamol improves the models' average confidence of car detection by 0.20, the lower confidence bound by 0.33, and reduces the fluctuation range by 0.33. In addition, Caracetamol is capable of processing frames at a rate of between 30-50 FPS, enabling real-time ADAS car detection.
近年来,自动驾驶汽车的安全性受到了广泛关注,尤其是在特斯拉(启用了Autopilot功能)的16起撞上停靠在路边的紧急车辆(警车、救护车和消防车)事件被记录下来之后。尽管先前的研究已经表明强光源常常会在拍摄到的照片中引入光晕现象,从而降低图像质量,但光晕对物体检测性能的影响仍不清楚。在这项研究中,我们揭示了PaniCar这一数字现象:当对象探测器接触到激活的紧急车辆照明时,其信心评分会降至低于检测阈值的程度,这会导致自动驾驶汽车无法识别靠近紧急车辆的物体。这种脆弱性构成了重要的安全风险,并可能被对手利用来破坏高级驾驶辅助系统(ADAS)的安全性。我们评估了七种商用ADAS(特斯拉Model 3、“制造商C”、HP、Pelsee、AZDOME、Imagebon和Rexing),四种对象检测器(YOLO、SSD、RetinaNet和Faster R-CNN),以及14种紧急车辆照明模式,以了解各种技术和环境因素的影响。我们还评估了四种最先进的去光晕方法,并表明它们的性能和延迟不足以满足实时驾驶要求。为了减轻这一风险,我们提出了一种名为Caracetamol的强大框架,旨在增强对象检测器对激活的紧急车辆照明影响的抵御能力。我们的测试结果显示,在YOLOv3和Faster R-CNN上,Caracetamol将模型平均检测到汽车的信心提高了0.20分,下限信心提高了0.33分,并将波动范围减少了0.33分。此外,Caracetamol能够以每秒30至50帧的速度处理图像,从而实现实时ADAS车辆检测。
https://arxiv.org/abs/2505.05183
This work presents an online velocity planner for autonomous racing that adapts to changing dynamic constraints, such as grip variations from tire temperature changes and rubber accumulation. The method combines a forward-backward solver for online velocity optimization with a novel spatial sampling strategy for local trajectory planning, utilizing a three-dimensional track representation. The computed velocity profile serves as a reference for the local planner, ensuring adaptability to environmental and vehicle dynamics. We demonstrate the approach's robust performance and computational efficiency in racing scenarios and discuss its limitations, including sensitivity to deviations from the predefined racing line and high jerk characteristics of the velocity profile.
这项工作提出了一种在线速度规划器,用于自主赛车,并能够适应动态约束的变化,例如由于轮胎温度变化和橡胶积累而导致的抓地力变化。该方法结合了前向后向求解器以进行在线速度优化以及一种新颖的空间采样策略来进行局部轨迹规划,利用三维赛道表示法。计算出的速度轮廓用作局部规划器的参考,确保能够适应环境和车辆动态的变化。我们在赛车场景中展示了这种方法的稳健性能和计算效率,并讨论了其局限性,包括对偏离预定义赛车线路的敏感性和速度轮廓中的高加速度变化特性。
https://arxiv.org/abs/2505.05157
End-to-end autonomous driving has advanced significantly, offering benefits such as system simplicity and stronger driving performance in both open-loop and closed-loop settings than conventional pipelines. However, existing frameworks still suffer from low success rates in closed-loop evaluations, highlighting their limitations in real-world deployment. In this paper, we introduce X-Driver, a unified multi-modal large language models(MLLMs) framework designed for closed-loop autonomous driving, leveraging Chain-of-Thought(CoT) and autoregressive modeling to enhance perception and decision-making. We validate X-Driver across multiple autonomous driving tasks using public benchmarks in CARLA simulation environment, including Bench2Drive[6]. Our experimental results demonstrate superior closed-loop performance, surpassing the current state-of-the-art(SOTA) while improving the interpretability of driving decisions. These findings underscore the importance of structured reasoning in end-to-end driving and establish X-Driver as a strong baseline for future research in closed-loop autonomous driving.
端到端自动驾驶技术取得了显著进步,提供了系统简化和在开环与闭环设置中的更强驾驶性能等优势。然而,现有框架在闭环评估中仍然存在成功率低的问题,这突显了它们在现实世界部署中的局限性。本文介绍了X-Driver,这是一种统一的多模态大型语言模型(MLLMs)框架,专门针对闭环自动驾驶设计,并利用链式思维(CoT)和自回归建模来增强感知和决策能力。我们使用CARLA模拟环境中的公开基准测试集对X-Driver进行了多项自主驾驶任务验证,包括Bench2Drive[6]。实验结果表明,X-Driver在闭环性能上表现出色,超越了当前最先进的技术(SOTA),并提高了驾驶决策的可解释性。这些发现强调了结构化推理在端到端自动驾驶中的重要性,并将X-Driver确立为未来闭环自主驾驶研究的强大基线。
https://arxiv.org/abs/2505.05098
Deep learning (DL) has surpassed human performance on standard benchmarks, driving its widespread adoption in computer vision tasks. One such task is disparity estimation, estimating the disparity between matching pixels in stereo image pairs, which is crucial for safety-critical applications like medical surgeries and autonomous navigation. However, DL-based disparity estimation methods are highly susceptible to distribution shifts and adversarial attacks, raising concerns about their reliability and generalization. Despite these concerns, a standardized benchmark for evaluating the robustness of disparity estimation methods remains absent, hindering progress in the field. To address this gap, we introduce DispBench, a comprehensive benchmarking tool for systematically assessing the reliability of disparity estimation methods. DispBench evaluates robustness against synthetic image corruptions such as adversarial attacks and out-of-distribution shifts caused by 2D Common Corruptions across multiple datasets and diverse corruption scenarios. We conduct the most extensive performance and robustness analysis of disparity estimation methods to date, uncovering key correlations between accuracy, reliability, and generalization. Open-source code for DispBench: this https URL
深度学习(DL)在标准基准测试中已超越人类表现,推动了其在计算机视觉任务中的广泛应用。其中一项任务是视差估计,即估算立体图像对中匹配像素之间的视差,这对于医疗手术和自主导航等安全关键型应用至关重要。然而,基于深度学习的视差估计算法极易受到分布变化和对抗性攻击的影响,引发了对其可靠性和泛化能力的关注。尽管存在这些担忧,但评估视差估计算法鲁棒性的标准化基准仍然缺失,这阻碍了该领域的发展。 为解决这一空白,我们推出了DispBench,这是一个全面的基准测试工具,旨在系统地评估视差估算法的可靠性。DispBench对多个数据集和各种腐败场景中的合成图像损坏(如对抗性攻击和由2D常见损坏导致的分布变化)进行鲁棒性评估。我们进行了迄今为止最广泛的性能和鲁棒性分析,揭示了准确性、可靠性和泛化能力之间的关键关联。 开源代码:[DispBench](https://this https URL)
https://arxiv.org/abs/2505.05091
We present a vehicle system capable of navigating safely and efficiently around Vulnerable Road Users (VRUs), such as pedestrians and cyclists. The system comprises key modules for environment perception, localization and mapping, motion planning, and control, integrated into a prototype vehicle. A key innovation is a motion planner based on Topology-driven Model Predictive Control (T-MPC). The guidance layer generates multiple trajectories in parallel, each representing a distinct strategy for obstacle avoidance or non-passing. The underlying trajectory optimization constrains the joint probability of collision with VRUs under generic uncertainties. To address extraordinary situations ("edge cases") that go beyond the autonomous capabilities - such as construction zones or encounters with emergency responders - the system includes an option for remote human operation, supported by visual and haptic guidance. In simulation, our motion planner outperforms three baseline approaches in terms of safety and efficiency. We also demonstrate the full system in prototype vehicle tests on a closed track, both in autonomous and remotely operated modes.
我们提出了一种车辆系统,该系统能够在复杂环境中安全高效地导航,并避开弱势道路使用者(VRUs),如行人和自行车骑行者。该系统包含关键模块,包括环境感知、定位与地图构建、运动规划以及控制,所有这些都被集成到了一辆原型车中。 一项主要创新是基于拓扑驱动模型预测控制(T-MPC)的运动规划器。这一层级可以同时生成多条轨迹,每一条轨迹都代表了不同的避障策略或避免相撞的方法。底层的轨迹优化则限制了一般不确定性下的与VRUs碰撞的整体概率。为了应对超出自动驾驶能力范围内的特殊情况(例如施工区域或遇到应急响应人员)——系统提供了一个远程人工操作选项,并通过视觉和触觉指引来支持这一过程。 在仿真测试中,我们的运动规划器的表现优于三种基准方法,在安全性和效率方面均有卓越表现。我们还在封闭赛道上的原型车试验中演示了整个系统的运行情况,包括自主模式与远程操控模式的展示。
https://arxiv.org/abs/2505.04982
This paper proposes a novel Large Vision-Language Model (LVLM) and Model Predictive Control (MPC) integration framework that delivers both task scalability and safety for Autonomous Driving (AD). LVLMs excel at high-level task planning across diverse driving scenarios. However, since these foundation models are not specifically designed for driving and their reasoning is not consistent with the feasibility of low-level motion planning, concerns remain regarding safety and smooth task switching. This paper integrates LVLMs with MPC Builder, which automatically generates MPCs on demand, based on symbolic task commands generated by the LVLM, while ensuring optimality and safety. The generated MPCs can strongly assist the execution or rejection of LVLM-driven task switching by providing feedback on the feasibility of the given tasks and generating task-switching-aware MPCs. Our approach provides a safe, flexible, and adaptable control framework, bridging the gap between cutting-edge foundation models and reliable vehicle operation. We demonstrate the effectiveness of our approach through a simulation experiment, showing that our system can safely and effectively handle highway driving while maintaining the flexibility and adaptability of LVLMs.
本文提出了一种新颖的大型视觉-语言模型(LVLM)与模型预测控制(MPC)集成框架,该框架在自动驾驶(AD)中提供了任务可扩展性和安全性。LVLM擅长于各种驾驶场景中的高层次任务规划。然而,由于这些基础模型并非专门为驾驶设计,且其推理并不一致于低级运动规划的可行性,因此关于安全性和平滑的任务切换仍然存在担忧。 本文将LVLM与MPC Builder集成在一起,后者可以根据由LVLM生成的符号任务命令自动生成MPC,并确保最优性和安全性。所生成的MPC能够通过提供给定任务可行性的反馈和生成具有任务切换意识的MPC来强烈支持或拒绝由LVLM驱动的任务切换的执行。 我们的方法提供了一个安全、灵活且适应性强的控制框架,弥合了前沿基础模型与可靠车辆操作之间的差距。我们通过模拟实验展示了这种方法的有效性,证明系统可以在保持LVLM灵活性和适应性的前提下,安全有效地处理高速公路驾驶任务。
https://arxiv.org/abs/2505.04980
The miniaturisation of sensors and processors, the advancements in connected edge intelligence, and the exponential interest in Artificial Intelligence are boosting the affirmation of autonomous nano-size drones in the Internet of Robotic Things ecosystem. However, achieving safe autonomous navigation and high-level tasks such as exploration and surveillance with these tiny platforms is extremely challenging due to their limited resources. This work focuses on enabling the safe and autonomous flight of a pocket-size, 30-gram platform called Crazyflie 2.1 in a partially known environment. We propose a novel AI-aided, vision-based reactive planning method for obstacle avoidance under the ambit of Integrated Sensing, Computing and Communication paradigm. We deal with the constraints of the nano-drone by splitting the navigation task into two parts: a deep learning-based object detector runs on the edge (external hardware) while the planning algorithm is executed onboard. The results show the ability to command the drone at $\sim8$ frames-per-second and a model performance reaching a COCO mean-average-precision of $60.8$. Field experiments demonstrate the feasibility of the solution with the drone flying at a top speed of $1$ m/s while steering away from an obstacle placed in an unknown position and reaching the target destination. The outcome highlights the compatibility of the communication delay and the model performance with the requirements of the real-time navigation task. We provide a feasible alternative to a fully onboard implementation that can be extended to autonomous exploration with nano-drones.
传感器和处理器的微型化、边缘智能连接技术的进步以及人工智能领域的指数级增长,正推动自主纳米无人机在机器人事物互联网生态系统中的应用。然而,在资源有限的情况下,实现这些小型平台的安全自主导航及执行探索与监控等高级任务极具挑战性。本研究关注的是在一个部分已知环境中,使一款名为Crazyflie 2.1的口袋大小、30克重的平台能够安全自主飞行的问题。我们提出了一种基于视觉的反应规划方法,利用集成传感、计算和通信范式的AI辅助技术来实现障碍物规避。为解决纳米无人机的约束问题,我们将导航任务分为两部分:深度学习对象检测器在边缘(外部硬件)上运行,而规划算法则在机载设备中执行。实验结果显示,该系统能够在每秒约8帧的速度下控制无人机,并且模型性能达到了COCO平均精度60.8%的水平。现场试验表明,在未知位置放置障碍物的情况下,无人机能够以最高1米/秒的速度飞行并绕过障碍物,最终到达目的地。结果强调了通信延迟和模型性能与实时导航任务要求之间的兼容性。我们提供了一种可行的替代方案,即不完全依赖机载实施的方法,并可扩展至使用纳米无人机实现自主探索的任务中。
https://arxiv.org/abs/2505.04972
Enabling intelligent agents to comprehend and interact with 3D environments through natural language is crucial for advancing robotics and human-computer interaction. A fundamental task in this field is ego-centric 3D visual grounding, where agents locate target objects in real-world 3D spaces based on verbal descriptions. However, this task faces two significant challenges: (1) loss of fine-grained visual semantics due to sparse fusion of point clouds with ego-centric multi-view images, (2) limited textual semantic context due to arbitrary language descriptions. We propose DenseGrounding, a novel approach designed to address these issues by enhancing both visual and textual semantics. For visual features, we introduce the Hierarchical Scene Semantic Enhancer, which retains dense semantics by capturing fine-grained global scene features and facilitating cross-modal alignment. For text descriptions, we propose a Language Semantic Enhancer that leverages large language models to provide rich context and diverse language descriptions with additional context during model training. Extensive experiments show that DenseGrounding significantly outperforms existing methods in overall accuracy, with improvements of 5.81% and 7.56% when trained on the comprehensive full dataset and smaller mini subset, respectively, further advancing the SOTA in egocentric 3D visual grounding. Our method also achieves 1st place and receives the Innovation Award in the CVPR 2024 Autonomous Grand Challenge Multi-view 3D Visual Grounding Track, validating its effectiveness and robustness.
通过自然语言使智能代理能够理解和互动于三维环境对于推进机器人技术和人机交互至关重要。这一领域的基本任务是自中心(ego-centric)的三维视觉定位,即根据口头描述在真实世界的三维空间中定位目标物体。然而,这项任务面临两个主要挑战:(1) 由于点云与自中心多视角图像稀疏融合导致的细粒度视觉语义丢失;(2) 随意的语言描述带来的文本语义上下文有限的问题。我们提出了DenseGrounding这一新方法来解决这些问题,通过增强视觉和文本语义来应对挑战。 对于视觉特征,我们引入了分层场景语义增强器(Hierarchical Scene Semantic Enhancer),该组件能够保留密集的视觉语义并通过捕捉细粒度的整体场景特征以及促进跨模态对齐来实现这一目标。对于文本描述,我们提出了一种语言语义增强器,利用大型语言模型提供丰富的上下文,并在训练模型时使用带有额外背景信息的多样化语言描述。 广泛实验表明,DenseGrounding相较于现有方法,在整体准确率上有了显著提升:当在综合完整数据集和较小的数据子集中进行训练时,分别提高了5.81% 和7.56%,从而进一步推动了自中心三维视觉定位领域的最先进水平。我们的方法还在CVPR 2024 自主挑战赛的多视角三维视觉定位赛道中荣获第一名,并获得创新奖,这验证了其有效性和鲁棒性。
https://arxiv.org/abs/2505.04965
The proposed system outlined in this paper is a solution to a use case that requires the autonomous picking of cuboidal objects from an organized or unorganized pile with high precision. This paper presents an efficient method for precise pose estimation of cuboid-shaped objects, which aims to reduce errors in target pose in a time-efficient manner. Typical pose estimation methods like global point cloud registrations are prone to minor pose errors for which local registration algorithms are generally used to improve pose accuracy. However, due to the execution time overhead and uncertainty in the error of the final achieved pose, an alternate, linear time approach is proposed for pose error estimation and correction. This paper presents an overview of the solution followed by a detailed description of individual modules of the proposed algorithm.
本文提出的系统是一种解决方案,旨在解决需要从有序或无序堆放中自主拣选立方体物体的用例,并且要求具有高精度。本文提出了一种高效的方法来估计立方体形状物体的姿态,目的是以节省时间的方式减少目标姿态中的误差。常见的姿态估计算法(如全局点云注册)容易产生微小的姿态误差,通常使用局部注册算法来提高姿态准确性。然而,由于执行时间和最终姿态误差的不确定性导致的时间开销问题,本文提出了一种替代方案,即采用线性时间的方法来进行姿态误差估计和校正。文章首先概述了解决方案,然后详细描述了所提算法中各个模块的内容。
https://arxiv.org/abs/2505.04962
Despite the impressive achievements of AI, including advancements in generative models and large language models, there remains a significant gap in the ability of AI to handle uncertainty and generalize beyond the training data. We argue that AI models, especially in autonomous systems, fail to make robust predictions when faced with unfamiliar or adversarial data, as evidenced by incidents with autonomous vehicles. Traditional machine learning approaches struggle to address these issues due to an overemphasis on data fitting and domain adaptation. This position paper posits a paradigm shift towards epistemic artificial intelligence, emphasizing the need for models to learn not only from what they know but also from their ignorance. This approach, which focuses on recognizing and managing uncertainty, offers a potential solution to improve the resilience and robustness of AI systems, ensuring that they can better handle unpredictable real-world environments.
尽管人工智能(AI)在生成模型和大型语言模型方面取得了令人印象深刻的成就,但在处理不确定性以及超出训练数据范围的泛化能力上仍然存在显著差距。我们认为,在面对不熟悉或对抗性数据时,尤其是自主系统中的AI模型无法做出稳健的预测,这一点从自动驾驶汽车事件中得到了证明。传统的机器学习方法由于过度关注数据拟合和领域适应而难以解决这些问题。本文主张转向知识论人工智能(Epistemic AI)的新范式,强调模型不仅要从它们已知的事物中学到东西,还要从它们未知的东西中学到东西。这种方法侧重于识别和管理不确定性,为提高AI系统的弹性和稳健性提供了潜在的解决方案,确保它们能够更好地应对不可预测的真实世界环境。
https://arxiv.org/abs/2505.04950
Fast and effective incident response is essential to prevent adversarial cyberattacks. Autonomous Cyber Defense (ACD) aims to automate incident response through Artificial Intelligence (AI) agents that plan and execute actions. Most ACD approaches focus on single-agent scenarios and leverage Reinforcement Learning (RL). However, ACD RL-trained agents depend on costly training, and their reasoning is not always explainable or transferable. Large Language Models (LLMs) can address these concerns by providing explainable actions in general security contexts. Researchers have explored LLM agents for ACD but have not evaluated them on multi-agent scenarios or interacting with other ACD agents. In this paper, we show the first study on how LLMs perform in multi-agent ACD environments by proposing a new integration to the CybORG CAGE 4 environment. We examine how ACD teams of LLM and RL agents can interact by proposing a novel communication protocol. Our results highlight the strengths and weaknesses of LLMs and RL and help us identify promising research directions to create, train, and deploy future teams of ACD agents.
快速且有效的事件响应对于防止敌对的网络攻击至关重要。自主网络安全防御(ACD)旨在通过人工智能(AI)代理自动化事件响应,这些代理可以规划和执行行动。大多数ACD方法专注于单一代理场景,并利用强化学习(RL)。然而,经过RL训练的ACD代理依赖于昂贵的训练过程,且其推理并不总是可解释或可转移的。大型语言模型(LLMs)可以通过在一般安全背景下提供可解释的操作来解决这些问题。研究人员已经探索了用于ACD的LLM代理,但尚未评估它们在多代理场景或多代理互动中的表现。 本文中,我们展示了首个关于LLM在多代理ACD环境中性能的研究,并提出了一种新的CybORG CAGE 4环境集成方法。通过提出一种新颖的通信协议,我们研究了由LLM和RL代理组成的ACD团队如何交互。我们的结果突出了LLMs和RL的优势与劣势,并帮助我们识别创建、训练和部署未来ACD代理团队的有前景的研究方向。
https://arxiv.org/abs/2505.04843
Vision-Language-Action (VLA) models mark a transformative advancement in artificial intelligence, aiming to unify perception, natural language understanding, and embodied action within a single computational framework. This foundational review presents a comprehensive synthesis of recent advancements in Vision-Language-Action models, systematically organized across five thematic pillars that structure the landscape of this rapidly evolving field. We begin by establishing the conceptual foundations of VLA systems, tracing their evolution from cross-modal learning architectures to generalist agents that tightly integrate vision-language models (VLMs), action planners, and hierarchical controllers. Our methodology adopts a rigorous literature review framework, covering over 80 VLA models published in the past three years. Key progress areas include architectural innovations, parameter-efficient training strategies, and real-time inference accelerations. We explore diverse application domains such as humanoid robotics, autonomous vehicles, medical and industrial robotics, precision agriculture, and augmented reality navigation. The review further addresses major challenges across real-time control, multimodal action representation, system scalability, generalization to unseen tasks, and ethical deployment risks. Drawing from the state-of-the-art, we propose targeted solutions including agentic AI adaptation, cross-embodiment generalization, and unified neuro-symbolic planning. In our forward-looking discussion, we outline a future roadmap where VLA models, VLMs, and agentic AI converge to power socially aligned, adaptive, and general-purpose embodied agents. This work serves as a foundational reference for advancing intelligent, real-world robotics and artificial general intelligence. >Vision-language-action, Agentic AI, AI Agents, Vision-language Models
视觉-语言-行动(VLA)模型标志着人工智能领域的一项变革性进展,旨在通过单一计算框架统一感知、自然语言理解和具身动作。这篇基础性的综述全面总结了最近在视觉-语言-行动模型方面的进步,并系统地组织成五个主题支柱来结构化这一快速发展的领域的景观。我们从确立VLA系统的概念基础开始,追溯其发展历程,从跨模态学习架构演变为整合视觉-语言模型(VLMs)、动作规划器和分层控制器的通用代理。我们的方法采用严格的文献回顾框架,涵盖过去三年发表的80多种VLA模型。 关键进展领域包括架构创新、参数高效的训练策略以及实时推理加速。我们探讨了各种应用领域,如人形机器人、自动驾驶车辆、医疗与工业机器人、精准农业和增强现实导航。综述进一步讨论了实时光控、多模态动作表示、系统可扩展性、对未知任务的泛化能力及伦理部署风险等重大挑战。 基于最先进的技术成果,我们提出了有针对性的解决方案,包括代理人工智能适应性、跨具身形式的通用性以及统一神经符号规划。在我们的前瞻性讨论中,我们勾勒出一条未来路线图,在这条道路上,VLA模型、VLMs和代理人工智能将汇聚一堂,为社会一致性的、自适应性和通用化的具身代理提供动力。 这项工作作为推进智能现实世界机器人技术和通用人工智能的基础性参考文献。关键词包括:视觉-语言-行动、代理式AI、AI代理、视觉-语言模型。
https://arxiv.org/abs/2505.04769
Lane determination and lane sequence determination are important components for many Connected and Automated Vehicle (CAV) applications. Lane determination has been solved using Hidden Markov Model (HMM) among other methods. The existing HMM literature for lane sequence determination uses empirical definitions with user-modified parameters to calculate HMM probabilities. The probability definitions in the literature can cause breaks in the HMM due to the inability to directly calculate probabilities of off-road positions, requiring post-processing of data. This paper develops a time-varying HMM using the physical properties of the roadway and vehicle, and the stochastic properties of the sensors. This approach yields emission and transition probability models conditioned on the sensor data without parameter tuning. It also accounts for the probability that the vehicle is not in any roadway lane (e.g., on the shoulder or making a U-turn), which eliminates the need for post-processing to deal with breaks in the HMM processing. This approach requires adapting the Viterbi algorithm and the HMM to be conditioned on the sensor data, which are then used to generate the most-likely sequence of lanes the vehicle has traveled. The proposed approach achieves an average accuracy of 95.9%. Compared to the existing literature, this provides an average increase of 2.25% by implementing the proposed transition probability and an average increase of 5.1% by implementing both the proposed transition and emission probabilities.
车道确定和车道序列确定是许多互联与自动化车辆(CAV)应用的重要组成部分。使用隐马尔可夫模型(HMM)等方法已经解决了车道确定问题。现有的用于车道序列确定的HMM文献利用经验定义以及用户修改参数来计算HMM概率。由于无法直接计算道路外位置的概率,文献中的概率定义会导致HMM处理中断,需要进行数据后处理。 本文提出了一种基于道路和车辆物理特性及传感器随机特性的时变HMM方法。这种方法可以在不调整参数的情况下根据传感器数据生成发射和转移概率模型,并考虑了车辆不在任何车道上的可能性(例如,在路肩或掉头),从而消除了由于HMM处理中断而需要进行后处理的需求。 该方法要求对维特比算法以及HMM进行适应,使其能够基于传感器数据条件化。然后使用这些经过调整的方法生成车辆最可能行驶的车道序列。所提出的方法达到了95.9%的平均准确率。与现有文献相比,在实施提出的转移概率时提供了平均2.25%的提升,在同时实现转移和发射概率时则提升了5.1%的性能。
https://arxiv.org/abs/2505.04763