Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL), which introduces inherent challenges such as distribution shift and causal confusion. Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning. However, applying online reinforcement learning to VLA models in autonomous driving is hindered by inefficient exploration in continuous action spaces. To overcome this limitation, we propose MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters. The one LLM serves as a Decision Expert for scenario reasoning and driving decision-making, while the other acts as an Action Expert that dynamically maps linguistic decisions into feasible trajectories. By feeding trajectory-level rewards back into the reasoning space, MindDrive enables trial-and-error learning over a finite set of discrete linguistic driving decisions, instead of operating directly in a continuous action space. This approach effectively balances optimal decision-making in complex scenarios, human-like driving behavior, and efficient exploration in online reinforcement learning. MindDrive achieves strong closed-loop performance on the challenging Bench2Drive benchmark, with a Driving Score (DS) of 78.04 and a Success Rate (SR) of 55.09%. To the best of our knowledge, this is the first work to demonstrate the effectiveness of online reinforcement learning for the VLA model in autonomous driving.
当前的视觉-语言-行动(VLA)范式在自动驾驶领域主要依赖于模仿学习(IL),这种方法会带来固有的挑战,例如分布偏移和因果混淆。在线强化学习通过试错学习提供了一种有前景的方法来解决这些问题。然而,在线强化学习应用于自动驾驶中的VLA模型时,由于连续动作空间中探索效率低下而受到限制。 为了解决这一局限性,我们提出了MindDrive框架,该框架包括一个大型语言模型(LLM),配备有两个不同的LoRA参数集。其中一个LLM作为决策专家,负责场景推理和驾驶决策;另一个则充当行动专家,能够动态地将语言决策映射到可行的轨迹中。通过向推理空间反馈轨迹级奖励,MindDrive使基于有限集合内的离散语言驾驶决策进行试错学习成为可能,而不是直接在连续动作空间内操作。这种方法有效地平衡了复杂场景中的最优决策、类似人类的驾驶行为以及在线强化学习中的高效探索。 在具有挑战性的Bench2Drive基准测试中,MindDrive表现出强大的闭环性能,获得了78.04的驾驶评分(DS)和55.09%的成功率(SR)。据我们所知,这是首次展示在线强化学习在自动驾驶领域VLA模型中有效性的研究。
https://arxiv.org/abs/2512.13636
The ability to perform multi-modal multi-hop reasoning by iteratively integrating information across various modalities and external knowledge is critical for addressing complex real-world challenges. However, existing Multi-modal Large Language Models (MLLMs) are predominantly limited to single-step reasoning, as existing benchmarks lack the complexity needed to evaluate and drive multi-hop abilities. To bridge this gap, we introduce MMhops, a novel, large-scale benchmark designed to systematically evaluate and foster multi-modal multi-hop reasoning. MMhops dataset comprises two challenging task formats, Bridging and Comparison, which necessitate that models dynamically construct complex reasoning chains by integrating external knowledge. To tackle the challenges posed by MMhops, we propose MMhops-R1, a novel multi-modal Retrieval-Augmented Generation (mRAG) framework for dynamic reasoning. Our framework utilizes reinforcement learning to optimize the model for autonomously planning reasoning paths, formulating targeted queries, and synthesizing multi-level information. Comprehensive experiments demonstrate that MMhops-R1 significantly outperforms strong baselines on MMhops, highlighting that dynamic planning and multi-modal knowledge integration are crucial for complex reasoning. Moreover, MMhops-R1 demonstrates strong generalization to tasks requiring fixed-hop reasoning, underscoring the robustness of our dynamic planning approach. In conclusion, our work contributes a challenging new benchmark and a powerful baseline model, and we will release the associated code, data, and weights to catalyze future research in this critical area.
执行跨模态多跳推理的能力,通过迭代地整合来自各种模式和外部知识的信息来解决复杂的现实世界挑战是至关重要的。然而,现有的多模态大型语言模型(MLLMs)主要局限于单步推理,因为现有基准的复杂性不足以评估和推动多跳能力的发展。为了解决这一差距,我们引入了MMhops,这是一个全新的大规模基准测试平台,旨在系统地评估并促进多模态多跳推理。MMhops数据集包括两个具有挑战性的任务格式:Bridging(桥接)和Comparison(比较),这些格式要求模型动态构建复杂的推理链,并整合外部知识。 为了应对MMhops带来的挑战,我们提出了MMhops-R1,这是一种新颖的多模态检索增强生成(mRAG)框架,旨在进行动态推理。我们的框架利用强化学习来优化模型,使其能够自主规划推理路径、形成有针对性的问题查询并综合多层次信息。全面的实验表明,在MMhops上,MMhops-R1显著优于强大的基线模型,这强调了动态规划和多模态知识整合对于复杂推理的重要性。此外,MMhops-R1在需要固定跳推理的任务中展示了很强的一般化能力,这突显了我们动态规划方法的稳健性。 总之,我们的工作贡献了一个具有挑战性的新基准测试以及一个强大的基线模型,并且我们将发布相关的代码、数据和权重以促进这一关键领域未来的研究。
https://arxiv.org/abs/2512.13573
The design of effective reward functions presents a central and often arduous challenge in reinforcement learning (RL), particularly when developing autonomous agents for complex reasoning tasks. While automated reward optimization approaches exist, they typically rely on derivative-free evolutionary heuristics that treat the reward function as a black box, failing to capture the causal relationship between reward structure and task performance. To bridge this gap, we propose Differentiable Evolutionary Reinforcement Learning (DERL), a bilevel framework that enables the autonomous discovery of optimal reward signals. In DERL, a Meta-Optimizer evolves a reward function (i.e., Meta-Reward) by composing structured atomic primitives, guiding the training of an inner-loop policy. Crucially, unlike previous evolution, DERL is differentiable in its metaoptimization: it treats the inner-loop validation performance as a signal to update the Meta-Optimizer via reinforcement learning. This allows DERL to approximate the "meta-gradient" of task success, progressively learning to generate denser and more actionable feedback. We validate DERL across three distinct domains: robotic agent (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8k, MATH). Experimental results show that DERL achieves state-of-the-art performance on ALFWorld and ScienceWorld, significantly outperforming methods relying on heuristic rewards, especially in out-of-distribution scenarios. Analysis of the evolutionary trajectory demonstrates that DERL successfully captures the intrinsic structure of tasks, enabling selfimproving agent alignment without human intervention.
有效奖励函数的设计是强化学习(RL)中的一个核心且常常艰难的挑战,尤其是在为复杂的推理任务开发自主代理时。虽然存在自动化奖励优化的方法,但这些方法通常依赖于无导数演化的启发式算法来处理奖励函数作为黑盒的问题,无法捕捉到奖励结构与任务表现之间的因果关系。为了弥合这一差距,我们提出了可微演化强化学习(DERL),这是一种双层框架,能够实现最优奖励信号的自主发现。 在DERL中,一个元优化器通过组合结构化的原子原语来进化奖励函数(即元奖励),并指导内循环策略的训练。关键在于,与以往的演算法不同,DERL在其元优化过程中是可微分的:它将内循环验证性能视为更新元优化器以强化学习方式传递信号的方法。这使得DERL能够近似“元梯度”,逐渐学会生成更密集和更具操作性的反馈。 我们在三个不同的领域中对DERL进行了验证:机器人代理(ALFWorld)、科学仿真(ScienceWorld)以及数学推理(GSM8k、MATH)。实验结果显示,DERL在ALFWorld和ScienceWorld上达到了最先进的性能,在基于启发式奖励的方法特别是在分布外场景下明显超越。对于演化的轨迹分析表明,DERL成功地捕捉到了任务的内在结构,使得代理能够在没有人类干预的情况下实现自我改进与对齐。 通过这一创新方法,DERL不仅提高了自主学习系统的效率和泛化能力,还展示了演化算法在智能体奖励设计中的潜力,为解决复杂推理任务带来了新的视角。
https://arxiv.org/abs/2512.13399
Learning interactive motion behaviors among multiple agents is a core challenge in autonomous driving. While imitation learning models generate realistic trajectories, they often inherit biases from datasets dominated by safe demonstrations, limiting robustness in safety-critical cases. Moreover, most studies rely on open-loop evaluation, overlooking compounding errors in closed-loop execution. We address these limitations with two complementary strategies. First, we propose Group Relative Behavior Optimization (GRBO), a reinforcement learning post-training method that fine-tunes pretrained behavior models via group relative advantage maximization with human regularization. Using only 10% of the training dataset, GRBO improves safety performance by over 40% while preserving behavioral realism. Second, we introduce Warm-K, a warm-started Top-K sampling strategy that balances consistency and diversity in motion selection. Our Warm-K method-based test-time scaling enhances behavioral consistency and reactivity at test time without retraining, mitigating covariate shift and reducing performance discrepancies. Demo videos are available in the supplementary material.
学习多代理之间的交互式运动行为是自动驾驶领域的一个核心挑战。虽然模仿学习模型能够生成现实的轨迹,但它们通常会从以安全演示为主的数据集中继承偏差,这限制了在安全性关键情况下表现的鲁棒性。此外,大多数研究依赖于开环评估方法,忽略了闭环执行中的累积误差问题。 为了解决这些局限性,我们采用了两种互补策略。首先,我们提出了组相对行为优化(GRBO),这是一种强化学习后期训练方法,通过组间的相对优势最大化以及人类规范化的手段来微调预训练的行为模型。使用仅10%的训练数据集,GRBO在保持行为真实性的同时,将安全性表现提高了超过40%。 其次,我们引入了Warm-K策略,这是一个带有热启动的Top-K采样方法,能够平衡运动选择的一致性和多样性。基于我们的Warm-K测试时间缩放法,在不重新进行训练的情况下,能够在测试时提升行为一致性与响应性,并且能缓解协变量变化和减少性能差异。 演示视频可在补充材料中查看。
https://arxiv.org/abs/2512.13262
Vision-language models enable the understanding and reasoning of complex traffic scenarios through multi-source information fusion, establishing it as a core technology for autonomous driving. However, existing vision-language models are constrained by the image understanding paradigm in 2D plane, which restricts their capability to perceive 3D spatial information and perform deep semantic fusion, resulting in suboptimal performance in complex autonomous driving environments. This study proposes MMDrive, an multimodal vision-language model framework that extends traditional image understanding to a generalized 3D scene understanding framework. MMDrive incorporates three complementary modalities, including occupancy maps, LiDAR point clouds, and textual scene descriptions. To this end, it introduces two novel components for adaptive cross-modal fusion and key information extraction. Specifically, the Text-oriented Multimodal Modulator dynamically weights the contributions of each modality based on the semantic cues in the question, guiding context-aware feature integration. The Cross-Modal Abstractor employs learnable abstract tokens to generate compact, cross-modal summaries that highlight key regions and essential semantics. Comprehensive evaluations on the DriveLM and NuScenes-QA benchmarks demonstrate that MMDrive achieves significant performance gains over existing vision-language models for autonomous driving, with a BLEU-4 score of 54.56 and METEOR of 41.78 on DriveLM, and an accuracy score of 62.7% on NuScenes-QA. MMDrive effectively breaks the traditional image-only understanding barrier, enabling robust multimodal reasoning in complex driving environments and providing a new foundation for interpretable autonomous driving scene understanding.
视觉-语言模型通过多源信息融合能够理解并推理复杂的交通场景,成为自动驾驶技术的核心。然而,现有的视觉-语言模型受限于二维平面上的图像理解范式,这限制了它们感知三维空间信息及进行深度语义融合的能力,在复杂驾驶环境中表现不佳。本研究提出了一种名为MMDrive的多模态视觉-语言模型框架,该框架将传统的图像理解扩展到一个通用的三维场景理解框架中。MMDrive结合了三种互补的模式:占据地图、激光雷达点云和文本场景描述。为此,它引入了两个新的组件来进行自适应跨模态融合和关键信息提取。具体而言,“基于文本的多模态调制器”根据问题中的语义线索动态调整每种模式的贡献权重,引导上下文感知特征整合。“跨模态抽象器”采用可学习的抽象令牌生成紧凑、跨模态摘要,强调关键区域及核心语义。 在DriveLM和NuScenes-QA基准上的全面评估表明,MMDrive相对于现有的视觉-语言模型,在自动驾驶方面实现了显著的性能提升。具体来说,在DriveLM上,其BLEU-4得分为54.56%,METEOR得分为41.78%;在NuScenes-QA上,准确率为62.7%。 MMDrive有效地打破了传统的仅基于图像理解的限制,在复杂的驾驶环境中实现了强大的多模态推理,并为可解释的自动驾驶场景理解提供了新的基础。
https://arxiv.org/abs/2512.13177
Conversational agents often encounter ambiguous user requests, requiring an effective clarification to successfully complete tasks. While recent advancements in real-world applications favor multi-agent architectures to manage complex conversational scenarios efficiently, ambiguity resolution remains a critical and underexplored challenge--particularly due to the difficulty of determining which agent should initiate a clarification and how agents should coordinate their actions when faced with uncertain or incomplete user input. The fundamental questions of when to interrupt a user and how to formulate the optimal clarification query within the most optimal multi-agent settings remain open. In this paper, we propose MAC (Multi-Agent Clarification), an interactive multi-agent framework specifically optimized to resolve user ambiguities by strategically managing clarification dialogues. We first introduce a novel taxonomy categorizing user ambiguities to systematically guide clarification strategies. Then, we present MAC that autonomously coordinates multiple agents to interact synergistically with users. Empirical evaluations on MultiWOZ 2.4 demonstrate that enabling clarification at both levels increases task success rate 7.8\% (54.5 to 62.3) and reduces the average number of dialogue turns (6.53 to 4.86) by eliciting all required user information up front and minimizing repetition. Our findings highlight the importance of active user interaction and role-aware clarification for more reliable human-agent communication.
对话代理经常遇到用户请求模糊不清的情况,这需要有效的澄清来顺利完成任务。尽管最近在实际应用中的进展倾向于使用多代理架构以高效管理复杂的对话场景,但含糊性解决仍然是一个关键且未充分探索的挑战——特别是由于难以确定哪个代理应发起澄清以及代理如何协调其行动应对不确定或不完整的用户输入。关于何时打断用户以及如何在最理想的多代理设置中提出最佳澄清问题的基本问题仍然悬而未决。 本文提出了MAC(Multi-Agent Clarification),这是一个专门为解决用户模糊性而优化的交互式多代理框架,通过战略性地管理澄清对话来实现这一目标。首先,我们引入了一个新颖的分类法,用于系统化地指导澄清策略。然后,我们展示了MAC如何自主协调多个代理以与用户协同互动。在MultiWOZ 2.4上的实证评估表明,在两个层面启用澄清可以将任务成功率提高7.8%(从54.5%到62.3%),并通过提前收集所有必要的用户信息和减少重复对话,使平均对话回合数从6.53降低至4.86。我们的研究结果强调了主动与用户的互动以及角色感知澄清对于更可靠的人机通信的重要性。
https://arxiv.org/abs/2512.13154
Multi-modal 3D object detection is important for reliable perception in robotics and autonomous driving. However, its effectiveness remains limited under adverse weather conditions due to weather-induced distortions and misalignment between different data modalities. In this work, we propose DiffFusion, a novel framework designed to enhance robustness in challenging weather through diffusion-based restoration and adaptive cross-modal fusion. Our key insight is that diffusion models possess strong capabilities for denoising and generating data that can adapt to various weather conditions. Building on this, DiffFusion introduces Diffusion-IR restoring images degraded by weather effects and Point Cloud Restoration (PCR) compensating for corrupted LiDAR data using image object cues. To tackle misalignments between two modalities, we develop Bidirectional Adaptive Fusion and Alignment Module (BAFAM). It enables dynamic multi-modal fusion and bidirectional bird's-eye view (BEV) alignment to maintain consistent spatial correspondence. Extensive experiments on three public datasets show that DiffFusion achieves state-of-the-art robustness under adverse weather while preserving strong clean-data performance. Zero-shot results on the real-world DENSE dataset further validate its generalization. The implementation of our DiffFusion will be released as open-source.
多模态3D物体检测在机器人技术和自动驾驶中对于可靠的感知至关重要。然而,由于恶劣天气引起的扭曲和不同数据模式之间的不匹配,在恶劣天气条件下其有效性仍然受限。在这项工作中,我们提出了DiffFusion,这是一种新颖的框架,旨在通过基于扩散的方法恢复和自适应跨模式融合来增强恶劣天气条件下的鲁棒性。我们的关键见解是,扩散模型具有强大的去噪能力和可以适应各种天气条件的数据生成能力。在此基础上,DiffFusion引入了Diffusion-IR(用于修复受天气影响而退化的图像)以及点云修复(PCR)(使用图像对象线索来补偿受损的LiDAR数据)。为了解决两种模式之间的不匹配问题,我们开发了一种双向自适应融合和对齐模块(BAFAM)。它能够实现动态多模态融合并进行双向俯视图(BEV)对齐以保持一致的空间对应关系。 在三个公开数据集上的大量实验表明,DiffFusion在恶劣天气条件下实现了最先进的鲁棒性,并且同时保持了强大的干净数据性能。在现实世界的DENSE数据集上进行的零样本结果进一步验证了其泛化能力。我们将开放源代码发布我们的DiffFusion实现。
https://arxiv.org/abs/2512.13107
Imitation learning (IL) has emerged as a central paradigm in autonomous driving. While IL excels in matching expert behavior in open-loop settings by minimizing per-step prediction errors, its performance degrades unexpectedly in closed-loop due to the gradual accumulation of small, often imperceptible errors over this http URL successive planning cycles, these errors compound, potentially resulting in severe this http URL research efforts predominantly rely on increasingly sophisticated network architectures or high-fidelity training datasets to enhance the robustness of IL planners against error accumulation, focusing on the state-level robustness at a single time point. However, autonomous driving is inherently a continuous-time process, and leveraging the temporal scale to enhance robustness may provide a new perspective for addressing this this http URL this end, we propose a method termed Sequence of Experts (SoE), a temporal alternation policy that enhances closed-loop performance without increasing model size or data requirements. Our experiments on large-scale autonomous driving benchmarks nuPlan demonstrate that SoE method consistently and significantly improves the performance of all the evaluated models, and achieves state-of-the-art this http URL module may provide a key and widely applicable support for improving the training efficiency of autonomous driving models.
模仿学习(IL)已成为自主驾驶中的中心范式。虽然在开环设置中,通过最小化每一步的预测误差,IL 能够很好地匹配专家行为,但在闭环环境中其性能会因累积的小且往往不易察觉的错误而意外下降。随着时间推移,在连续规划周期中这些错误会叠加,可能引发严重后果。目前的研究工作主要依赖于越来越复杂的网络架构或高保真训练数据集来增强 IL 规划器对错误积累的鲁棒性,侧重于单一时点的状态级鲁棒性。然而,自主驾驶本质上是一个持续时间过程,并利用时间尺度来提高鲁棒性能可能为解决这一问题提供新的视角。 为此,我们提出了一种名为专家序列(SoE)的方法,这是一种时序交替策略,在不增加模型大小或数据需求的情况下提升闭环性能。我们在大规模自主驾驶基准 nuPlan 上的实验表明,SoE 方法能够持续且显著地改进所有评估模型的性能,并达到最先进的水平。这种方法可能为提高自主驾驶模型训练效率提供关键的支持和广泛应用的可能性。
https://arxiv.org/abs/2512.13094
Object tracking is an important step in robotics and reautonomous driving pipelines, which has to generalize to previously unseen and complex objects. Existing high-performing methods often rely on pre-captured object views to build explicit reference models, which restricts them to a fixed set of known objects. However, such reference models can struggle with visually complex appearance, reducing the quality of tracking. In this work, we introduce an object tracking method based on light field images that does not depend on a pre-trained model, while being robust to complex visual behavior, such as reflections. We extract semantic and geometric features from light field inputs using vision foundation models and convert them into view-dependent Gaussian splats. These splats serve as a unified object representation, supporting differentiable rendering and pose optimization. We further introduce a light field object tracking dataset containing challenging reflective objects with precise ground truth poses. Experiments demonstrate that our method is competitive with state-of-the-art model-based trackers in these difficult cases, paving the way toward universal object tracking in robotic systems. Code/data available at this https URL.
对象跟踪是机器人和自动驾驶流水线中的一个重要步骤,需要对以前未见过的复杂物体进行泛化。现有的高性能方法通常依赖于预捕获的对象视图来构建显式的参考模型,这限制了它们只能处理一组已知的对象。然而,这样的参考模型在面对视觉上复杂的外观时会遇到困难,从而降低了跟踪质量。在这项工作中,我们介绍了一种基于光场图像的对象跟踪方法,该方法无需依赖预训练的模型,并且对诸如反射等复杂视觉行为具有鲁棒性。我们利用视觉基础模型从光场输入中提取语义和几何特征,并将其转换为视图相关的高斯点(Gaussian splats)。这些点作为统一的对象表示形式,支持可微渲染和姿态优化。此外,我们还引入了一个包含挑战性的反射物体的光场对象跟踪数据集,并提供了精确的姿态地面实况。实验表明,在这些困难的情况下,我们的方法与最先进的模型基于的方法具有竞争力,为机器人系统中的通用对象跟踪铺平了道路。代码和数据可在该链接获取。
https://arxiv.org/abs/2512.13007
This paper proposes two new algorithms for the lane keeping system (LKS) in autonomous vehicles (AVs) operating under snowy road conditions. These algorithms use deep reinforcement learning (DRL) to handle uncertainties and slippage. They include Action-Robust Recurrent Deep Deterministic Policy Gradient (AR-RDPG) and end-to-end Action-Robust convolutional neural network Attention Deterministic Policy Gradient (AR-CADPG), two action-robust approaches for decision-making. In the AR-RDPG method, within the perception layer, camera images are first denoised using multi-scale neural networks. Then, the centerline coefficients are extracted by a pre-trained deep convolutional neural network (DCNN). These coefficients, concatenated with the driving characteristics, are used as input to the control layer. The AR-CADPG method presents an end-to-end approach in which a convolutional neural network (CNN) and an attention mechanism are integrated within a DRL framework. Both methods are first trained in the CARLA simulator and validated under various snowy scenarios. Real-world experiments on a Jetson Nano-based autonomous vehicle confirm the feasibility and stability of the learned policies. Among the two models, the AR-CADPG approach demonstrates superior path-tracking accuracy and robustness, highlighting the effectiveness of combining temporal memory, adversarial resilience, and attention mechanisms in AVs.
本文提出两种新的算法,用于在雪地路况下自主车辆(AV)车道保持系统(LKS)中的决策。这些算法利用深度强化学习(DRL)来处理不确定性和打滑问题。它们包括动作鲁棒递归深层确定性策略梯度(Action-Robust Recurrent Deep Deterministic Policy Gradient, AR-RDPG)和端到端动作鲁棒卷积神经网络注意机制确定性策略梯度(end-to-end Action-Robust Convolutional Neural Network Attention Deterministic Policy Gradient, AR-CADPG),这两种方法都是用于决策制定的动作鲁棒性方案。 在AR-RDPG方法中,感知层首先使用多尺度神经网络对相机图像进行去噪处理。然后,利用预训练的深度卷积神经网络(DCNN)提取中心线系数,并将这些系数与驾驶特性相结合作为控制层的输入数据。 AR-CADPG方法则提供了一种端到端的方法,在这种方法中,一个卷积神经网络(CNN)和注意机制被整合在了DRL框架内。这两种方法都在CARLA仿真器上进行初步训练,并通过各种雪地场景进行了验证。基于Jetson Nano的自动驾驶车辆上的真实世界实验证实了所学策略的有效性和稳定性。 在这两个模型中,AR-CADPG方法展示了更优的道路跟踪准确性和鲁棒性,这突显了在AV中结合时间记忆、对抗性适应能力和注意机制的有效性。
https://arxiv.org/abs/2512.12987
Despite their remarkable performance, deep neural networks exhibit a critical vulnerability: small, often imperceptible, adversarial perturbations can lead to drastically altered model predictions. Given the stringent reliability demands of applications such as medical diagnosis and autonomous driving, robust detection of such adversarial attacks is paramount. In this paper, we investigate the geometric properties of a model's input loss landscape. We analyze the Intrinsic Dimensionality (ID) of the model's gradient parameters, which quantifies the minimal number of coordinates required to describe the data points on their underlying manifold. We reveal a distinct and consistent difference in the ID for natural and adversarial data, which forms the basis of our proposed detection method. We validate our approach across two distinct operational scenarios. First, in a batch-wise context for identifying malicious data groups, our method demonstrates high efficacy on datasets like MNIST and SVHN. Second, in the critical individual-sample setting, we establish new state-of-the-art results on challenging benchmarks such as CIFAR-10 and MS COCO. Our detector significantly surpasses existing methods against a wide array of attacks, including CW and AutoAttack, achieving detection rates consistently above 92\% on CIFAR-10. The results underscore the robustness of our geometric approach, highlighting that intrinsic dimensionality is a powerful fingerprint for adversarial detection across diverse datasets and attack strategies.
尽管深度神经网络表现出色,但它们存在一个关键的脆弱性:小而常常难以察觉的对抗性扰动可能导致模型预测发生剧烈变化。对于医学诊断和自动驾驶等对可靠性要求极高的应用而言,能够有效地检测此类对抗性攻击至关重要。在本文中,我们探讨了模型输入损失景观中的几何特性。我们分析了模型梯度参数的固有维度(ID),它量化了描述数据点所需的基本坐标数量。我们发现自然数据与对抗性数据之间存在显著且一致的区别,这一差异构成了我们提出的检测方法的基础。我们在两种不同的操作场景中验证了我们的方法。 首先,在批量处理环境中用于识别恶意数据组时,我们的方法在MNIST和SVHN等数据集上表现出高效率。其次,在关键的单样本设置下,我们在CIFAR-10和MS COCO等具有挑战性的基准测试中建立了新的最先进的结果。我们的检测器显著超越了现有的针对包括CW(Carlini Wagner)攻击和AutoAttack在内的多种攻击方法的表现,达到了在CIFAR-10数据集上超过92%的一致检测率。这些结果显示出了我们几何方法的鲁棒性,并强调固有维度是跨不同数据集和攻击策略进行对抗性检测的强大指纹特征。
https://arxiv.org/abs/2512.12827
The transition of Large Language Models (LLMs) from passive code generators to autonomous agents introduces significant safety risks, specifically regarding destructive commands and inconsistent system states. Existing commercial solutions often prioritize interactive user safety, enforcing authentication barriers that break the headless loops required for true autonomy. This paper presents a Fault-Tolerant Sandboxing framework designed to mitigate these risks through a policy-based interception layer and a transactional filesystem snapshot mechanism. We hypothesize that wrapping agent actions in atomic transactions can guarantee safety with acceptable latency, outperforming the heavy initialization overhead of containers or the interactive friction of commercial CLIs. We validated this approach by deploying the Minimind-MoE LLM served via nano-vllm on a custom Proxmox-based testbed utilizing EVPN/VXLAN isolation. Experimental results demonstrate a 100\% interception rate for high-risk commands and a 100\% success rate in rolling back failed states. Crucially, our prototype incurs only a 14.5\% performance overhead (approx. 1.8s) per transaction. In contrast, benchmarking against the Gemini CLI sandbox revealed that it requires interactive authentication ("Sign in"), rendering it unusable for headless, autonomous agent workflows.
大型语言模型(LLMs)从被动的代码生成器转变为自主代理,带来了显著的安全风险,尤其是在破坏性命令和系统状态不一致性方面。现有的商业解决方案通常优先考虑交互式用户体验安全,通过实施身份验证障碍来防止真正自治所需的无头循环。本文提出了一种容错隔离框架,旨在通过基于策略的拦截层和事务型文件系统快照机制来缓解这些风险。我们假设将代理操作包裹在原子交易中可以保证安全性,并且接受可容忍的延迟,从而优于容器的重型初始化开销或商业命令行接口(CLI)带来的交互摩擦。 为了验证这一方法的有效性,我们在一个基于自定义Proxmox测试平台并利用EVPN/VXLAN隔离技术的情况下部署了通过nano-vllm提供的Minimind-MoE LLM。实验结果显示,在高风险命令的拦截率为100%,以及在失败状态回滚的成功率同样为100%的情况下,我们的原型方案每笔交易仅造成了约14.5%的性能损耗(大约1.8秒)。相比之下,与Gemini CLI沙盒进行基准测试时发现,它需要交互式身份验证(“登录”),这使得其对于无头、自主代理的工作流程不可用。
https://arxiv.org/abs/2512.12806
Although multi-modal large language models (MLLMs) have shown strong capabilities across diverse domains, their application in generating fine-grained 3D perception and prediction outputs in autonomous driving remains underexplored. In this paper, we propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action (VLA) framework that is also compatible with vision-action (VA) models. Our method jointly performs spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through end-to-end optimization. To obtain both precise geometric information and rich visual appearance, our approach integrates point clouds, multi-view images, and language instructions within a unified MLLM architecture. We further develop a data engine to generate text-occupancy and text-flow QA pairs for 4D spatial understanding. Remarkably, with only a 0.5B Qwen2.5 model as MLLM backbone, DrivePI as a single unified model matches or exceeds both existing VLA models and specialized VA models. Specifically, compared to VLA models, DrivePI outperforms OpenDriveVLA-7B by 2.5% mean accuracy on nuScenes-QA and reduces collision rate by 70% over ORION (from 0.37% to 0.11%) on nuScenes. Against specialized VA models, DrivePI surpasses FB-OCC by 10.3 RayIoU for 3D occupancy on OpenOcc, reduces the mAVE from 0.591 to 0.509 for occupancy flow on OpenOcc, and achieves 32% lower L2 error than VAD (from 0.72m to 0.49m) for planning on nuScenes. Code will be available at this https URL
尽管多模态大型语言模型(MLLMs)在各个领域的强大能力已经得到展示,但它们在自驾车中生成精细的3D感知和预测输出的应用仍处于探索阶段。本文提出了DrivePI,这是一种新颖的空间感知4D MLLM,作为统一的视觉-语言-行动(VLA)框架,并兼容视觉-行动(VA)模型。我们的方法通过端到端优化同时进行空间理解、3D感知(即,3D占用)、预测(即,占用流)和规划(即,行动输出)。为了获得精确的几何信息和丰富的视觉外观,我们的方法在统一的MLLM架构中集成了点云、多视角图像和语言指令。我们进一步开发了一个数据引擎以生成4D空间理解所需的文本-占用和文本-流动问答对。值得注意的是,仅用0.5B参数量的Qwen2.5模型作为基础的MLLM,DrivePI作为一个单一统一的模型就能在性能上与现有的VLA模型和专业的VA模型匹敌或超越。具体而言,相比于VLA模型,DrivePI在nuScenes-QA上的均值准确度比OpenDriveVLA-7B高出2.5%,并在nuScenes数据集上将碰撞率从ORION的0.37%降低至0.11%;与专业的VA模型相比,DrivePI在OpenOcc上的3D占用任务中以10.3 RayIoU超越FB-OCC,在同一数据集上为占用流任务将mAVE值由0.591降至0.509,并且在nuScenes的规划任务上比VAD低了32%的L2误差(从0.72米减少到0.49米)。代码将在提供的链接中发布。
https://arxiv.org/abs/2512.12799
Recent advances in agentic AI have shifted the focus from standalone Large Language Models (LLMs) to integrated systems that combine LLMs with tools, memory, and other agents to perform complex tasks. These multi-agent architectures enable coordinated reasoning, planning, and execution across diverse domains, allowing agents to collaboratively automate complex workflows. Despite these advances, evaluation and assessment of LLM agents and the multi-agent systems they constitute remain a fundamental challenge. Although various approaches have been proposed in the software engineering literature for evaluating conventional software components, existing methods for AI-based systems often overlook the non-deterministic nature of models. This non-determinism introduces behavioral uncertainty during execution, yet existing evaluations rely on binary task completion metrics that fail to capture it. Evaluating agentic systems therefore requires examining additional dimensions, including the agent ability to invoke tools, ingest and retrieve memory, collaborate with other agents, and interact effectively with its environment. We propose an end-to-end Agent Assessment Framework with four evaluation pillars encompassing LLMs, Memory, Tools, and Environment. We validate the framework on a representative Autonomous CloudOps use case, where experiments reveal behavioral deviations overlooked by conventional metrics, demonstrating its effectiveness in capturing runtime uncertainties.
最近在代理人工智能(Agentic AI)领域的进展,已经从单独的大型语言模型(LLMs)转向了结合使用LLMs、工具、记忆和其他代理来执行复杂任务的一体化系统。这些多代理架构能够在各种领域内实现协调推理、规划和执行,使代理能够协作自动化复杂的业务流程。尽管取得了这些进步,但评估和评估LLM代理及其构成的多代理系统仍然是一个根本性的挑战。 虽然软件工程文献中已经提出了多种用于评估传统软件组件的方法,但对于基于AI系统的现有方法往往忽视了模型的非确定性本质。这种非确定性会在执行期间引入行为不确定性,然而现有的评估仍然依赖于简单的任务完成指标,这些指标无法捕捉到这种不确定性。因此,评价代理系统需要考虑更多维度,包括代理调用工具的能力、摄入和检索记忆的能力、与其他代理合作的能力以及与其环境有效互动的能力。 我们提出了一种端到端的代理评估框架,该框架包含四个评估支柱:大型语言模型(LLMs)、内存、工具和环境。我们在一个具有代表性的自主云运维使用案例中验证了这一框架的有效性,在这些实验中发现了常规指标所忽略的行为偏差,这证明了该框架在捕捉运行时不确定性和复杂行为方面的有效性。 通过这个框架的实施,我们可以更全面地评估代理系统的能力,并且更好地理解它们如何在其工作环境中运作和改进。
https://arxiv.org/abs/2512.12791
With the recent development and integration of autonomous vehicles (AVs) in transportation systems of the modern world, the emphasis on customizing user interfaces to optimize the overall user experience has been growing expediently. Therefore, understanding user needs and preferences is essential to the acceptance and trust of these technologies as they continue to grow in prevalence. This paper addresses the implementation of HCI principles in the personalization of interfaces to improve safety, security, and usability for the users. This paper explores the way that personalized interfaces can be devised to increase user engagement and satisfaction through various HCI strategies such as adaptive design, multi-modal interaction, and user feedback mechanisms. Moreover, this paper puts emphasis on factors of transparency and user control in the design of an interface; hence, allowing users to design or modify their experience could foster an increase in trust in autonomous systems. In so doing, this research touches on the quite influential role HCI will play in this future scenario of autonomous vehicles while designing to ensure relevance to the diverse needs of users while maintaining high standards of safety and security. Discussing various HCI strategies such as adaptive design, multi-modal interaction, and feedback mechanisms to the user, this paper demonstrates how personalized interfaces can enhance significantly both user engagement and satisfaction. Transparency and user control also in designing an interface are further discussed, pointing out the need for a prerequisite condition of enabling the user to take control of their experience as a state of trust in autonomous systems. In summary, this paper points out the role of HCI in the development of autonomous vehicles and addresses numerous needs with respect to those enforced safety and security standards.
https://arxiv.org/abs/2512.12773
Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems. Most prior evaluations focus on localized code generation, scaffolded completion, or short-term repair tasks, leaving open the question of whether agents can sustain coherent reasoning, planning, and execution over the extended horizons demanded by real-world repository construction. To address this gap, we present NL2Repo Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation ability of coding agents. Given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, implement multi-module logic, and produce a fully installable Python library. Our experiments across state-of-the-art open- and closed-source models reveal that long-horizon repository generation remains largely unsolved: even the strongest agents achieve below 40% average test pass rates and rarely complete an entire repository correctly. Detailed analysis uncovers fundamental long-horizon failure modes, including premature termination, loss of global coherence, fragile cross-file dependencies, and inadequate planning over hundreds of interaction steps. NL2Repo Bench establishes a rigorous, verifiable testbed for measuring sustained agentic competence and highlights long-horizon reasoning as a central bottleneck for the next generation of autonomous coding agents.
近期在编码代理领域的进展表明,自主软件开发正快速进步,然而现有的基准测试却无法严格评估构建完整软件系统所需的长期能力。大多数先前的评价主要集中在局部代码生成、框架内完成或短期修复任务上,未能解决代理能否在现实世界仓库建设所要求的长时间范围内保持连贯推理、计划和执行的问题。为填补这一空白,我们提出了NL2Repo Bench,这是一个专门设计用于评估编码代理长期仓库生成能力的基准。 给定单一自然语言需求文档和一个空的工作区,这些代理必须自主地设计架构、管理依赖关系、实现多模块逻辑,并且最终产生一个可以完全安装的Python库。在最先进的开源和闭源模型上进行的实验表明,长周期仓库生成仍然大多未被解决:即使是最强大的代理,在平均测试通过率也低于40%,并且很少能够正确地完成整个仓库。 详细的分析揭示了根本性的长期失败模式,包括过早终止、全局连贯性丧失、脆弱的跨文件依赖关系以及在数百次交互步骤中的规划不足。NL2Repo Bench建立了一个严格的可验证平台来衡量持续的代理能力,并且强调长时间推理是下一代自主编码代理的核心瓶颈。
https://arxiv.org/abs/2512.12730
LLM-based agents often operate in a greedy, step-by-step manner, selecting actions solely based on the current observation without considering long-term consequences or alternative paths. This lack of foresight is particularly problematic in web environments, which are only partially observable-limited to browser-visible content (e.g., DOM and UI elements)-where a single misstep often requires complex and brittle navigation to undo. Without an explicit backtracking mechanism, agents struggle to correct errors or systematically explore alternative paths. Tree-search methods provide a principled framework for such structured exploration, but existing approaches lack mechanisms for safe backtracking, making them prone to unintended side effects. They also assume that all actions are reversible, ignoring the presence of irreversible actions-limitations that reduce their effectiveness in realistic web tasks. To address these challenges, we introduce WebOperator, a tree-search framework that enables reliable backtracking and strategic exploration. Our method incorporates a best-first search strategy that ranks actions by both reward estimates and safety considerations, along with a robust backtracking mechanism that verifies the feasibility of previously visited paths before replaying them, preventing unintended side effects. To further guide exploration, WebOperator generates action candidates from multiple, varied reasoning contexts to ensure diverse and robust exploration, and subsequently curates a high-quality action set by filtering out invalid actions pre-execution and merging semantically equivalent ones. Experimental results on WebArena and WebVoyager demonstrate the effectiveness of WebOperator. On WebArena, WebOperator achieves a state-of-the-art 54.6% success rate with gpt-4o, underscoring the critical advantage of integrating strategic foresight with safe execution.
基于大语言模型(LLM)的代理通常以贪婪、逐步的方式运行,仅根据当前观察到的信息选择行动,而不考虑长期后果或替代路径。这种缺乏远见的问题在只能部分观察的网络环境中尤为严重——比如受限于浏览器可见内容(例如DOM和UI元素),一个错误步骤往往需要复杂的且脆弱的导航来撤销。没有明确的回溯机制,代理难以纠正错误或系统地探索其他路径。尽管树搜索方法提供了一种结构化探索的原则框架,但现有方法缺乏安全回溯机制,导致它们容易产生意外副作用,并假设所有动作都是可逆的,忽略了不可逆操作的存在——这些限制削弱了其在现实网络任务中的有效性。 为了解决这些问题,我们引入了WebOperator,这是一种树搜索框架,能够实现可靠的回溯和战略探索。我们的方法结合了最佳优先搜索策略,根据奖励估计和安全考量对动作进行排名,并且采用了一种稳健的回溯机制,在重新播放之前访问过的路径时验证其可行性,防止产生意外副作用。为了进一步指导探索,WebOperator从多个不同推理上下文中生成行动候选人,以确保多样性和鲁棒性,并通过预执行过滤无效操作和合并语义上等价的操作来后期精炼高质量的动作集。 在WebArena和WebVoyager上的实验结果验证了WebOperator的有效性。在WebArena中,使用gpt-4o时,WebOperator达到了54.6%的最新成功率,突显了将战略远见与安全执行相结合的关键优势。
https://arxiv.org/abs/2512.12692
This research introduces a transformative framework for integrating Vision-Enhanced Large Language Models (LLMs) with advanced transformer-based architectures to tackle challenges in high-resolution image synthesis and multimodal data interpretation. The proposed model incorporates a rectified flow mechanism that connects noise and data with linear paths, enabling efficient and high-quality generation. A bidirectional tokenization strategy is employed to seamlessly merge inputs from text, image, and video modalities, fostering a unified understanding across diverse data types. By embedding spatial-temporal features and leveraging a hybrid text-image sequence modeling approach, the framework achieves unparalleled fidelity in synthesized images and coherent multimodal representations. The architecture is optimized with a noise-aware learning algorithm, addressing discrepancies in noisy data distributions and improving generative performance under varying input conditions. Rigorous evaluations on benchmark datasets demonstrate a 25% increase in image resolution clarity and a 20% reduction in computational requirements compared to diffusion-based methods. Furthermore, the model exhibits robust scalability and adaptability, showcasing its potential in applications like autonomous systems, creative content generation, and advanced video analysis. This work underscores the role of vision-centric LLMs in redefining capabilities in computer vision and multimodal artificial intelligence.
这项研究提出了一种变革性的框架,用于将增强视觉的大规模语言模型(LLMs)与先进的Transformer架构相结合,以解决高分辨率图像合成和多模态数据解读中的挑战。提出的模型采用了一个修正的流机制,该机制通过线性路径连接噪声和数据,从而实现高效且高质量的生成。此外,还采用了双向标记化策略来无缝地融合文本、图像和视频模式的输入,促进不同数据类型的统一理解。通过嵌入空间-时间特征,并利用混合文本-图像序列建模方法,该框架实现了合成图像前所未有的逼真度以及连贯的多模态表示。 为了应对噪声数据分布中的差异并提高在各种输入条件下的生成性能,模型进行了优化以采用一种感知噪声的学习算法。基准数据集上的严格评估表明,在与基于扩散的方法相比时,图像分辨率清晰度提高了25%,计算需求减少了20%。此外,该模型还表现出强大的可扩展性和适应性,展示了其在自主系统、创意内容生成和高级视频分析等应用中的潜力。 这项工作强调了以视觉为中心的LLMs在重新定义计算机视觉和多模态人工智能能力方面的作用。
https://arxiv.org/abs/2512.12595
Vertical beam dropout in spinning LiDAR sensors triggered by hardware aging, dust, snow, fog, or bright reflections removes entire vertical slices from the point cloud and severely degrades 3D perception in autonomous vehicles. This paper proposes a Graph Attention Network (GAT)-based framework that reconstructs these missing vertical channels using only the current LiDAR frame, with no camera images or temporal information required. Each LiDAR sweep is represented as an unstructured spatial graph: points are nodes and edges connect nearby points while preserving the original beam-index ordering. A multi-layer GAT learns adaptive attention weights over local geometric neighborhoods and directly regresses the missing elevation (z) values at dropout locations. Trained and evaluated on 1,065 raw KITTI sequences with simulated channel dropout, the method achieves an average height RMSE of 11.67 cm, with 87.98% of reconstructed points falling within a 10 cm error threshold. Inference takes 14.65 seconds per frame on a single GPU, and reconstruction quality remains stable for different neighborhood sizes k. These results show that a pure graph attention model operating solely on raw point-cloud geometry can effectively recover dropped vertical beams under realistic sensor degradation.
旋转激光雷达传感器由于硬件老化、灰尘、雪、雾或强烈反射等原因导致的垂直光束脱落会从点云中移除整个垂直切片,并严重损害自动驾驶车辆的3D感知能力。本文提出了一种基于图注意力网络(GAT)的框架,该框架仅使用当前的激光雷达帧来重建缺失的垂直通道,无需相机图像或时间信息。每个激光雷达扫描都被表示为一个无结构的空间图:点是节点,边连接附近的点并保持原始光束索引顺序。多层GAT通过局部几何邻域学习自适应注意力权重,并直接回归丢失的高度(z)值。在1,065个具有模拟通道脱落的原始KITTI序列上进行训练和评估后,该方法实现了平均高度RMSE为11.67厘米,87.98%重建点误差小于10厘米的结果。推理速度为单GPU每帧14.65秒,并且对于不同的邻域大小k,重建质量保持稳定。这些结果表明,仅基于原始点云几何的纯图注意力模型能够在现实中的传感器退化情况下有效恢复脱落的垂直光束。
https://arxiv.org/abs/2512.12410
Long-form video understanding remains challenging due to the extended temporal structure and dense multimodal cues. Despite recent progress, many existing approaches still rely on hand-crafted reasoning pipelines or employ token-consuming video preprocessing to guide MLLMs in autonomous reasoning. To overcome these limitations, we introduce VideoARM, an Agentic Reasoning-over-hierarchical-Memory paradigm for long-form video understanding. Instead of static, exhaustive preprocessing, VideoARM performs adaptive, on-the-fly agentic reasoning and memory construction. Specifically, VideoARM performs an adaptive and continuous loop of observing, thinking, acting, and memorizing, where a controller autonomously invokes tools to interpret the video in a coarse-to-fine manner, thereby substantially reducing token consumption. In parallel, a hierarchical multimodal memory continuously captures and updates multi-level clues throughout the operation of the agent, providing precise contextual information to support the controller in decision-making. Experiments on prevalent benchmarks demonstrate that VideoARM outperforms the state-of-the-art method, DVD, while significantly reducing token consumption for long-form videos.
长视频理解由于其扩展的时间结构和密集的多模态线索仍然具有挑战性。尽管近期有所进展,许多现有方法依然依赖于手工制作的推理流程或采用消耗大量令牌的视频预处理来引导大规模语言模型(MLLMs)实现自主推理。为克服这些限制,我们引入了VideoARM,这是一种面向长视频理解的代理式层级记忆推理范式(Agentic Reasoning-over-hierarchical-Memory paradigm)。与静态、详尽的预处理不同,VideoARM执行适应性极强且实时进行的代理推理和内存构建过程。 具体而言,VideoARM执行了一个观察、思考、行动和记忆的自适应连续循环,在这一过程中一个控制器会自主调用工具以粗略到精细的方式来解释视频内容,并大幅减少令牌消耗。同时,一种层级化的多模态内存持续捕捉并更新在代理操作期间出现的多层次线索,为决策提供精确的上下文信息来支持控制器。 实验表明,在流行基准上,VideoARM的表现优于最先进的方法DVD,并且显著减少了长视频中令牌的使用量。
https://arxiv.org/abs/2512.12360