Scaling laws have achieved success in LLM and foundation models. To explore their potential in ISAC research, we propose Great-X. This single-engine multimodal data twin platform reconstructs the ray-tracing computation of Sionna within Unreal Engine and is deeply integrated with autonomous driving tools. This enables efficient and synchronized simulation of multimodal data, including CSI, RGB, Radar, and LiDAR. Based on this platform, we construct an open-source, large-scale, low-altitude UAV multimodal synaesthesia dataset named Great-MSD, and propose a baseline CSI-based UAV 3D localization algorithm, demonstrating its feasibility and generalizability across different CSI simulation engines. The related code and dataset are publicly available at: this https URL.
规模定律在大型语言模型(LLM)和基础模型研究中取得了成功。为了探索其在ISAC(智能传感与通信)研究中的潜力,我们提出了Great-X平台。这是一个单引擎多模态数据孪生平台,它将Sionna中的光线追踪计算重构到Unreal Engine中,并且与自动驾驶工具深度集成。这使得包括信道状态信息(CSI)、RGB图像、雷达和激光雷达在内的多模态数据的高效同步仿真成为可能。 基于此平台,我们构建了一个开源的大规模低空无人机多模态通感数据集,名为Great-MSD,并提出了一种基于CSI的无人机3D定位算法基线方案。该方案展示了其在不同CSI模拟引擎中的可行性和泛化能力。 相关代码和数据集可在以下链接获取:[此网址](this https URL)(请将"this https URL"替换为实际的网址)。
https://arxiv.org/abs/2507.08716
Inverse Reinforcement Learning (IRL) presents a powerful paradigm for learning complex robotic tasks from human demonstrations. However, most approaches make the assumption that expert demonstrations are available, which is often not the case. Those that allow for suboptimality in the demonstrations are not designed for long-horizon goals or adversarial tasks. Many desirable robot capabilities fall into one or both of these categories, thus highlighting a critical shortcoming in the ability of IRL to produce field-ready robotic agents. We introduce Sample-efficient Preference-based inverse reinforcement learning for Long-horizon Adversarial tasks from Suboptimal Hierarchical demonstrations (SPLASH), which advances the state-of-the-art in learning from suboptimal demonstrations to long-horizon and adversarial settings. We empirically validate SPLASH on a maritime capture-the-flag task in simulation, and demonstrate real-world applicability with sim-to-real translation experiments on autonomous unmanned surface vehicles. We show that our proposed methods allow SPLASH to significantly outperform the state-of-the-art in reward learning from suboptimal demonstrations.
逆向强化学习(IRL)为从人类演示中学习复杂机器人任务提供了一个强大的框架。然而,大多数方法假设专家演示是可用的,这在实践中常常不成立。那些允许演示存在非最优性的方法并不适用于长期目标或对抗性任务的设计。许多理想的机器人能力都属于上述一种或两种情况,因此突显了IRL生成可直接应用的机器人代理的能力上的一个关键缺陷。 我们提出了SPLASH(从次优层次化演示中进行样本高效偏好评价逆向强化学习以解决长时序和对抗性任务),该方法在从非最优演示中学习除外,在长期目标与对抗性任务设置方面也实现了对现有技术的重大突破。我们在模拟环境中通过海上夺旗任务验证了SPLASH的效果,并通过自主无人水面艇的仿真到现实转换实验展示了其实际应用潜力。我们证明,我们的方法使SPLASH在从非最优演示中学习奖励时显著超越现有的最先进技术。
https://arxiv.org/abs/2507.08707
Multi-view camera-based 3D perception can be conducted using bird's eye view (BEV) features obtained through perspective view-to-BEV transformations. Several studies have shown that the performance of these 3D perception methods can be further enhanced by combining sequential BEV features obtained from multiple camera frames. However, even after compensating for the ego-motion of an autonomous agent, the performance gain from temporal aggregation is limited when combining a large number of image frames. This limitation arises due to dynamic changes in BEV features over time caused by object motion. In this paper, we introduce a novel temporal 3D perception method called OnlineBEV, which combines BEV features over time using a recurrent structure. This structure increases the effective number of combined features with minimal memory usage. However, it is critical to spatially align the features over time to maintain strong performance. OnlineBEV employs the Motion-guided BEV Fusion Network (MBFNet) to achieve temporal feature alignment. MBFNet extracts motion features from consecutive BEV frames and dynamically aligns historical BEV features with current ones using these motion features. To enforce temporal feature alignment explicitly, we use Temporal Consistency Learning Loss, which captures discrepancies between historical and target BEV features. Experiments conducted on the nuScenes benchmark demonstrate that OnlineBEV achieves significant performance gains over the current best method, SOLOFusion. OnlineBEV achieves 63.9% NDS on the nuScenes test set, recording state-of-the-art performance in the camera-only 3D object detection task.
基于多视角相机的三维感知可以通过透视视图到鸟瞰视图(BEV)变换获得的BEV特征来进行。多项研究表明,通过结合多个摄像机帧获取的连续BEV特征,可以进一步增强这些三维感知方法的性能。然而,在补偿自主代理的自我运动后,当合并大量图像帧时,时间聚合带来的性能提升是有限的,这主要是由于随着时间变化,由物体移动引起的BEV特征动态变化所导致的。 在这篇文章中,我们介绍了一种新颖的时间三维感知方法——OnlineBEV,它通过递归结构结合随时间变化的BEV特征。这种结构能够在最小化内存使用的情况下增加有效组合的功能数量。然而,在保持高性能的同时,跨时间对齐功能是非常重要的。为了实现这一点,OnlineBEV采用了基于运动引导的BEV融合网络(MBFNet)来完成时间特性对准。MBFNet从连续的BEV帧中提取运动特征,并利用这些运动特征动态地将历史BEV特征与当前特征对齐。为确保显式的时间特性对齐,我们使用了时间一致性学习损失,该损失捕捉到历史和目标BEV特征之间的差异。 在nuScenes基准测试上进行的实验表明,OnlineBEV相比目前最佳方法SOLOFusion取得了显著的性能提升。在nuScenes测试集上,OnlineBEV达到了63.9%的NDS(平均精度),创下了仅使用相机的三维物体检测任务中的最新记录。
https://arxiv.org/abs/2507.08644
3D reconstruction, which aims to recover the dense three-dimensional structure of a scene, is a cornerstone technology for numerous applications, including augmented/virtual reality, autonomous driving, and robotics. While traditional pipelines like Structure from Motion (SfM) and Multi-View Stereo (MVS) achieve high precision through iterative optimization, they are limited by complex workflows, high computational cost, and poor robustness in challenging scenarios like texture-less regions. Recently, deep learning has catalyzed a paradigm shift in 3D reconstruction. A new family of models, exemplified by DUSt3R, has pioneered a feed-forward approach. These models employ a unified deep network to jointly infer camera poses and dense geometry directly from an Unconstrained set of images in a single forward pass. This survey provides a systematic review of this emerging domain. We begin by dissecting the technical framework of these feed-forward models, including their Transformer-based correspondence modeling, joint pose and geometry regression mechanisms, and strategies for scaling from two-view to multi-view scenarios. To highlight the disruptive nature of this new paradigm, we contrast it with both traditional pipelines and earlier learning-based methods like MVSNet. Furthermore, we provide an overview of relevant datasets and evaluation metrics. Finally, we discuss the technology's broad application prospects and identify key future challenges and opportunities, such as model accuracy and scalability, and handling dynamic scenes.
三维重建技术旨在恢复场景的密集三维结构,它是包括增强/虚拟现实、自动驾驶和机器人技术在内的众多应用领域的基石。传统的流水线方法如基于运动的结构(SfM)和多视角立体视觉(MVS),通过迭代优化实现高精度的同时,受限于复杂的流程、高昂的计算成本以及在无纹理区域等挑战性场景中的鲁棒性差的问题。最近,深度学习推动了三维重建领域的范式转变。以DUSt3R为代表的新型模型开创了一种前馈方法。这些模型采用统一的深层网络直接从前向传递中从一组非约束图像集中推断相机姿态和密集几何结构。本综述系统地回顾了这一新兴领域。 我们首先剖析了这些前馈模型的技术框架,包括它们基于Transformer的对应关系建模、联合的姿态与几何回归机制以及从两视图到多视图场景扩展的策略。为了突出这种新范式的颠覆性特征,我们将它与传统的流水线方法和早期的学习方法(如MVSNet)进行了对比。此外,我们还概述了相关的数据集和评估指标。 最后,我们讨论了该技术广泛的应用前景,并识别出关键的未来挑战和机遇,包括模型精度和可扩展性以及处理动态场景的能力。
https://arxiv.org/abs/2507.08448
Reliable satellite attitude control is essential for the success of space missions, particularly as satellites increasingly operate autonomously in dynamic and uncertain environments. Reaction wheels (RWs) play a pivotal role in attitude control, and maintaining control resilience during RW faults is critical to preserving mission objectives and system stability. However, traditional Proportional Derivative (PD) controllers and existing deep reinforcement learning (DRL) algorithms such as TD3, PPO, and A2C often fall short in providing the real time adaptability and fault tolerance required for autonomous satellite operations. This study introduces a DRL-based control strategy designed to improve satellite resilience and adaptability under fault conditions. Specifically, the proposed method integrates Twin Delayed Deep Deterministic Policy Gradient (TD3) with Hindsight Experience Replay (HER) and Dimension Wise Clipping (DWC) referred to as TD3-HD to enhance learning in sparse reward environments and maintain satellite stability during RW failures. The proposed approach is benchmarked against PD control and leading DRL algorithms. Experimental results show that TD3-HD achieves significantly lower attitude error, improved angular velocity regulation, and enhanced stability under fault conditions. These findings underscore the proposed method potential as a powerful, fault tolerant, onboard AI solution for autonomous satellite attitude control.
可靠的卫星姿态控制对于太空任务的成功至关重要,尤其是在卫星越来越多地在动态和不确定环境中自主运行的情况下。反应轮(RW)在姿态控制中扮演着关键角色,而维护反应轮故障期间的姿态控制韧性是保持任务目标和系统稳定性的重要因素。然而,传统的比例导数(PD)控制器以及现有的深度强化学习(DRL)算法如TD3、PPO和A2C通常无法提供自主卫星操作所需的实际适应性和容错性。 本研究提出了一种基于DRL的控制策略,旨在改善卫星在故障条件下的韧性和适应能力。具体而言,该方法将双延迟深度确定性策略梯度(TD3)与事后经验回放(HER)和维度逐个裁剪(DWC)相结合,称之为TD3-HD。这种方法增强了在稀疏奖励环境中学习的能力,并能够在反应轮故障期间保持卫星的稳定性。 研究提出的方案通过与PD控制及其他前沿的DRL算法进行基准比较来评估其性能。实验结果表明,TD3-HD实现了显著较低的姿态误差、改进了角速度调节,并且提高了故障条件下的稳定性。这些发现突显了所提出方法作为自主卫星姿态控制中强大而具有容错性的在轨AI解决方案的巨大潜力。
https://arxiv.org/abs/2507.08366
With the advancement of vision-based autonomous driving technology, pedestrian detection have become an important component for improving traffic safety and driving system robustness. Nevertheless, in complex traffic scenarios, conventional pose estimation approaches frequently fail to accurately reconstruct occluded keypoints, primarily due to obstructions caused by vehicles, vegetation, or architectural elements. To address this issue, we propose a novel real-time occluded pedestrian pose completion framework termed Separation and Dimensionality Reduction-based Generative Adversarial Imputation Nets (SDR-GAIN). Unlike previous approaches that train visual models to distinguish occlusion patterns, SDR-GAIN aims to learn human pose directly from the numerical distribution of keypoint coordinates and interpolate missing positions. It employs a self-supervised adversarial learning paradigm to train lightweight generators with residual structures for the imputation of missing pose keypoints. Additionally, it integrates multiple pose standardization techniques to alleviate the difficulty of the learning process. Experiments conducted on the COCO and JAAD datasets demonstrate that SDR-GAIN surpasses conventional machine learning and Transformer-based missing data interpolation algorithms in accurately recovering occluded pedestrian keypoints, while simultaneously achieving microsecond-level real-time inference.
随着基于视觉的自动驾驶技术的进步,行人检测已成为提高交通安全和驾驶系统鲁棒性的重要组成部分。然而,在复杂的交通场景中,传统的姿态估计方法经常无法准确重建被遮挡的关键点,主要是因为车辆、植被或建筑元素造成的阻碍。为了解决这个问题,我们提出了一种名为分离与降维生成对抗网络填充(SDR-GAIN)的实时遮挡行人姿势补全框架。 不同于以往的方法通过训练视觉模型来区分遮挡模式,SDR-GAIN旨在直接从关键点坐标的数值分布中学习人体姿态,并插值缺失的位置。它采用了一种自监督对抗性学习范式,利用轻量级生成器和残差结构进行缺失姿势关键点的补全。此外,该方法还整合了多种姿态标准化技术来缓解学习过程中的难度。 在COCO和JAAD数据集上的实验表明,SDR-GAIN在准确恢复遮挡行人的关键点方面超过了传统的机器学习和基于Transformer的缺失数据插值算法,并同时实现了微秒级的实时推理性能。
https://arxiv.org/abs/2306.03538
The emergence of autonomous Large Language Model (LLM) agents capable of tool usage has introduced new safety risks that go beyond traditional conversational misuse. These agents, empowered to execute external functions, are vulnerable to both user-initiated threats (e.g., adversarial prompts) and tool-initiated threats (e.g., malicious outputs from compromised tools). In this paper, we propose the first unified safety-alignment framework for tool-using agents, enabling models to handle both channels of threat via structured reasoning and sandboxed reinforcement learning. We introduce a tri-modal taxonomy, including benign, malicious, and sensitive for both user prompts and tool responses, and define a policy-driven decision model. Our framework employs a custom-designed sandbox environment that simulates real-world tool execution and allows fine-grained reward shaping. Through extensive evaluations on public and self-built benchmarks, including Agent SafetyBench, InjecAgent, and BFCL, we demonstrate that our safety-aligned agents significantly improve resistance to security threats while preserving strong utility on benign tasks. Our results show that safety and effectiveness can be jointly optimized, laying the groundwork for trustworthy deployment of autonomous LLM agents.
自主大型语言模型(LLM)代理能够使用工具的出现,引入了超出传统对话滥用的新安全风险。这些有能力执行外部功能的代理容易受到用户发起的威胁(例如敌意提示)和工具发起的威胁(例如被破坏的工具产生的恶意输出)。在本文中,我们提出了第一个用于使用工具的代理的安全对齐框架,通过结构化推理和沙箱强化学习使模型能够处理两种渠道的威胁。我们引入了一种三模态分类法,包括用户提示和工具响应中的良性、恶意和敏感类别,并定义了一个政策驱动的决策模型。我们的框架采用了一个定制设计的沙箱环境来模拟现实世界的工具执行,并允许进行细粒度奖励塑形。通过在公开和自建基准上的广泛评估,包括Agent SafetyBench、InjecAgent和BFCL,我们证明了我们的安全对齐代理显著提高了对抗安全威胁的能力,同时保持了良性任务的强大效用。我们的结果显示,可以共同优化安全性和有效性,为自主LLM代理的可信部署奠定了基础。
https://arxiv.org/abs/2507.08270
The emergence of large language models (LLMs) and agentic systems is enabling autonomous 6G networks with advanced intelligence, including self-configuration, self-optimization, and self-healing. However, the current implementation of individual intelligence tasks necessitates isolated knowledge retrieval pipelines, resulting in redundant data flows and inconsistent interpretations. Inspired by the service model unification effort in Open-RAN (to support interoperability and vendor diversity), we propose KP-A: a unified Network Knowledge Plane specifically designed for Agentic network intelligence. By decoupling network knowledge acquisition and management from intelligence logic, KP-A streamlines development and reduces maintenance complexity for intelligence engineers. By offering an intuitive and consistent knowledge interface, KP-A also enhances interoperability for the network intelligence agents. We demonstrate KP-A in two representative intelligence tasks: live network knowledge Q&A and edge AI service orchestration. All implementation artifacts have been open-sourced to support reproducibility and future standardization efforts.
大型语言模型(LLMs)和代理系统的出现正在使具有高级智能的自主6G网络成为可能,包括自我配置、自我优化和自我修复。然而,目前实现单个智能任务需要孤立的知识检索管道,导致数据流冗余和解释不一致的问题。受到Open-RAN中服务模型统一工作(以支持互操作性和供应商多样性)的启发,我们提出了KP-A:一个专为代理网络智能设计的统一网络知识平面。通过将网络知识获取和管理与智能逻辑解耦,KP-A简化了开发过程,并减少了智能工程师的维护复杂性。通过提供直观且一致的知识接口,KP-A还增强了网络智能代理之间的互操作性。我们通过两个代表性的智能任务展示了KP-A的应用:实时网络知识问答以及边缘AI服务编排。所有实现工件已开源以支持可重复性和未来的标准化工作。
https://arxiv.org/abs/2507.08164
LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present PyVision, an interactive, multi-turn framework that enables MLLMs to autonomously generate, execute, and refine Python-based tools tailored to the task at hand, unlocking flexible and interpretable problem-solving. We develop a taxonomy of the tools created by PyVision and analyze their usage across a diverse set of benchmarks. Quantitatively, PyVision achieves consistent performance gains, boosting GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini. These results point to a broader shift: dynamic tooling allows models not just to use tools, but to invent them, advancing toward more agentic visual reasoning.
大型语言模型(LLMs)越来越多地被部署为能够规划、推理并动态调用外部工具的代理系统。然而,在视觉推理领域,先前的方法仍然受限于预定义的工作流程和静态工具集。在本报告中,我们介绍了PyVision,这是一个交互式的多回合框架,它使多语言大模型(MLLMs)能够自主生成、执行并优化针对特定任务量身定制的基于Python的工具,从而解锁灵活且可解释的问题解决能力。我们开发了一个关于PyVision创建的工具分类,并分析了这些工具在多样化的基准测试集中的使用情况。从定量角度来看,PyVision实现了持续的性能提升,在V*上将GPT-4.1的表现提高了+7.8%,在VLMsAreBlind-mini上将Claude-4.0-Sonnet的表现提升了+31.1%。这些结果表明了一个更广泛的转变:动态工具使模型不仅能够使用现有工具,还能发明新工具,从而向着更具代理性的视觉推理迈进。
https://arxiv.org/abs/2507.07998
Accurate position estimation is essential for modern navigation systems deployed in autonomous platforms, including ground vehicles, marine vessels, and aerial drones. In this context, Visual Simultaneous Localisation and Mapping (VSLAM) - which includes Visual Odometry - relies heavily on the reliable extraction of salient feature points from the visual input data. In this work, we propose an embedded implementation of an unsupervised architecture capable of detecting and describing feature points. It is based on a quantised SuperPoint convolutional neural network. Our objective is to minimise the computational demands of the model while preserving high detection quality, thus facilitating efficient deployment on platforms with limited resources, such as mobile or embedded systems. We implemented the solution on an FPGA System-on-Chip (SoC) platform, specifically the AMD/Xilinx Zynq UltraScale+, where we evaluated the performance of Deep Learning Processing Units (DPUs) and we also used the Brevitas library and the FINN framework to perform model quantisation and hardware-aware optimisation. This allowed us to process 640 x 480 pixel images at up to 54 fps on an FPGA platform, outperforming state-of-the-art solutions in the field. We conducted experiments on the TUM dataset to demonstrate and discuss the impact of different quantisation techniques on the accuracy and performance of the model in a visual odometry task.
准确的位置估计对于部署在自主平台(包括地面车辆、海洋船只和无人机)上的现代导航系统至关重要。在此背景下,视觉同时定位与地图构建 (VSLAM) 依赖于从视觉输入数据中可靠地提取显著特征点,其中包含视觉里程计。本文提出了一种嵌入式实现方案,该方案基于未监督架构,并能够检测和描述特征点。我们的方法采用量化后的 SuperPoint 卷积神经网络。 我们的目标是通过保持高检测质量的同时最小化模型的计算需求,从而在资源有限(如移动或嵌入式系统)平台上实现高效部署。我们在 AMD/Xilinx Zynq UltraScale+ FPGA 系统级芯片 (SoC) 平台上实现了这一解决方案,并评估了深度学习处理单元(DPUs) 的性能。此外,我们使用 Brevitas 库和 FINN 框架进行模型量化及硬件感知优化。这使我们能够在FPGA平台上以高达54 fps的帧率处理640x480像素图像,优于该领域的现有解决方案。 我们在 TUM 数据集上进行了实验,展示了不同量化技术对视觉里程计任务中模型精度和性能的影响,并讨论了这些影响。
https://arxiv.org/abs/2507.07903
Autonomous agents, particularly in the field of robotics, rely on sensory information to perceive and navigate their environment. However, these sensory inputs are often imperfect, leading to distortions in the agent's internal representation of the world. This paper investigates the nature of these perceptual distortions and how they influence autonomous representation learning using a minimal robotic system. We utilize a simulated two-wheeled robot equipped with distance sensors and a compass, operating within a simple square environment. Through analysis of the robot's sensor data during random exploration, we demonstrate how a distorted perceptual space emerges. Despite these distortions, we identify emergent structures within the perceptual space that correlate with the physical environment, revealing how the robot autonomously learns a structured representation for navigation without explicit spatial information. This work contributes to the understanding of embodied cognition, minimal agency, and the role of perception in self-generated navigation strategies in artificial life.
自主代理,尤其是在机器人领域中,依赖于感官信息来感知和导航其环境。然而,这些感官输入往往是不完美的,导致了代理人内部世界表示的扭曲。本文通过一个最小化的机器人系统研究了这种知觉偏差的本质及其对自主表征学习的影响。我们使用了一个配备有距离传感器和指南针的双轮模拟机器人,在简单的方形环境中操作。通过对机器人在随机探索过程中的感官数据进行分析,我们展示了如何在这种情况下出现感知空间的扭曲现象。尽管存在这些扭曲,我们在感知空间中识别出了与物理环境相关的新兴结构,揭示了机器人是如何自主学习用于导航的结构化表示方法,而无需显式的空间信息。这项工作对具身认知、最小代理和感知在人工生命自我生成导航策略中的作用有了更深入的理解。
https://arxiv.org/abs/2507.07845
Current AI advances largely rely on scaling neural models and expanding training datasets to achieve generalization and robustness. Despite notable successes, this paradigm incurs significant environmental, economic, and ethical costs, limiting sustainability and equitable access. Inspired by biological sensory systems, where adaptation occurs dynamically at the input (e.g., adjusting pupil size, refocusing vision)--we advocate for adaptive sensing as a necessary and foundational shift. Adaptive sensing proactively modulates sensor parameters (e.g., exposure, sensitivity, multimodal configurations) at the input level, significantly mitigating covariate shifts and improving efficiency. Empirical evidence from recent studies demonstrates that adaptive sensing enables small models (e.g., EfficientNet-B0) to surpass substantially larger models (e.g., OpenCLIP-H) trained with significantly more data and compute. We (i) outline a roadmap for broadly integrating adaptive sensing into real-world applications spanning humanoid, healthcare, autonomous systems, agriculture, and environmental monitoring, (ii) critically assess technical and ethical integration challenges, and (iii) propose targeted research directions, such as standardized benchmarks, real-time adaptive algorithms, multimodal integration, and privacy-preserving methods. Collectively, these efforts aim to transition the AI community toward sustainable, robust, and equitable artificial intelligence systems.
当前的人工智能进步主要依赖于扩大神经网络模型的规模和扩展训练数据集,以实现泛化和鲁棒性。尽管这种方法取得了显著的成功,但它带来了巨大的环境、经济和伦理成本,限制了可持续性和公平获取的机会。受到生物感觉系统(如动态调整瞳孔大小或重新聚焦视觉)启发,我们提倡采用自适应传感作为必要且基础的变革。自适应传感能够在输入层面主动调节传感器参数(例如曝光度、敏感性、多模态配置),从而显著减轻协变量变化,并提高效率。 最近的研究表明,自适应传感使小型模型(如EfficientNet-B0)能够超越使用大量数据和计算资源训练的大规模模型(如OpenCLIP-H)。我们提出了以下几点建议: 1. **制定路线图**:广泛整合自适应传感器到实际应用中,涵盖人形机器人、医疗保健、自动驾驶系统、农业及环境监测等领域。 2. **技术与伦理挑战评估**:批判性地分析将自适应传感集成到这些领域时遇到的技术和伦理问题。 3. **研究方向建议**:提出如标准化基准测试、实时自适应算法开发、多模态整合以及隐私保护方法等具体的研究方向。 通过集体努力,上述措施旨在促使人工智能社区转向可持续的、稳健且公平的人工智能系统。
https://arxiv.org/abs/2507.07820
Recent studies show large language models (LLMs) and vision language models (VLMs) trained using web-scale data can empower end-to-end autonomous driving systems for a better generalization and interpretation. Specifically, by dynamically routing inputs to specialized subsets of parameters, the Mixture-of-Experts (MoE) technique enables general LLMs or VLMs to achieve substantial performance improvements while maintaining computational efficiency. However, general MoE models usually demands extensive training data and complex optimization. In this work, inspired by the learning process of human drivers, we propose a skill-oriented MoE, called MoSE, which mimics human drivers' learning process and reasoning process, skill-by-skill and step-by-step. We propose a skill-oriented routing mechanism that begins with defining and annotating specific skills, enabling experts to identify the necessary driving competencies for various scenarios and reasoning tasks, thereby facilitating skill-by-skill learning. Further align the driving process to multi-step planning in human reasoning and end-to-end driving models, we build a hierarchical skill dataset and pretrain the router to encourage the model to think step-by-step. Unlike multi-round dialogs, MoSE integrates valuable auxiliary tasks (e.g.\ description, reasoning, planning) in one single forward process without introducing any extra computational cost. With less than 3B sparsely activated parameters, our model outperforms several 8B+ parameters on CODA AD corner case reasoning task. Compared to existing methods based on open-source models and data, our approach achieves state-of-the-art performance with significantly reduced activated model size (at least by $62.5\%$) with a single-turn conversation.
最近的研究表明,利用网络规模数据训练的大规模语言模型(LLMs)和视觉语言模型(VLMs)能够增强端到端的自动驾驶系统,提高其泛化能力和解释能力。具体而言,通过将输入动态路由至参数子集的专业领域,混合专家(MoE)技术使通用的LLM或VLM在保持计算效率的同时实现了显著的性能提升。然而,一般的MoE模型通常需要大量的训练数据和复杂的优化过程。 在这项工作中,我们借鉴了人类驾驶员的学习过程,提出了以技能为导向的MoE方法,称为MoSE。该方法模仿了人类驾驶员逐步学习驾驶技巧的过程,并分步骤进行推理。我们提出了一种技能导向的路由机制,首先定义并标注特定技能,使专家能够识别各种场景和推理任务所需的关键驾驶能力,从而实现逐项技能的学习。 进一步地,我们将驾驶过程与人类多步规划相联系,并为端到端的驾驶模型构建了一个分层的技能数据集。我们预先训练了路由器以鼓励模型逐步进行思考。不同于需要多次对话的过程,MoSE在单一前向传递过程中整合了有价值的辅助任务(例如描述、推理和计划),而无需增加任何额外的计算成本。 使用不到30亿稀疏激活参数的情况下,我们的模型在CODA AD角落案例推理任务上超越了几种8B+规模的模型。与现有的基于开源模型和数据的方法相比,在单轮对话中,我们的方法实现了最先进的性能,并且显著减少了激活模型的大小(至少减少了62.5%)。
https://arxiv.org/abs/2507.07818
Image sensors are integral to a wide range of safety- and security-critical systems, including surveillance infrastructure, autonomous vehicles, and industrial automation. These systems rely on the integrity of visual data to make decisions. In this work, we investigate a novel class of electromagnetic signal injection attacks that target the analog domain of image sensors, allowing adversaries to manipulate raw visual inputs without triggering conventional digital integrity checks. We uncover a previously undocumented attack phenomenon on CMOS image sensors: rainbow-like color artifacts induced in images captured by image sensors through carefully tuned electromagnetic interference. We further evaluate the impact of these attacks on state-of-the-art object detection models, showing that the injected artifacts propagate through the image signal processing pipeline and lead to significant mispredictions. Our findings highlight a critical and underexplored vulnerability in the visual perception stack, highlighting the need for more robust defenses against physical-layer attacks in such systems.
图像传感器在包括监控基础设施、自动驾驶汽车和工业自动化在内的多种安全和关键系统中扮演着重要角色。这些系统的决策依赖于视觉数据的完整性。在这项研究工作中,我们调查了一类新颖的电磁信号注入攻击,这类攻击针对的是图像传感器的模拟领域,允许对手操纵原始视觉输入而不触发传统的数字完整性检查。我们揭露了CMOS图像传感器上此前未被记录的一种新型攻击现象:通过精心调制的电磁干扰,在由图像传感器捕获的图像中诱导出类似彩虹的颜色伪影。进一步地,我们评估了这些攻击对最先进的目标检测模型的影响,表明注入的伪影会传播到图像信号处理管道,并导致显著的误预测。我们的研究结果强调了视觉感知栈中的一个关键且未被充分探索的安全漏洞,突显了此类系统中需要更强大的物理层防御措施以抵御攻击的重要性。
https://arxiv.org/abs/2507.07773
Robust Visual SLAM (vSLAM) is essential for autonomous systems operating in real-world environments, where challenges such as dynamic objects, low texture, and critically, varying illumination conditions often degrade performance. Existing feature-based SLAM systems rely on fixed front-end parameters, making them vulnerable to sudden lighting changes and unstable feature tracking. To address these challenges, we propose ``IRAF-SLAM'', an Illumination-Robust and Adaptive Feature-Culling front-end designed to enhance vSLAM resilience in complex and challenging environments. Our approach introduces: (1) an image enhancement scheme to preprocess and adjust image quality under varying lighting conditions; (2) an adaptive feature extraction mechanism that dynamically adjusts detection sensitivity based on image entropy, pixel intensity, and gradient analysis; and (3) a feature culling strategy that filters out unreliable feature points using density distribution analysis and a lighting impact factor. Comprehensive evaluations on the TUM-VI and European Robotics Challenge (EuRoC) datasets demonstrate that IRAF-SLAM significantly reduces tracking failures and achieves superior trajectory accuracy compared to state-of-the-art vSLAM methods under adverse illumination conditions. These results highlight the effectiveness of adaptive front-end strategies in improving vSLAM robustness without incurring significant computational overhead. The implementation of IRAF-SLAM is publicly available at https://thanhnguyencanh. this http URL.
鲁棒的视觉同步定位与地图构建(vSLAM)对于在真实世界环境中运行的自主系统至关重要。动态物体、低纹理以及最关键的是光照条件的变化,常常会降低其性能。现有的基于特征的SLAM系统依赖于固定的前端参数设置,这使得它们对突然的照明变化和不稳定的特征跟踪变得脆弱。 为了解决这些问题,我们提出了“IRAF-SLAM”,这是一种设计用于提升vSLAM在复杂且挑战性环境中稳健性的光照鲁棒性和自适应特征剔除前端。我们的方法包括: 1. 一种图像增强方案,用于预处理并根据不同的光照条件调整图像质量; 2. 自适应的特征提取机制,该机制可以根据图像熵、像素强度和梯度分析动态地调整检测敏感性; 3. 基于密度分布分析和照明影响因子过滤不可靠特征点的特征剔除策略。 我们在TUM-VI和欧洲机器人挑战赛(EuRoC)数据集上进行了全面评估,结果表明IRAF-SLAM在不良光照条件下显著减少了跟踪失败,并且其轨迹准确性优于最先进的vSLAM方法。这些结果突显了自适应前端策略在不增加显著计算开销的情况下提升vSLAM鲁棒性方面的有效性。 IRAF-SLAM的实现代码公开可用,网址为https://thanhnguyencanh.github.io/iraf-slam/。
https://arxiv.org/abs/2507.07752
With widespread adoption of transformer-based language models in AI, there is significant interest in the limits of LLMs capabilities, specifically so-called hallucinations, occurrences in which LLMs provide spurious, factually incorrect or nonsensical information when prompted on certain subjects. Furthermore, there is growing interest in agentic uses of LLMs - that is, using LLMs to create agents that act autonomously or semi-autonomously to carry out various tasks, including tasks with applications in the real world. This makes it important to understand the types of tasks LLMs can and cannot perform. We explore this topic from the perspective of the computational complexity of LLM inference. We show that LLMs are incapable of carrying out computational and agentic tasks beyond a certain complexity, and further that LLMs are incapable of verifying the accuracy of tasks beyond a certain complexity. We present examples of both, then discuss some consequences of this work.
随着基于变压器的语言模型在人工智能领域的广泛应用,人们对于这些大规模语言模型(LLMs)能力的限制产生了浓厚的兴趣,尤其是所谓的“幻觉”现象——即当被提示涉及某些主题时,LLMs会提供虚假、事实错误或无意义的信息。此外,人们对利用LLMs构建自主或半自主代理的兴趣日益增加,这种代理可以执行各种任务,包括在现实世界中的应用。因此,了解LLMs能够和不能够完成哪些类型的任务变得尤为重要。 我们从计算复杂性的角度探讨了这一主题。研究表明,LLMs无法执行超过一定复杂度的计算和代理任务,并且它们也无法验证超过一定复杂度的任务的准确性。我们将展示这两个方面的例子,并讨论这项工作的某些后果。
https://arxiv.org/abs/2507.07505
Autonomous vehicles rely on global standard-definition (SD) maps for road-level route planning and online local high-definition (HD) maps for lane-level navigation. However, recent work concentrates on construct online HD maps, often overlooking the association of global SD maps with online HD maps for hybrid navigation, making challenges in utilizing online HD maps in the real world. Observing the lack of the capability of autonomous vehicles in navigation, we introduce \textbf{O}nline \textbf{M}ap \textbf{A}ssociation, the first benchmark for the association of hybrid navigation-oriented online maps, which enhances the planning capabilities of autonomous vehicles. Based on existing datasets, the OMA contains 480k of roads and 260k of lane paths and provides the corresponding metrics to evaluate the performance of the model. Additionally, we propose a novel framework, named Map Association Transformer, as the baseline method, using path-aware attention and spatial attention mechanisms to enable the understanding of geometric and topological correspondences. The code and dataset can be accessed at this https URL.
自主车辆依赖全球标准定义(SD)地图进行道路级别的路线规划,以及在线本地高精度(HD)地图进行车道级导航。然而,最近的研究集中于构建在线高清地图,常常忽视了全局SD地图与在线高清地图之间的关联性,这对混合导航构成了挑战,并且在实际应用中利用在线高清地图存在困难。鉴于自主车辆在导航能力上的不足,我们引入了**O**nline **M**ap **A**ssociation(OMA),这是第一个专注于混合导航导向型在线地图关联的基准测试,它增强了自主车辆的规划能力。基于现有的数据集,OMA包含了48万条道路和26万条车道路径,并提供了相应的评估模型性能的指标。此外,我们提出了一种新的框架——Map Association Transformer(MAT),作为基线方法,该框架利用路径感知注意力机制和空间注意力机制来实现对几何和拓扑对应关系的理解。代码和数据集可在此[URL]访问。
https://arxiv.org/abs/2507.07487
Autonomous flight in GPS denied indoor spaces requires trajectories that keep visual localization error tightly bounded across varied missions. Whereas visual inertial odometry (VIO) accumulates drift over time, scene coordinate regression (SCR) yields drift-free, high accuracy absolute pose estimation. We present a perception-aware framework that couples an evidential learning-based SCR pose estimator with a receding horizon trajectory optimizer. The optimizer steers the onboard camera toward pixels whose uncertainty predicts reliable scene coordinates, while a fixed-lag smoother fuses the low rate SCR stream with high rate IMU data to close the perception control loop in real time. In simulation, our planner reduces translation (rotation) mean error by 54% / 15% (40% / 31%) relative to yaw fixed and forward-looking baselines, respectively. Moreover, hardware in the loop experiment validates the feasibility of our proposed framework.
在没有GPS的室内空间中进行自主飞行,需要设计轨迹以确保视觉定位误差在整个任务期间都保持在一个紧密控制的范围内。虽然视觉惯性里程计(VIO)会随着时间推移累积漂移误差,场景坐标回归(SCR)则能够提供无漂移、高精度的姿态估计。我们提出了一种感知驱动的框架,该框架结合了基于证据学习的SCR姿态估计算法和滚动地平线轨迹优化器。优化器引导机载摄像头转向不确定性较低的像素,这些像素可以预测出可靠的场景坐标。同时,固定延迟平滑器将低频率的SCR数据流与高频率的惯性测量单元(IMU)数据融合,从而实时闭合感知控制环路。 在模拟实验中,我们的规划器相对于固定航向和前视基线,在位置误差上减少了54%(旋转误差减少了15%),而在旋转误差方面则分别减少了40%/31%。此外,硬件在环实验验证了我们所提出框架的可行性。
https://arxiv.org/abs/2507.07467
Out-of-Distribution (OoD) segmentation is critical for safety-sensitive applications like autonomous driving. However, existing mask-based methods often suffer from boundary imprecision, inconsistent anomaly scores within objects, and false positives from background noise. We propose \textbf{\textit{Objectomaly}}, an objectness-aware refinement framework that incorporates object-level priors. Objectomaly consists of three stages: (1) Coarse Anomaly Scoring (CAS) using an existing OoD backbone, (2) Objectness-Aware Score Calibration (OASC) leveraging SAM-generated instance masks for object-level score normalization, and (3) Meticulous Boundary Precision (MBP) applying Laplacian filtering and Gaussian smoothing for contour refinement. Objectomaly achieves state-of-the-art performance on key OoD segmentation benchmarks, including SMIYC AnomalyTrack/ObstacleTrack and RoadAnomaly, improving both pixel-level (AuPRC up to 96.99, FPR$_{95}$ down to 0.07) and component-level (F1$-$score up to 83.44) metrics. Ablation studies and qualitative results on real-world driving videos further validate the robustness and generalizability of our method. Code will be released upon publication.
**翻译** 分布外(OoD,Out-of-Distribution)分割对于自动驾驶等安全敏感型应用至关重要。然而,现有的基于掩码的方法常常会遇到边界不精确、对象内部异常分数不一致以及背景噪声导致的假阳性的问题。为此,我们提出了**Objectomaly**——一个基于目标特性的改进框架,该框架利用了物体级别的先验知识。Objectomaly包括三个阶段:(1)粗略异常评分(CAS),使用现有的OoD骨干网络;(2)目标感知得分校准(OASC),通过SAM生成的实例掩码对对象级别的分数进行归一化处理;以及(3)细致边界精度(MBP),应用拉普拉斯滤波和高斯平滑以优化轮廓细节。在关键的OoD分割基准测试,如SMIYC AnomalyTrack/ObstacleTrack 和RoadAnomaly 上,Objectomaly实现了最先进的性能,在像素级别指标上取得了高达96.99的AuPRC值并把FPR$_{95}$降低到了0.07,并且在组件级别的F1-分数上达到了83.44。通过对实际驾驶视频中的消融研究和定性分析,进一步验证了我们方法的鲁棒性和泛化能力。代码将在发布后公开。
https://arxiv.org/abs/2507.07460
Autonomous agents navigating human society must master both production activities and social interactions, yet existing benchmarks rarely evaluate these skills simultaneously. To bridge this gap, we introduce StarDojo, a novel benchmark based on Stardew Valley, designed to assess AI agents in open-ended production-living simulations. In StarDojo, agents are tasked to perform essential livelihood activities such as farming and crafting, while simultaneously engaging in social interactions to establish relationships within a vibrant community. StarDojo features 1,000 meticulously curated tasks across five key domains: farming, crafting, exploration, combat, and social interactions. Additionally, we provide a compact subset of 100 representative tasks for efficient model evaluation. The benchmark offers a unified, user-friendly interface that eliminates the need for keyboard and mouse control, supports all major operating systems, and enables the parallel execution of multiple environment instances, making it particularly well-suited for evaluating the most capable foundation agents, powered by multimodal large language models (MLLMs). Extensive evaluations of state-of-the-art MLLMs agents demonstrate substantial limitations, with the best-performing model, GPT-4.1, achieving only a 12.7% success rate, primarily due to challenges in visual understanding, multimodal reasoning and low-level manipulation. As a user-friendly environment and benchmark, StarDojo aims to facilitate further research towards robust, open-ended agents in complex production-living environments.
自主代理在人类社会中导航时,必须掌握生产和社交互动技能,然而现有的基准测试很少同时评估这些能力。为弥补这一不足,我们推出了StarDojo——一个基于《星露谷物语》的全新基准测试工具,旨在评估AI代理在开放式生产生活模拟中的表现。在StarDojo中,代理被分配执行诸如耕作和手工制作等基本生计活动的任务,同时还需要参与社交互动以建立与充满活力社区的关系。 StarDojo包含1,000个精心策划的任务,覆盖五个关键领域:农业、工艺制造、探索、战斗以及社会交往。此外,我们还提供了一个由100项代表性任务组成的精简集合,以便于模型评估。该基准测试提供了统一且用户友好的界面,无需使用键盘和鼠标控制,并支持所有主要操作系统,能够同时执行多个环境实例的并行操作。因此,StarDojo特别适合用于评估功能强大的基础代理,这些代理由多模态大型语言模型(MLLM)驱动。 对最新多模态大型语言模型(MLLM)代理进行广泛的评估后发现,最佳表现模型GPT-4.1的成功率仅为12.7%,主要原因是视觉理解、跨模式推理和低级操作方面的挑战。作为一个易于使用的环境及基准测试工具,StarDojo旨在促进在复杂生产生活环境中研究稳健且开放式的智能体的进一步发展。
https://arxiv.org/abs/2507.07445