Echocardiography is crucial for cardiovascular disease detection but relies heavily on experienced sonographers. Echocardiography probe guidance systems, which provide real-time movement instructions for acquiring standard plane images, offer a promising solution for AI-assisted or fully autonomous scanning. However, developing effective machine learning models for this task remains challenging, as they must grasp heart anatomy and the intricate interplay between probe motion and visual signals. To address this, we present EchoWorld, a motion-aware world modeling framework for probe guidance that encodes anatomical knowledge and motion-induced visual dynamics, while effectively leveraging past visual-motion sequences to enhance guidance precision. EchoWorld employs a pre-training strategy inspired by world modeling principles, where the model predicts masked anatomical regions and simulates the visual outcomes of probe adjustments. Built upon this pre-trained model, we introduce a motion-aware attention mechanism in the fine-tuning stage that effectively integrates historical visual-motion data, enabling precise and adaptive probe guidance. Trained on more than one million ultrasound images from over 200 routine scans, EchoWorld effectively captures key echocardiographic knowledge, as validated by qualitative analysis. Moreover, our method significantly reduces guidance errors compared to existing visual backbones and guidance frameworks, excelling in both single-frame and sequential evaluation protocols. Code is available at this https URL.
超声心动图对于心血管疾病的检测至关重要,但严重依赖有经验的超声技师。超声心动图探头引导系统通过提供获取标准切面图像的实时移动指令,为AI辅助或完全自主扫描提供了具有前景的解决方案。然而,开发用于此类任务的有效机器学习模型仍然极具挑战性,因为这些模型必须掌握心脏解剖结构以及探针运动与视觉信号之间复杂的相互作用。为此,我们提出了EchoWorld,这是一个专为探头引导设计的感知动作的世界建模框架,它编码了解剖知识和由运动引起的视觉动态,并且能够有效利用过去的视动序列来提高指导精度。 EchoWorld采用了一种受世界建模原理启发的预训练策略,在这种策略中,模型预测掩码解剖区域并模拟探针调整后的视觉效果。在微调阶段,我们在此预先训练的模型基础上引入了一个感知动作的关注机制,该机制能够有效地整合历史视动数据,从而实现精确且适应性强的探头引导。 EchoWorld是在超过200万张来自200多例常规检查的超声图像上进行训练的,这些图像涵盖了关键的超声心动图知识,并通过定性分析得到了验证。此外,与现有的视觉骨干网和指导框架相比,我们的方法在单帧和序列评估协议中显著降低了引导误差。 代码可在以下链接获取:[https URL] (请将 [https URL] 替换为实际提供的链接地址)
https://arxiv.org/abs/2504.13065
In the rapidly advancing field of robotics, dual-arm coordination and complex object manipulation are essential capabilities for developing advanced autonomous systems. However, the scarcity of diverse, high-quality demonstration data and real-world-aligned evaluation benchmarks severely limits such development. To address this, we introduce RoboTwin, a generative digital twin framework that uses 3D generative foundation models and large language models to produce diverse expert datasets and provide a real-world-aligned evaluation platform for dual-arm robotic tasks. Specifically, RoboTwin creates varied digital twins of objects from single 2D images, generating realistic and interactive scenarios. It also introduces a spatial relation-aware code generation framework that combines object annotations with large language models to break down tasks, determine spatial constraints, and generate precise robotic movement code. Our framework offers a comprehensive benchmark with both simulated and real-world data, enabling standardized evaluation and better alignment between simulated training and real-world performance. We validated our approach using the open-source COBOT Magic Robot platform. Policies pre-trained on RoboTwin-generated data and fine-tuned with limited real-world samples demonstrate significant potential for enhancing dual-arm robotic manipulation systems by improving success rates by over 70% for single-arm tasks and over 40% for dual-arm tasks compared to models trained solely on real-world data.
在快速发展的机器人技术领域,双臂协调和复杂物体操作是开发高级自主系统的关键能力。然而,多样化、高质量的演示数据稀缺以及与现实世界相匹配的评估基准不足,严重限制了这一领域的进步。为此,我们引入了RoboTwin,这是一个使用3D生成基础模型和大型语言模型来产生多样化的专家数据集,并提供针对双臂机器人任务的真实世界对齐评估平台的生成式数字孪生框架。 具体而言,RoboTwin可以从单张2D图像创建多样化且逼真的物体数字化副本,生成现实而互动的情景。它还引入了一个具有空间关系感知的代码生成框架,该框架结合了对象注释和大型语言模型来分解任务、确定空间约束,并生成精确的机器人运动代码。 我们的框架提供了一个包含模拟数据和真实世界数据的全面基准测试平台,从而能够进行标准化评估并更好地将模拟训练与现实世界性能对齐。我们使用开源COBOT Magic Robot平台验证了这一方法的有效性。在RoboTwin生成的数据上预先训练策略,并通过少量的真实世界样本进一步微调,可以显著提高单臂任务的成功率超过70%,双臂任务的成功率超过40%(相较于仅基于真实数据训练的模型)。这表明该框架对于增强双臂机器人操作系统的性能具有巨大潜力。
https://arxiv.org/abs/2504.13059
Recent advancements in large language models (LLMs) have catalyzed the development of general-purpose autonomous agents, demonstrating remarkable performance in complex reasoning tasks across various domains. This surge has spurred the evolution of a plethora of prompt-based reasoning frameworks. A recent focus has been on iterative reasoning strategies that refine outputs through self-evaluation and verbalized feedback. However, these strategies require additional computational complexity to enable models to recognize and correct their mistakes, leading to a significant increase in their cost. In this work, we introduce the concept of ``retrials without feedback'', an embarrassingly simple yet powerful mechanism for enhancing reasoning frameworks by allowing LLMs to retry problem-solving attempts upon identifying incorrect answers. Unlike conventional iterative refinement methods, our method does not require explicit self-reflection or verbalized feedback, simplifying the refinement process. Our findings indicate that simpler retrial-based approaches often outperform more sophisticated reasoning frameworks, suggesting that the benefits of complex methods may not always justify their computational costs. By challenging the prevailing assumption that more intricate reasoning strategies inherently lead to better performance, our work offers new insights into how simpler, more efficient approaches can achieve optimal results. So, are retrials all you need?
近期在大型语言模型(LLMs)领域的进展激发了通用自主代理的发展,这些代理在各种领域中复杂的推理任务上表现出色。这一浪潮推动了一系列基于提示的推理框架的发展。最近的研究重点在于通过自我评估和口头反馈来改进输出的迭代推理策略。然而,这类策略需要额外的计算复杂度以使模型能够识别并纠正错误,从而显著增加了它们的成本。在这项工作中,我们引入了“无需反馈的重试”这一概念,这是一种简单而强大的机制,可通过允许LLM在识别出不正确答案后重新尝试解决问题来增强推理框架。与传统的迭代细化方法不同,我们的方法不需要显式的自我反思或口头反馈,从而简化了改进过程。 我们的研究结果表明,基于简单重试的方法往往优于更复杂的推理框架,这表明复杂方法的好处不一定总能证明其计算成本的合理性。通过挑战主流假设——即更复杂的推理策略必然导致更好的性能,我们为如何采用更为简单且高效的方案来实现最佳效果提供了新的见解。那么问题来了:简单的重试就够了吗?
https://arxiv.org/abs/2504.12951
Despite significant progress, recent studies indicate that current large language models (LLMs) may still capture dataset biases and utilize them during inference, leading to the poor generalizability of LLMs. However, due to the diversity of dataset biases and the insufficient nature of bias suppression based on in-context learning, the effectiveness of previous prior knowledge-based debiasing methods and in-context learning based automatic debiasing methods is limited. To address these challenges, we explore the combination of causal mechanisms with information theory and propose an information gain-guided causal intervention debiasing (IGCIDB) framework. This framework first utilizes an information gain-guided causal intervention method to automatically and autonomously balance the distribution of instruction-tuning dataset. Subsequently, it employs a standard supervised fine-tuning process to train LLMs on the debiased dataset. Experimental results show that IGCIDB can effectively debias LLM to improve its generalizability across different tasks.
尽管取得了显著进步,最近的研究表明,当前的大规模语言模型(LLMs)可能仍然会捕获数据集中的偏见,并在推理过程中利用这些偏见,从而导致LLM的泛化能力较差。然而,由于数据集中偏见多样性的原因以及基于上下文学习的偏差抑制方法的不足性,以往基于先验知识和基于上下文学习的自动去偏方法的效果受到了限制。 为了解决这些问题,我们探索了因果机制与信息论的结合,并提出了一种以信息增益引导的因果干预去偏(IGCIDB)框架。该框架首先利用一种信息增益导向的因果干预方法,自动且自主地平衡指令微调数据集的分布。随后,它采用标准监督微调过程,在去偏后的数据集上训练LLM。 实验结果表明,IGCIDB能够有效地去除LLM中的偏差,并提升其在不同任务上的泛化能力。
https://arxiv.org/abs/2504.12898
End-to-end autonomous driving aims to produce planning trajectories from raw sensors directly. Currently, most approaches integrate perception, prediction, and planning modules into a fully differentiable network, promising great scalability. However, these methods typically rely on deterministic modeling of online maps in the perception module for guiding or constraining vehicle planning, which may incorporate erroneous perception information and further compromise planning safety. To address this issue, we delve into the importance of online map uncertainty for enhancing autonomous driving safety and propose a novel paradigm named UncAD. Specifically, UncAD first estimates the uncertainty of the online map in the perception module. It then leverages the uncertainty to guide motion prediction and planning modules to produce multi-modal trajectories. Finally, to achieve safer autonomous driving, UncAD proposes an uncertainty-collision-aware planning selection strategy according to the online map uncertainty to evaluate and select the best trajectory. In this study, we incorporate UncAD into various state-of-the-art (SOTA) end-to-end methods. Experiments on the nuScenes dataset show that integrating UncAD, with only a 1.9% increase in parameters, can reduce collision rates by up to 26% and drivable area conflict rate by up to 42%. Codes, pre-trained models, and demo videos can be accessed at this https URL.
端到端的自动驾驶旨在直接从原始传感器数据生成规划轨迹。目前,大多数方法将感知、预测和规划模块整合为一个完全可微分的网络,以实现良好的扩展性。然而,这些方法通常依赖于感知模块中的在线地图的确定性建模来指导或约束车辆规划,这可能会引入错误的感知信息并进一步影响规划的安全性。为了应对这一问题,我们深入探讨了在线地图不确定性在增强自动驾驶安全性方面的重要性,并提出了一种名为UncAD的新范式。具体而言,UncAD首先估计感知模块中在线地图的不确定性。然后利用这种不确定性来指导运动预测和规划模块生成多模态轨迹。最后,为了实现更安全的自动驾驶,UncAD提出了根据在线地图不确定性评估并选择最佳路径的不确定性碰撞感知规划选择策略。 在本研究中,我们将UncAD整合到各种最先进的端到端方法中。实验结果表明,在nuScenes数据集上,集成UncAD仅增加1.9%的参数量便可以将碰撞率降低最多26%,可行驶区域冲突率减少最多42%。代码、预训练模型和演示视频可在以下网址访问:[此链接](请根据实际情况提供正确的URL)。
https://arxiv.org/abs/2504.12826
This paper investigates the integration of graph neural networks (GNNs) with Qualitative Explainable Graphs (QXGs) for scene understanding in automated driving. Scene understanding is the basis for any further reactive or proactive decision-making. Scene understanding and related reasoning is inherently an explanation task: why is another traffic participant doing something, what or who caused their actions? While previous work demonstrated QXGs' effectiveness using shallow machine learning models, these approaches were limited to analysing single relation chains between object pairs, disregarding the broader scene context. We propose a novel GNN architecture that processes entire graph structures to identify relevant objects in traffic scenes. We evaluate our method on the nuScenes dataset enriched with DriveLM's human-annotated relevance labels. Experimental results show that our GNN-based approach achieves superior performance compared to baseline methods. The model effectively handles the inherent class imbalance in relevant object identification tasks while considering the complete spatial-temporal relationships between all objects in the scene. Our work demonstrates the potential of combining qualitative representations with deep learning approaches for explainable scene understanding in autonomous driving systems.
本文研究了将图神经网络(GNNs)与定性可解释图(QXGs)相结合,用于自动驾驶中的场景理解。场景理解是任何进一步反应性和前瞻性决策的基础。场景理解和相关推理本质上是一种解释任务:为什么其他交通参与者会采取某种行动?是什么或谁导致了他们的行为?虽然之前的工作展示了通过浅层机器学习模型使用QXGs的有效性,但这些方法仅限于分析对象对之间的单一关系链,忽略了更广泛的场景背景。 我们提出了一种新的GNN架构,该架构处理整个图结构以识别交通场景中的相关对象。我们在nuScenes数据集上评估了我们的方法,该数据集包含了DriveLM的人工标注的相关性标签。实验结果表明,与基线方法相比,基于GNN的方法在性能上取得了显著优势。模型能够有效地应对相关物体识别任务中固有的类别不平衡问题,并考虑场景中所有对象之间的完整时空关系。 本工作展示了将定性表示与深度学习方法相结合,在自动驾驶系统中的可解释场景理解方面具有巨大潜力。
https://arxiv.org/abs/2504.12817
Autonomous driving is a complex undertaking. A common approach is to break down the driving task into individual subtasks through modularization. These sub-modules are usually developed and published separately. However, if these individually developed algorithms have to be combined again to form a full-stack autonomous driving software, this poses particular challenges. Drawing upon our practical experience in developing the software of TUM Autonomous Motorsport, we have identified and derived these challenges in developing an autonomous driving software stack within a scientific environment. We do not focus on the specific challenges of individual algorithms but on the general difficulties that arise when deploying research algorithms on real-world test vehicles. To overcome these challenges, we introduce strategies that have been effective in our development approach. We additionally provide open-source implementations that enable these concepts on GitHub. As a result, this paper's contributions will simplify future full-stack autonomous driving projects, which are essential for a thorough evaluation of the individual algorithms.
自动驾驶是一项复杂的任务。一种常见的方法是通过模块化将驾驶任务分解为单独的子任务。这些子模块通常分别开发和发布。然而,如果需要将这些独立开发的算法重新组合以形成一个完整的自动驾驶软件栈,则会面临特殊的挑战。基于我们在开发慕尼黑工业大学自主赛车队软件方面的实践经验,我们已经识别并总结了在科学研究环境中开发自动驾驶软件栈所面临的挑战。我们的重点不在于个别算法的具体挑战,而是在于将研究算法部署到实际测试车辆时出现的一般性难题。为了克服这些挑战,我们介绍了一些在我们开发方法中证明有效的策略,并且还提供了开源实现,使这些概念可以在GitHub上运行。因此,本文的贡献将简化未来的全栈自动驾驶项目,这对于全面评估各个独立算法至关重要。
https://arxiv.org/abs/2504.12813
The significant achievements of pre-trained models leveraging large volumes of data in the field of NLP and 2D vision inspire us to explore the potential of extensive data pre-training for 3D perception in autonomous driving. Toward this goal, this paper proposes to utilize massive unlabeled data from heterogeneous datasets to pre-train 3D perception models. We introduce a self-supervised pre-training framework that learns effective 3D representations from scratch on unlabeled data, combined with a prompt adapter based domain adaptation strategy to reduce dataset bias. The approach significantly improves model performance on downstream tasks such as 3D object detection, BEV segmentation, 3D object tracking, and occupancy prediction, and shows steady performance increase as the training data volume scales up, demonstrating the potential of continually benefit 3D perception models for autonomous driving. We will release the source code to inspire further investigations in the community.
在自然语言处理(NLP)和2D视觉领域,预训练模型利用大量数据所取得的显著成就激励我们探索广泛数据预训练在自动驾驶3D感知中的潜力。为此,本文提出了一种利用来自异构数据集的大规模未标注数据对3D感知模型进行预训练的方法。我们介绍了一个自我监督的预训练框架,该框架可以从无标签的数据中从头开始学习有效的3D表示,并结合基于提示适配器的领域自适应策略来减少数据集偏差。这种方法在诸如3D目标检测、鸟瞰图(BEV)分割、3D目标跟踪和占用预测等下游任务上显著提升了模型性能,同时随着训练数据量的增加而表现出稳定提升的表现,展示了持续受益于3D感知模型以支持自动驾驶的潜力。我们将发布源代码以激发社区进一步的研究探索。
https://arxiv.org/abs/2504.12709
Collaborative perception has attracted growing interest from academia and industry due to its potential to enhance perception accuracy, safety, and robustness in autonomous driving through multi-agent information fusion. With the advancement of Vehicle-to-Everything (V2X) communication, numerous collaborative perception datasets have emerged, varying in cooperation paradigms, sensor configurations, data sources, and application scenarios. However, the absence of systematic summarization and comparative analysis hinders effective resource utilization and standardization of model evaluation. As the first comprehensive review focused on collaborative perception datasets, this work reviews and compares existing resources from a multi-dimensional perspective. We categorize datasets based on cooperation paradigms, examine their data sources and scenarios, and analyze sensor modalities and supported tasks. A detailed comparative analysis is conducted across multiple dimensions. We also outline key challenges and future directions, including dataset scalability, diversity, domain adaptation, standardization, privacy, and the integration of large language models. To support ongoing research, we provide a continuously updated online repository of collaborative perception datasets and related literature: this https URL.
协作感知由于其在自动驾驶中通过多智能体信息融合提高感知准确性、安全性和鲁棒性的潜力,已经引起了学术界和工业界的广泛关注。随着车辆到一切(V2X)通信技术的进步,出现了许多不同的协作感知数据集,在合作模式、传感器配置、数据来源和应用场景等方面各不相同。然而,由于缺乏系统的总结和比较分析,有效资源利用以及模型评估标准化受到了阻碍。作为首个专注于协作感知数据集的全面回顾工作,本文从多维度视角对现有资源进行了回顾和对比。我们根据合作模式对数据集进行分类,考察其数据来源和应用场景,并分析传感器模态和支持的任务类型。我们还从多个维度开展详细的比较分析,并概述了关键挑战和未来方向,包括数据集可扩展性、多样性、领域适应性、标准化、隐私保护以及大型语言模型的集成问题。为了支持持续的研究,我们提供了一个持续更新的协作感知数据集及相关文献在线资源库:[此URL]。 (请注意,最后一个链接应替换为实际提供的具体网址)
https://arxiv.org/abs/2504.12696
End-to-end autonomous driving has made impressive progress in recent years. Former end-to-end autonomous driving approaches often decouple planning and motion tasks, treating them as separate modules. This separation overlooks the potential benefits that planning can gain from learning out-of-distribution data encountered in motion tasks. However, unifying these tasks poses significant challenges, such as constructing shared contextual representations and handling the unobservability of other vehicles' states. To address these challenges, we propose TTOG, a novel two-stage trajectory generation framework. In the first stage, a diverse set of trajectory candidates is generated, while the second stage focuses on refining these candidates through vehicle state information. To mitigate the issue of unavailable surrounding vehicle states, TTOG employs a self-vehicle data-trained state estimator, subsequently extended to other vehicles. Furthermore, we introduce ECSA (equivariant context-sharing scene adapter) to enhance the generalization of scene representations across different agents. Experimental results demonstrate that TTOG achieves state-of-the-art performance across both planning and motion tasks. Notably, on the challenging open-loop nuScenes dataset, TTOG reduces the L2 distance by 36.06\%. Furthermore, on the closed-loop Bench2Drive dataset, our approach achieves a 22\% improvement in the driving score (DS), significantly outperforming existing baselines.
近年来,端到端自动驾驶技术取得了显著进展。以往的端到端自动驾驶方法通常将规划和运动任务分开处理,将其视为独立模块。这种分离忽视了从运动任务中遇到的分布外数据学习可能给规划带来的潜在收益。然而,统一这些任务也面临着重大挑战,例如构建共享上下文表示以及处理其他车辆状态不可观察的问题。为了解决这些问题,我们提出了TTOG(两阶段轨迹生成框架)。在第一阶段,TTOG会生成一组多样化的轨迹候选方案;而在第二阶段,则重点关注通过使用车辆状态信息来细化这些候选方案。 为了缓解周围车辆状态缺失的问题,TTOG采用了一种由自身车辆数据训练的状态估计器,并将其扩展应用于其他车辆。此外,我们还引入了ECSA(等变上下文共享场景适配器),以增强不同代理之间场景表示的泛化能力。 实验结果表明,TTOG在规划和运动任务中均取得了最先进的性能表现。特别是在具有挑战性的开环nuScenes数据集上,TTOG将L2距离减少了36.06%;而在闭环Bench2Drive数据集上,我们的方法使驾驶评分(DS)提高了22%,显著超越了现有的基准模型。
https://arxiv.org/abs/2504.12667
This paper presents a novel autonomous drone-based smoke plume tracking system capable of navigating and tracking plumes in highly unsteady atmospheric conditions. The system integrates advanced hardware and software and a comprehensive simulation environment to ensure robust performance in controlled and real-world settings. The quadrotor, equipped with a high-resolution imaging system and an advanced onboard computing unit, performs precise maneuvers while accurately detecting and tracking dynamic smoke plumes under fluctuating conditions. Our software implements a two-phase flight operation, i.e., descending into the smoke plume upon detection and continuously monitoring the smoke movement during in-plume tracking. Leveraging Proportional Integral-Derivative (PID) control and a Proximal Policy Optimization based Deep Reinforcement Learning (DRL) controller enables adaptation to plume dynamics. Unreal Engine simulation evaluates performance under various smoke-wind scenarios, from steady flow to complex, unsteady fluctuations, showing that while the PID controller performs adequately in simpler scenarios, the DRL-based controller excels in more challenging environments. Field tests corroborate these findings. This system opens new possibilities for drone-based monitoring in areas like wildfire management and air quality assessment. The successful integration of DRL for real-time decision-making advances autonomous drone control for dynamic environments.
本文介绍了一种新颖的自主无人机烟羽追踪系统,该系统能够在高度不稳定的大气条件下导航和跟踪烟羽。该系统集成了先进的硬件与软件,并且包括一个全面的模拟环境,以确保在控制和现实世界设置中均能实现稳健性能。四旋翼飞行器配备了高分辨率成像系统和高级机载计算单元,在不断变化的情况下能够执行精确操作并准确地检测和跟踪动态烟羽。 我们的软件实施了两个阶段的飞行操作:即探测到烟羽后下降进入烟羽,并在进入烟羽后持续监测烟雾运动。通过利用比例积分微分(PID)控制以及基于近端策略优化(Proximal Policy Optimization)的深度强化学习(DRL)控制器,使系统能够适应烟羽动态变化。 借助Unreal Engine模拟器,在各种烟雾-风环境场景下评估了系统的性能,从稳定的气流到复杂的不稳定性波动。结果显示:虽然PID控制器在简单情况下表现良好,但基于DRL的控制器在更复杂和具有挑战性的环境中表现出色。实地测试验证了这些发现。 该系统为无人机监测开辟了新的可能性,特别是在野火管理和空气质量评估等领域。将深度强化学习成功集成到实时决策制定中,有助于自主无人机控制在动态环境中的发展与应用。
https://arxiv.org/abs/2504.12664
Deep learning (DL)-based image classification models are essential for autonomous vehicle (AV) perception modules since incorrect categorization might have severe repercussions. Adversarial attacks are widely studied cyberattacks that can lead DL models to predict inaccurate output, such as incorrectly classified traffic signs by the perception module of an autonomous vehicle. In this study, we create and compare hybrid classical-quantum deep learning (HCQ-DL) models with classical deep learning (C-DL) models to demonstrate robustness against adversarial attacks for perception modules. Before feeding them into the quantum system, we used transfer learning models, alexnet and vgg-16, as feature extractors. We tested over 1000 quantum circuits in our HCQ-DL models for projected gradient descent (PGD), fast gradient sign attack (FGSA), and gradient attack (GA), which are three well-known untargeted adversarial approaches. We evaluated the performance of all models during adversarial attacks and no-attack scenarios. Our HCQ-DL models maintain accuracy above 95\% during a no-attack scenario and above 91\% for GA and FGSA attacks, which is higher than C-DL models. During the PGD attack, our alexnet-based HCQ-DL model maintained an accuracy of 85\% compared to C-DL models that achieved accuracies below 21\%. Our results highlight that the HCQ-DL models provide improved accuracy for traffic sign classification under adversarial settings compared to their classical counterparts.
基于深度学习(DL)的图像分类模型对于自动驾驶汽车(AV)感知模块至关重要,因为错误的分类可能会导致严重后果。对抗性攻击是广泛研究的网络攻击之一,可以导致DL模型预测出不准确的结果,例如自动驾驶车辆感知模块中交通标志被误分类的情况。在这项研究中,我们创建并比较了混合经典-量子深度学习(HCQ-DL)模型与传统深度学习(C-DL)模型,以展示其在对抗性攻击下为感知模块提供的鲁棒性。为了将它们输入到量子系统之前,我们使用迁移学习模型alexnet和vgg-16作为特征提取器。我们在我们的HCQ-DL模型中测试了超过1000个量子电路,针对投影梯度下降(PGD)、快速梯度符号攻击(FGSA)和梯度攻击(GA),这三种著名的非目标对抗性方法进行了测试。我们评估了所有模型在遭遇对抗性和无攻击场景下的性能表现。我们的HCQ-DL模型在无攻击场景下保持超过95%的准确率,在面对GA和FGSA攻击时准确率维持在91%以上,这一数值高于传统DL模型的表现。在PGD攻击期间,我们基于alexnet的HCQ-DL模型能够保持85%的准确性,而C-DL模型的准确率则低于21%。我们的研究结果表明,在对抗性设置下,HCQ-DL模型为交通标志分类提供的精度优于其传统DL模型对应者。
https://arxiv.org/abs/2504.12644
Safe and efficient path planning in parking scenarios presents a significant challenge due to the presence of cluttered environments filled with static and dynamic obstacles. To address this, we propose a novel and computationally efficient planning strategy that seamlessly integrates the predictions of dynamic obstacles into the planning process, ensuring the generation of collision-free paths. Our approach builds upon the conventional Hybrid A star algorithm by introducing a time-indexed variant that explicitly accounts for the predictions of dynamic obstacles during node exploration in the graph, thus enabling dynamic obstacle avoidance. We integrate the time-indexed Hybrid A star algorithm within an online planning framework to compute local paths at each planning step, guided by an adaptively chosen intermediate goal. The proposed method is validated in diverse parking scenarios, including perpendicular, angled, and parallel parking. Through simulations, we showcase our approach's potential in greatly improving the efficiency and safety when compared to the state of the art spline-based planning method for parking situations.
在停车场景中,安全且高效的路径规划面临着巨大挑战,由于这些环境中充斥着静态和动态障碍物。为了解决这个问题,我们提出了一种新颖且计算效率高的规划策略,该策略将对动态障碍物的预测无缝地集成到规划过程中,确保生成无碰撞路径。我们的方法基于传统的混合A*(Hybrid A*)算法,并引入了一个时间索引变体,在图中节点探索阶段显式考虑动态障碍物的预测,从而实现动态避障功能。我们将时间索引混合A*算法整合到在线规划框架中,以在每次规划步骤时计算局部路径,并根据自适应选择的中间目标进行引导。所提出的方法已在多种停车场景(包括垂直、倾斜和并行停车)中得到验证。通过模拟实验,我们展示了与当前最先进的基于样条曲线的停车情况规划方法相比,我们的方法在效率和安全性方面具有巨大改进潜力。
https://arxiv.org/abs/2504.12616
Provenance is the chronology of things, resonating with the fundamental pursuit to uncover origins, trace connections, and situate entities within the flow of space and time. As artificial intelligence advances towards autonomous agents capable of interactive collaboration on complex tasks, the provenance of generated content becomes entangled in the interplay of collective creation, where contributions are continuously revised, extended or overwritten. In a multi-agent generative chain, content undergoes successive transformations, often leaving little, if any, trace of prior contributions. In this study, we investigates the problem of tracking multi-agent provenance across the temporal dimension of generation. We propose a chronological system for post hoc attribution of generative history from content alone, without reliance on internal memory states or external meta-information. At its core lies the notion of symbolic chronicles, representing signed and time-stamped records, in a form analogous to the chain of custody in forensic science. The system operates through a feedback loop, whereby each generative timestep updates the chronicle of prior interactions and synchronises it with the synthetic content in the very act of generation. This research seeks to develop an accountable form of collaborative artificial intelligence within evolving cyber ecosystems.
出处(Provenance)是指事物的时间顺序,它与探究起源、追溯联系以及将实体置于时空流中的基本追求相呼应。随着人工智能向能够进行复杂任务交互协作的自主代理发展,生成内容的出处变得纠缠于集体创作过程中,其中贡献不断被修订、扩展或重写。在一个多代理生成链中,内容经历连续的转变,往往几乎不留任何先前贡献的痕迹。在这项研究中,我们探讨了在生成的时间维度上追踪多代理出处的问题。我们提出了一种基于时间顺序系统的后验归因方法,仅通过内容本身来追溯生成历史,而不依赖于内部记忆状态或外部元信息。该系统的核心是象征性编年史的概念,即类似于法医学中的证据链的签名和时间戳记录形式。该系统通过反馈循环运行,在每个生成的时间步中更新先前交互的编年史,并在生成过程中与合成内容同步。这项研究旨在开发一种在不断演变的网络生态系统中可问责的合作人工智能形式。
https://arxiv.org/abs/2504.12612
The emergence of Agentic Artificial Intelligence (AAI) systems capable of independently initiating digital interactions necessitates a new optimisation paradigm designed explicitly for seamless agent-platform interactions. This article introduces Agentic AI Optimisation (AAIO) as an essential methodology for ensuring effective integration between websites and agentic AI systems. Like how Search Engine Optimisation (SEO) has shaped digital content discoverability, AAIO can define interactions between autonomous AI agents and online platforms. By examining the mutual interdependency between website optimisation and agentic AI success, the article highlights the virtuous cycle that AAIO can create. It further explores the governance, ethical, legal, and social implications (GELSI) of AAIO, emphasising the necessity of proactive regulatory frameworks to mitigate potential negative impacts. The article concludes by affirming AAIO's essential role as part of a fundamental digital infrastructure in the era of autonomous digital agents, advocating for equitable and inclusive access to its benefits.
具有自主发起数字互动能力的代理型人工智能(AAI)系统的出现,要求设计一种新的优化范式,以实现无缝的智能体-平台交互。本文介绍了代理型AI优化(AAIO),这是一种确保网站与代理型AI系统有效整合的重要方法。就像搜索引擎优化(SEO)塑造了数字内容的可发现性一样,AAIO可以定义自主AI代理和在线平台之间的互动方式。通过探讨网站优化与代理型AI成功之间的相互依赖关系,本文强调了AAIO能够创造的良性循环。此外,文章还深入探讨了AAIO在治理、伦理、法律和社会影响(GELSI)方面的内容,并强调建立积极主动的监管框架以减轻潜在负面影响的必要性。最后,本文确认了AAIO作为自主数字代理时代基础数字基础设施的重要组成部分,在倡导公平和包容地获取其益处方面发挥着不可或缺的作用。
https://arxiv.org/abs/2504.12482
Mobile robots on construction sites require accurate pose estimation to perform autonomous surveying and inspection missions. Localization in construction sites is a particularly challenging problem due to the presence of repetitive features such as flat plastered walls and perceptual aliasing due to apartments with similar layouts inter and intra floors. In this paper, we focus on the global re-positioning of a robot with respect to an accurate scanned mesh of the building solely using LiDAR data. In our approach, a neural network is trained on synthetic LiDAR point clouds generated by simulating a LiDAR in an accurate real-life large-scale mesh. We train a diffusion model with a PointNet++ backbone, which allows us to model multiple position candidates from a single LiDAR point cloud. The resulting model can successfully predict the global position of LiDAR in confined and complex sites despite the adverse effects of perceptual aliasing. The learned distribution of potential global positions can provide multi-modal position distribution. We evaluate our approach across five real-world datasets and show the place recognition accuracy of 77% +/-2m on average while outperforming baselines at a factor of 2 in mean error.
施工现场的移动机器人需要精确的姿态估计来执行自主测量和检查任务。由于存在诸如平坦抹灰墙等重复特征以及因楼层内和楼层间公寓布局相似而导致的感知歧义,因此在建筑工地进行定位是一项特别具有挑战性的问题。本文专注于仅使用LiDAR数据,通过与建筑物准确扫描网格的全局重新定位移动机器人的位置。我们的方法训练了一个神经网络,该网络是在精确的真实大型网格中模拟LiDAR生成合成LiDAR点云的基础上进行训练的。我们用PointNet++作为骨干网来训练扩散模型,这使我们能够从单个LiDAR点云中建模多个可能的位置候选。所得模型可以在狭窄和复杂的环境中成功预测LiDAR的全局位置,即使存在感知歧义带来的不利影响也不例外。所学的潜在全球位置分布可以提供多模式位置分布。我们在五个真实数据集上评估了我们的方法,并展示了平均77%±2米的地方识别精度,同时在均方误差上比基线方法提高了两倍。
https://arxiv.org/abs/2504.12412
Conventional trajectory planning approaches for autonomous vehicles often assume a fixed vehicle model that remains constant regardless of the vehicle's location. This overlooks the critical fact that the tires and the surface are the two force-transmitting partners in vehicle dynamics; while the tires stay with the vehicle, surface conditions vary with location. Recognizing these challenges, this paper presents a novel framework for spatially resolving dynamic constraints in both offline and online planning algorithms applied to autonomous racing. We introduce the GripMap concept, which provides a spatial resolution of vehicle dynamic constraints in the Frenet frame, allowing adaptation to locally varying grip conditions. This enables compensation for location-specific effects, more efficient vehicle behavior, and increased safety, unattainable with spatially invariant vehicle models. The focus is on low storage demand and quick access through perfect hashing. This framework proved advantageous in real-world applications in the presented form. Experiments inspired by autonomous racing demonstrate its effectiveness. In future work, this framework can serve as a foundational layer for developing future interpretable learning algorithms that adjust to varying grip conditions in real-time.
传统的自主车辆轨迹规划方法通常假设一个固定的车辆模型,该模型在任何位置都保持不变。然而,这种方法忽略了这样一个关键事实:轮胎和路面是汽车动态中的两个力传递伙伴;虽然轮胎与车辆一起移动,但路面条件会根据位置的不同而变化。为了应对这些挑战,本文提出了一种新的框架,在离线和在线规划算法中解决自主赛车中动态约束的空间分布问题。我们引入了GripMap的概念,它在Frenet坐标系下提供了车辆动力学约束的空间解析,从而能够适应局部抓地力条件的变化。这使得针对特定位置的效应进行补偿、更有效的车辆行为以及提高安全性成为可能,这些都是使用空间不变的车辆模型无法实现的。 该框架重点关注低存储需求和快速访问,通过完美哈希技术实现这一目标。在当前形式下,这个框架已经在现实世界的应用中显示出其优势。受自主赛车启发的实验展示了它在实践中的有效性。在未来的工作中,可以将此框架作为基础层,用于开发能够实时适应不同抓地力条件的可解释学习算法。
https://arxiv.org/abs/2504.12115
Achieving reliable and safe autonomous driving in off-road environments requires accurate and efficient terrain traversability analysis. However, this task faces several challenges, including the scarcity of large-scale datasets tailored for off-road scenarios, the high cost and potential errors of manual annotation, the stringent real-time requirements of motion planning, and the limited computational power of onboard units. To address these challenges, this paper proposes a novel traversability learning method that leverages self-supervised learning, eliminating the need for manual annotation. For the first time, a Birds-Eye View (BEV) representation is used as input, reducing computational burden and improving adaptability to downstream motion planning. During vehicle operation, the proposed method conducts online analysis of traversed regions and dynamically updates prototypes to adaptively assess the traversability of the current environment, effectively handling dynamic scene changes. We evaluate our approach against state-of-the-art benchmarks on both public datasets and our own dataset, covering diverse seasons and geographical locations. Experimental results demonstrate that our method significantly outperforms recent approaches. Additionally, real-world vehicle experiments show that our method operates at 10 Hz, meeting real-time requirements, while a 5.5 km autonomous driving experiment further validates the generated traversability cost maps compatibility with downstream motion planning.
实现越野环境中可靠且安全的自动驾驶需要准确而高效的地形通行性分析。然而,这一任务面临着诸多挑战,包括缺乏针对越野场景的大规模数据集、手动标注成本高昂且可能出错、运动规划对实时性的严格要求以及车载设备计算能力有限的问题。为解决这些难题,本文提出了一种新的基于自监督学习的通行性学习方法,该方法无需人工标注。首次采用了鸟瞰图(BEV)表示作为输入,以此来减少计算负担并提升适应下游运动规划的能力。 在车辆运行期间,所提出的这种方法会进行在线分析经过区域,并动态更新原型以根据当前环境情况评估地形通行性,从而有效处理动态场景的变化。我们在公共数据集和我们自己的数据集上对方法进行了测试,这些数据涵盖了不同的季节和地区。实验结果表明,我们的方法明显优于现有的先进方案。 此外,在真实世界的车辆试验中,所提出的方法能够以每秒10帧的速度运行,满足实时性要求。通过一项5.5公里的自动驾驶试验进一步验证了生成的通行成本地图与下游运动规划的兼容性。
https://arxiv.org/abs/2504.12109
Industrial Anomaly Detection (IAD) poses a formidable challenge due to the scarcity of defective samples, making it imperative to deploy models capable of robust generalization to detect unseen anomalies effectively. Traditional approaches, often constrained by hand-crafted features or domain-specific expert models, struggle to address this limitation, underscoring the need for a paradigm shift. We introduce AnomalyR1, a pioneering framework that leverages VLM-R1, a Multimodal Large Language Model (MLLM) renowned for its exceptional generalization and interpretability, to revolutionize IAD. By integrating MLLM with Group Relative Policy Optimization (GRPO), enhanced by our novel Reasoned Outcome Alignment Metric (ROAM), AnomalyR1 achieves a fully end-to-end solution that autonomously processes inputs of image and domain knowledge, reasons through analysis, and generates precise anomaly localizations and masks. Based on the latest multimodal IAD benchmark, our compact 3-billion-parameter model outperforms existing methods, establishing state-of-the-art results. As MLLM capabilities continue to advance, this study is the first to deliver an end-to-end VLM-based IAD solution that demonstrates the transformative potential of ROAM-enhanced GRPO, positioning our framework as a forward-looking cornerstone for next-generation intelligent anomaly detection systems in industrial applications with limited defective data.
工业异常检测(IAD)面临着严峻挑战,主要原因是缺陷样本的稀缺性。因此,部署能够实现稳健泛化的模型以有效检测未知异常变得至关重要。传统方法通常受限于手工设计的特征或特定领域的专家模型,难以克服这一局限性,这表明需要进行范式转变。我们介绍了AnomalyR1,这是一个创新性的框架,它利用VLM-R1(一种备受赞誉的多模态大型语言模型),该模型以其卓越的泛化能力和解释性而闻名,旨在革新IAD领域。 通过将多模态大型语言模型与基于组相对策略优化(GRPO)结合,并采用我们新颖的推理结果对齐度量标准(ROAM),AnomalyR1实现了端到端解决方案。这一方案能够自主处理图像和领域知识输入、进行分析推理,并生成精确的异常定位和掩码。 根据最新的多模态IAD基准测试,我们的紧凑型30亿参数模型在性能上超越了现有方法,创下了最新纪录。随着多模态大型语言模型能力的不断进步,本研究首次提供了一种基于VLM的端到端IAD解决方案,并展示了增强GRPO的ROAM具备变革性的潜力。这使我们框架成为下一代智能异常检测系统中具有前瞻性的基石,在工业应用中的有限缺陷数据环境下尤为关键。
https://arxiv.org/abs/2504.11914
Autonomous exploration of cluttered environments requires efficient exploration strategies that guarantee safety against potential collisions with unknown random obstacles. This paper presents a novel approach combining a graph neural network-based exploration greedy policy with a safety shield to ensure safe navigation goal selection. The network is trained using reinforcement learning and the proximal policy optimization algorithm to maximize exploration efficiency while reducing the safety shield interventions. However, if the policy selects an infeasible action, the safety shield intervenes to choose the best feasible alternative, ensuring system consistency. Moreover, this paper proposes a reward function that includes a potential field based on the agent's proximity to unexplored regions and the expected information gain from reaching them. Overall, the approach investigated in this paper merges the benefits of the adaptability of reinforcement learning-driven exploration policies and the guarantee ensured by explicit safety mechanisms. Extensive evaluations in simulated environments demonstrate that the approach enables efficient and safe exploration in cluttered environments.
在复杂环境中进行自主探索需要高效的探索策略,以确保能够安全地避开未知的随机障碍物。本文提出了一种新颖的方法,该方法结合了基于图神经网络的贪婪探索策略与一个保障安全性的防护罩(safety shield),用以确保导航目标选择的安全性。通过强化学习和近端政策优化算法对该网络进行训练,旨在最大化探索效率的同时减少对安全防护措施的依赖。然而,在探索过程中如果策略选择了不可行的动作,安全防护将介入并选择最佳可行替代方案,从而保持系统的稳定性。 此外,本文还提出了一种奖励函数,该函数结合了基于智能体距离未探索区域远近的势场以及到达这些区域后预期的信息增益。总而言之,本文研究的方法将强化学习驱动的探索策略的适应性与显式的安全保障机制的优势结合起来。在模拟环境中进行的广泛评估表明,这种方法能够在复杂环境中实现高效的和安全的探索。
https://arxiv.org/abs/2504.11907