Large language model-based agents are rapidly evolving from simple conversational assistants into autonomous systems capable of performing complex, professional-level tasks in various domains. While these advancements promise significant productivity gains, they also introduce critical safety risks that remain under-explored. Existing safety evaluations primarily focus on simple, daily assistance tasks, failing to capture the intricate decision-making processes and potential consequences of misaligned behaviors in professional settings. To address this gap, we introduce \textbf{SafePro}, a comprehensive benchmark designed to evaluate the safety alignment of AI agents performing professional activities. SafePro features a dataset of high-complexity tasks across diverse professional domains with safety risks, developed through a rigorous iterative creation and review process. Our evaluation of state-of-the-art AI models reveals significant safety vulnerabilities and uncovers new unsafe behaviors in professional contexts. We further show that these models exhibit both insufficient safety judgment and weak safety alignment when executing complex professional tasks. In addition, we investigate safety mitigation strategies for improving agent safety in these scenarios and observe encouraging improvements. Together, our findings highlight the urgent need for robust safety mechanisms tailored to the next generation of professional AI agents.
基于大型语言模型的代理正在迅速从简单的对话助手演变为能够执行各种领域中复杂、专业级任务的自主系统。虽然这些进展承诺了显著的生产力提升,但也引入了一些关键的安全风险,这些问题尚未得到充分探索。现有的安全评估主要集中在简单的日常辅助任务上,无法捕捉到在专业环境中行为不一致时复杂的决策过程和潜在后果。为解决这一缺口,我们介绍了**SafePro**,这是一个全面的基准测试工具,旨在评估执行专业活动的人工智能代理的安全性一致性。SafePro 包含了一个通过严格的迭代创建和审查流程开发的数据集,涵盖了具有安全风险的各种复杂专业领域的任务。对最先进的AI模型进行评估后发现存在显著的安全漏洞,并揭示了专业环境中新的不安全行为。此外,我们还表明这些模型在执行复杂的专业任务时表现出安全判断不足以及安全一致性弱的问题。进一步地,我们探讨了改善此类场景中代理安全性的策略,并观察到一些令人鼓舞的改进措施。综上所述,我们的研究结果突显了为下一代专业AI代理制定稳健的安全机制的紧迫需求。
https://arxiv.org/abs/2601.06663
Localization is a fundamental capability for autonomous robots, enabling them to operate effectively in dynamic environments. In Robocon 2025, accurate and reliable localization is crucial for improving shooting precision, avoiding collisions with other robots, and navigating the competition field efficiently. In this paper, we propose a hybrid localization algorithm that integrates classical techniques with learning based methods that rely solely on visual data from the court's floor to achieve self-localization on the basketball field.
本地化是自主机器人的一项基本能力,使其能够在动态环境中有效运行。在Robocon 2025中,精确且可靠的定位对于提高投篮精度、避免与其他机器人碰撞以及高效地导航比赛场地至关重要。本文提出了一种融合传统技术与基于学习的方法的混合定位算法,该方法仅依赖于球场地板上的视觉数据,在篮球场上实现自我定位。
https://arxiv.org/abs/2601.08713
Agentic Retrieval-Augmented Generation (RAG) empowers large language models to autonomously plan and retrieve information for complex problem-solving. However, the development of robust agents is hindered by the scarcity of high-quality training data that reflects the noise and complexity of real-world retrieval environments. Conventional manual annotation is unscalable and often fails to capture the dynamic reasoning strategies required to handle retrieval failures. To bridge this gap, we introduce RAGShaper, a novel data synthesis framework designed to automate the construction of RAG tasks and robust agent trajectories. RAGShaper incorporates an InfoCurator to build dense information trees enriched with adversarial distractors spanning Perception and Cognition levels. Furthermore, we propose a constrained navigation strategy that forces a teacher agent to confront these distractors, thereby eliciting trajectories that explicitly demonstrate error correction and noise rejection. Comprehensive experiments confirm that models trained on our synthesized corpus significantly outperform existing baselines, exhibiting superior robustness in noise-intensive and complex retrieval tasks.
代理增强的检索增广生成(RAG)使大型语言模型能够自主规划并检索信息,以解决复杂问题。然而,由于缺乏高质量的训练数据来反映现实世界中检索环境中的噪音和复杂性,开发稳健的代理受到了限制。传统的手动注释方法不可扩展,并且通常无法捕捉处理检索失败所需的动态推理策略。为了解决这个问题,我们引入了RAGShaper,这是一种新型的数据合成框架,旨在自动化构建RAG任务及稳健代理轨迹。 RAGShaper整合了一个InfoCurator来建立富含感知和认知层对抗性干扰的密集信息树。此外,我们提出了一种受限制的导航策略,该策略迫使教师代理面对这些干扰,从而激发能够明确展示错误修正和噪声拒绝的路径。全面的实验确认了在我们的合成数据集上训练的模型显著优于现有的基线模型,在噪音密集且复杂的检索任务中表现出更高的稳健性。
https://arxiv.org/abs/2601.08699
With the increasing adoption of vision-language models (VLMs) in critical decision-making systems such as healthcare or autonomous driving, the calibration of their uncertainty estimates becomes paramount. Yet, this dimension has been largely underexplored in the VLM test-time prompt-tuning (TPT) literature, which has predominantly focused on improving their discriminative performance. Recent state-of-the-art advocates for enforcing full orthogonality over pairs of text prompt embeddings to enhance separability, and therefore calibration. Nevertheless, as we theoretically show in this work, the inherent gradients from fully orthogonal constraints will strongly push semantically related classes away, ultimately making the model overconfident. Based on our findings, we propose Semantic Orthogonal Calibration (SoC), a Huber-based regularizer that enforces smooth prototype separation while preserving semantic proximity, thereby improving calibration compared to prior orthogonality-based approaches. Across a comprehensive empirical validation, we demonstrate that SoC consistently improves calibration performance, while also maintaining competitive discriminative capabilities.
随着视觉-语言模型(VLMs)在医疗保健或自动驾驶等关键决策系统中的应用日益广泛,对其不确定性估计的校准变得至关重要。然而,在关于VLM测试时间提示调整(TPT)文献中,这一方面基本上未被充分探讨,该领域的研究主要集中在提升模型的判别性能上。最近的研究成果提倡通过强制成对标文本提示嵌入之间的完全正交性来增强类别的可分离性,从而改善校准效果。然而,在这项工作中,我们从理论上展示了全正交约束带来的固有梯度会强烈推动语义相关的类别远离彼此,最终使模型变得过度自信。 基于我们的发现,我们提出了语义正交校准(SoC),这是一种基于Huber函数的正则化方法,它在保持语义相近的同时强制原型分离平滑进行,从而与以往基于正交性的方法相比提高了校准性能。通过全面的经验验证,我们证明了SoC能持续提高校准表现,并且同时保持有竞争力的判别能力。
https://arxiv.org/abs/2601.08617
While autonomous software engineering (SWE) agents are reshaping programming paradigms, they currently suffer from a "closed-world" limitation: they attempt to fix bugs from scratch or solely using local context, ignoring the immense historical human experience available on platforms like GitHub. Accessing this open-world experience is hindered by the unstructured and fragmented nature of real-world issue-tracking data. In this paper, we introduce MemGovern, a framework designed to govern and transform raw GitHub data into actionable experiential memory for agents. MemGovern employs experience governance to convert human experience into agent-friendly experience cards and introduces an agentic experience search strategy that enables logic-driven retrieval of human expertise. By producing 135K governed experience cards, MemGovern achieves a significant performance boost, improving resolution rates on the SWE-bench Verified by 4.65%. As a plug-in approach, MemGovern provides a solution for agent-friendly memory infrastructure.
虽然自主软件工程(SWE)代理正在重塑编程范式,但它们目前面临着“封闭世界”的局限性:这些代理试图从零开始或仅依靠局部上下文来修复错误,而忽略了GitHub等平台上积累的大量人类历史经验。由于现实世界的问题追踪数据缺乏结构化和碎片化,访问这种开放世界的经验变得困难。在本文中,我们介绍了MemGovern框架,该框架旨在管理和转换原始的GitHub数据,将其转化为代理可以利用的经验记忆。MemGovern采用经验治理方法,将人类经验转化为便于代理使用的经验卡片,并引入了以逻辑驱动的方式检索人类专业知识的代理式经验搜索策略。通过生成135,000张经过管理的经验卡片,MemGovern实现了显著的性能提升,在SWE-bench Verified上提升了4.65%的解决率。作为一种插件方法,MemGovern为便于代理使用的记忆基础设施提供了一种解决方案。
https://arxiv.org/abs/2601.06789
Internet of underwater things (IoUT) is increasingly gathering attention with the aim of monitoring sea life and deep ocean environment, underwater surveillance as well as maintenance of underwater installments. However, conventional IoUT devices, reliant on battery power, face limitations in lifespan and pose environmental hazards upon disposal. This paper introduces a sustainable approach for simultaneous information uplink from the IoUT devices and acoustic energy transfer (AET) to the devices via an autonomous underwater vehicle (AUV), potentially enabling them to operate indefinitely. To tackle the time-sensitivity, we adopt age of information (AoI), and Jain's fairness index. We develop two deep-reinforcement learning (DRL) algorithms, offering a high-complexity, high-performance frequency division duplex (FDD) solution and a low-complexity, medium-performance time division duplex (TDD) approach. The results elucidate that the proposed FDD and TDD solutions significantly reduce the average AoI and boost the harvested energy as well as data collection fairness compared to baseline approaches.
水下物联网(IoUT)正逐渐吸引人们的关注,其目的是监测海洋生物和深海环境、进行水下监视以及维护海底设施。然而,传统的依赖电池供电的IoUT设备在使用寿命方面存在局限性,并且废弃后可能对环境造成危害。本文提出了一种可持续的方法,通过自主水下航行器(AUV)同时从IoUT设备上传信息并传输声能(AET),从而有可能使这些设备实现无限期运行。为应对时间敏感性问题,我们采用了信息时效(AoI)和Jain的公平指数作为评估指标。 为了优化性能,我们开发了两种深度强化学习(DRL)算法:一种是复杂度高但性能卓越的频分双工(FDD)解决方案;另一种则是复杂度低、性能中等的时间分集双工(TDD)方法。实验结果表明,所提出的FDD和TDD解决方案在降低平均信息时效的同时,也显著提高了收集到的能量及数据采集的公平性,相较于基线方法有了明显改进。
https://arxiv.org/abs/2601.08491
Vision Language Models (VLMs) are increasingly deployed in autonomous vehicles and mobile systems, making it crucial to evaluate their ability to support safer decision-making in complex environments. However, existing benchmarks inadequately cover diverse hazardous situations, especially anomalous scenarios with spatio-temporal dynamics. While image editing models are a promising means to synthesize such hazards, it remains challenging to generate well-formulated scenarios that include moving, intrusive, and distant objects frequently observed in the real world. To address this gap, we introduce \textbf{HazardForge}, a scalable pipeline that leverages image editing models to generate these scenarios with layout decision algorithms, and validation modules. Using HazardForge, we construct \textbf{MovSafeBench}, a multiple-choice question (MCQ) benchmark comprising 7,254 images and corresponding QA pairs across 13 object categories, covering both normal and anomalous objects. Experiments using MovSafeBench show that VLM performance degrades notably under conditions including anomalous objects, with the largest drop in scenarios requiring nuanced motion understanding.
视觉语言模型(VLMs)在自动驾驶车辆和移动系统中的应用日益广泛,因此评估它们支持复杂环境中安全决策的能力变得至关重要。然而,现有的基准测试不足以涵盖多样化的危险情况,尤其是那些具有时空动态特性的异常场景。虽然图像编辑模型是合成此类危害的一种有前途的方法,但生成包含现实中常见的移动、入侵性及远距离物体的场景仍然具有挑战性。为弥补这一不足,我们引入了**HazardForge**,这是一个可扩展的工作流程,利用图像编辑模型和布局决策算法以及验证模块来生成这些危险情况。 通过使用HazardForge,我们构建了**MovSafeBench**,这是一个多项选择题(MCQ)基准测试,包含了7,254张图片及其对应的问答对,并覆盖了13个物体类别,包括正常对象和异常对象。实验结果表明,在包含异常对象的情况下,VLM的表现显著下降,在需要细致理解运动的情景中表现尤为明显。
https://arxiv.org/abs/2601.08470
Constructing an accurate simulation model of real-world environments requires reliable estimation of physical parameters such as mass, geometry, friction, and contact surfaces. Traditional real-to-simulation (Real2Sim) pipelines rely on manual measurements or fixed, pre-programmed exploration routines, which limit their adaptability to varying tasks and user intents. This paper presents a Real2Sim framework that autonomously generates and executes Behavior Trees for task-specific physical interactions to acquire only the parameters required for a given simulation objective, without relying on pre-defined task templates or expert-designed exploration routines. Given a high-level user request, an incomplete simulation description, and an RGB observation of the scene, a vision-language model performs multi-modal reasoning to identify relevant objects, infer required physical parameters, and generate a structured Behavior Tree composed of elementary robotic actions. The resulting behavior is executed on a torque-controlled Franka Emika Panda, enabling compliant, contact-rich interactions for parameter estimation. The acquired measurements are used to automatically construct a physics-aware simulation. Experimental results on the real manipulator demonstrate estimation of object mass, surface height, and friction-related quantities across multiple scenarios, including occluded objects and incomplete prior models. The proposed approach enables interpretable, intent-driven, and autonomously Real2Sim pipelines, bridging high-level reasoning with physically-grounded robotic interaction.
构建一个真实世界环境的精确仿真模型,需要可靠地估算物理参数,如质量、几何形状、摩擦力和接触面等。传统的真实到模拟(Real-to-Simulation,简称Real2Sim)管道依赖于手动测量或固定预编程的探索程序,这限制了它们适应各种任务和用户意图的能力。本文提出了一种Real2Sim框架,该框架能够自主生成并执行行为树以进行特定任务所需的物理交互,并获取给定仿真目标所需的具体参数,而无需依赖预先定义的任务模板或专家设计的探索程序。 当接收到高级别用户的请求、不完整的模拟描述以及场景的RGB观察数据时,视觉语言模型会进行跨模态推理来识别相关对象,推断所需的物理参数,并生成一个由基本机器人动作组成的结构化行为树。由此产生的行为将被执行在具有扭矩控制功能的Franka Emika Panda机械臂上,以实现用于参数估计的顺应性和接触丰富的交互。 获取到的测量数据被用来自动构建一个基于物理学原理的仿真模型。实验结果表明,在真实操纵器上的多种场景下(包括遮挡物体和不完整的先前模型),本方法能够估算出物体的质量、表面高度以及与摩擦相关的量。该提出的方案使得实现解释性、意图驱动且自主运行的Real2Sim管道成为可能,从而在高层次推理和基于物理原理的机器人交互之间建立了桥梁。
https://arxiv.org/abs/2601.08454
The advent of Large Multimodal Models (LMMs) offers a promising technology to tackle the limitations of modular design in autonomous driving, which often falters in open-world scenarios requiring sustained environmental understanding and logical reasoning. Besides, embodied artificial intelligence facilitates policy optimization through closed-loop interactions to achieve the continuous learning capability, thereby advancing autonomous driving toward embodied intelligent (El) driving. However, such capability will be constrained by relying solely on LMMs to enhance EI driving without joint decision-making. This article introduces a novel semantics and policy dual-driven hybrid decision framework to tackle this challenge, ensuring continuous learning and joint decision. The framework merges LMMs for semantic understanding and cognitive representation, and deep reinforcement learning (DRL) for real-time policy optimization. We starts by introducing the foundational principles of EI driving and LMMs. Moreover, we examine the emerging opportunities this framework enables, encompassing potential benefits and representative use cases. A case study is conducted experimentally to validate the performance superiority of our framework in completing lane-change planning task. Finally, several future research directions to empower EI driving are identified to guide subsequent work.
大型多模态模型(LMMs)的出现为解决自动驾驶中模块化设计的局限性提供了有前景的技术手段,这种局限性在需要持续环境理解和逻辑推理的开放世界场景中尤为突出。此外,具身人工智能通过闭环交互促进政策优化,实现连续学习能力,从而推动自动驾驶向具身智能(EI)驾驶发展。然而,仅依赖LMMs来增强EI驾驶而没有联合决策,会限制这种能力的发挥。本文介绍了一种新颖的语义和策略双驱动混合决策框架,以解决这一挑战,确保持续学习和联合决策。该框架将LMMs用于语义理解和认知表示,并结合深度强化学习(DRL)进行实时政策优化。 我们首先介绍了EI驾驶和LMMs的基本原理。此外,我们还探讨了这一框架带来的新兴机遇,包括潜在的好处和代表性用例。通过实验性案例研究验证了该框架在完成车道变更规划任务中的性能优势。最后,本文确定了一些未来的研究方向,以支持EI驾驶的发展,并为后续工作提供指导。 --- 具体来说: 1. **背景介绍**:大型多模态模型(LMMs)的出现解决了模块化设计带来的局限性问题,在面对需要持续环境理解和逻辑推理能力的真实世界场景时显得尤为重要。 2. **框架概述**:提出了一个全新的语义和策略双驱动混合决策框架,旨在通过结合LMMs进行语义理解与认知表示,并利用深度强化学习(DRL)实现实时策略优化来解决当前挑战。 3. **性能验证**:通过实验性的案例研究证明了所提出框架在完成车道变更规划任务中的优越性。 4. **未来展望**:指出了几个方向以支持具身智能驾驶的未来发展,为后续的研究工作提供了指导。
https://arxiv.org/abs/2601.08434
Pet ownership is increasingly common in modern households, yet maintaining a consistent feeding schedule remains challenging for the owners particularly those who live in cities and have busy lifestyles. This paper presents the design, development, and validation of a low-cost, scalable GSM-IoT smart pet feeder that enables remote monitoring and control through cellular communication. The device combines with an Arduino microcontroller, a SIM800L GSM module for communication, an ultrasonic sensor for real-time food-level assessment, and a servo mechanism for accurate portion dispensing. A dedicated mobile application was developed using MIT App Inventor which allows owners to send feeding commands and receive real-time status updates. Experimental results demonstrate a 98\% SMS command success rate, consistent portion dispensing with $\pm 2.67$\% variance, and reliable autonomous operation. Its modular, energy-efficient design makes it easy to use in a wide range of households, including those with limited resources. This work pushes forward the field of accessible pet care technology by providing a practical, scalable, and completely internet-independent solution for personalized pet feeding. In doing so, it sets a new benchmark for low-cost, GSM-powered automation in smart pet products.
宠物所有权在现代家庭中越来越普遍,但维持一致的喂食时间表对于主人来说仍然具有挑战性,尤其是那些居住在城市且生活节奏快的人。本文介绍了一种低成本、可扩展的GSM-IoT智能宠物喂食器的设计、开发和验证过程,该设备通过蜂窝通信实现远程监控和控制。此装置结合了Arduino微控制器、SIM800L GSM模块用于通讯、超声波传感器进行实时食物水平评估以及伺服机构以实现准确的食物分发。 我们使用MIT App Inventor开发了一款专用的移动应用程序,允许主人发送喂食指令并接收实时状态更新。实验结果显示,短信命令的成功率为98%,每次分配的食物量偏差为±2.67%,并且设备可以可靠地进行自主操作。其模块化和节能的设计使其易于在各种家庭环境中使用,包括资源有限的家庭。 这项工作通过提供一种实用、可扩展且完全独立于互联网的个性化宠物喂养解决方案,推动了易用型宠物护理技术的发展。同时,它为低成本、基于GSM的智能宠物产品的自动化设定了新的标准。
https://arxiv.org/abs/2601.08394
Vision-Language Models (VLMs) are increasingly deployed in autonomous driving and embodied AI systems, where reliable perception is critical for safe semantic reasoning and decision-making. While recent VLMs demonstrate strong performance on multimodal benchmarks, their robustness to realistic perception degradation remains poorly understood. In this work, we systematically study semantic misalignment in VLMs under controlled degradation of upstream visual perception, using semantic segmentation on the Cityscapes dataset as a representative perception module. We introduce perception-realistic corruptions that induce only moderate drops in conventional segmentation metrics, yet observe severe failures in downstream VLM behavior, including hallucinated object mentions, omission of safety-critical entities, and inconsistent safety judgments. To quantify these effects, we propose a set of language-level misalignment metrics that capture hallucination, critical omission, and safety misinterpretation, and analyze their relationship with segmentation quality across multiple contrastive and generative VLMs. Our results reveal a clear disconnect between pixel-level robustness and multimodal semantic reliability, highlighting a critical limitation of current VLM-based systems and motivating the need for evaluation frameworks that explicitly account for perception uncertainty in safety-critical applications.
视觉-语言模型(VLM)在自动驾驶和具身AI系统中越来越广泛地被部署,而在这些领域中可靠的感知对于安全的语义推理和决策至关重要。尽管最近的VLM在多模态基准测试中表现出强大的性能,但它们对现实世界中感知退化情况下的鲁棒性仍然知之甚少。在这项工作中,我们系统地研究了在上游视觉感知受控退化的条件下VLM中的语义错配问题,使用Cityscapes数据集上的语义分割作为典型的感知模块。我们引入了一组逼真的感知退化形式,这些退化仅导致传统分割指标有轻微下降,但观察到下游VLM行为出现严重故障,包括虚构对象提及、关键实体遗漏以及不一致的安全判断。为了量化这些效应,我们提出了一系列语言层面的错配度量标准,捕捉虚幻生成、关键信息遗漏和安全误解,并分析了它们与多个对比和生成型VLM之间分割质量的关系。我们的结果揭示了一个清晰的事实:像素级鲁棒性和多模态语义可靠性之间的明显脱节,突显出当前基于VLM系统的重大限制,并强调在安全性至关重要的应用中需要明确考虑感知不确定性评估框架的必要性。
https://arxiv.org/abs/2601.08355
Equipping agents with memory is essential for solving real-world long-horizon problems. However, most existing agent memory mechanisms rely on static and hand-crafted workflows. This limits the performance and generalization ability of these memory designs, which highlights the need for a more flexible, learning-based memory framework. In this paper, we propose AtomMem, which reframes memory management as a dynamic decision-making problem. We deconstruct high-level memory processes into fundamental atomic CRUD (Create, Read, Update, Delete) operations, transforming the memory workflow into a learnable decision process. By combining supervised fine-tuning with reinforcement learning, AtomMem learns an autonomous, task-aligned policy to orchestrate memory behaviors tailored to specific task demands. Experimental results across 3 long-context benchmarks demonstrate that the trained AtomMem-8B consistently outperforms prior static-workflow memory methods. Further analysis of training dynamics shows that our learning-based formulation enables the agent to discover structured, task-aligned memory management strategies, highlighting a key advantage over predefined routines.
为代理配备记忆对于解决现实世界中的长期问题至关重要。然而,大多数现有的代理内存机制依赖于静态和人工设计的工作流程。这限制了这些内存设计的性能和泛化能力,从而凸显了一种更灵活、基于学习的记忆框架的需求。在本文中,我们提出了AtomMem,它将内存管理重新定义为一个动态决策问题。我们将高层次的内存过程分解成基本的操作单元(CRUD:创建、读取、更新、删除),并将内存工作流程转化为可学习的决策过程。通过结合监督微调和强化学习,AtomMem 学习了一个自主的任务对齐策略来协调特定任务需求下的记忆行为。在三个长上下文基准测试上的实验结果显示,训练后的 AtomMem-8B 一致地优于先前的静态工作流内存方法。进一步分析训练动力学表明,我们的基于学习的方法使代理能够发现结构化且与任务相匹配的记忆管理策略,突显了其相对于预定义流程的关键优势。
https://arxiv.org/abs/2601.08323
The recent paradigm shift toward large reasoning models (LRMs) as autonomous agents has intensified the demand for sophisticated, multi-turn tool-use capabilities. Yet, existing datasets and data-generation approaches are limited by static, predefined toolsets that cannot scale to the complexity of open-ended human-agent collaboration. To address this, we initially developed a framework for automated task-oriented multi-turn dialogue generation at scale, utilizing an LRM-based simulator to dynamically generate high-value, domain-specific tools to solve specified tasks. However, we observe that a purely task-oriented design often results in "solely task-solving" trajectories, where the agent completes the objective with minimal interaction, failing to generate the high turn-count conversations seen in realistic scenarios. To bridge this gap, we shift toward a user-oriented simulation paradigm. By decoupling task generation from a dedicated user simulator that mimics human behavioral rules - such as incremental request-making and turn-by-turn feedback - we facilitate more authentic, extended multi-turn dialogues that reflect the iterative nature of real-world problem solving. Our generation pipeline operates as a versatile, plug-and-play module capable of initiating generation from any state, ensuring high scalability in producing extended tool-use data. Furthermore, by facilitating multiple task completions within a single trajectory, it yields a high-density dataset that reflects the multifaceted demands of real-world human-agent interaction.
最近向大型推理模型(LRMs)作为自主代理的范式转变,增强了对复杂、多轮工具使用能力的需求。然而,现有的数据集和数据生成方法受限于静态预定义的工具集,这无法扩展到开放性的人机协作场景中的复杂度。为了解决这个问题,我们最初开发了一个大规模自动化任务导向型多轮对话生成框架,并利用基于LRM的模拟器来动态生成特定领域的高价值工具以解决指定的任务。然而,我们观察到纯粹以任务为导向的设计往往会导致“仅解决问题”的轨迹,在这些轨迹中代理通过最少的互动完成目标,这无法产生现实场景中的高度交互式的多轮对话。 为了弥合这一差距,我们将重点转向用户导向的模拟范式。通过将任务生成与模仿人类行为规则(如渐进式请求和逐回合反馈)的专用用户模拟器解耦,我们能够促进更为真实的、扩展性的多轮对话,这反映了现实世界中解决问题的迭代性质。我们的生成流水线操作为一个灵活的插件模块,可以从任意状态开始生成数据,并确保在生产扩展工具使用数据时的高度可伸缩性。此外,通过在一个轨迹内实现多次任务完成,它可以产生高密度的数据集,反映出现实中人机交互的多方面需求。
https://arxiv.org/abs/2601.08225
The integration of large language models (LLMs) into autonomous agents has enabled complex tool use, yet in high-stakes domains, these systems must strictly adhere to regulatory standards beyond simple functional correctness. However, existing benchmarks often overlook implicit regulatory compliance, thus failing to evaluate whether LLMs can autonomously enforce mandatory safety constraints. To fill this gap, we introduce LogiSafetyGen, a framework that converts unstructured regulations into Linear Temporal Logic oracles and employs logic-guided fuzzing to synthesize valid, safety-critical traces. Building on this framework, we construct LogiSafetyBench, a benchmark comprising 240 human-verified tasks that require LLMs to generate Python programs that satisfy both functional objectives and latent compliance rules. Evaluations of 13 state-of-the-art (SOTA) LLMs reveal that larger models, despite achieving better functional correctness, frequently prioritize task completion over safety, which results in non-compliant behavior.
将大型语言模型(LLM)集成到自主代理中,使得复杂工具的使用成为可能。然而,在高风险领域,这些系统必须严格遵守超出简单功能性正确的监管标准。不过,现有的基准测试通常忽视了隐含的合规性要求,因此无法评估LLM能否自主执行强制性的安全约束。为了填补这一空白,我们引入了一个名为LogiSafetyGen的框架,该框架将非结构化的法规转换为线性时态逻辑预言,并使用基于逻辑引导的模糊测试来合成有效的、关键的安全跟踪数据。在此基础上,我们构建了LogiSafetyBench基准,其中包括240个人工验证的任务,这些任务要求LLM生成满足功能目标和隐含合规规则的Python程序。对13种最先进的(SOTA)LLM进行评估后发现,尽管较大的模型在功能性正确性方面表现更好,但它们往往更注重任务完成而非安全问题,这导致了不合规的行为。
https://arxiv.org/abs/2601.08196
Autonomous experimentation holds the potential to accelerate materials development by combining artificial intelligence (AI) with modular robotic platforms to explore extensive combinatorial chemical and processing spaces. Such self-driving laboratories can not only increase the throughput of repetitive experiments, but also incorporate human domain expertise to drive the search towards user-defined objectives, including improved materials performance metrics. We present an autonomous materials synthesis extension to SARA, the Scientific Autonomous Reasoning Agent, utilizing phase information provided by an automated probabilistic phase labeling algorithm to expedite the search for targeted phase regions. By incorporating human input into an expanded SARA-H (SARA with human-in-the-loop) framework, we enhance the efficiency of the underlying reasoning process. Using synthetic benchmarks, we demonstrate the efficiency of our AI implementation and show that the human input can contribute to significant improvement in sampling efficiency. We conduct experimental active learning campaigns using robotic processing of thin-film samples of several oxide material systems, including Bi$_2$O$_3$, SnO$_x$, and Bi-Ti-O, using lateral-gradient laser spike annealing to synthesize and kinetically trap metastable phases. We showcase the utility of human-in-the-loop autonomous experimentation for the Bi-Ti-O system, where we identify extensive processing domains that stabilize $\delta$-Bi$_2$O$_3$ and Bi$_2$Ti$_2$O$_7$, explore dwell-dependent ternary oxide phase behavior, and provide evidence confirming predictions that cationic substitutional doping of TiO$_2$ with Bi inhibits the unfavorable transformation of the metastable anatase to the ground-state rutile phase. The autonomous methods we have developed enable the discovery of new materials and new understanding of materials synthesis and properties.
自主实验能够通过结合人工智能(AI)与模块化机器人平台来探索广泛组合的化学和工艺空间,从而加速材料开发。这样的自驱动实验室不仅可以提高重复性实验的吞吐量,还可以整合人类的专业知识以推动研究朝向用户定义的目标发展,包括改善材料性能指标等。我们为科学自主推理代理(SARA)提供了一个自主材料合成扩展,利用自动化概率相位标签算法提供的相位信息来加快目标相位区域的搜索速度。通过将人类输入纳入扩大的SARA-H(具有人在回路中的SARA)框架中,我们可以增强底层推理过程的有效性。使用合成基准测试,我们展示了我们的AI实现效率,并证明了人类输入可以显著提高采样效率。 我们在几种氧化物材料系统(包括Bi₂O₃、SnOₓ和Bi-Ti-O)的薄膜样品上进行实验主动学习活动,采用横向梯度激光脉冲退火来合成并动力学捕获亚稳态相。我们展示了人在回路中的自主实验在Bi-Ti-O系统的实用性,在该系统中,我们识别了大量的加工领域以稳定δ-Bi₂O₃和Bi₂Ti₂O₇,并探索了滞留依赖的三元氧化物相位行为,并提供了证据支持预测:即对TiO₂进行Bi阳离子置换掺杂可抑制亚稳态锐钛矿向体心结构金红石相变。我们开发的自主方法能够发现新材料并深入理解材料合成和性质的新知识。
https://arxiv.org/abs/2601.08185
This paper introduces Project Synapse, a novel agentic framework designed for the autonomous resolution of last-mile delivery disruptions. Synapse employs a hierarchical multi-agent architecture in which a central Resolution Supervisor agent performs strategic task decomposition and delegates subtasks to specialized worker agents responsible for tactical execution. The system is orchestrated using LangGraph to manage complex and cyclical workflows. To validate the framework, a benchmark dataset of 30 complex disruption scenarios was curated from a qualitative analysis of over 6,000 real-world user reviews. System performance is evaluated using an LLM-as-a-Judge protocol with explicit bias mitigation.
这篇论文介绍了Project Synapse,这是一个专为解决最后一公里配送中断问题而设计的新型代理框架。Synapse采用分层多代理架构,在该架构中,一个中央决议监督者代理负责战略性的任务分解,并将子任务委派给专门的工作代理,这些工作代理则负责战术执行。系统使用LangGraph来管理复杂和循环的工作流程。为了验证该框架的有效性,研究人员从超过6,000条真实用户评价的定性分析中整理出了包含30个复杂中断场景的数据集作为基准测试。系统的性能评估采用带有明确偏见缓解机制的LLM-as-a-Judge协议进行。
https://arxiv.org/abs/2601.08156
We propose a deterministic and time-efficient contact-aware path planner for neurovascular navigation. The algorithm leverages information from pre- and intra-operative images of the vessels to navigate pre-bent passive tools, by intelligently predicting and exploiting interactions with the anatomy. A kinematic model is derived and employed by the sampling-based planner for tree expansion that utilizes simplified motion primitives. This approach enables fast computation of the feasible path, with negligible loss in accuracy, as demonstrated in diverse and representative anatomies of the vessels. In these anatomical demonstrators, the algorithm shows a 100% convergence rate within 22.8s in the worst case, with sub-millimeter tracking errors (less than 0.64 mm), and is found effective on anatomical phantoms representative of around 94% of patients.
我们提出了一种确定性和时间效率高的接触感知路径规划器,用于神经血管导航。该算法利用术前和术中血管图像的信息,通过智能预测与解剖结构的交互来引导预弯被动工具进行导航。基于简化运动基元,推导出并由采样基础规划器使用的动力学模型实现了树形扩展。这种方法能够快速计算可行路径,在准确性上几乎没有损失,这一点在各种代表性的血管解剖中得到了验证。在这些解剖演示者中,算法在最坏情况下于22.8秒内达到了100%的收敛率,并且具有亚毫米级跟踪误差(小于0.64毫米),并且对于大约94%患者的代表性解剖模型有效。
https://arxiv.org/abs/2601.07945
LiDAR scene synthesis is an emerging solution to scarcity in 3D data for robotic tasks such as autonomous driving. Recent approaches employ diffusion or flow matching models to generate realistic scenes, but 3D data remains limited compared to RGB datasets with millions of samples. We introduce R3DPA, the first LiDAR scene generation method to unlock image-pretrained priors for LiDAR point clouds, and leverage self-supervised 3D representations for state-of-the-art results. Specifically, we (i) align intermediate features of our generative model with self-supervised 3D features, which substantially improves generation quality; (ii) transfer knowledge from large-scale image-pretrained generative models to LiDAR generation, mitigating limited LiDAR datasets; and (iii) enable point cloud control at inference for object inpainting and scene mixing with solely an unconditional model. On the KITTI-360 benchmark R3DPA achieves state of the art performance. Code and pretrained models are available at this https URL.
LiDAR场景合成是解决机器人任务(如自动驾驶)中三维数据稀缺问题的一种新兴解决方案。最近的方法使用扩散模型或流匹配模型来生成逼真的场景,但与包含数百万样本的RGB数据集相比,3D数据仍然非常有限。我们引入了R3DPA,这是第一个利用图像预训练先验解锁LiDAR点云潜力,并采用自我监督三维表示以实现最佳性能的LiDAR场景生成方法。具体来说: 1. 我们将生成模型的中间特征与自监督三维特征对齐,这显著提高了生成质量。 2. 将大规模图像预训练生成模型的知识转移到LiDAR生成中,从而缓解了有限的LiDAR数据集问题。 3. 仅使用无条件模型就能够在推理过程中实现点云控制,进行对象修复和场景混合。 在KITTI-360基准测试上,R3DPA实现了最佳性能。代码和预训练模型可在提供的链接中获得。
https://arxiv.org/abs/2601.07692
Autonomous 3D scanning of open-world target structures via drones remains challenging despite broad applications. Existing paradigms rely on restrictive assumptions or effortful human priors, limiting practicality, efficiency, and adaptability. Recent foundation models (FMs) offer great potential to bridge this gap. This paper investigates a critical research problem: What system architecture can effectively integrate FM knowledge for this task? We answer it with FlyCo, a principled FM-empowered perception-prediction-planning loop enabling fully autonomous, prompt-driven 3D target scanning in diverse unknown open-world environments. FlyCo directly translates low-effort human prompts (text, visual annotations) into precise adaptive scanning flights via three coordinated stages: (1) perception fuses streaming sensor data with vision-language FMs for robust target grounding and tracking; (2) prediction distills FM knowledge and combines multi-modal cues to infer the partially observed target's complete geometry; (3) planning leverages predictive foresight to generate efficient and safe paths with comprehensive target coverage. Building on this, we further design key components to boost open-world target grounding efficiency and robustness, enhance prediction quality in terms of shape accuracy, zero-shot generalization, and temporal stability, and balance long-horizon flight efficiency with real-time computability and online collision avoidance. Extensive challenging real-world and simulation experiments show FlyCo delivers precise scene understanding, high efficiency, and real-time safety, outperforming existing paradigms with lower human effort and verifying the proposed architecture's practicality. Comprehensive ablations validate each component's contribution. FlyCo also serves as a flexible, extensible blueprint, readily leveraging future FM and robotics advances. Code will be released.
自主无人机三维扫描开放环境中的目标结构仍然面临诸多挑战,尽管其应用范围广泛。现有的方法依赖于严格的假设或需要大量的先验人工干预,这限制了其实用性、效率和适应能力。最近的基础模型(Foundation Models, FMs)为解决这些问题提供了巨大的潜力。本文探讨了一个关键的研究问题:何种系统架构能够有效地将基础模型知识整合到该任务中?我们通过提出FlyCo来回答这个问题——这是一个基于原则的自主感知-预测-规划循环框架,它利用了强大的基础模型能力,能够在多样且未知的开放环境中进行完全自主、指令驱动的目标三维扫描。 FlyCo直接将低投入的人类指令(如文本和视觉注释)转化为精确而适应性强的无人机飞行任务,通过三个协调阶段实现这一目标: 1. **感知**:该阶段融合了实时传感器数据与视觉-语言基础模型的知识,以确保对目标对象的稳定识别和跟踪。 2. **预测**:此步骤提取并利用基础模型知识,并结合多模态线索来推断部分可见目标的整体几何结构。 3. **规划**:这一环节利用预测性的洞察力生成高效且安全的飞行路径,确保全面覆盖目标物体。 基于上述架构,我们进一步设计了关键组件以提升开放环境下的目标定位效率和鲁棒性,并增强预测质量,特别是在形状准确性、零样本泛化能力和时间稳定性方面。此外,还优化了长时段任务的执行效率与实时计算能力以及在线避障之间的平衡。一系列广泛的现实世界和仿真实验显示,FlyCo在场景理解精度、运行效率及即时安全性等方面超越现有方法,并且需要较低的人工干预量,验证了所提出架构的实际可行性。 详细消融研究进一步证实了每个组件的贡献。此外,FlyCo还作为一个灵活、可扩展的设计蓝图存在,能够轻松利用未来的基础模型和机器人技术进步。代码将在适当时候发布。
https://arxiv.org/abs/2601.07558
Autonomous driving systems rely heavily on multi-view images to ensure accurate perception and robust decision-making. To effectively develop and evaluate perception stacks and planning algorithms, realistic closed-loop simulators are indispensable. While 3D reconstruction techniques such as Gaussian Splatting offer promising avenues for simulator construction, the rendered novel views often exhibit artifacts, particularly in extrapolated perspectives or when available observations are sparse. We introduce ViewMorpher3D, a multi-view image enhancement framework based on image diffusion models, designed to elevate photorealism and multi-view coherence in driving scenes. Unlike single-view approaches, ViewMorpher3D jointly processes a set of rendered views conditioned on camera poses, 3D geometric priors, and temporally adjacent or spatially overlapping reference views. This enables the model to infer missing details, suppress rendering artifacts, and enforce cross-view consistency. Our framework accommodates variable numbers of cameras and flexible reference/target view configurations, making it adaptable to diverse sensor setups. Experiments on real-world driving datasets demonstrate substantial improvements in image quality metrics, effectively reducing artifacts while preserving geometric fidelity.
自动驾驶系统高度依赖多视角图像以确保准确的感知和稳健的决策制定。为了有效地开发和评估感知栈与规划算法,现实主义闭环模拟器不可或缺。虽然像高斯点阵(Gaussian Splatting)这样的三维重建技术为构建模拟器提供了有前景的方法,但渲染的新视角往往会出现伪影,尤其是在外推视图或可用观察数据稀疏的情况下。我们推出了ViewMorpher3D,这是一个基于图像扩散模型的多视角图像增强框架,旨在提升驾驶场景中的照片真实感和多视角一致性。 与单视角方法不同,ViewMorpher3D同时处理一组渲染视角,这些视角以相机姿态、三维几何先验知识以及时间上相邻或空间上重叠的参考视图为条件。这使得模型能够推断缺失细节,抑制渲染伪影,并确保跨视角的一致性。我们的框架支持可变数量的摄像头和灵活的参考/目标视角配置,使其适用于各种传感器设置。 在真实驾驶数据集上的实验表明,在图像质量指标上取得了显著提升,有效减少了伪影同时保持了几何保真度。
https://arxiv.org/abs/2601.07540