Autonomous driving demands safe motion planning, especially in critical "long-tail" scenarios. Recent end-to-end autonomous driving systems leverage large language models (LLMs) as planners to improve generalizability to rare events. However, using LLMs at test time introduces high computational costs. To address this, we propose DiMA, an end-to-end autonomous driving system that maintains the efficiency of an LLM-free (or vision-based) planner while leveraging the world knowledge of an LLM. DiMA distills the information from a multi-modal LLM to a vision-based end-to-end planner through a set of specially designed surrogate tasks. Under a joint training strategy, a scene encoder common to both networks produces structured representations that are semantically grounded as well as aligned to the final planning objective. Notably, the LLM is optional at inference, enabling robust planning without compromising on efficiency. Training with DiMA results in a 37% reduction in the L2 trajectory error and an 80% reduction in the collision rate of the vision-based planner, as well as a 44% trajectory error reduction in longtail scenarios. DiMA also achieves state-of-the-art performance on the nuScenes planning benchmark.
自动驾驶需要安全的路径规划,尤其是在关键的“长尾”场景中。最近端到端的自动驾驶系统利用大型语言模型(LLM)作为规划器来提高对罕见事件的泛化能力。然而,在测试时使用LLM会引入高昂的计算成本。为了解决这个问题,我们提出了DiMA,这是一种保持无LLM(或基于视觉的)规划器效率的同时又能利用LLM世界知识的端到端自动驾驶系统。通过一组特别设计的代理任务,DiMA将多模态LLM中的信息浓缩成一个基于视觉的端到端规划器。在联合训练策略下,两个网络共用的一个场景编码器生成结构化表示,并且这些表示既语义相关又与最终规划目标对齐。值得注意的是,在推理时不需要使用LLM,从而实现了无需牺牲效率的前提下进行稳健规划的能力。采用DiMA训练后,基于视觉的规划器在L2轨迹误差上减少了37%,碰撞率降低了80%;同时在长尾场景下轨迹误差也减少了44%。此外,DiMA还在nuScenes规划基准测试中取得了最先进的性能水平。
https://arxiv.org/abs/2501.09757
The rapid deployment of autonomous AI agents creates urgent challenges around authorization, accountability, and access control in digital spaces. New standards are needed to know whom AI agents act on behalf of and guide their use appropriately, protecting online spaces while unlocking the value of task delegation to autonomous agents. We introduce a novel framework for authenticated, authorized, and auditable delegation of authority to AI agents, where human users can securely delegate and restrict the permissions and scope of agents while maintaining clear chains of accountability. This framework builds on existing identification and access management protocols, extending OAuth 2.0 and OpenID Connect with agent-specific credentials and metadata, maintaining compatibility with established authentication and web infrastructure. Further, we propose a framework for translating flexible, natural language permissions into auditable access control configurations, enabling robust scoping of AI agent capabilities across diverse interaction modalities. Taken together, this practical approach facilitates immediate deployment of AI agents while addressing key security and accountability concerns, working toward ensuring agentic AI systems perform only appropriate actions and providing a tool for digital service providers to enable AI agent interactions without risking harm from scalable interaction.
自主AI代理的快速部署在授权、问责制和访问控制方面为数字空间带来了紧迫挑战。需要新的标准来确定AI代理代表谁行事,并指导它们的适当使用,以保护在线空间并解锁将任务委托给自主代理的价值。我们介绍了一种新型框架,用于将经过认证、授权和可审计的权限委派给AI代理,使人类用户能够安全地向代理授予和限制权限范围,同时保持清晰的责任链。此框架建立在现有的身份验证和访问管理协议之上,并通过为特定代理扩展OAuth 2.0和OpenID Connect的身份验证凭证和元数据来维护与现有认证和网络基础设施的兼容性。 此外,我们还提出了一种将灵活、自然语言权限转换为可审计的访问控制配置框架,使AI代理的能力范围能够在各种交互模式下保持稳健。总体而言,这种方法旨在在解决关键的安全性和问责制问题的同时实现AI代理的即时部署,并且努力确保具有自主性的AI系统仅执行适当的操作,并提供一种工具,使数字服务提供商能够安全地启用与AI代理的互动而不会因可扩展性互动带来风险。
https://arxiv.org/abs/2501.09674
Autonomous docking remains one of the most challenging maneuvers in marine robotics, requiring precise control and robust perception in confined spaces. This paper presents a novel approach integrating Model Predictive Path Integral(MPPI) control with real-time LiDAR-based dock detection for autonomous surface vessel docking. Our framework uniquely combines probabilistic trajectory optimization with a multiobjective cost function that simultaneously considers docking precision, safety constraints, and motion efficiency. The MPPI controller generates optimal trajectories by intelligently sampling control sequences and evaluating their costs based on dynamic clearance requirements, orientation alignment, and target position objectives. We introduce an adaptive dock detection pipeline that processes LiDAR point clouds to extract critical geometric features, enabling real-time updates of docking parameters. The proposed method is extensively validated in a physics-based simulation environment that incorporates realistic sensor noise, vessel dynamics, and environmental constraints. Results demonstrate successful docking from various initial positions while maintaining safe clearances and smooth motion characteristics.
自主对接仍然是海洋机器人技术中最具挑战性的操作之一,要求在狭小空间内进行精确控制和稳健感知。本文提出了一种新颖的方法,将模型预测路径积分(MPPI)控制与实时LiDAR-based船坞检测相结合,用于自主水面船舶的靠泊。我们的框架独特地结合了概率轨迹优化和一个多目标成本函数,同时考虑了对接精度、安全约束以及运动效率。 MPPI控制器通过智能抽样控制序列并根据动态避碰要求、方向对齐及目标位置目标来评估其成本,从而生成最优轨迹。我们引入了一种自适应船坞检测流水线,该流程处理LiDAR点云以提取关键几何特征,使对接参数能够在实时中更新。 所提出的方法在物理基础的仿真环境中进行了广泛的验证,该环境包括了现实传感器噪声、船舶动力学以及环境约束等要素。结果表明,从各种初始位置成功实现靠泊,并且保持安全距离和流畅运动特性。
https://arxiv.org/abs/2501.09668
Planning for autonomous systems typically requires reasoning with models at different levels of abstraction, and the harmonization of two competing sets of objectives: high-level mission goals that refer to an interaction of the system with the external environment, and low-level platform constraints that aim to preserve the integrity and the correct interaction of the subsystems. The complicated interplay between these two models makes it very hard to reason on the system as a whole, especially when the objective is to find plans with robustness guarantees, considering the non-deterministic behavior of the lower layers of the system. In this paper, we introduce the problem of Platform-Aware Mission Planning (PAMP), addressing it in the setting of temporal durative actions. The PAMP problem differs from standard temporal planning for its exists-forall nature: the high-level plan dealing with mission goals is required to satisfy safety and executability constraints, for all the possible non-deterministic executions of the low-level model of the platform and the environment. We propose two approaches for solving PAMP. The first baseline approach amalgamates the mission and platform levels, while the second is based on an abstraction-refinement loop that leverages the combination of a planner and a verification engine. We prove the soundness and completeness of the proposed approaches and validate them experimentally, demonstrating the importance of heterogeneous modeling and the superiority of the technique based on abstraction-refinement.
自主系统规划通常需要在不同抽象层次上使用模型进行推理,并协调两个相互竞争的目标集:高层次的任务目标,这些目标涉及到系统的外部环境交互;以及低层次的平台约束,旨在保护子系统的完整性和正确交互。这两者之间的复杂互动使得整体系统难以分析,尤其是在寻找具备鲁棒性保证的计划时,这需要考虑系统底层不确定行为的影响。 本文介绍了基于时态持续动作设置下的平台感知任务规划(Platform-Aware Mission Planning, PAMP)问题,并提出了解决该问题的两种方法。PAMP问题不同于标准的时间规划之处在于其存在全称性质:处理任务目标的高层次计划必须满足所有可能的低层次模型非确定性执行的安全性和可执行约束。 我们提出了两种解决PAMP的方法。第一种基础方法是将任务和平台层次合并,而第二种则基于抽象细化循环,利用了规划器与验证引擎相结合的技术。我们证明了这两种方法的有效性和完备性,并通过实验对其进行了验证,展示了异构建模的重要性以及基于抽象细化技术的优越性。 总的来说,这项研究强调了在自主系统中处理高层次任务目标和低层次平台约束之间复杂关系时,采用适当的方法和技术来保证规划的鲁棒性和可行性至关重要。
https://arxiv.org/abs/2501.09632
LiDAR is a crucial sensor in autonomous driving, commonly used alongside cameras. By exploiting this camera-LiDAR setup and recent advances in image representation learning, prior studies have shown the promising potential of image-to-LiDAR distillation. These prior arts focus on the designs of their own losses to effectively distill the pre-trained 2D image representations into a 3D model. However, the other parts of the designs have been surprisingly unexplored. We find that fundamental design elements, e.g., the LiDAR coordinate system, quantization according to the existing input interface, and data utilization, are more critical than developing loss functions, which have been overlooked in prior works. In this work, we show that simple fixes to these designs notably outperform existing methods by 16% in 3D semantic segmentation on the nuScenes dataset and 13% in 3D object detection on the KITTI dataset in downstream task performance. We focus on overlooked design choices along the spatial and temporal axes. Spatially, prior work has used cylindrical coordinate and voxel sizes without considering their side effects yielded with a commonly deployed sparse convolution layer input interface, leading to spatial quantization errors in 3D models. Temporally, existing work has avoided cumbersome data curation by discarding unsynced data, limiting the use to only the small portion of data that is temporally synced across sensors. We analyze these effects and propose simple solutions for each overlooked aspect.
在自动驾驶领域,LiDAR(光探测和测距)传感器与摄像头共同使用是至关重要的。通过利用这种相机-LiDAR配置及近期图像表示学习的进展,先前的研究展示了从二维图像表示向三维模型提炼信息的巨大潜力,即所谓的“图像到LiDAR的知识蒸馏”。这些早期研究主要集中在设计自己的损失函数以有效提炼预训练的2D图像表示方面。然而,其他设计方案却鲜被探索。 我们发现,一些基本的设计要素——比如LiDAR坐标系统、依据现有输入接口进行量化的方法以及数据利用方式——比开发损失函数更为关键,而这些却被之前的工作所忽视了。在这项工作中,我们展示了对这些设计的简单改进可以显著超越现有的方法,在nuScenes数据集上的3D语义分割任务中性能提高了16%,在KITTI数据集上的3D物体检测任务中性能提高了13%。 我们的工作主要集中在被忽略的空间和时间轴的设计选择上。在空间维度上,早期的工作采用了柱状坐标系和体素尺寸大小而没有考虑它们与常用稀疏卷积层输入接口相结合时的副作用,这导致了三维模型中的空间量化误差。在时间维度上,为了避开繁琐的数据整理工作,现有的方法丢弃了不同步的数据,这限制了只有很小部分同步数据能够被利用。 我们分析了这些影响并针对每个被忽视的部分提出了简单的解决方案。
https://arxiv.org/abs/2501.09485
Detecting the three-dimensional position and orientation of objects using a single RGB camera is a foundational task in computer vision with many important applications. Traditionally, 3D object detection methods are trained in a fully-supervised setup, requiring vast amounts of human annotations, which are laborious, costly, and do not scale well with the ever-increasing amounts of data being captured. In this paper, we present the first method to train 3D object detectors for monocular RGB cameras without domain-specific human annotations, thus making orders of magnitude more data available for training. Thanks to newly proposed Canonical Object Space, the method can not only exploit data across a variety of datasets and camera setups to train a single 3D detector, but unlike previous work it also works out of the box in previously unseen camera setups. All this is crucial for practical applications, where the data and cameras are extremely heterogeneous. The method is evaluated on two standard autonomous driving datasets, where it outperforms previous works, which, unlike our method, still rely on 2D human annotations.
使用单个RGB相机检测物体的三维位置和姿态是计算机视觉中的一个基础任务,具有许多重要的应用。传统上,3D对象检测方法在完全监督的学习环境中进行训练,需要大量的手工标注数据,这种标注过程耗时、昂贵,并且随着捕获的数据量不断增加而难以扩展。在这篇论文中,我们提出了首个无需特定领域的人工注释即可为单目RGB相机训练3D物体检测器的方法,这使得可以利用数量级更多的数据进行训练。 得益于新提出的规范对象空间(Canonical Object Space),该方法不仅可以利用来自各种数据集和相机设置的大量数据来训练一个单一的3D检测器,而且与先前的工作不同的是,它在之前未见过的相机设置中也能直接应用。所有这些对于实际应用场景来说至关重要,在这些场景中,数据和相机的异质性极高。 该方法已在两个标准的自动驾驶数据集上进行了评估,并且其性能超过了仍依赖于2D人工注释的先前工作。
https://arxiv.org/abs/2501.09481
Object detection plays a crucial role in smart video analysis, with applications ranging from autonomous driving and security to smart cities. However, achieving real-time object detection on edge devices presents significant challenges due to their limited computational resources and the high demands of deep neural network (DNN)-based detection models, particularly when processing high-resolution video. Conventional strategies, such as input down-sampling and network up-scaling, often compromise detection accuracy for faster performance or lead to higher inference latency. To address these issues, this paper introduces RE-POSE, a Reinforcement Learning (RL)-Driven Partitioning and Edge Offloading framework designed to optimize the accuracy-latency trade-off in resource-constrained edge environments. Our approach features an RL-Based Dynamic Clustering Algorithm (RL-DCA) that partitions video frames into non-uniform blocks based on object distribution and the computational characteristics of DNNs. Furthermore, a parallel edge offloading scheme is implemented to distribute these blocks across multiple edge servers for concurrent processing. Experimental evaluations show that RE-POSE significantly enhances detection accuracy and reduces inference latency, surpassing existing methods.
物体检测在智能视频分析中扮演着至关重要的角色,应用范围从自动驾驶和安全到智慧城市。然而,在计算资源有限的边缘设备上实现实时物体检测面临着巨大挑战,尤其是当处理高分辨率视频时,深度神经网络(DNN)模型的需求变得更高。传统的策略,如输入降采样和网络扩展,通常会牺牲检测精度以换取更快的速度或导致推理延迟增加。为了解决这些问题,本文介绍了RE-POSE框架,这是一种基于强化学习(RL)的分区与边缘卸载方法,旨在优化资源受限的边缘环境中准确性与延迟之间的权衡。我们的方法包括一个基于RL的动力聚类算法(RL-DCA),该算法根据物体分布和DNN计算特性将视频帧分割成非均匀块。此外,还实施了一种并行边缘卸载方案,用于将这些块分布在多个边缘服务器上进行并发处理。实验评估表明,RE-POSE显著提高了检测精度,并减少了推理延迟,超越了现有方法。
https://arxiv.org/abs/2501.09465
Traditional in-person psychological counseling remains primarily niche, often chosen by individuals with psychological issues, while online automated counseling offers a potential solution for those hesitant to seek help due to feelings of shame. Cognitive Behavioral Therapy (CBT) is an essential and widely used approach in psychological counseling. The advent of large language models (LLMs) and agent technology enables automatic CBT diagnosis and treatment. However, current LLM-based CBT systems use agents with a fixed structure, limiting their self-optimization capabilities, or providing hollow, unhelpful suggestions due to redundant response patterns. In this work, we utilize Quora-like and YiXinLi single-round consultation models to build a general agent framework that generates high-quality responses for single-turn psychological consultation scenarios. We use a bilingual dataset to evaluate the quality of single-response consultations generated by each framework. Then, we incorporate dynamic routing and supervisory mechanisms inspired by real psychological counseling to construct a CBT-oriented autonomous multi-agent framework, demonstrating its general applicability. Experimental results indicate that AutoCBT can provide higher-quality automated psychological counseling services.
传统面对面的心理咨询仍然主要局限于特定群体,通常是那些有心理问题的人的选择。而在线自动化心理咨询为那些因羞耻感而不愿意寻求帮助的人提供了一个潜在的解决方案。认知行为疗法(Cognitive Behavioral Therapy, CBT)是心理治疗中一种重要且广泛使用的方法。大型语言模型(Large Language Models, LLMs)和代理技术的发展使得自动化的CBT诊断与治疗成为可能。然而,目前基于LLM的CBT系统使用的通常是结构固定的代理,这限制了它们自我优化的能力;或者由于冗余的回答模式提供空洞且无益的建议。 在此项工作中,我们利用类似于Quora和“一颗心”单轮咨询模型构建了一个通用代理框架,该框架能够生成高质量的回答以应对单回合的心理咨询服务场景。通过使用双语数据集来评估每个框架所产生的单一回应心理咨询的质量。然后,我们将受实际心理治疗启发的动态路由和监管机制融入其中,构建一个面向CBT的自主多智能体框架,并展示了其广泛的适用性。 实验结果显示,AutoCBT能够提供更高质量的自动化心理健康咨询服务。
https://arxiv.org/abs/2501.09426
The unmanned aerial vehicles (UAVs) are efficient tools for diverse tasks such as electronic reconnaissance, agricultural operations and disaster relief. In the complex three-dimensional (3D) environments, the path planning with obstacle avoidance for UAVs is a significant issue for security assurance. In this paper, we construct a comprehensive 3D scenario with obstacles and no-fly zones for dynamic UAV trajectory. Moreover, a novel artificial potential field algorithm coupled with simulated annealing (APF-SA) is proposed to tackle the robust path planning problem. APF-SA modifies the attractive and repulsive potential functions and leverages simulated annealing to escape local minimum and converge to globally optimal solutions. Simulation results demonstrate that the effectiveness of APF-SA, enabling efficient autonomous path planning for UAVs with obstacle avoidance.
无人飞行器(UAV)是执行电子侦察、农业操作和灾害救援等多样化任务的高效工具。在复杂的三维(3D)环境中,为无人机规划具有避障功能的路径对于安全保障来说是一个重要问题。本文构建了一个包含障碍物和禁飞区的综合三维场景,用于动态无人机轨迹设计。此外,我们提出了一种新颖的人工势场算法与模拟退火相结合的方法(APF-SA),以解决鲁棒路径规划的问题。APF-SA通过修改吸引力和排斥力势能函数,并利用模拟退火技术来避免局部最优解并收敛于全局最优解。仿真结果表明,APF-SA方法能够有效实现无人机的自主避障路径规划。
https://arxiv.org/abs/2501.09338
In real-world sequential decision making tasks like autonomous driving, robotics, and healthcare, learning from observed state-action trajectories is critical for tasks like imitation, classification, and clustering. For example, self-driving cars must replicate human driving behaviors, while robots and healthcare systems benefit from modeling decision sequences, whether or not they come from expert data. Existing trajectory encoding methods often focus on specific tasks or rely on reward signals, limiting their ability to generalize across domains and tasks. Inspired by the success of embedding models like CLIP and BERT in static domains, we propose a novel method for embedding state-action trajectories into a latent space that captures the skills and competencies in the dynamic underlying decision-making processes. This method operates without the need for reward labels, enabling better generalization across diverse domains and tasks. Our contributions are threefold: (1) We introduce a trajectory embedding approach that captures multiple abilities from state-action data. (2) The learned embeddings exhibit strong representational power across downstream tasks, including imitation, classification, clustering, and regression. (3) The embeddings demonstrate unique properties, such as controlling agent behaviors in IQ-Learn and an additive structure in the latent space. Experimental results confirm that our method outperforms traditional approaches, offering more flexible and powerful trajectory representations for various applications. Our code is available at this https URL.
在现实世界的顺序决策任务中,如自动驾驶、机器人技术和医疗保健领域,从观察到的状态-动作轨迹(state-action trajectories)中学习对于模仿、分类和聚类等任务至关重要。例如,无人驾驶汽车需要复制人类驾驶行为,而机器人系统和医疗健康系统则可以从建模决策序列中受益,不论这些数据是否来自专家。现有的轨迹编码方法通常专注于特定的任务或依赖于奖励信号,这限制了它们在跨领域和任务中的泛化能力。 受嵌入模型如CLIP(Contrastive Language–Image Pre-training)和BERT(Bidirectional Encoder Representations from Transformers)在静态域中取得成功的启发,我们提出了一种将状态-动作轨迹嵌入到一个潜在空间的方法。这种方法旨在捕捉动态决策过程中的技能与能力,并且不需要奖励标签,从而能够在不同领域和任务之间实现更好的泛化。 我们的贡献主要包括三个方面: 1. 我们引入了一种新的轨迹嵌入方法,该方法能够从状态-动作数据中捕获多种能力。 2. 学习到的嵌入具有强大的跨下游任务表示能力,包括模仿、分类、聚类和回归等。 3. 嵌入还展示了独特的性质,如在IQ-Learn中控制代理行为以及潜在空间中的加性结构。 实验结果证实了我们提出的方法优于传统方法,为各种应用提供了更灵活且强大的轨迹表示。我们的代码可在以下网址获取:[这个URL应该是一个实际可用的链接,在此处用占位符"this https URL"代替]。
https://arxiv.org/abs/2501.09327
As robotic technology rapidly develops, robots are being employed in an increasing number of fields. However, due to the complexity of deployment environments or the prevalence of ambiguous-condition objects, the practical application of robotics still faces many challenges, leading to frequent errors. Traditional methods and some LLM-based approaches, although improved, still require substantial human intervention and struggle with autonomous error correction in complex this http URL this work, we propose RoboReflect, a novel framework leveraging large vision-language models (LVLMs) to enable self-reflection and autonomous error correction in robotic grasping tasks. RoboReflect allows robots to automatically adjust their strategies based on unsuccessful attempts until successful execution is this http URL corrected strategies are saved in a memory for future task this http URL evaluate RoboReflect through extensive testing on eight common objects prone to ambiguous conditions of three this http URL results demonstrate that RoboReflect not only outperforms existing grasp pose estimation methods like AnyGrasp and high-level action planning techniques using GPT-4V but also significantly enhances the robot's ability to adapt and correct errors independently. These findings underscore the critical importance of autonomous selfreflection in robotic systems while effectively addressing the challenges posed by ambiguous environments.
随着机器人技术的迅速发展,机器人被越来越多地应用到各个领域。然而,由于部署环境的复杂性或模糊条件物体的普遍存在,机器人的实际应用仍面临许多挑战,导致频繁出现错误。传统的解决方案和一些基于大型语言模型的方法虽然有所改进,但仍需大量的人工干预,并且在复杂的环境中难以实现自主纠错。 在此工作中,我们提出了一种名为RoboReflect的新框架,该框架利用大规模视觉-语言模型(LVLMs)使机器人能够在抓取任务中进行自我反思并实现自主错误纠正。RoboReflect允许机器人根据未成功的尝试自动调整策略,直到成功执行为止,并将这些修正后的策略保存在内存中以备将来任务使用。 我们通过广泛的测试,在三种具有模糊条件倾向的常见物体上评估了RoboReflect的表现。结果表明,与现有的抓取姿态估计方法(如AnyGrasp)和使用GPT-4V的高度行动规划技术相比,RoboReflect不仅表现更佳,还显著增强了机器人在复杂环境中独立适应和纠正错误的能力。 这些发现强调了在机器人系统中实现自主自我反思的重要性,并有效解决了模糊环境带来的挑战。
https://arxiv.org/abs/2501.09307
Building autonomous mobile robots (AMRs) with optimized efficiency and adaptive capabilities-able to respond to changing task demands and dynamic environments-is a strongly desired goal for advancing construction robotics. Such robots can play a critical role in enabling automation, reducing operational carbon footprints, and supporting modular construction processes. Inspired by the adaptive autonomy of living organisms, we introduce interoception, which centers on the robot's internal state representation, as a foundation for developing self-reflection and conscious learning to enable continual learning and adaptability in robotic agents. In this paper, we factorize internal state variables and mathematical properties as "cognitive dissonance" in shared control paradigms, where human interventions occasionally occur. We offer a new perspective on how interoception can help build adaptive motion planning in AMRs by integrating the legacy of heuristic costs from grid/graph-based algorithms with recent advances in neuroscience and reinforcement learning. Declarative and procedural knowledge extracted from human semantic inputs is encoded into a hypergraph model that overlaps with the spatial configuration of onsite layout for path planning. In addition, we design a velocity-replay module using an encoder-decoder architecture with few-shot learning to enable robots to replicate velocity profiles in contextualized scenarios for multi-robot synchronization and handover collaboration. These "cached" knowledge representations are demonstrated in simulated environments for multi-robot motion planning and stacking tasks. The insights from this study pave the way toward artificial general intelligence in AMRs, fostering their progression from complexity to competence in construction automation.
构建具备优化效率和适应能力的自主移动机器人(AMR),使其能够应对任务需求的变化和动态环境,是推进建筑机器人技术发展的一个重要目标。这类机器人可以在自动化、减少运营碳足迹和支持模块化施工流程方面发挥关键作用。受生物体自适应自主性的启发,我们引入了内感受(interoception)的概念,它侧重于机器人的内部状态表示,并以此为基础开发自我反思和有意识的学习能力,以实现持续学习和适应性。 本文中,我们将内部状态变量和数学属性视为在共享控制范式中的“认知失调”,其中偶尔有人类干预。我们提出了一种新视角,说明内感受如何通过整合基于网格/图算法的传统启发式成本与神经科学及强化学习的最新进展来帮助构建具有适应性运动规划能力的AMR。 从人类语义输入中提取的声明性和程序性知识被编码到一个超图模型中,该模型与其现场布局的空间配置重叠,用于路径规划。此外,我们设计了一个速度回放模块,采用带有少量样本学习能力的编码器-解码器架构,使机器人能够在情境化的场景中复制速度曲线,以实现多机器人同步和交接协作。 这些“缓存”的知识表示在模拟环境中展示了多机器人运动规划和堆叠任务的效果。本研究的见解为AMR的人工通用智能铺平了道路,并推动它们从复杂性向建筑自动化能力的发展。
https://arxiv.org/abs/2501.09290
This work introduces a novel Retention Layer mechanism for Transformer based architectures, addressing their inherent lack of intrinsic retention capabilities. Unlike human cognition, which can encode and dynamically recall symbolic templates, Generative Pretrained Transformers rely solely on fixed pretrained weights and ephemeral context windows, limiting their adaptability. The proposed Retention Layer incorporates a persistent memory module capable of real time data population, dynamic recall, and guided output generation. This enhancement allows models to store, update, and reuse observed patterns across sessions, enabling incremental learning and bridging the gap between static pretraining and dynamic, context sensitive adaptation. The Retention Layer design parallels social learning processes, encompassing attention, retention, reproduction, and motivation stages. Technically, it integrates a memory attention mechanism and episodic buffers to manage memory scalability, mitigate overfitting, and ensure efficient recall. Applications span adaptive personal assistants, real time fraud detection, autonomous robotics, content moderation, and healthcare diagnostics. In each domain, the retention mechanism enables systems to learn incrementally, personalize outputs, and respond to evolving real world challenges effectively. By emulating key aspects of human learning, this retention enhanced architecture fosters a more fluid and responsive AI paradigm, paving the way for dynamic, session aware models that extend the capabilities of traditional Transformers into domains requiring continual adaptation.
这项工作引入了一种新颖的保留层(Retention Layer)机制,用于基于Transformer架构的设计中。此机制解决了这些模型固有的内在保持能力不足的问题。与人类认知不同,后者能够编码并动态回忆象征性模板,生成式预训练Transformer仅依赖于固定的预训练权重和短暂的有效上下文窗口,这限制了它们的适应性。 所提出的保留层(Retention Layer)包含一个持久内存模块,该模块能够在实时数据填充、动态召回以及指导输出生成方面发挥作用。这一增强使模型能够跨会话存储、更新并重复使用观察到的模式,从而实现增量学习,并弥合静态预训练与动态上下文敏感适应之间的差距。 保留层的设计借鉴了社会学习过程中的注意力、记忆保持、再现和激励阶段。技术上,它集成了内存注意机制和情景缓冲区,以管理内存可扩展性、缓解过拟合以及确保高效召回。该方法的应用范围广泛,包括自适应个人助手、实时欺诈检测、自主机器人系统、内容审核和医疗诊断等。 在每个领域中,保留机制使得系统能够进行增量学习,个性化输出,并有效应对不断变化的现实挑战。通过模仿人类学习的关键方面,这种改进后的架构促进了更加流畅且响应迅速的人工智能范式的发展,为动态会话感知模型铺平了道路,从而将传统的Transformer能力扩展到需要持续适应性的领域中去。
https://arxiv.org/abs/2501.09166
Large Language Models (LLMs) have revolutionized artificial intelligence (AI) by enabling human like text generation and natural language understanding. However, their reliance on static training data limits their ability to respond to dynamic, real time queries, resulting in outdated or inaccurate outputs. Retrieval Augmented Generation (RAG) has emerged as a solution, enhancing LLMs by integrating real time data retrieval to provide contextually relevant and up-to-date responses. Despite its promise, traditional RAG systems are constrained by static workflows and lack the adaptability required for multistep reasoning and complex task management. Agentic Retrieval-Augmented Generation (Agentic RAG) transcends these limitations by embedding autonomous AI agents into the RAG pipeline. These agents leverage agentic design patterns reflection, planning, tool use, and multiagent collaboration to dynamically manage retrieval strategies, iteratively refine contextual understanding, and adapt workflows to meet complex task requirements. This integration enables Agentic RAG systems to deliver unparalleled flexibility, scalability, and context awareness across diverse applications. This survey provides a comprehensive exploration of Agentic RAG, beginning with its foundational principles and the evolution of RAG paradigms. It presents a detailed taxonomy of Agentic RAG architectures, highlights key applications in industries such as healthcare, finance, and education, and examines practical implementation strategies. Additionally, it addresses challenges in scaling these systems, ensuring ethical decision making, and optimizing performance for real-world applications, while providing detailed insights into frameworks and tools for implementing Agentic RAG
大型语言模型(LLMs)通过实现类似人类的文本生成和自然语言理解彻底革新了人工智能领域。然而,它们依赖于静态训练数据的特性限制了其应对动态、实时查询的能力,导致输出过时或不准确。检索增强生成(RAG)作为一种解决方案应运而生,它通过将实时数据检索集成到LLM中来提供上下文相关和最新的响应。尽管RAG系统展现出巨大潜力,但传统的RAG系统受限于静态工作流程,并且缺乏多步推理和复杂任务管理所需的适应性。 代理增强生成(Agentic RAG)则超越了这些限制,在RAG管道中嵌入自主AI代理。这些代理利用代理人设计模式的反思、规划、工具使用及多代理协作,以动态管理检索策略、迭代细化上下文理解,并适应工作流程以满足复杂任务需求。这种集成使得Agentic RAG系统能够在各种应用中提供无与伦比的灵活性、可扩展性和上下文感知能力。 本文综述全面探索了Agentic RAG,从其基础原理和RAG范式的演进开始。它详细介绍了Agentic RAG架构的分类,并强调了在医疗保健、金融和教育等行业中的关键应用,同时探讨了实用实现策略。此外,该综述还讨论了扩展这些系统所面临的挑战,确保伦理决策并优化实际应用性能的问题,并提供了关于实施Agentic RAG框架和技术的详细见解。 通过这样的综合分析,我们可以更深入地理解Agentic RAG的技术细节及其在现实世界中的广泛应用潜力。
https://arxiv.org/abs/2501.09136
Turn-taking is a fundamental aspect of conversation, but current Human-Robot Interaction (HRI) systems often rely on simplistic, silence-based models, leading to unnatural pauses and interruptions. This paper investigates, for the first time, the application of general turn-taking models, specifically TurnGPT and Voice Activity Projection (VAP), to improve conversational dynamics in HRI. These models are trained on human-human dialogue data using self-supervised learning objectives, without requiring domain-specific fine-tuning. We propose methods for using these models in tandem to predict when a robot should begin preparing responses, take turns, and handle potential interruptions. We evaluated the proposed system in a within-subject study against a traditional baseline system, using the Furhat robot with 39 adults in a conversational setting, in combination with a large language model for autonomous response generation. The results show that participants significantly prefer the proposed system, and it significantly reduces response delays and interruptions.
换言之,对话中的轮流发言是交流的基本方面,但当前的人机互动(HRI)系统通常依赖于基于静默的简单模型,这导致了不自然的停顿和打断。本文首次研究了一般轮流发言模型——特别是TurnGPT和声音活动预测(VAP)的应用,以改善人机对话中的交流动态。这些模型通过自我监督学习目标在人类之间的对话数据上进行训练,并且不需要特定领域的微调。我们提出了利用这两种模型的组合方法来预测机器人何时应开始准备回应、何时发言以及如何处理可能的打断。我们在一次实验中评估了所提出的系统,该实验采用了一个传统的基线系统,在39名成年人与Furhat机器人的对话环境中进行,并结合大型语言模型自动生成自主响应。结果表明,参与者明显更偏好于我们提出的系统,它在减少回应延迟和打断方面也显著有效。
https://arxiv.org/abs/2501.08946
Despite the recent developments in obstacle avoidance and other safety features, autonomous Unmanned Aerial Vehicles (UAVs) continue to face safety challenges. No previous work investigated the relationship between the behavioral uncertainty of a UAV and the unsafety of its flight. By quantifying uncertainty, it is possible to develop a predictor for unsafety, which acts as a flight supervisor. We conducted a large-scale empirical investigation of safety violations using PX4-Autopilot, an open-source UAV software platform. Our dataset of over 5,000 simulated flights, created to challenge obstacle avoidance, allowed us to explore the relation between uncertain UAV decisions and safety violations: up to 89% of unsafe UAV states exhibit significant decision uncertainty, and up to 74% of uncertain decisions lead to unsafe states. Based on these findings, we implemented Superialist (Supervising Autonomous Aerial Vehicles), a runtime uncertainty detector based on autoencoders, the state-of-the-art technology for anomaly detection. Superialist achieved high performance in detecting uncertain behaviors with up to 96% precision and 93% recall. Despite the observed performance degradation when using the same approach for predicting unsafety (up to 74% precision and 87% recall), Superialist enabled early prediction of unsafe states up to 50 seconds in advance.
尽管在障碍物规避和其他安全特性方面取得了近期进展,自主无人飞行器(UAV)仍然面临许多安全挑战。此前没有研究探讨过无人机行为的不确定性与其飞行中的安全隐患之间的关系。通过量化不确定性,可以开发出一种预测不安全性的工具,这种工具可作为飞行监督者发挥作用。我们使用开源无人机软件平台PX4-Autopilot进行了大规模的安全违规实证调查。我们的数据集包含超过5,000次模拟飞行,这些飞行被设计来挑战障碍物规避功能,并使我们能够探索不确定的UAV决策与安全违规之间的关系:高达89%的不安全UAV状态表现出显著的行为不确定性;同时高达74%的不确定决策会导致不安全状态。基于这些发现,我们实施了Superialist(监督自主空中车辆),这是一种运行时不确定性检测器,它采用了自动编码器这一用于异常检测的最新技术。Superialist在检测不确定行为方面表现出了极高的性能,最高可达96%的确切率和93%的召回率。尽管使用相同的方法预测不安全性时表现出性能下降(精确率为高达74%,召回率为87%),但Superialist仍能够提前最多50秒预测出不安全状态。
https://arxiv.org/abs/2501.08908
Autonomous driving is a challenging task that requires perceiving and understanding the surrounding environment for safe trajectory planning. While existing vision-based end-to-end models have achieved promising results, these methods are still facing the challenges of vision understanding, decision reasoning and scene generalization. To solve these issues, a generative planning with 3D-vision language pre-training model named GPVL is proposed for end-to-end autonomous driving. The proposed paradigm has two significant aspects. On one hand, a 3D-vision language pre-training module is designed to bridge the gap between visual perception and linguistic understanding in the bird's eye view. On the other hand, a cross-modal language model is introduced to generate holistic driving decisions and fine-grained trajectories with perception and navigation information in an auto-regressive manner. Experiments on the challenging nuScenes dataset demonstrate that the proposed scheme achieves excellent performances compared with state-of-the-art methods. Besides, the proposed GPVL presents strong generalization ability and real-time potential when handling high-level commands in various scenarios. It is believed that the effective, robust and efficient performance of GPVL is crucial for the practical application of future autonomous driving systems. Code is available at this https URL
自动驾驶是一项具有挑战性的任务,它需要感知和理解周围环境以进行安全的路径规划。尽管现有的基于视觉的端到端模型已经取得了令人鼓舞的结果,但这些方法仍然面临着视觉理解、决策推理和场景泛化方面的挑战。为了解决这些问题,提出了一种名为GPVL(Generative Planning with 3D Vision-Language Pre-training)的新方案,用于实现端到端自动驾驶。所提出的范式有两个显著方面。 一方面,设计了一个基于鸟瞰图的三维视觉语言预训练模块,以弥合视觉感知与语言理解之间的差距;另一方面,引入了跨模态语言模型,以自回归方式生成全面的驾驶决策和细致的轨迹,同时结合感知与导航信息。在具有挑战性的nuScenes数据集上的实验表明,所提出的方案相比现有先进技术表现出卓越性能。此外,当处理各种场景中的高级命令时,GPVL展示了强大的泛化能力和实时潜力。人们认为,GPVL的有效性、鲁棒性和效率对于未来自动驾驶系统的实际应用至关重要。 代码可在提供的链接获取。
https://arxiv.org/abs/2501.08861
Visual odometry (VO) plays a crucial role in autonomous driving, robotic navigation, and other related tasks by estimating the position and orientation of a camera based on visual input. Significant progress has been made in data-driven VO methods, particularly those leveraging deep learning techniques to extract image features and estimate camera poses. However, these methods often struggle in low-light conditions because of the reduced visibility of features and the increased difficulty of matching keypoints. To address this limitation, we introduce BrightVO, a novel VO model based on Transformer architecture, which not only performs front-end visual feature extraction, but also incorporates a multi-modality refinement module in the back-end that integrates Inertial Measurement Unit (IMU) data. Using pose graph optimization, this module iteratively refines pose estimates to reduce errors and improve both accuracy and robustness. Furthermore, we create a synthetic low-light dataset, KiC4R, which includes a variety of lighting conditions to facilitate the training and evaluation of VO frameworks in challenging environments. Experimental results demonstrate that BrightVO achieves state-of-the-art performance on both the KiC4R dataset and the KITTI benchmarks. Specifically, it provides an average improvement of 20% in pose estimation accuracy in normal outdoor environments and 259% in low-light conditions, outperforming existing methods. For widespread use and further development, the research work is fully open-source at this https URL.
视觉里程计(VO)在自动驾驶、机器人导航及其他相关任务中发挥着关键作用,它通过基于视觉输入来估计相机的位置和方向。近年来,在数据驱动的VO方法方面取得了显著进展,特别是那些利用深度学习技术提取图像特征并估算摄像机姿态的方法。然而,这些方法往往难以应对低光环境下的挑战,因为在这种条件下可见特征减少且匹配关键点变得更为困难。 为了解决这一限制,我们引入了BrightVO,这是一种基于Transformer架构的新型VO模型,它不仅执行前端视觉特性提取,还在后端整合了一个多模态精炼模块,该模块结合了惯性测量单元(IMU)的数据。利用姿态图优化方法,此模块可以迭代地细化姿态估计值以减少误差并提高精度和鲁棒性。 此外,我们创建了一个合成低光数据集KiC4R,该数据集中包含各种照明条件,这有助于训练和评估在挑战环境下工作的VO框架。实验结果表明,BrightVO在KiC4R数据集以及KITTI基准测试中均表现出领先水平的性能。具体而言,在常规室外环境中,它将姿态估计精度平均提高了20%,而在低光条件下则提高了259%(相对于现有方法)。为了促进广泛使用和进一步开发,我们的研究工作完全开源,并可在[此链接](https://github.com/your-research-group/brightvo)访问。 注意:文中提及的GitHub链接需要替换为实际公开的研究组BrightVO代码库链接。
https://arxiv.org/abs/2501.08659
Autonomous unmanned aerial vehicles (UAVs) integrated with edge computing capabilities empower real-time data processing directly on the device, dramatically reducing latency in critical scenarios such as wildfire detection. This study underscores Transfer Learning's (TL) significance in boosting the performance of object detectors for identifying wildfire smoke and flames, especially when trained on limited datasets, and investigates the impact TL has on edge computing metrics. With the latter focusing how TL-enhanced You Only Look Once (YOLO) models perform in terms of inference time, power usage, and energy consumption when using edge computing devices. This study utilizes the Aerial Fire and Smoke Essential (AFSE) dataset as the target, with the Flame and Smoke Detection Dataset (FASDD) and the Microsoft Common Objects in Context (COCO) dataset serving as source datasets. We explore a two-stage cascaded TL method, utilizing D-Fire or FASDD as initial stage target datasets and AFSE as the subsequent stage. Through fine-tuning, TL significantly enhances detection precision, achieving up to 79.2% mean Average Precision (mAP@0.5), reduces training time, and increases model generalizability across the AFSE dataset. However, cascaded TL yielded no notable improvements and TL alone did not benefit the edge computing metrics evaluated. Lastly, this work found that YOLOv5n remains a powerful model when lacking hardware acceleration, finding that YOLOv5n can process images nearly twice as fast as its newer counterpart, YOLO11n. Overall, the results affirm TL's role in augmenting the accuracy of object detectors while also illustrating that additional enhancements are needed to improve edge computing performance.
自主无人飞行器(UAV)与边缘计算能力的集成使设备能够直接进行实时数据处理,显著减少了诸如野火检测等关键场景中的延迟。这项研究强调了迁移学习(Transfer Learning, TL)在增强对象检测器识别野火烟雾和火焰性能方面的重要性,尤其是在使用有限的数据集时,并探讨了TL对边缘计算指标的影响。后者关注的是经过TL增强的You Only Look Once (YOLO)模型在推理时间、功耗和能量消耗方面的表现,这些数据是在使用边缘设备的情况下收集的。 本研究利用Aerial Fire and Smoke Essential (AFSE) 数据集作为目标,以Flame and Smoke Detection Dataset (FASDD) 和 Microsoft Common Objects in Context (COCO) 数据集作为源数据集。我们探索了一种两阶段级联迁移学习方法,第一阶段使用D-Fire或FASDD作为目标数据集,第二阶段则转向AFSE数据集。通过微调,TL显著提升了检测精度,在AFSE数据集上的平均准确率(mAP@0.5)达到了79.2%,同时减少了训练时间,并增强了模型在该数据集上的泛化能力。 然而,级联迁移学习并没有带来明显的改进,且单独的迁移学习并未改善所评估的边缘计算指标。最终发现,在缺乏硬件加速的情况下,YOLOv5n仍然是一个强大的模型,结果表明,YOLOv5n可以比其较新的版本YOLO11n处理图像快近一倍。 总体而言,研究结果确认了TL在提高对象检测器准确性方面的作用,并展示了为了提升边缘计算性能,还需要进行额外的改进。
https://arxiv.org/abs/2501.08639
Generative AI presents transformative potential across various domains, from creative arts to scientific visualization. However, the utility of AI-generated imagery is often compromised by visual flaws, including anatomical inaccuracies, improper object placements, and misplaced textual elements. These imperfections pose significant challenges for practical applications. To overcome these limitations, we introduce \textit{Yuan}, a novel framework that autonomously corrects visual imperfections in text-to-image synthesis. \textit{Yuan} uniquely conditions on both the textual prompt and the segmented image, generating precise masks that identify areas in need of refinement without requiring manual intervention -- a common constraint in previous methodologies. Following the automated masking process, an advanced inpainting module seamlessly integrates contextually coherent content into the identified regions, preserving the integrity and fidelity of the original image and associated text prompts. Through extensive experimentation on publicly available datasets such as ImageNet100 and Stanford Dogs, along with a custom-generated dataset, \textit{Yuan} demonstrated superior performance in eliminating visual imperfections. Our approach consistently achieved higher scores in quantitative metrics, including NIQE, BRISQUE, and PI, alongside favorable qualitative evaluations. These results underscore \textit{Yuan}'s potential to significantly enhance the quality and applicability of AI-generated images across diverse fields.
生成式AI在创意艺术到科学可视化等多个领域中展现出变革性的潜力。然而,由于包括解剖不准确、物体放置不当和文本元素错位在内的视觉缺陷,AI生成的图像的实际效用常常受到限制。这些问题对实际应用构成了重大挑战。为了解决这些局限性,我们引入了名为“Yuan”的全新框架,该框架能够自动纠正文本到图像合成中的视觉瑕疵。 **Yuan的独特之处在于它不仅根据文本提示工作,还结合分割后的图像信息进行操作,生成精确的掩码来识别需要改进的部分,而无需人工干预——这在以前的方法中是一个常见的限制因素。** 在自动化遮罩过程之后,“Yuan”通过先进的修复模块将语义上连贯的内容无缝集成到这些区域中,保持了原始图像和相关文本提示的完整性和保真度。 通过对包括ImageNet100和斯坦福狗在内的公共数据集以及一个自定义生成的数据集进行广泛的实验,“Yuan”展示了在消除视觉缺陷方面卓越的表现。我们的方法在NIQE、BRISQUE和PI等定量指标上均取得了更高的评分,并且在定性评估中也表现出了优势。 这些结果强调了“Yuan”的潜力,它能够显著提升AI生成图像的质量和应用范围,在多个领域内具有重要的实际价值。
https://arxiv.org/abs/2501.08505