Navigating complex environments requires Unmanned Aerial Vehicles (UAVs) and autonomous systems to perform trajectory tracking and obstacle avoidance in real-time. While many control strategies have effectively utilized linear approximations, addressing the non-linear dynamics of UAV, especially in obstacle-dense environments, remains a key challenge that requires further research. This paper introduces a Non-linear Model Predictive Control (NMPC) framework for the DJI Matrice 100, addressing these challenges by using a dynamic model and B-spline interpolation for smooth reference trajectories, ensuring minimal deviation while respecting safety constraints. The framework supports various trajectory types and employs a penalty-based cost function for control accuracy in tight maneuvers. The framework utilizes CasADi for efficient real-time optimization, enabling the UAV to maintain robust operation even under tight computational constraints. Simulation and real-world indoor and outdoor experiments demonstrated the NMPC ability to adapt to disturbances, resulting in smooth, collision-free navigation.
导航复杂的环境需要无人机(UAVs)和自主系统在实时进行轨迹跟踪和避障。虽然许多控制策略有效地利用了线性近似,但处理UAV的非线性动力学,特别是在密集障碍物环境中,仍然是一个关键挑战,需要进一步研究。本文介绍了一种非线性模型预测控制(NMPC)框架,用于DJI Matrice 100,通过使用动态模型和B-spline插值来提供平滑的参考轨迹,确保在遵守安全约束的情况下最小偏差。该框架支持各种轨迹类型,并采用基于惩罚的成本函数来控制精确度在紧缩操纵中。该框架利用CasADi实现高效的实时优化,使无人机在计算约束紧张的情况下仍保持稳健操作。模拟和现实世界的室内和室外实验证明,NMPC能力能够适应干扰,从而实现平滑、无碰撞的导航。
https://arxiv.org/abs/2410.02732
Although LLM-based agents, powered by Large Language Models (LLMs), can use external tools and memory mechanisms to solve complex real-world tasks, they may also introduce critical security vulnerabilities. However, the existing literature does not comprehensively evaluate attacks and defenses against LLM-based agents. To address this, we introduce Agent Security Bench (ASB), a comprehensive framework designed to formalize, benchmark, and evaluate the attacks and defenses of LLM-based agents, including 10 scenarios (e.g., e-commerce, autonomous driving, finance), 10 agents targeting the scenarios, over 400 tools, 23 different types of attack/defense methods, and 8 evaluation metrics. Based on ASB, we benchmark 10 prompt injection attacks, a memory poisoning attack, a novel Plan-of-Thought backdoor attack, a mixed attack, and 10 corresponding defenses across 13 LLM backbones with nearly 90,000 testing cases in total. Our benchmark results reveal critical vulnerabilities in different stages of agent operation, including system prompt, user prompt handling, tool usage, and memory retrieval, with the highest average attack success rate of 84.30\%, but limited effectiveness shown in current defenses, unveiling important works to be done in terms of agent security for the community. Our code can be found at this https URL.
尽管基于大型语言模型的(LLM)代理可以利用外部工具和内存机制来解决复杂现实任务,但他们也可能引入关键的安全漏洞。然而,现有的文献并没有全面评估针对LLM代理的攻击和防御。为了应对这个问题,我们引入了Agent Security Bench(ASB)框架,这是一个全面框架,旨在形式化、基准和评估基于LLM的代理的攻击和防御,包括10个场景(例如,电子商务、自动驾驶、金融),10个针对场景的代理,超过400个工具,23种不同的攻击/防御方法,以及8个评估指标。基于ASB,我们基准了10个提示注入攻击,一种内存污染攻击,一种新颖的思维回路后门攻击,一种混合攻击,以及10个相应的防御措施,总共有近90,000个测试用例。我们的基准结果揭示了代理操作不同阶段的安全漏洞,包括系统提示、用户提示处理、工具使用和内存检索,平均攻击成功率为84.30%,但当前的防御措施的有效性有限,揭示了在代理安全方面需要进行的重要工作。我们的代码可以从这个链接找到:https://www.acm.org/anthology/ACM/CIS/2022/0801206
https://arxiv.org/abs/2410.02644
Accurate online multiple-camera vehicle tracking is essential for intelligent transportation systems, autonomous driving, and smart city applications. Like single-camera multiple-object tracking, it is commonly formulated as a graph problem of tracking-by-detection. Within this framework, existing online methods usually consist of two-stage procedures that cluster temporally first, then spatially, or vice versa. This is computationally expensive and prone to error accumulation. We introduce a graph representation that allows spatial-temporal clustering in a single, combined step: New detections are spatially and temporally connected with existing clusters. By keeping sparse appearance and positional cues of all detections in a cluster, our method can compare clusters based on the strongest available evidence. The final tracks are obtained online using a simple multicut assignment procedure. Our method does not require any training on the target scene, pre-extraction of single-camera tracks, or additional annotations. Notably, we outperform the online state-of-the-art on the CityFlow dataset in terms of IDF1 by more than 14%, and on the Synthehicle dataset by more than 25%, respectively. The code is publicly available.
准确的在线多摄像头车辆跟踪对于智能交通系统、自动驾驶和智能城市应用至关重要。与单摄像头多对象跟踪一样,通常用跟踪检测问题来表示它。在这个框架内,现有的在线方法通常包括两个步骤:首先进行时序聚类,然后进行空间聚类;或者反过来。这是计算密集型且容易累积错误的。我们引入了一个图表示,允许在单个、联合步骤中进行空间-时间聚类:新检测到的样本在空间和时间上与现有的聚类相互连接。通过保留所有检测到的样本的稀疏表示和位置线索,我们的方法可以基于最强的可用证据比较聚类。通过简单的多路复用分配方案,我们可以在在线过程中获得最终轨迹。我们的方法不需要在目标场景上进行训练,也不需要预先提取单摄像头的轨迹或附加注释。值得注意的是,我们在CityFlow数据集上比在线最先进的方法提高了约14%,而在Synthehicle数据集上提高了约25%。代码是公开可用的。
https://arxiv.org/abs/2410.02638
The rapid advancements in autonomous vehicle software present both opportunities and challenges, especially in enhancing road safety. The primary objective of autonomous vehicles is to reduce accident rates through improved safety measures. However, the integration of new algorithms into the autonomous vehicle, such as Artificial Intelligence methods, raises concerns about the compliance with established safety regulations. This paper introduces a novel software architecture based on behavior trees, aligned with established standards and designed to supervise vehicle functional safety in real time. It specifically addresses the integration of algorithms into industrial road vehicles, adhering to the ISO 26262. The proposed supervision methodology involves the detection of hazards and compliance with functional and technical safety requirements when a hazard arises. This methodology, implemented in this study in a Renault Mégane (currently at SAE level 3 of automation), not only guarantees compliance with safety standards, but also paves the way for safer and more reliable autonomous driving technologies.
自动驾驶软件的快速发展既带来了机会,也带来了挑战,特别是在提高道路安全方面。自动驾驶汽车的主要目标是通过改进安全措施降低事故率。然而,将新的算法集成到自动驾驶汽车中,如人工智能方法,引起了关于是否符合既定安全法规的担忧。本文介绍了一种基于行为树的新软件架构,与既定标准保持一致,旨在实时监督车辆的功能安全。它特别关注将算法集成到工业道路上,遵循ISO 26262标准。所提出的监督方法包括在危险发生时检测危险并符合功能和技术安全要求。这项研究中的Renault Mégane(目前处于SAE level 3的自动化水平)不仅确保了符合安全标准,还为更安全、更可靠的自动驾驶技术铺平了道路。
https://arxiv.org/abs/2410.02469
The Coastal underwater evidence search system with surface-underwater collaboration is designed to revolutionize the search for artificial objects in coastal underwater environments, overcoming limitations associated with traditional methods such as divers and tethered remotely operated vehicles. Our innovative multi-robot collaborative system consists of three parts, an autonomous surface vehicle as a mission control center, a towed underwater vehicle for wide-area search, and a biomimetic underwater robot inspired by marine organisms for detailed inspections of identified areas. We conduct extensive simulations and real-world experiments in pond environments and coastal fields to demonstrate the system potential to surpass the limitations of conventional underwater search methods, offering a robust and efficient solution for law enforcement and recovery operations in marine settings.
海洋水下证据搜索系统与水面下合作搜索是一个设计,旨在彻底颠覆沿海水下环境中寻找人造物体的传统方法,克服了与传统方法相关的限制,如潜水员和附着式远程操控车辆。我们创新的多机器人协同系统由三个部分组成,分别是自主水面车辆作为任务控制中心、拖行的水下车辆进行区域搜索和以海洋生物为灵感的水下机器人,用于对确定的区域进行详细检查。我们在池塘环境和沿海水域进行广泛的仿真和实地试验,以展示该系统在超越传统水下搜索方法的局限性方面具有潜力,为警察和救援人员在海洋环境中的执法和恢复操作提供了一个健壮和高效解决方案。
https://arxiv.org/abs/2410.02345
Establishing and maintaining 5G mmWave vehicular connectivity poses a significant challenge due to high user mobility that necessitates frequent triggering of beam switching procedures. Departing from reactive beam switching based on the user device channel state feedback, proactive beam switching prepares in advance for upcoming beam switching decisions by exploiting accurate channel state information (CSI) prediction. In this paper, we develop a framework for autonomous self-trained CSI prediction for mmWave vehicular users where a base station (gNB) collects and labels a dataset that it uses for training recurrent neural network (RNN)-based CSI prediction model. The proposed framework exploits the CSI feedback from vehicular users combined with overhearing the C-V2X cooperative awareness messages (CAMs) they broadcast. We implement and evaluate the proposed framework using deepMIMO dataset generation environment and demonstrate its capability to provide accurate CSI prediction for 5G mmWave vehicular users. CSI prediction model is trained and its capability to provide accurate CSI predictions from various input features are investigated.
建立和维护5G mmWave车辆连接 poses a significant challenge due to high user mobility that necessitates frequent triggering of beam switching procedures. 离开基于用户设备信道状态反馈的反应式波切换,主动波切换在事先利用准确的信道状态信息(CSI)预测进行波切换决策方面做好准备。在本文中,我们为mmWave车辆用户开发了一个自适应的CSI预测框架,该框架基于基站(gNB)收集和标记用于训练基于循环神经网络(RNN)的CSI预测模型的数据集。所提出的框架利用车辆用户产生的CSI反馈以及监听他们广播的C-V2X合作意识消息(CAMs)。我们使用 deepMIMO 数据生成环境实现并评估所提出的框架,并证明了它为5G mmWave车辆用户提供准确CSI预测的能力。我们研究了CSI预测模型的训练及其从各种输入特征提供准确CSI预测的能力。
https://arxiv.org/abs/2410.02326
Dynamic and interactive traffic scenarios pose significant challenges for autonomous driving systems. Reinforcement learning (RL) offers a promising approach by enabling the exploration of driving policies beyond the constraints of pre-collected datasets and predefined conditions, particularly in complex environments. However, a critical challenge lies in effectively extracting spatial and temporal features from sequences of high-dimensional, multi-modal observations while minimizing the accumulation of errors over time. Additionally, efficiently guiding large-scale RL models to converge on optimal driving policies without frequent failures during the training process remains tricky. We propose an end-to-end model-based RL algorithm named Ramble to address these issues. Ramble processes multi-view RGB images and LiDAR point clouds into low-dimensional latent features to capture the context of traffic scenarios at each time step. A transformer-based architecture is then employed to model temporal dependencies and predict future states. By learning a dynamics model of the environment, Ramble can foresee upcoming traffic events and make more informed, strategic decisions. Our implementation demonstrates that prior experience in feature extraction and decision-making plays a pivotal role in accelerating the convergence of RL models toward optimal driving policies. Ramble achieves state-of-the-art performance regarding route completion rate and driving score on the CARLA Leaderboard 2.0, showcasing its effectiveness in managing complex and dynamic traffic situations.
动态和交互式的交通场景对自动驾驶系统来说具有巨大的挑战。强化学习(RL)通过允许在预收集数据和预定义条件的限制之外探索驾驶策略,为解决这个挑战提供了一个有前景的方法。然而,关键挑战在于在时间上有效地提取高维、多模态观测序列中的空间和时间特征,同时最小化误差积累。此外,在训练过程中有效地引导大规模RL模型收敛到最优驾驶策略,同时避免频繁的训练失败也具有挑战性。为了应对这些问题,我们提出了一个基于模型的强化学习算法,名为Ramble。 Ramble将多视角的RGB图像和激光点云转换为低维的潜在特征,以捕捉每个时间步度的交通场景的上下文。然后采用Transformer架构来建模时间依赖关系并预测未来状态。通过学习环境的动态模型,Ramble可以预测即将到来的交通事件,并做出更有力的决策。 我们的实现证明了在特征提取和决策方面最初的经验对于加速RL模型达到最优驾驶策略的收敛速度具有关键作用。Ramble在CARLA Leaderboard 2.0上实现了与最先进性能相关的路线完成率和驾驶评分,展示了其在处理复杂和动态交通情况方面的有效性。
https://arxiv.org/abs/2410.02253
Trajectory prediction is a pivotal component of autonomous driving systems, enabling the application of accumulated movement experience to current scenarios. Although most existing methods concentrate on learning continuous representations to gain valuable experience, they often suffer from computational inefficiencies and struggle with unfamiliar situations. To address this issue, we propose the Fragmented-Memory-based Trajectory Prediction (FMTP) model, inspired by the remarkable learning capabilities of humans, particularly their ability to leverage accumulated experience and recall relevant memories in unfamiliar situations. The FMTP model employs discrete representations to enhance computational efficiency by reducing information redundancy while maintaining the flexibility to utilize past experiences. Specifically, we design a learnable memory array by consolidating continuous trajectory representations from the training set using defined quantization operations during the training phase. This approach further eliminates redundant information while preserving essential features in discrete form. Additionally, we develop an advanced reasoning engine based on language models to deeply learn the associative rules among these discrete representations. Our method has been evaluated on various public datasets, including ETH-UCY, inD, SDD, nuScenes, Waymo, and VTL-TP. The extensive experimental results demonstrate that our approach achieves significant performance and extracts more valuable experience from past trajectories to inform the current state.
轨迹预测是自动驾驶系统的重要组成部分,可以将积累的运动经验应用于当前场景。尽管大多数现有方法都集中于学习连续表示以获得宝贵的经验,但它们通常存在计算效率低下和难以应对不熟悉情况的问题。为解决这个问题,我们提出了基于破碎记忆的轨迹预测(FMTP)模型,灵感来自人类惊人的学习能力,尤其是他们能够在不熟悉情况下利用积累的经验和回忆相关的记忆。FMTP模型采用离散表示来提高计算效率,通过减少信息冗余保持灵活性,同时利用过去的经验。具体来说,我们在训练阶段通过定义的量化操作将连续轨迹表示进行汇总,从而设计了一个可学习的记忆阵列。这种方法进一步消除冗余信息,同时保留离散形式中的关键特征。此外,我们还基于语言模型开发了高级推理引擎,以深入学习这些离散表示之间的关联规则。我们的方法已经在多个公开数据集上进行了评估,包括ETH-UCY、inD、SDD、nuScenes、Waymo和VTL-TP。丰富的实验结果证明,我们的方法取得了显著的性能,并且从过去的轨迹中提取了更有价值的信息,以指导当前的状态。
https://arxiv.org/abs/2410.02201
Evaluating policies using off-policy data is crucial for applying reinforcement learning to real-world problems such as healthcare and autonomous driving. Previous methods for off-policy evaluation (OPE) generally suffer from high variance or irreducible bias, leading to unacceptably high prediction errors. In this work, we introduce STAR, a framework for OPE that encompasses a broad range of estimators -- which include existing OPE methods as special cases -- that achieve lower mean squared prediction errors. STAR leverages state abstraction to distill complex, potentially continuous problems into compact, discrete models which we call abstract reward processes (ARPs). Predictions from ARPs estimated from off-policy data are provably consistent (asymptotically correct). Rather than proposing a specific estimator, we present a new framework for OPE and empirically demonstrate that estimators within STAR outperform existing methods. The best STAR estimator outperforms baselines in all twelve cases studied, and even the median STAR estimator surpasses the baselines in seven out of the twelve cases.
使用离线数据评估策略对于将强化学习应用于现实世界问题(如医疗和自动驾驶)中至关重要。之前的方法(OPE)通常具有高方差或不可归因偏见,导致预测误差过高。在本文中,我们引入了STAR,一个涵盖了各种估计算法的框架——包括现有OPE方法的特殊情况——从而实现较低的均方预测误差。STAR利用状态抽象将复杂、可能连续的问题压缩成我们称之为抽象奖励过程(ARPs)的紧凑离散模型。ARPs从离线数据中估计的预测是可证明的一致的(渐进式正确)。我们而不是提出一个特定的估计算法,而是介绍了一个新的OPE框架,并实证证明STAR中的估计算法优于现有方法。在所研究的所有12个案例中,最佳STAR估计算法都超过了基线,而且即使在最佳情况下,STAR的均方估计算法也超过了基线。
https://arxiv.org/abs/2410.02172
Monocular Depth and Surface Normals Estimation (MDSNE) is crucial for tasks such as 3D reconstruction, autonomous navigation, and underwater exploration. Current methods rely either on discriminative models, which struggle with transparent or reflective surfaces, or generative models, which, while accurate, are computationally expensive. This paper presents a novel deep learning model for MDSNE, specifically tailored for underwater environments, using a hybrid architecture that integrates Convolutional Neural Networks (CNNs) with Transformers, leveraging the strengths of both approaches. Training effective MDSNE models is often hampered by noisy real-world datasets and the limited generalization of synthetic datasets. To address this, we generate pseudo-labeled real data using multiple pre-trained MDSNE models. To ensure the quality of this data, we propose the Depth Normal Evaluation and Selection Algorithm (DNESA), which evaluates and selects the most reliable pseudo-labeled samples using domain-specific metrics. A lightweight student model is then trained on this curated dataset. Our model reduces parameters by 90% and training costs by 80%, allowing real-time 3D perception on resource-constrained devices. Key contributions include: a novel and efficient MDSNE model, the DNESA algorithm, a domain-specific data pipeline, and a focus on real-time performance and scalability. Designed for real-world underwater applications, our model facilitates low-cost deployments in underwater robots and autonomous vehicles, bridging the gap between research and practical implementation.
单目深度和表面法线估计(MDSNE)对于诸如3D建模、自主导航和水下探索等任务至关重要。目前的方法依赖于分类模型,这些模型在透明的或反射性表面上有困难;或者依赖于生成模型,虽然准确,但计算成本较高。本文提出了一种新的用于MDSNE的深度学习模型,特别针对水下环境,采用结合卷积神经网络(CNNs)和Transformer的混合架构,利用两种方法的优点。训练有效的MDSNE模型通常受到噪声 real-world 数据集和合成数据集的有限泛化能力的困扰。为解决这个问题,我们使用多个预训练的 MDSNE 模型生成伪标签。为了确保数据的质量,我们提出了深度法线评估和选择算法(DNESA),它通过领域特定指标评估和选择最可靠的伪标签样本。然后,在经过筛选的数据集上训练一个轻量级的学生模型。我们的模型将参数减少90%,训练成本减少80%,允许在资源受限的设备上实现实时 3D 感知。关键贡献包括:一种新颖且有效的 MDSNE 模型、DNESA 算法、一个领域特定的数据管道,以及关注实时性能和可扩展性。为了实现现实世界的水下应用,我们的模型促使成本较低的部署在 underwater 机器人或自动驾驶车辆上,将研究和技术实现之间的差距缩小。
https://arxiv.org/abs/2410.02072
Autonomous agents have demonstrated significant potential in automating complex multistep decision-making tasks. However, even state-of-the-art vision-language models (VLMs), such as GPT-4o, still fall short of human-level performance, particularly in intricate web environments and long-horizon planning tasks. To address these limitations, we introduce Reflective Monte Carlo Tree Search (R-MCTS), a novel test-time algorithm designed to enhance the ability of AI agents, e.g., powered by GPT-4o, to explore decision space on the fly. R-MCTS extends traditional MCTS by 1) incorporating contrastive reflection, allowing agents to learn from past interactions and dynamically improve their search efficiency; and 2) using multi-agent debate to provide reliable state evaluation. Moreover, we improve the agent's performance by fine-tuning GPT-4o through self-learning, using R-MCTS generated tree traversals without any human-provided labels. On the challenging VisualWebArena benchmark, our GPT-4o-based R-MCTS agent achieves a 6% to 30% relative improvement across various tasks compared to the previous state-of-the-art. Additionally, we show that the knowledge gained from test-time search can be effectively transferred back to GPT-4o via fine-tuning. The fine-tuned GPT-4o matches 97% of R-MCTS's performance while reducing compute usage by a factor of four at test time. Furthermore, qualitative results reveal that the fine-tuned GPT-4o model demonstrates the ability to explore the environment, evaluate a state, and backtrack to viable ones when it detects that the current state cannot lead to success. Moreover, our work demonstrates the compute scaling properties in both training - data collection with R-MCTS - and testing time. These results suggest a promising research direction to enhance VLMs' reasoning and planning capabilities for agentic applications via test-time search and self-learning.
自动代理在自动化复杂多步骤决策任务方面已经展现了显著潜力。然而,即使是最先进的视觉语言模型(VLMs),如GPT-4.0,在复杂的Web环境和长期规划任务方面也仍与人类水平相距甚远。为了克服这些限制,我们引入了反思蒙特卡洛树搜索(R-MCTS),一种新型的测试时间算法,旨在增强AI代理(如GPT-4.0)在飞行中探索决策空间的能力。R-MCTS在传统MCTS的基础上扩展了1)包括对比反射,使代理能够从过去的交互中学习并动态地提高搜索效率;和2)使用多代理器辩论来提供可靠的狀態評估。此外,我们通过自学习对GPT-4.0进行微调,使用没有任何人类提供标签的R-MCTS生成的树遍历来提高代理的表现。在具有挑战性的VisualWebArena基准中,基于GPT-4.0的R-MCTS代理在各种任务上实现了6%至30%的相对改进。此外,我们还证明了测试时间搜索获得的知識可以有效地通过微调传回GPT-4.0。微调后的GPT-4.0在性能上与R-MCTS相当,同时将计算使用量降低了四倍。此外,定性的结果表明,微调后的GPT-4.0模型具有探索环境、评估状态并回溯到可行状态的能力,当它检测到当前状态无法带来成功时。此外,我们的工作展示了在训练和测试时间上的计算扩展特性。这些结果表明,通过测试时间搜索和自学习来增强VLMs的推理和规划能力可以为代理应用程序提供有益的研究方向。
https://arxiv.org/abs/2410.02052
We reframe scene flow as the problem of estimating a continuous space and time PDE that describes motion for an entire observation sequence, represented with a neural prior. Our resulting unsupervised method, EulerFlow, produces high quality scene flow on real-world data across multiple domains, including large-scale autonomous driving scenes and dynamic tabletop settings. Notably, EulerFlow produces high quality flow on small, fast moving objects like birds and tennis balls, and exhibits emergent 3D point tracking behavior by solving its estimated PDE over long time horizons. On the Argoverse 2 2024 Scene Flow Challenge, EulerFlow outperforms all prior art, beating the next best unsupervised method by over 2.5x and the next best supervised method by over 10%.
我们将场景流重新建模为估计一个连续的空间和时间PDE,该PDE描述了整个观测序列的运动,用神经先验表示。我们得到的无监督方法EulerFlow在多个领域产生了高质量的场景流,包括大规模自动驾驶场景和动态桌面设置。值得注意的是,EulerFlow在小型、快速移动的对象(如鸟和网球)上产生了高质量的流,通过在长时间尺度上求解其估计的PDE而表现出自适应的3D点跟踪行为。在2024年ArgoVerse场景流挑战中,EulerFlow超越了所有先驱技术,超过了下一届最好的无监督方法的2.5倍,超过了下一届最好的监督方法的10%。
https://arxiv.org/abs/2410.02031
Detecting human actions is a crucial task for autonomous robots and vehicles, often requiring the integration of various data modalities for improved accuracy. In this study, we introduce a novel approach to Human Action Recognition (HAR) based on skeleton and visual cues. Our method leverages a language model to guide the feature extraction process in the skeleton encoder. Specifically, we employ learnable prompts for the language model conditioned on the skeleton modality to optimize feature representation. Furthermore, we propose a fusion mechanism that combines dual-modality features using a salient fusion module, incorporating attention and transformer mechanisms to address the modalities' high dimensionality. This fusion process prioritizes informative video frames and body joints, enhancing the recognition accuracy of human actions. Additionally, we introduce a new dataset tailored for real-world robotic applications in construction sites, featuring visual, skeleton, and depth data modalities, named VolvoConstAct. This dataset serves to facilitate the training and evaluation of machine learning models to instruct autonomous construction machines for performing necessary tasks in the real world construction zones. To evaluate our approach, we conduct experiments on our dataset as well as three widely used public datasets, NTU-RGB+D, NTU-RGB+D120 and NW-UCLA. Results reveal that our proposed method achieves promising performance across all datasets, demonstrating its robustness and potential for various applications. The codes and dataset are available at: this https URL
检测人类行为是自动驾驶机器人车辆的关键任务,通常需要将各种数据模式进行集成以提高准确性。在这项研究中,我们提出了一种基于骨架和视觉线索的新人机行为识别(HAR)方法。我们的方法利用语言模型引导骨架编码器的特征提取过程。具体来说,我们使用条件于骨架模态的learnable prompts来优化特征表示。此外,我们提出了一种融合机制,使用显著性融合模块结合注意力和Transformer机制来处理模态的高维度。这个融合过程优先考虑视频帧和身体关节的有用信息,提高了人类行为的识别准确性。此外,我们还引入了一个新的数据集,名为VolvoConstAct,专门针对实境建筑工地进行设计,包括视觉、骨架和深度数据模式。这个数据集有助于指导自主建筑机器人在现实世界的建筑区执行必要任务。为了评估我们的方法,我们在我们的数据集以及三个广泛使用的主流公共数据集(NTU-RGB+D,NTU-RGB+D120和NW-UCLA)上进行了实验。结果表明,我们提出的方法在所有数据集上都取得了良好的性能,证明了其稳健性和各种应用的前景。代码和数据集可在此处访问:https://this URL
https://arxiv.org/abs/2410.01962
Endovascular interventions are a life-saving treatment for many diseases, yet suffer from drawbacks such as radiation exposure and potential scarcity of proficient physicians. Robotic assistance during these interventions could be a promising support towards these problems. Research focusing on autonomous endovascular interventions utilizing artificial intelligence-based methodologies is gaining popularity. However, variability in assessment environments hinders the ability to compare and contrast the efficacy of different approaches, primarily due to each study employing a unique evaluation framework. In this study, we present deep reinforcement learning-based autonomous endovascular device navigation on three distinct digital benchmark interventions: BasicWireNav, ArchVariety, and DualDeviceNav. The benchmark interventions were implemented with our modular simulation framework stEVE (simulated EndoVascular Environment). Autonomous controllers were trained solely in simulation and evaluated in simulation and on physical test benches with camera and fluoroscopy feedback. Autonomous control for BasicWireNav and ArchVariety reached high success rates and was successfully transferred from the simulated training environment to the physical test benches, while autonomous control for DualDeviceNav reached a moderate success rate. The experiments demonstrate the feasibility of stEVE and its potential for transferring controllers trained in simulation to real-world scenarios. Nevertheless, they also reveal areas that offer opportunities for future research. This study demonstrates the transferability of autonomous controllers from simulation to the real world in endovascular navigation and lowers the entry barriers and increases the comparability of research on endovascular assistance systems by providing open-source training scripts, benchmarks and the stEVE framework.
血管内干预治疗许多疾病具有救生作用,但也存在一些缺点,如辐射暴露和熟练医师的潜在短缺。在這些治療過程中,機器人協助可能有助於解決這些問題。致力於使用人工智能方法學進行自主血管內治療的研究越來越受到歡迎。然而,評估環境的變化會阻礙比較和差異不同方法的有效性,主要因為每項研究都採用了一個獨特的評估框架。在這個研究中,我們在三個不同的數字基准治療中介紹了基於深度强化學習的自動血管內器械導航:基本電纜導航、ArchVariety和雙設備導航。這些標記性治療是用我們的模擬仿真的框架stEVE(模擬生動血管環境)實現的。自動控制器僅在模擬環境中進行訓練,并通过模擬和攝像頭反饋進行評估。自動控制for BasicWireNav和ArchVariety达到了高成功率,並成功從模擬訓練環境轉移到實際測試台,而自動控制for DualDeviceNav的成功率較高。實驗證明了stEVE和將訓練在模擬環境中的控制器轉移到現實環境的可能性。然而,它們也揭示了未來研究的機會。這項研究將自動控制器從模擬環境轉移到現實環境在血管內導航中的可行性证明了,并为研究人員提供了開放源代碼訓練腳本、基准和stEVE框架,从而降低了進入障礙,提高了研究結果的可比性。
https://arxiv.org/abs/2410.01956
Agentic AIs $-$ AIs that are capable and permitted to undertake complex actions with little supervision $-$ mark a new frontier in AI capabilities and raise new questions about how to safely create and align such systems with users, developers, and society. Because agents' actions are influenced by their attitudes toward risk, one key aspect of alignment concerns the risk profiles of agentic AIs. Risk alignment will matter for user satisfaction and trust, but it will also have important ramifications for society more broadly, especially as agentic AIs become more autonomous and are allowed to control key aspects of our lives. AIs with reckless attitudes toward risk (either because they are calibrated to reckless human users or are poorly designed) may pose significant threats. They might also open 'responsibility gaps' in which there is no agent who can be held accountable for harmful actions. What risk attitudes should guide an agentic AI's decision-making? How might we design AI systems that are calibrated to the risk attitudes of their users? What guardrails, if any, should be placed on the range of permissible risk attitudes? What are the ethical considerations involved when designing systems that make risky decisions on behalf of others? We present three papers that bear on key normative and technical aspects of these questions.
翻译:智能代理 $-$ 具有能力并得到允许从事复杂任务的人工智能 $-$ 标志着人工智能能力的新前沿,并提出了如何安全地为用户、开发人员和社会创建此类系统的新问题。因为代理的行为受到他们对风险的态度的影响,所以对齐的关键方面涉及代理智能的行为风险概况。对齐对用户满意度和信任至关重要,但也会对更广泛的社会产生重要影响,特别是当代理智能变得越来越自主,并被允许控制我们生活中的关键方面时。对风险态度持轻率态度的 AI(无论是因为它们被校准为轻率的人类用户,还是因为设计不良)可能构成重大威胁。它们还可能会打开“责任缺口”,在缺口中有没有代理可以对有害行为承担责任。应该指导智能代理的决策的风险态度是什么?我们如何设计适应用户风险态度的 AI 系统?在允许的冒险态度范围内,是否应该设置警惕线?为他人代表做出风险决策时应考虑哪些伦理问题?我们提交了三个与这些问题相关的论文。
https://arxiv.org/abs/2410.01927
Autonomous robots navigating in off-road terrain like forests open new opportunities for automation. While off-road navigation has been studied, existing work often relies on clearly delineated pathways. We present a method allowing for long-range planning, exploration and low-level control in unknown off-trail forest terrain, using vision and GPS only. We represent outdoor terrain with a topological map, which is a set of panoramic snapshots connected with edges containing traversability information. A novel traversability analysis method is demonstrated, predicting the existence of a safe path towards a target in an image. Navigating between nodes is done using goal-conditioned behavior cloning, leveraging the power of a pretrained vision transformer. An exploration planner is presented, efficiently covering an unknown off-road area with unknown traversability using a frontiers-based approach. The approach is successfully deployed to autonomously explore two 400 meters squared forest sites unseen during training, in difficult conditions for navigation.
自主机器人穿越像森林这样的非道路地形为自动化带来了新的机会。尽管已经研究了非道路导航,但现有工作通常依赖于明确的界定路径。我们提出了一种方法,允许在未知非道路地形中进行长距离规划、探索和低级别控制,仅使用视觉和GPS。我们用拓扑图表示未知地形,它是一组包含可穿越信息边缘的全景快照。我们展示了了一种新的可穿越性分析方法,它在图像中预测了通往目标的安全路径。在节点之间导航使用基于目标的条件行为复制,利用预训练的视觉变压器的优势。我们提出了一个探索规划器,使用前沿基于方法有效地覆盖未知非道路面积和未知可穿越性。该方法在训练期间成功应用于自主探索两个未见过的400平方米森林场地,在导航困难的环境中取得了成功。
https://arxiv.org/abs/2410.01925
Coding assistants are increasingly leveraged in game design, both generating code and making high-level plans. To what degree can these tools align with developer workflows, and what new modes of human-computer interaction can emerge from their use? We present DreamGarden, an AI system capable of assisting with the development of diverse game environments in Unreal Engine. At the core of our method is an LLM-driven planner, capable of breaking down a single, high-level prompt -- a dream, memory, or imagined scenario provided by a human user -- into a hierarchical action plan, which is then distributed across specialized submodules facilitating concrete implementation. This system is presented to the user as a garden of plans and actions, both growing independently and responding to user intervention via seed prompts, pruning, and feedback. Through a user study, we explore design implications of this system, charting courses for future work in semi-autonomous assistants and open-ended simulation design.
编程助手在游戏设计中越来越受到欢迎,既生成代码又制定高级计划。这些工具与开发者工作流程的 align 程度有多大,以及它们的使用可以产生多少新的人机交互模式呢?我们展示了 DreamGarden,一个可以在 Unreal Engine 中协助开发各种游戏环境的 AI 系统。我们方法的核心是一个基于 LLM 的规划器,可以将一个由人类用户提供的单一高层次提示(梦境、记忆或想象场景)分解成一个等级化的动作计划,然后通过专用子模块在具体实施过程中进行分发。这个系统被用户看作是一个花园,包括计划和动作,它们可以独立生长,并通过种子提示、修剪和反馈来响应用户干预。通过用户研究,我们探讨了这种系统的设计影响,并为未来半自主助手和开放性模拟设计的研究方向进行了规划。
https://arxiv.org/abs/2410.01791
3D multi-object tracking plays a critical role in autonomous driving by enabling the real-time monitoring and prediction of multiple objects' movements. Traditional 3D tracking systems are typically constrained by predefined object categories, limiting their adaptability to novel, unseen objects in dynamic environments. To address this limitation, we introduce open-vocabulary 3D tracking, which extends the scope of 3D tracking to include objects beyond predefined categories. We formulate the problem of open-vocabulary 3D tracking and introduce dataset splits designed to represent various open-vocabulary scenarios. We propose a novel approach that integrates open-vocabulary capabilities into a 3D tracking framework, allowing for generalization to unseen object classes. Our method effectively reduces the performance gap between tracking known and novel objects through strategic adaptation. Experimental results demonstrate the robustness and adaptability of our method in diverse outdoor driving scenarios. To the best of our knowledge, this work is the first to address open-vocabulary 3D tracking, presenting a significant advancement for autonomous systems in real-world settings. Code, trained models, and dataset splits are available publicly.
3D多对象跟踪在自动驾驶中扮演着关键角色,通过实现对多个物体运动情况的实时监测和预测,提高了自动驾驶系统的实时性能。传统3D跟踪系统通常受到预定义的物体类别的限制,导致其对动态环境中新颖、未见物体的适应性受限。为了应对这一局限,我们引入了开放词汇3D跟踪,将3D跟踪的范围扩展到包括超出预定义类别的物体。我们形式化开放词汇3D跟踪的问题,并引入了旨在表示各种开放词汇场景的数据集划分。我们提出了一个新方法,将开放词汇功能整合到3D跟踪框架中,允许对未见物体类进行泛化。我们的方法通过策略性调整有效减少了跟踪已知和未见物体之间的性能差距。实验结果表明,我们的方法在各种户外驾驶场景中具有稳健性和适应性。据我们所知,这是第一个针对开放词汇3D跟踪的论文,为现实环境中的自动驾驶系统带来了显著的进展。代码、训练的模型和数据集都可以公开获取。
https://arxiv.org/abs/2410.01678
In autonomous driving, accurate motion prediction is essential for safe and efficient motion planning. To ensure safety, planners must rely on reliable uncertainty information about the predicted future behavior of surrounding agents, yet this aspect has received limited attention. This paper addresses the so-far neglected problem of uncertainty modeling in trajectory prediction. We adopt a holistic approach that focuses on uncertainty quantification, decomposition, and the influence of model composition. Our method is based on a theoretically grounded information-theoretic approach to measure uncertainty, allowing us to decompose total uncertainty into its aleatoric and epistemic components. We conduct extensive experiments on the nuScenes dataset to assess how different model architectures and configurations affect uncertainty quantification and model robustness.
在自动驾驶中,精确的运动预测对于安全和高效的轨迹规划至关重要。为了确保安全,规划师必须依赖周围代理物预测未来行为的可靠不确定性信息,然而这一方面受到了很少的关注。本文解决了迄今为止被忽视的问题,即轨迹预测中的不确定性建模。我们采用一种以理论为基础的方法,重点关注不确定性量化和分解以及模型组合的影响。我们的方法基于理论依据的信息论方法来测量不确定性,使我们能够将总不确定性分解为它的随机和知识组件。我们在 nuScenes 数据集上进行广泛的实验,以评估不同模型架构和配置对不确定性量化和模型鲁棒性影响的程度。
https://arxiv.org/abs/2410.01628
In the endeavor to make autonomous robots take actions, task planning is a major challenge that requires translating high-level task descriptions into long-horizon action sequences. Despite recent advances in language model agents, they remain prone to planning errors and limited in their ability to plan ahead. To address these limitations in robotic planning, we advocate a self-refining scheme that iteratively refines a draft plan until an equilibrium is reached. Remarkably, this process can be optimized end-to-end from an analytical perspective without the need to curate additional verifiers or reward models, allowing us to train self-refining planners in a simple supervised learning fashion. Meanwhile, a nested equilibrium sequence modeling procedure is devised for efficient closed-loop planning that incorporates useful feedback from the environment (or an internal world model). Our method is evaluated on the VirtualHome-Env benchmark, showing advanced performance with better scaling for inference computation. Code is available at this https URL.
在努力使自主机器人采取行动的过程中,任务规划是一个主要挑战,需要将高级任务描述翻译成长时间的 horizon 行动序列。尽管语言模型代理最近取得了进展,但它们仍然容易发生规划错误,并且对未来规划的能力有限。为了解决这些限制,我们倡导一个自我优化方案,该方案在达到平衡之前迭代优化草稿计划。值得注意的是,从分析的角度来看,这个过程可以优化整个过程,而无需额外维护验证者或奖励模型,使我们能够以简单的监督学习方式训练自我优化规划器。同时,我们针对高效的闭环规划设计了嵌套平衡序列建模方法,该方法包含了从环境中获得的有用反馈(或内部世界模型)。我们的方法在 VirtualHome-Env 基准上评估,表明具有卓越的性能,具有更好的推理计算缩放。代码可以从该链接处获取:https:// this URL.
https://arxiv.org/abs/2410.01440