Existing definitions and associated conceptual frameworks for computer-based system safety should be revisited in light of real-world experiences from deploying autonomous vehicles. Current terminology used by industry safety standards emphasizes mitigation of risk from specifically identified hazards, and carries assumptions based on human-supervised vehicle operation. Operation without a human driver dramatically increases the scope of safety concerns, especially due to operation in an open world environment, a requirement to self-enforce operational limits, participation in an ad hoc sociotechnical system of systems, and a requirement to conform to both legal and ethical constraints. Existing standards and terminology only partially address these new challenges. We propose updated definitions for core system safety concepts that encompass these additional considerations as a starting point for evolving safe-ty approaches to address these additional safety challenges. These results might additionally inform framing safety terminology for other autonomous system applications.
现有的计算机系统安全定义和相关概念框架应该在考虑自动驾驶车辆的实际应用经验后进行重新审视。目前由行业安全标准使用的术语侧重于从明确确定的危险中减轻风险,并基于人类监督的车辆操作做出假设。在没有人类驾驶员的情况下,操作范围和安全问题的范围急剧扩大,特别是由于在开放世界环境中的操作,需要自我执行操作限制,参与临时社会技术系统的运作,以及符合法律和道德约束等。现有的标准和术语仅部分解决了这些新挑战。我们提出了涵盖这些额外考虑的核心系统安全概念的更新定义,作为发展安全方法应对这些额外安全挑战的起点。这些结果还可能告知其他自动驾驶系统应用的安全术语的制定。
https://arxiv.org/abs/2404.16768
Autonomous navigation in dynamic environments is a complex but essential task for autonomous robots, with recent deep reinforcement learning approaches showing promising results. However, the complexity of the real world makes it infeasible to train agents in every possible scenario configuration. Moreover, existing methods typically overlook factors such as robot kinodynamic constraints, or assume perfect knowledge of the environment. In this work, we present RUMOR, a novel planner for differential-drive robots that uses deep reinforcement learning to navigate in highly dynamic environments. Unlike other end-to-end DRL planners, it uses a descriptive robocentric velocity space model to extract the dynamic environment information, enhancing training effectiveness and scenario interpretation. Additionally, we propose an action space that inherently considers robot kinodynamics and train it in a simulator that reproduces the real world problematic aspects, reducing the gap between the reality and simulation. We extensively compare RUMOR with other state-of-the-art approaches, demonstrating a better performance, and provide a detailed analysis of the results. Finally, we validate RUMOR's performance in real-world settings by deploying it on a ground robot. Our experiments, conducted in crowded scenarios and unseen environments, confirm the algorithm's robustness and transferability.
自主导航在动态环境中是一个复杂但 essential 的任务,对于自主机器人来说,最近的深度强化学习方法显示出良好的效果。然而,真实世界的复杂性使得在所有可能的场景配置上训练代理是不切实际的。此外,现有的方法通常忽视诸如机器人动力学限制等因素,或者假设对环境具有完美的了解。在这项工作中,我们提出了 RUMOR,一种用于在高度动态环境中进行自主导航的新规划器,它使用深度强化学习来 navigate。与其它端到端 DRL 规划器不同,它使用描述性的机器人本体运动空间模型来提取动态环境信息,提高训练效果和场景解释。此外,我们还提出了一个考虑机器人动力学的动作空间,并将其在模拟器上训练,减少了现实世界和模拟器之间的差距。我们详细比较了 RUMOR 与其他最先进的方案,证明了其更好的性能,并提供了结果的详细分析。最后,我们通过在真实环境中部署 RUMOR 来验证其性能。我们的实验在拥挤的场景和未知的环境中进行,证实了算法的稳健性和可迁移性。
https://arxiv.org/abs/2404.16672
Developing autonomous agents for mobile devices can significantly enhance user interactions by offering increased efficiency and accessibility. However, despite the growing interest in mobile device control agents, the absence of a commonly adopted benchmark makes it challenging to quantify scientific progress in this area. In this work, we introduce B-MoCA: a novel benchmark designed specifically for evaluating mobile device control agents. To create a realistic benchmark, we develop B-MoCA based on the Android operating system and define 60 common daily tasks. Importantly, we incorporate a randomization feature that changes various aspects of mobile devices, including user interface layouts and language settings, to assess generalization performance. We benchmark diverse agents, including agents employing large language models (LLMs) or multi-modal LLMs as well as agents trained from scratch using human expert demonstrations. While these agents demonstrate proficiency in executing straightforward tasks, their poor performance on complex tasks highlights significant opportunities for future research to enhance their effectiveness. Our source code is publicly available at this https URL.
开发移动设备上的自主代理可以显著增强用户交互,通过提供更高的效率和可访问性。然而,尽管移动设备控制代理越来越受到关注,但缺乏一个普遍适用的基准使得衡量这一领域科学进展具有挑战性。在这项工作中,我们介绍了B-MoCA:一个专门为评估移动设备控制代理而设计的新的基准。为了创建一个真实的基准,我们基于Android操作系统开发B-MoCA,并定义了60个常见的日常任务。重要的是,我们引入了随机化功能,随机改变移动设备的各个方面,包括用户界面布局和语言设置,以评估泛化性能。我们基准了各种代理,包括使用大型语言模型(LLMs)或多模态LLM训练的代理以及使用人类专家演示训练的代理。虽然这些代理在执行简单任务时表现出熟练,但他们在复杂任务上的表现却显露出未来研究可以改进其有效性的巨大潜力。我们的源代码可在此链接公开使用。
https://arxiv.org/abs/2404.16660
In this paper, we propose a novel approach to address the problem of camera and radar sensor fusion for 3D object detection in autonomous vehicle perception systems. Our approach builds on recent advances in deep learning and leverages the strengths of both sensors to improve object detection performance. Precisely, we extract 2D features from camera images using a state-of-the-art deep learning architecture and then apply a novel Cross-Domain Spatial Matching (CDSM) transformation method to convert these features into 3D space. We then fuse them with extracted radar data using a complementary fusion strategy to produce a final 3D object representation. To demonstrate the effectiveness of our approach, we evaluate it on the NuScenes dataset. We compare our approach to both single-sensor performance and current state-of-the-art fusion methods. Our results show that the proposed approach achieves superior performance over single-sensor solutions and could directly compete with other top-level fusion methods.
在本文中,我们提出了一种新的方法来解决自动驾驶感知系统中3D物体检测的问题。我们的方法基于最近在深度学习方面的进展,并利用两个传感器的优势来提高物体检测性能。具体来说,我们使用最先进的深度学习架构提取相机图像的2D特征,然后应用一种新颖的跨域空间匹配(CDSM)变换方法将它们转换为3D空间。接着,我们使用互补的融合策略将提取的雷达数据与2D特征融合,产生最终的3D物体表示。为了证明我们方法的有效性,我们在 nuScenes 数据集上进行了评估。我们将我们的方法与单传感器性能和当前最先进的融合方法进行了比较。我们的结果表明,与单传感器解决方案相比,所提出的方法具有卓越的性能,并可以直接与其他顶级融合方法竞争。
https://arxiv.org/abs/2404.16548
Adversarial patch attacks present a significant threat to real-world object detectors due to their practical feasibility. Existing defense methods, which rely on attack data or prior knowledge, struggle to effectively address a wide range of adversarial patches. In this paper, we show two inherent characteristics of adversarial patches, semantic independence and spatial heterogeneity, independent of their appearance, shape, size, quantity, and location. Semantic independence indicates that adversarial patches operate autonomously within their semantic context, while spatial heterogeneity manifests as distinct image quality of the patch area that differs from original clean image due to the independent generation process. Based on these observations, we propose PAD, a novel adversarial patch localization and removal method that does not require prior knowledge or additional training. PAD offers patch-agnostic defense against various adversarial patches, compatible with any pre-trained object detectors. Our comprehensive digital and physical experiments involving diverse patch types, such as localized noise, printable, and naturalistic patches, exhibit notable improvements over state-of-the-art works. Our code is available at this https URL.
对抗性补丁攻击对现实世界的物体检测器构成了显著的安全威胁,因为它们的实际可行性。现有的防御方法,依赖攻击数据或先验知识,很难有效地应对广泛的对抗性补丁。在本文中,我们展示了对抗性补丁的两个固有特性:语义独立性和空间异质性,无论它们的形状、大小、数量和位置如何。语义独立性表明,攻击性补丁在语义上下文内自行为,而空间异质性表现为由于独立生成过程,补丁区域与原始干净图像的图像质量不同的显著图像质量差异。基于这些观察结果,我们提出了PAD,一种新颖的对抗性补丁局部化和删除方法,不需要先验知识或额外训练。PAD能够对各种对抗性补丁进行补丁,兼容任何预训练的物体检测器。我们对各种补丁类型(如局部噪音、可打印的和自然istic补丁)进行全面的数字和物理实验,结果表明,与最先进的成果相比,我们的工作取得了显著的改善。我们的代码可在此处访问:https://www.thuatminh.com/
https://arxiv.org/abs/2404.16452
Generative AI (GAI) can enhance the cognitive, reasoning, and planning capabilities of intelligent modules in the Internet of Vehicles (IoV) by synthesizing augmented datasets, completing sensor data, and making sequential decisions. In addition, the mixture of experts (MoE) can enable the distributed and collaborative execution of AI models without performance degradation between connected vehicles. In this survey, we explore the integration of MoE and GAI to enable Artificial General Intelligence in IoV, which can enable the realization of full autonomy for IoV with minimal human supervision and applicability in a wide range of mobility scenarios, including environment monitoring, traffic management, and autonomous driving. In particular, we present the fundamentals of GAI, MoE, and their interplay applications in IoV. Furthermore, we discuss the potential integration of MoE and GAI in IoV, including distributed perception and monitoring, collaborative decision-making and planning, and generative modeling and simulation. Finally, we present several potential research directions for facilitating the integration.
生成式人工智能(GAI)可以通过合成增强数据集、完成传感器数据和做出序列决策来增强智能模块在汽车互联网(IoV)中的认知、推理和规划能力。此外,专家混合(MoE)可以实现分布式和协作执行AI模型,而不会导致性能下降。在本次调查中,我们探讨了将MoE和GAI集成到IoV中以实现汽车通用人工智能(AGI),这将实现无需大量人类监督即可实现IoV的全自动驾驶,并适用于各种移动场景,包括环境监测、交通管理和自动驾驶。特别是,我们概述了GAI、MoE及其相互作用的原理在IoV中的应用。此外,我们讨论了在IoV中实现MoE和GAI协同的可能性,包括分布式感知和监测、协同决策和规划以及生成建模和仿真。最后,我们提出了几个促进整合的研究方向。
https://arxiv.org/abs/2404.16356
As one of the emerging challenges in Automated Machine Learning, the Hardware-aware Neural Architecture Search (HW-NAS) tasks can be treated as black-box multi-objective optimization problems (MOPs). An important application of HW-NAS is real-time semantic segmentation, which plays a pivotal role in autonomous driving scenarios. The HW-NAS for real-time semantic segmentation inherently needs to balance multiple optimization objectives, including model accuracy, inference speed, and hardware-specific considerations. Despite its importance, benchmarks have yet to be developed to frame such a challenging task as multi-objective optimization. To bridge the gap, we introduce a tailored streamline to transform the task of HW-NAS for real-time semantic segmentation into standard MOPs. Building upon the streamline, we present a benchmark test suite, CitySeg/MOP, comprising fifteen MOPs derived from the Cityscapes dataset. The CitySeg/MOP test suite is integrated into the EvoXBench platform to provide seamless interfaces with various programming languages (e.g., Python and MATLAB) for instant fitness evaluations. We comprehensively assessed the CitySeg/MOP test suite on various multi-objective evolutionary algorithms, showcasing its versatility and practicality. Source codes are available at this https URL.
作为自动机器学习领域新兴挑战之一,硬件感知的神经架构搜索(HW-NAS)任务可以被视为多目标优化问题(MOPs)。HW-NAS在实时语义分割(Real-time Semantic Segmentation,RSS)中的应用至关重要。为实现实时语义分割,HW-NAS在实时语义分割本身就需要在多个优化目标之间取得平衡,包括模型准确性、推理速度和硬件特定考虑。尽管HW-NAS在实时语义分割中具有重要性,但迄今为止还没有为这样的具有挑战性的任务开发基准。为了填补这一空白,我们引入了一个定制的流线来将HW-NAS在实时语义分割中的任务转化为标准的MOP。在此基础上,我们提出了一个基准测试套件,CitySeg/MOP,包含来自Cityscapes数据集的15个MOP。CitySeg/MOP测试套件已集成到EvoXBench平台中,为各种编程语言(例如Python和MATLAB)提供了一个无缝的界面来进行即时的健身评估。我们对CitySeg/MOP测试套件进行了全面评估,展示了其多才性和实用性。源代码可以从此链接获取:https://url.cn/xyz6uJ4
https://arxiv.org/abs/2404.16266
This survey analyzes intermediate fusion methods in collaborative perception for autonomous driving, categorized by real-world challenges. We examine various methods, detailing their features and the evaluation metrics they employ. The focus is on addressing challenges like transmission efficiency, localization errors, communication disruptions, and heterogeneity. Moreover, we explore strategies to counter adversarial attacks and defenses, as well as approaches to adapt to domain shifts. The objective is to present an overview of how intermediate fusion methods effectively meet these diverse challenges, highlighting their role in advancing the field of collaborative perception in autonomous driving.
这份调查对自动驾驶中协作感知的中间融合方法进行了分析,按照现实世界的挑战进行分类。我们检查了各种方法,详细介绍了它们的特征以及它们采用的评估指标。重点在于解决诸如传输效率、定位误差、通信干扰和异质性等问题。此外,我们还探讨了应对对抗攻击和防御策略以及适应领域转移的方法。调查的目的是提供一个概述,表明中间融合方法如何有效应对这些多样挑战,突出它们在自动驾驶领域协作感知中的作用。
https://arxiv.org/abs/2404.16139
Physics-based simulations have accelerated progress in robot learning for driving, manipulation, and locomotion. Yet, a fast, accurate, and robust surgical simulation environment remains a challenge. In this paper, we present ORBIT-Surgical, a physics-based surgical robot simulation framework with photorealistic rendering in NVIDIA Omniverse. We provide 14 benchmark surgical tasks for the da Vinci Research Kit (dVRK) and Smart Tissue Autonomous Robot (STAR) which represent common subtasks in surgical training. ORBIT-Surgical leverages GPU parallelization to train reinforcement learning and imitation learning algorithms to facilitate study of robot learning to augment human surgical skills. ORBIT-Surgical also facilitates realistic synthetic data generation for active perception tasks. We demonstrate ORBIT-Surgical sim-to-real transfer of learned policies onto a physical dVRK robot. Project website: this http URL
基于物理的机器人学习在驾驶、操作和移动方面已经取得了进展。然而,快速、准确和稳健的手术模拟环境仍然是一个挑战。在本文中,我们提出了ORBIT-Surgical,一个基于物理的手术机器人模拟框架,在NVIDIA Omniverse中实现光栅化渲染。我们为达芬奇研究工具包(dVRK)和智能组织自主机器人(STAR)提供了14个基准手术任务,这些任务代表了手术训练中常见的子任务。ORBIT-Surgical利用GPU并行训练强化学习和模仿学习算法,以促进研究机器人学习以提高人类手术技能。ORBIT-Surgical还促进了真实合成数据生成,用于主动感知任务。我们证明了ORBIT-Surgical将学习到的策略在物理dVRK机器人上实现模拟-到-实转。项目网站:这个链接
https://arxiv.org/abs/2404.16027
The livestock industry faces several challenges, including labor-intensive management, the threat of predators and environmental sustainability concerns. Therefore, this paper explores the integration of quadruped robots in extensive livestock farming as a novel application of field robotics. The SELF-AIR project, an acronym for Supporting Extensive Livestock Farming with the use of Autonomous Intelligent Robots, exemplifies this innovative approach. Through advanced sensors, artificial intelligence, and autonomous navigation systems, these robots exhibit remarkable capabilities in navigating diverse terrains, monitoring large herds, and aiding in various farming tasks. This work provides insight into the SELF-AIR project, presenting the lessons learned.
畜牧业面临着几个挑战,包括劳动密集管理、捕食者的威胁和环境可持续性问题。因此,本文探讨了在放养畜牧业中集成四足机器人的新型应用,作为场机器人技术的一种创新。SELF-AIR项目,即使用智能机器人支持放养畜牧业的缩写,体现了这种创新方法。通过先进的传感器、人工智能和自主导航系统,这些机器人表现出在复杂地形中导航、监测大规模群牛和协助各种农场任务的非凡能力。本工作提供了对SELF-AIR项目的了解,分享了其中的经验教训。
https://arxiv.org/abs/2404.16008
Autonomous vehicles (AVs) heavily rely on LiDAR perception for environment understanding and navigation. LiDAR intensity provides valuable information about the reflected laser signals and plays a crucial role in enhancing the perception capabilities of AVs. However, accurately simulating LiDAR intensity remains a challenge due to the unavailability of material properties of the objects in the environment, and complex interactions between the laser beam and the environment. The proposed method aims to improve the accuracy of intensity simulation by incorporating physics-based modalities within the deep learning framework. One of the key entities that captures the interaction between the laser beam and the objects is the angle of incidence. In this work we demonstrate that the addition of the LiDAR incidence angle as a separate input to the deep neural networks significantly enhances the results. We present a comparative study between two prominent deep learning architectures: U-NET a Convolutional Neural Network (CNN), and Pix2Pix a Generative Adversarial Network (GAN). We implemented these two architectures for the intensity prediction task and used SemanticKITTI and VoxelScape datasets for experiments. The comparative analysis reveals that both architectures benefit from the incidence angle as an additional input. Moreover, the Pix2Pix architecture outperforms U-NET, especially when the incidence angle is incorporated.
自动驾驶车辆(AVs)对环境理解和导航重度依赖激光雷达感知。激光雷达强度提供了关于反射激光信号的有价值的信息,并在增强AV的感知能力中发挥了关键作用。然而,准确模拟激光雷达强度仍然是一个挑战,由于环境中物体的材料性质不可用,以及激光束与环境的复杂相互作用。所提出的方法旨在通过在深度学习框架中引入基于物理的模态来提高强度模拟的准确性。一个捕捉激光束与物体之间互动的关键实体是入射角。在本文中,我们证明了将激光雷达入射角作为额外的输入到深度神经网络可以显著增强结果。我们比较了两个著名的深度学习架构:U-NET和Pix2Pix。我们将这两个架构用于强度预测任务,并使用SemanticKITTI和VoxelScape数据集进行实验。比较分析揭示了,这两个架构都从入射角作为额外的输入受益。此外,Pix2Pix架构在纳入入射角时优于U-NET。
https://arxiv.org/abs/2404.15774
Mapping traversal costs in an environment and planning paths based on this map are important for autonomous navigation. We present a neurobotic navigation system that utilizes a Spiking Neural Network Wavefront Planner and E-prop learning to concurrently map and plan paths in a large and complex environment. We incorporate a novel method for mapping which, when combined with the Spiking Wavefront Planner, allows for adaptive planning by selectively considering any combination of costs. The system is tested on a mobile robot platform in an outdoor environment with obstacles and varying terrain. Results indicate that the system is capable of discerning features in the environment using three measures of cost, (1) energy expenditure by the wheels, (2) time spent in the presence of obstacles, and (3) terrain slope. In just twelve hours of online training, E-prop learns and incorporates traversal costs into the path planning maps by updating the delays in the Spiking Wavefront Planner. On simulated paths, the Spiking Wavefront Planner plans significantly shorter and lower cost paths than A* and RRT*. The spiking wavefront planner is compatible with neuromorphic hardware and could be used for applications requiring low size, weight, and power.
映射环境中的穿行成本并进行路径规划对于自主导航非常重要。我们提出了一个神经机器人导航系统,该系统利用Spiking Neural Network Wavefront Planner和E-prop学习在大型和复杂的环境中同时映射和规划路径。我们引入了一种新的映射方法,当与Spiking Wavefront Planner结合时,可以通过选择性地考虑任何成本组合来进行自适应规划。该系统在户外环境中的移动机器人平台上进行了测试,并遇到了障碍物和不同的地形。结果表明,系统能够通过三种成本指标(1)轮子消耗的能量,2)与障碍物相伴的时间,3)地形斜率来辨别环境特征。仅在在线训练的12小时内,E-prop就学会了将穿行成本纳入路径规划图,并通过更新Spiking Wavefront Planner中的延迟来完成。在模拟路径上,Spiking Wavefront Planner规划的路径比A*和RRT短得多,且成本更低。spiking wavefront planner与神经元硬件兼容,可以用于需要低尺寸、重量和功率的应用。
https://arxiv.org/abs/2404.15524
Existing solutions for 3D semantic occupancy prediction typically treat the task as a one-shot 3D voxel-wise segmentation perception problem. These discriminative methods focus on learning the mapping between the inputs and occupancy map in a single step, lacking the ability to gradually refine the occupancy map and the reasonable scene imaginative capacity to complete the local regions somewhere. In this paper, we introduce OccGen, a simple yet powerful generative perception model for the task of 3D semantic occupancy prediction. OccGen adopts a ''noise-to-occupancy'' generative paradigm, progressively inferring and refining the occupancy map by predicting and eliminating noise originating from a random Gaussian distribution. OccGen consists of two main components: a conditional encoder that is capable of processing multi-modal inputs, and a progressive refinement decoder that applies diffusion denoising using the multi-modal features as conditions. A key insight of this generative pipeline is that the diffusion denoising process is naturally able to model the coarse-to-fine refinement of the dense 3D occupancy map, therefore producing more detailed predictions. Extensive experiments on several occupancy benchmarks demonstrate the effectiveness of the proposed method compared to the state-of-the-art methods. For instance, OccGen relatively enhances the mIoU by 9.5%, 6.3%, and 13.3% on nuScenes-Occupancy dataset under the muli-modal, LiDAR-only, and camera-only settings, respectively. Moreover, as a generative perception model, OccGen exhibits desirable properties that discriminative models cannot achieve, such as providing uncertainty estimates alongside its multiple-step predictions.
目前针对3D语义占有预测的解决方案通常将其视为一次性的3D体素级分割感知问题。这些判别方法集中精力在单步内学习输入与占有图之间的映射,而缺乏逐步精炼占有图和完成局部区域的合理场景想象能力。在本文中,我们介绍了OccGen,一个简单而强大的3D语义占有预测生成感知模型。OccGen采用了一种“噪声到占有”的生成范式,通过预测和消除源于随机高斯分布的噪声来逐步精炼占有图。OccGen由两个主要组件组成:一个条件编码器,能够处理多模态输入,和一个 progressive refinement decoder,通过多模态特征应用扩散去噪。这一生成流程的关键见解是,扩散去噪过程自然能够建模出密集3D占有图的粗细精炼,从而产生更详细的预测。在多个占有基准测试中进行的广泛实验证明,与最先进的方法相比,所提出的方法具有有效性。例如,在nuScenes-Occupancy数据集上,OccGen分别相对增强了mIoU by 9.5%,6.3%和13.3%。此外,作为生成感知模型,OccGen表现出判别模型无法实现的一些有价值的特性,例如在多次预测中提供不确定性估计。
https://arxiv.org/abs/2404.15014
We propose a novel pipeline for unknown object grasping in shared robotic autonomy scenarios. State-of-the-art methods for fully autonomous scenarios are typically learning-based approaches optimised for a specific end-effector, that generate grasp poses directly from sensor input. In the domain of assistive robotics, we seek instead to utilise the user's cognitive abilities for enhanced satisfaction, grasping performance, and alignment with their high level task-specific goals. Given a pair of stereo images, we perform unknown object instance segmentation and generate a 3D reconstruction of the object of interest. In shared control, the user then guides the robot end-effector across a virtual hemisphere centered around the object to their desired approach direction. A physics-based grasp planner finds the most stable local grasp on the reconstruction, and finally the user is guided by shared control to this grasp. In experiments on the DLR EDAN platform, we report a grasp success rate of 87% for 10 unknown objects, and demonstrate the method's capability to grasp objects in structured clutter and from shelves.
我们提出了一个在共享机器人自主场景中解决未知物体抓取的新流程。在先进的全自动驾驶场景中,通常采用基于学习的优化方法,针对特定的末端设备生成直接的抓取姿态。在辅助机器人领域,我们寻求利用用户的认知能力来提高满足感、抓取性能以及与高层次任务目标的对齐。 给定一对立体图像,我们进行未知物体实例分割并生成物体感兴趣的3D复原。在共享控制下,用户 then 导引机器人末端Effector 穿越围绕物体的虚拟半球,以到达期望的接近方向。基于物理的抓取规划器找到重构中最具稳定性的局部抓取,最后用户通过共享控制找到这个抓取。 在德国 Frauncese 实验室的 EDAN 平台实验中,我们报告了10个未知物体的抓取成功率为87%,并展示了该方法在结构混乱和货架上的物体抓取能力。
https://arxiv.org/abs/2404.15001
Driver activity classification is crucial for ensuring road safety, with applications ranging from driver assistance systems to autonomous vehicle control transitions. In this paper, we present a novel approach leveraging generalizable representations from vision-language models for driver activity classification. Our method employs a Semantic Representation Late Fusion Neural Network (SRLF-Net) to process synchronized video frames from multiple perspectives. Each frame is encoded using a pretrained vision-language encoder, and the resulting embeddings are fused to generate class probability predictions. By leveraging contrastively-learned vision-language representations, our approach achieves robust performance across diverse driver activities. We evaluate our method on the Naturalistic Driving Action Recognition Dataset, demonstrating strong accuracy across many classes. Our results suggest that vision-language representations offer a promising avenue for driver monitoring systems, providing both accuracy and interpretability through natural language descriptors.
驾驶员活动分类对于确保道路安全具有至关重要的作用,应用范围从驾驶员辅助系统到自动驾驶车辆控制转换。在本文中,我们提出了一种利用可扩展性来自视觉-语言模型的驾驶员活动分类新方法。我们的方法采用了一个预训练的视觉-语言编码器来处理来自多个视角的同步视频帧。每个帧都使用预训练的视觉-语言编码器进行编码,然后将得到的嵌入进行融合以生成分类概率预测。通过利用对比学习得到的视觉-语言表示,我们的方法在多样驾驶员活动中取得了稳健的性能。我们在自然驾驶行动识别数据集上评估我们的方法,证明了在许多类别中具有强大的准确性。我们的结果表明,视觉-语言表示为驾驶员监测系统提供了有前途的途径,通过自然语言描述符实现准确性和可解释性。
https://arxiv.org/abs/2404.14906
Dynamic obstacle avoidance is a popular research topic for autonomous systems, such as micro aerial vehicles and service robots. Accurately evaluating the performance of dynamic obstacle avoidance methods necessitates the establishment of a metric to quantify the environment's difficulty, a crucial aspect that remains unexplored. In this paper, we propose four metrics to measure the difficulty of dynamic environments. These metrics aim to comprehensively capture the influence of obstacles' number, size, velocity, and other factors on the difficulty. We compare the proposed metrics with existing static environment difficulty metrics and validate them through over 1.5 million trials in a customized simulator. This simulator excludes the effects of perception and control errors and supports different motion and gaze planners for obstacle avoidance. The results indicate that the survivability metric outperforms and establishes a monotonic relationship between the success rate, with a Spearman's Rank Correlation Coefficient (SRCC) of over 0.9. Specifically, for every planner, lower survivability leads to a higher success rate. This metric not only facilitates fair and comprehensive benchmarking but also provides insights for refining collision avoidance methods, thereby furthering the evolution of autonomous systems in dynamic environments.
动态避障是自动驾驶系统和服务机器人的一个热门研究课题。准确评估动态避障方法的性能需要建立一个指标来量化环境的难度,这是的一个重要方面,但尚未被探索。在本文中,我们提出了四个指标来衡量动态环境的难度。这些指标旨在全面捕捉障碍物数量、大小、速度和其他因素对难度的影响。我们将所提出的指标与现有的静态环境难度指标进行比较,并通过在定制仿真器上进行超过150000次试验来验证它们。这个仿真器排除了感知和控制误差的影响,支持不同避障规划的运动和视觉计划。结果表明,生存能力指标超过了传统的避障方法,并建立了成功率与幸存能力之间的单调关系,相关系数(SRCC)超过0.9。具体来说,对于每个规划器,较低的生存能力会导致更高的成功率。这个指标不仅促进了公平和全面的基准测试,还为改进避障方法提供了洞察,从而进一步推动自动驾驶系统在动态环境中的发展。
https://arxiv.org/abs/2404.14848
The fusion of multimodal sensor data streams such as camera images and lidar point clouds plays an important role in the operation of autonomous vehicles (AVs). Robust perception across a range of adverse weather and lighting conditions is specifically required for AVs to be deployed widely. While multi-sensor fusion networks have been previously developed for perception in sunny and clear weather conditions, these methods show a significant degradation in performance under night-time and poor weather conditions. In this paper, we propose a simple yet effective technique called ContextualFusion to incorporate the domain knowledge about cameras and lidars behaving differently across lighting and weather variations into 3D object detection models. Specifically, we design a Gated Convolutional Fusion (GatedConv) approach for the fusion of sensor streams based on the operational context. To aid in our evaluation, we use the open-source simulator CARLA to create a multimodal adverse-condition dataset called AdverseOp3D to address the shortcomings of existing datasets being biased towards daytime and good-weather conditions. Our ContextualFusion approach yields an mAP improvement of 6.2% over state-of-the-art methods on our context-balanced synthetic dataset. Finally, our method enhances state-of-the-art 3D objection performance at night on the real-world NuScenes dataset with a significant mAP improvement of 11.7%.
多模态传感器数据流(如摄像头图像和激光雷达点云)在自动驾驶车辆(AV)的操作中发挥着重要作用。在各种恶劣天气和照明条件下实现稳健感知对于广泛部署AV至关重要。虽然之前为在晴朗和良好的天气条件下进行感知而开发了多传感器融合网络,但这些方法在夜间和恶劣天气条件下性能显著下降。在本文中,我们提出了一种简单而有效的技术——ContextualFusion,将相机和激光雷达在不同照明和天气条件下的行为差异领域知识融入到3D物体检测模型中。具体来说,我们设计了一种基于操作上下文的GatedConv融合方法,用于融合传感器流。为了帮助评估,我们使用开源模拟器CARLA创建了一个多模态恶劣条件数据集AdverseOp3D,以解决现有数据集对于白天和良好天气条件的偏差。我们的ContextualFusion方法在 our context-balanced synthetic dataset上的性能提高了6.2%。最后,我们的方法在现实世界的NuScenes数据集上提高了最先进的3D物体性能,显著提高了11.7%的mAP。
https://arxiv.org/abs/2404.14780
Large Language Models (LLMs) and multi-agent systems have shown impressive capabilities in natural language tasks but face challenges in clinical trial applications, primarily due to limited access to external knowledge. Recognizing the potential of advanced clinical trial tools that aggregate and predict based on the latest medical data, we propose an integrated solution to enhance their accessibility and utility. We introduce Clinical Agent System (CT-Agent), a Clinical multi-agent system designed for clinical trial tasks, leveraging GPT-4, multi-agent architectures, LEAST-TO-MOST, and ReAct reasoning technology. This integration not only boosts LLM performance in clinical contexts but also introduces novel functionalities. Our system autonomously manages the entire clinical trial process, demonstrating significant efficiency improvements in our evaluations, which include both computational benchmarks and expert feedback.
大语言模型(LLMs)和多代理系统在自然语言任务中表现出令人印象深刻的 capabilities,但在临床试验应用中面临挑战,主要原因是对外部知识的访问有限。认识到基于最新医疗数据的聚合和预测的高级临床试验工具的潜在能力,我们提出了一个集成解决方案,以增强其可访问性和实用性。我们引入了临床代理系统(CT-Agent),这是一种专为临床试验任务设计的临床多代理系统,利用了GPT-4、多代理架构、LEAST-TO-MOST和ReAct推理技术。这一集成不仅提高了LLM在临床情境中的性能,还引入了新的功能。我们的系统自主管理整个临床试验过程,证明了我们在评估中实现显著的效率改进,包括计算基准和专家反馈。
https://arxiv.org/abs/2404.14777
Testing and evaluating the safety performance of autonomous vehicles (AVs) is essential before the large-scale deployment. Practically, the number of testing scenarios permissible for a specific AV is severely limited by tight constraints on testing budgets and time. With the restrictions imposed by strictly restricted numbers of tests, existing testing methods often lead to significant uncertainty or difficulty to quantifying evaluation results. In this paper, we formulate this problem for the first time the "few-shot testing" (FST) problem and propose a systematic framework to address this challenge. To alleviate the considerable uncertainty inherent in a small testing scenario set, we frame the FST problem as an optimization problem and search for the testing scenario set based on neighborhood coverage and similarity. Specifically, under the guidance of better generalization ability of the testing scenario set on AVs, we dynamically adjust this set and the contribution of each testing scenario to the evaluation result based on coverage, leveraging the prior information of surrogate models (SMs). With certain hypotheses on SMs, a theoretical upper bound of evaluation error is established to verify the sufficiency of evaluation accuracy within the given limited number of tests. The experiment results on cut-in scenarios demonstrate a notable reduction in evaluation error and variance of our method compared to conventional testing methods, especially for situations with a strict limit on the number of scenarios.
在自动驾驶车辆(AVs)的大规模部署之前,对AV的安全性能进行测试和评估是至关重要的。实际上,特定AV允许的测试场景数量受到严格预算和时间限制的严重限制。由于限制了测试预算和时间,现有测试方法通常导致对评估结果的不确定性和量化评估结果的困难。在本文中,我们首次将这个问题定义为“少样本测试”(FST)问题,并提出了一个系统框架来解决这个挑战。为了减轻小测试场景集中存在的相当大的不确定性,我们将FST问题定义为优化问题,并基于邻域覆盖和相似性搜索测试场景集。具体来说,在AV测试场景集的更好泛化能力指导下,我们动态调整该集,并根据覆盖率基于每个测试场景对评估结果的贡献进行调整,利用代理模型(SMs)的先验信息。在某些假设关于SMs的情况下,建立了评估误差的理论上限,以验证在给定的有限测试数量内评估准确性的充分性。对切分场景的实验结果表明,与传统测试方法相比,我们的方法在评估误差和方差方面具有显著的减少,尤其是在有限场景数量的情况下。
https://arxiv.org/abs/2402.01795
Lane detection has evolved highly functional autonomous driving system to understand driving scenes even under complex environments. In this paper, we work towards developing a generalized computer vision system able to detect lanes without using any annotation. We make the following contributions: (i) We illustrate how to perform unsupervised 3D lane segmentation by leveraging the distinctive intensity of lanes on the LiDAR point cloud frames, and then obtain the noisy lane labels in the 2D plane by projecting the 3D points; (ii) We propose a novel self-supervised training scheme, dubbed LaneCorrect, that automatically corrects the lane label by learning geometric consistency and instance awareness from the adversarial augmentations; (iii) With the self-supervised pre-trained model, we distill to train a student network for arbitrary target lane (e.g., TuSimple) detection without any human labels; (iv) We thoroughly evaluate our self-supervised method on four major lane detection benchmarks (including TuSimple, CULane, CurveLanes and LLAMAS) and demonstrate excellent performance compared with existing supervised counterpart, whilst showing more effective results on alleviating the domain gap, i.e., training on CULane and test on TuSimple.
车道检测已经发展成为高度功能自动驾驶系统,以在复杂环境中理解驾驶场景。在本文中,我们致力于开发一个通用计算机视觉系统,能够无需使用任何标注来检测车道。我们做出以下贡献:(一)通过利用LIDAR点云帧中车道独特的强度进行无监督的三维车道分割,然后通过投影获取二维平面上的噪音车道标签;(二)我们提出了一种新颖的自监督训练方案,称为LaneCorrect,通过学习来自对抗增强的几何一致性和实例意识来自动纠正车道标签;(三)在自监督预训练模型的基础上,我们通过训练学生网络来检测任意目标车道(例如TuSimple)而无需任何人类标签;(四)我们在包括TuSimple、CULane、CurveLanes和LLAMAS在内的四个主要车道检测基准上对自监督方法进行了全面评估,并证明了与现有监督方法相比具有卓越的性能,同时表现出在减轻领域差异方面的更有效结果,即在CULane上训练并在TuSimple上测试。
https://arxiv.org/abs/2404.14671