Large language models (LLMs) like ChatGPT have shown significant advancements across diverse natural language understanding (NLU) tasks, including intelligent dialogue and autonomous agents. Yet, lacking widely acknowledged testing mechanisms, answering `whether LLMs are stochastic parrots or genuinely comprehend the world' remains unclear, fostering numerous studies and sparking heated debates. Prevailing research mainly focuses on surface-level NLU, neglecting fine-grained explorations. However, such explorations are crucial for understanding their unique comprehension mechanisms, aligning with human cognition, and finally enhancing LLMs' general NLU capacities. To address this gap, our study delves into LLMs' nuanced semantic comprehension capabilities, particularly regarding common words with uncommon meanings. The idea stems from foundational principles of human communication within psychology, which underscore accurate shared understandings of word semantics. Specifically, this paper presents the innovative construction of a Lexical Semantic Comprehension (LeSC) dataset with novel evaluation metrics, the first benchmark encompassing both fine-grained and cross-lingual dimensions. Introducing models of both open-source and closed-source, varied scales and architectures, our extensive empirical experiments demonstrate the inferior performance of existing models in this basic lexical-meaning understanding task. Notably, even the state-of-the-art LLMs GPT-4 and GPT-3.5 lag behind 16-year-old humans by 3.9% and 22.3%, respectively. Additionally, multiple advanced prompting techniques and retrieval-augmented generation are also introduced to help alleviate this trouble, yet limitations persist. By highlighting the above critical shortcomings, this research motivates further investigation and offers novel insights for developing more intelligent LLMs.
大型语言模型(LLMs)如ChatGPT在各种自然语言理解(NLU)任务中取得了显著的进步,包括智能对话和自主机器人。然而,缺乏广泛认可的测试机制,回答`LLMs是否是杂凑的鹦鹉还是真正理解世界`仍然不明确,引发了无数的研究和激烈的争论。目前的研究主要集中在表面层的NLU,而忽略了微观的探索。然而,这种探索对于理解它们的独特理解机制,与人类认知相一致,并最终提高LLMs的通用NLU能力至关重要。为解决这一空白,我们的研究深入探讨了LLMs的复杂语义理解能力,特别是关于常见但含义不寻常的词汇。这一想法源于人类交流心理学的基础原则,强调准确共享对单词语义的理解。具体来说,本文通过新颖的数据集构建了一种新的词汇语义理解(LeSC)数据集,并引入了开源和闭源模型、不同规模和架构的模型,对广泛的实证实验进行了探讨。我们的广泛实证实验证明,现有模型在基本词汇意义理解任务中的表现较差。值得注意的是,即使是目前最先进的LLM GPT-4和GPT-3.5,与16岁的成年人相比,表现也落后了3.9%和22.3%。此外,还引入了多种先进的提示技术和检索增强生成,以帮助缓解这一问题,但限制仍然存在。通过强调上述关键不足,这项研究激发了进一步的调查,为开发更智能的LLM提供了新的见解。
https://arxiv.org/abs/2405.05741
Evaluating and updating the obstacle avoidance velocity for an autonomous robot in real-time ensures robust- ness against noise and disturbances. A passive damping con- troller can obtain the desired motion with a torque-controlled robot, which remains compliant and ensures a safe response to external perturbations. Here, we propose a novel approach for designing the passive control policy. Our algorithm com- plies with obstacle-free zones while transitioning to increased damping near obstacles to ensure collision avoidance. This approach ensures stability across diverse scenarios, effectively mitigating disturbances. Validation on a 7DoF robot arm demonstrates superior collision rejection capabilities compared to the baseline, underlining its practicality for real-world ap- plications. Our obstacle-aware damping controller represents a substantial advancement in secure robot control within complex and uncertain environments.
评估和实时更新自主机器人避障速度确保其在现实环境中具有稳健性,抵御噪声和干扰。一个静摩擦控制器可以通过控制扭矩的机器人获得所需的运动,保持顺从,并确保对外部干扰的合适安全响应。在这里,我们提出了设计被动控制策略的新颖方法。我们的算法在过渡到障碍物附近的增加阻尼时保持无碰撞区域,以确保避障。这种方法确保了在不同场景下的稳定性,有效减轻了干扰。在7自由度机器人手臂上的验证表明,与基线相比,具有卓越的碰撞拒绝能力,强调了其在现实应用中的实用性。我们的具有避障意识的阻尼控制器在复杂且不确定的环境中确保了机器人控制的稳健性。
https://arxiv.org/abs/2405.05669
Autonomous systems often employ multiple LiDARs to leverage the integrated advantages, enhancing perception and robustness. The most critical prerequisite under this setting is the estimating the extrinsic between each LiDAR, i.e., calibration. Despite the exciting progress in multi-LiDAR calibration efforts, a universal, sensor-agnostic calibration method remains elusive. According to the coarse-to-fine framework, we first design a spherical descriptor TERRA for 3-DoF rotation initialization with no prior knowledge. To further optimize, we present JEEP for the joint estimation of extrinsic and pose, integrating geometric and motion information to overcome factors affecting the point cloud registration. Finally, the LiDAR poses optimized by the hierarchical optimization module are input to time syn- chronization module to produce the ultimate calibration results, including the time offset. To verify the effectiveness, we conduct extensive experiments on eight datasets, where 16 diverse types of LiDARs in total and dozens of calibration tasks are tested. In the challenging tasks, the calibration errors can still be controlled within 5cm and 1° with a high success rate.
自主系统通常会采用多个LiDAR,以利用其集成优势,提高感知和稳健性。在這種设置下,最關鍵的先决条件是估计每个LiDAR之间的外骨骼,即校准。尽管在多LiDAR校准的努力中取得了令人兴奋的进展,但仍然缺乏一个通用的、对传感器无关的校准方法。根据粗到细的框架,我们首先设计了一个球形描述符TERRA,用于无先验知识的三维旋转初始化。为了进一步优化,我们提出了JEEP,用于联合估计外骨骼和姿态,整合几何和运动信息,克服影响点云配准的因素。最后,由层次优化模块优化的LiDAR姿态被输入到时间同步模块,产生最终校准结果,包括时间偏移。为了验证其有效性,我们在八个数据集上进行了广泛的实验,其中总共有16种不同类型的LiDAR和几十个校准任务进行了测试。在具有挑战性的任务中,校准误差仍然可以控制在5cm和1°以内,具有很高的成功率。
https://arxiv.org/abs/2405.05589
Deep learning-based lane detection (LD) plays a critical role in autonomous driving systems, such as adaptive cruise control. However, it is vulnerable to backdoor attacks. Existing backdoor attack methods on LD exhibit limited effectiveness in dynamic real-world scenarios, primarily because they fail to consider dynamic scene factors, including changes in driving perspectives (e.g., viewpoint transformations) and environmental conditions (e.g., weather or lighting changes). To tackle this issue, this paper introduces BadLANE, a dynamic scene adaptation backdoor attack for LD designed to withstand changes in real-world dynamic scene factors. To address the challenges posed by changing driving perspectives, we propose an amorphous trigger pattern composed of shapeless pixels. This trigger design allows the backdoor to be activated by various forms or shapes of mud spots or pollution on the road or lens, enabling adaptation to changes in vehicle observation viewpoints during driving. To mitigate the effects of environmental changes, we design a meta-learning framework to train meta-generators tailored to different environmental conditions. These generators produce meta-triggers that incorporate diverse environmental information, such as weather or lighting conditions, as the initialization of the trigger patterns for backdoor implantation, thus enabling adaptation to dynamic environments. Extensive experiments on various commonly used LD models in both digital and physical domains validate the effectiveness of our attacks, outperforming other baselines significantly (+25.15\% on average in Attack Success Rate). Our codes will be available upon paper publication.
基于深度学习的车道检测(LD)在自动驾驶系统中扮演着关键角色,例如自适应巡航控制。然而,它很容易受到后门攻击。现有 LD 中的后门攻击方法在动态现实场景中的效果有限,主要是因为它们没有考虑到动态场景因素,包括驾驶视角的变化(例如,视角变换)和环境条件(例如,天气或照明变化)。为解决这个问题,本文引入了 BadLANE,一种用于 LD 的动态场景适应后门攻击,以抵御现实世界动态场景因素的变化。为了应对驾驶视角变化带来的挑战,我们提出了一个由无形状像素组成的模糊触发模式。这种触发设计使后门可以通过道路上或镜头上的各种形式或形状的泥斑或污染点来激活,从而在驾驶过程中适应车辆观察视角的变化。为了减轻环境变化的影响,我们设计了一个元学习框架,用于为不同的环境条件训练元生成器。这些生成器产生的元触发器包含不同的环境信息,例如天气或照明条件,作为后门植入的初始触发模式,从而适应动态环境。在数字和物理领域中,对各种常用 LD 模型的广泛实验证实了我们攻击的有效性,平均攻击成功率比其他基线高出 +25.15%。我们的代码将在论文发表后公开发布。
https://arxiv.org/abs/2405.05553
Neural Radiance Fields (NeRF) have emerged as a powerful paradigm for 3D scene representation, offering high-fidelity renderings and reconstructions from a set of sparse and unstructured sensor data. In the context of autonomous robotics, where perception and understanding of the environment are pivotal, NeRF holds immense promise for improving performance. In this paper, we present a comprehensive survey and analysis of the state-of-the-art techniques for utilizing NeRF to enhance the capabilities of autonomous robots. We especially focus on the perception, localization and navigation, and decision-making modules of autonomous robots and delve into tasks crucial for autonomous operation, including 3D reconstruction, segmentation, pose estimation, simultaneous localization and mapping (SLAM), navigation and planning, and interaction. Our survey meticulously benchmarks existing NeRF-based methods, providing insights into their strengths and limitations. Moreover, we explore promising avenues for future research and development in this domain. Notably, we discuss the integration of advanced techniques such as 3D Gaussian splatting (3DGS), large language models (LLM), and generative AIs, envisioning enhanced reconstruction efficiency, scene understanding, decision-making capabilities. This survey serves as a roadmap for researchers seeking to leverage NeRFs to empower autonomous robots, paving the way for innovative solutions that can navigate and interact seamlessly in complex environments.
神经辐射场(NeRF)已成为3D场景表示的强大范例,提供来自一系列稀疏和无结构传感器数据的最高质量渲染和重构。在自主机器人领域,感知和理解环境至关重要,因此NeRF在提高性能方面具有巨大的潜力。在本文中,我们对使用NeRF增强自主机器人能力的最先进技术进行全面调查和分析。我们特别关注自主机器人的感知、定位和导航模块,深入研究了对于自主操作至关重要的任务,包括3D建模、分割、姿态估计、同时定位与映射(SLAM)、导航和规划以及交互。我们的调查详细基准了现有的NeRF方法,提供了它们的优势和局限性的洞察。此外,我们探讨了该领域未来研究的方向和前景。值得注意的是,我们讨论了包括3D高斯分裂(3DGS)、大型语言模型(LLM)和生成式人工智能(GAN)等先进技术的整合,旨在提高建模效率、增强场景理解和决策能力。本调查为研究人员利用NeRF增强自主机器人提供了路线图,为研究人员提供了一个创新解决方案,可以让自主机器人顺畅地导航和交互。
https://arxiv.org/abs/2405.05526
The robotic autonomous luggage trolley collection system employs robots to gather and transport scattered luggage trolleys at airports. However, existing methods for detecting and locating these luggage trolleys often fail when they are not fully visible. To address this, we introduce the Hierarchical Progressive Perception System (HPPS), which enhances the detection and localization of luggage trolleys under partial occlusion. The HPPS processes the luggage trolley's position and orientation separately, which requires only RGB images for labeling and training, eliminating the need for 3D coordinates and alignment. The HPPS can accurately determine the position of the luggage trolley with just one well-detected keypoint and estimate the luggage trolley's orientation when it is partially occluded. Once the luggage trolley's initial pose is detected, HPPS updates this information continuously to refine its accuracy until the robot begins grasping. The experiments on detection and localization demonstrate that HPPS is more reliable under partial occlusion compared to existing methods. Its effectiveness and robustness have also been confirmed through practical tests in actual luggage trolley collection tasks. A website about this work is available at HPPS.
机器人自动行李传送带收集系统采用机器人收集和运输机场散落的行李传送带。然而,现有的检测和定位这些行李传送带的方法通常在完全可见时会失败。为解决这个问题,我们引入了分层渐进感知系统(HPPS),它通过增加在部分遮挡下的行李传送带的检测和定位来提高其性能。HPPS分别处理行李传送带的位置和方向,这只需要用于标签和训练的RGB图像,消除了需要3D坐标和对齐的需求。HPPS可以准确确定行李传送带的位置,只需检测到一个关键点,而在行李传送带被部分遮挡时,可以估计其方向。一旦检测到行李传送带的初始姿态,HPPS会持续更新该信息,直到机器人开始抓取,从而提高其准确性。检测和定位的实验结果表明,与现有方法相比,HPPS在部分遮挡情况下具有更高的可靠性。其效果和鲁棒性还通过实际行李传送带收集任务的实际测试得到了证实。关于这项工作的网站可以在HPPS上找到。
https://arxiv.org/abs/2405.05514
Efficient data utilization is crucial for advancing 3D scene understanding in autonomous driving, where reliance on heavily human-annotated LiDAR point clouds challenges fully supervised methods. Addressing this, our study extends into semi-supervised learning for LiDAR semantic segmentation, leveraging the intrinsic spatial priors of driving scenes and multi-sensor complements to augment the efficacy of unlabeled datasets. We introduce LaserMix++, an evolved framework that integrates laser beam manipulations from disparate LiDAR scans and incorporates LiDAR-camera correspondences to further assist data-efficient learning. Our framework is tailored to enhance 3D scene consistency regularization by incorporating multi-modality, including 1) multi-modal LaserMix operation for fine-grained cross-sensor interactions; 2) camera-to-LiDAR feature distillation that enhances LiDAR feature learning; and 3) language-driven knowledge guidance generating auxiliary supervisions using open-vocabulary models. The versatility of LaserMix++ enables applications across LiDAR representations, establishing it as a universally applicable solution. Our framework is rigorously validated through theoretical analysis and extensive experiments on popular driving perception datasets. Results demonstrate that LaserMix++ markedly outperforms fully supervised alternatives, achieving comparable accuracy with five times fewer annotations and significantly improving the supervised-only baselines. This substantial advancement underscores the potential of semi-supervised approaches in reducing the reliance on extensive labeled data in LiDAR-based 3D scene understanding systems.
高效的数据利用对于在自动驾驶中提高3D场景理解至关重要,因为过分依赖人工标注的激光雷达点云使得完全监督方法受到挑战。为了解决这个问题,我们的研究扩展了激光雷达语义分割的半监督学习,利用驾驶场景的固有空间先验和多传感器补充,以提高未标注数据集的有效性。我们引入了LaserMix++,一个进化框架,整合了不同来源的激光雷达扫描的激光束操作,并进一步通过激光雷达相机对应关系辅助数据有效的学习。我们的框架旨在通过包括多模态、包括1)多模态激光Mix操作,实现对细粒度跨传感器交互的优化;2)相机到激光雷达特征蒸馏,增强激光雷达特征学习;3)使用开维词表模型的语言驱动知识指导生成辅助监督。LaserMix++的多功能使其能够应用于各种激光雷达表示形式,使其成为一种通用的解决方案。通过理论分析和广泛应用于流行驾驶感知数据集的实验,验证了我们的框架。结果表明,LaserMix++在完全监督替代方案方面取得了显著的优越性,其准确率与五个标注少的关系相当,同时显著提高了仅监督基础。这一重大的进步凸显了半监督方法在减少基于激光雷达的3D场景理解系统中过度依赖全面标注数据中的潜力。
https://arxiv.org/abs/2405.05258
In this work, we present SuFIA, the first framework for natural language-guided augmented dexterity for robotic surgical assistants. SuFIA incorporates the strong reasoning capabilities of large language models (LLMs) with perception modules to implement high-level planning and low-level control of a robot for surgical sub-task execution. This enables a learning-free approach to surgical augmented dexterity without any in-context examples or motion primitives. SuFIA uses a human-in-the-loop paradigm by restoring control to the surgeon in the case of insufficient information, mitigating unexpected errors for mission-critical tasks. We evaluate SuFIA on four surgical sub-tasks in a simulation environment and two sub-tasks on a physical surgical robotic platform in the lab, demonstrating its ability to perform common surgical sub-tasks through supervised autonomous operation under challenging physical and workspace conditions. Project website: this http URL
在这项工作中,我们提出了SuFIA,这是第一个自然语言指导的机器人手术助手增强熟练性的框架。SuFIA通过结合大型语言模型的强大推理能力(LLMs)感知模块来实现机器人手术子任务的高层次规划和低层次控制,从而实现了一种无需上下文示例或运动原型的高速学习无关的方法。在手术子任务执行过程中,当信息不足时,SuFIA将手术医生的控制权还给外科医生,从而减轻了对于关键任务任务 unexpected的错误。在实验室中,我们评估了SuFIA在虚拟手术环境和实物手术机器人平台上的表现,证明了它能够在具有挑战性的物理和工作环境条件下,通过监督自适应操作执行常见手术子任务。项目网站:这是这个链接
https://arxiv.org/abs/2405.05226
3D occupancy perception technology aims to observe and understand dense 3D environments for autonomous vehicles. Owing to its comprehensive perception capability, this technology is emerging as a trend in autonomous driving perception systems, and is attracting significant attention from both industry and academia. Similar to traditional bird's-eye view (BEV) perception, 3D occupancy perception has the nature of multi-source input and the necessity for information fusion. However, the difference is that it captures vertical structures that are ignored by 2D BEV. In this survey, we review the most recent works on 3D occupancy perception, and provide in-depth analyses of methodologies with various input modalities. Specifically, we summarize general network pipelines, highlight information fusion techniques, and discuss effective network training. We evaluate and analyze the occupancy perception performance of the state-of-the-art on the most popular datasets. Furthermore, challenges and future research directions are discussed. We hope this report will inspire the community and encourage more research work on 3D occupancy perception. A comprehensive list of studies in this survey is available in an active repository that continuously collects the latest work: this https URL.
3D占用感知技术旨在观察和理解自动驾驶车辆对密集3D环境的感知。由于其全面的感知能力,这项技术正在成为自动驾驶感知系统中的趋势,并吸引了许多来自行业和学术界的高度关注。与传统的鸟瞰(BEV)感知类似,3D占用感知具有多源输入的性质以及信息融合的必要性。然而,区别在于它捕捉到了2D BEV所忽略的垂直结构。在这次调查中,我们回顾了最新的3D占用感知研究,并探讨了各种输入模态的方法论。具体来说,我们总结了通用网络流程,突出了信息融合技术,并讨论了有效的网络训练。我们评估和分析了当前最先进的占用感知性能,以及大多数流行数据集上的占用感知效果。此外,还讨论了挑战和未来的研究方向。我们希望这份报告能够激发社区,鼓励更多关于3D占用感知的研究工作。这份调查的全面列表可以在一个持续收集最新工作的活动中找到:https://this URL。
https://arxiv.org/abs/2405.05173
The 4D millimeter-wave (mmWave) radar, with its robustness in extreme environments, extensive detection range, and capabilities for measuring velocity and elevation, has demonstrated significant potential for enhancing the perception abilities of autonomous driving systems in corner-case scenarios. Nevertheless, the inherent sparsity and noise of 4D mmWave radar point clouds restrict its further development and practical application. In this paper, we introduce a novel 4D mmWave radar point cloud detector, which leverages high-resolution dense LiDAR point clouds. Our approach constructs dense 3D occupancy ground truth from stitched LiDAR point clouds, and employs a specially designed network named DenserRadar. The proposed method surpasses existing probability-based and learning-based radar point cloud detectors in terms of both point cloud density and accuracy on the K-Radar dataset.
4D毫米波(mmWave)雷达,在极端环境下具有稳健性,广泛的探测范围和测量速度和高度的能力,已经展示了在角落场景中增强自动驾驶系统感知能力的巨大潜力。然而,4D mmWave雷达点云固有的稀疏性和噪声限制了其进一步发展和实际应用。在本文中,我们介绍了一种新型的4D mmWave雷达点云检测器,它利用高分辨率的大致激光雷达点云。我们的方法从拼接的激光雷达点云中构建了密集的3D占有率地面真实值,并采用了一个专门设计的网络名为DenserRadar。与现有的概率基础学习和基于学习的雷达点云检测器相比,所提出的方法在K-Radar数据集上点云密度和准确性都超过了现有的水平。
https://arxiv.org/abs/2405.05131
It is widely acknowledged that we need to establish where responsibility lies for the outputs and impacts of AI-enabled systems. But without a clear and precise understanding of what "responsibility" means, deliberations about where responsibility lies will be, at best, unfocused and incomplete and, at worst, misguided. To address this concern, this paper draws upon central distinctions in philosophy and law to clarify the concept of responsibility for AI for policymakers, practitioners, researchers and students from non-philosophical and non-legal backgrounds. Taking the three-part formulation "Actor A is responsible for Occurrence O," the paper unravels the concept of responsibility to clarify that there are different possibilities of who is responsible for AI, the senses in which they are responsible, and aspects of events they are responsible for. Criteria and conditions for fitting attributions of responsibility in the core senses (causal responsibility, role-responsibility, liability responsibility and moral responsibility) are articulated to promote an understanding of when responsibility attributions would be inappropriate or unjust. The analysis is presented with a graphical notation to facilitate informal diagrammatic reasoning and discussion about specific cases. It is illustrated by application to a scenario of a fatal collision between an autonomous AI-enabled ship and a traditional, crewed vessel at sea.
人们普遍认为,我们需要明确人工智能(AI)驱动系统的产出和影响的责任归属。然而,如果没有对“责任”有清晰而精确的理解,关于责任归属的讨论最多是模糊和不完整的,最差的情况是误导性的。为解决这个问题,本文借鉴了哲学和法律领域的核心区别,澄清了AI在非哲学和法律背景下的责任概念。采用三部分公式“行为者A对事件O负责”,本文揭示了责任的概念,以澄清谁对AI负有责任,以及他们承担责任的方式和事件。提出了判断和条件,以促进对责任归属的合适性和公正性的理解(因果责任、角色责任、责任责任和道德责任)。分析采用图形表示法呈现,以促进关于具体案例的粗略图解和讨论。它通过将应用于海上自动驾驶船与传统船只相撞的场景进行说明。
https://arxiv.org/abs/2308.02608
Current autonomous driving systems heavily rely on V2X communication data to enhance situational awareness and the cooperation between vehicles. However, a major challenge when using V2X data is that it may not be available periodically because of unpredictable delays and data loss during wireless transmission between road stations and the receiver vehicle. This issue should be considered when designing control strategies for connected and autonomous vehicles. Therefore, this paper proposes a novel 'Blind Actor-Critic' algorithm that guarantees robust driving performance in V2X environment with delayed and/or lost data. The novel algorithm incorporates three key mechanisms: a virtual fixed sampling period, a combination of Temporal-Difference and Monte Carlo learning, and a numerical approximation of immediate reward values. To address the temporal aperiodicity problem of V2X data, we first illustrate this challenge. Then, we provide a detailed explanation of the Blind Actor-Critic algorithm where we highlight the proposed components to compensate for the temporal aperiodicity problem of V2X data. We evaluate the performance of our algorithm in a simulation environment and compare it to benchmark approaches. The results demonstrate that training metrics are improved compared to conventional actor-critic algorithms. Additionally, testing results show that our approach provides robust control, even under low V2X network reliability levels.
当前的自动驾驶系统高度依赖V2X通信数据来提高情境意识和车辆之间的合作。然而,在使用V2X数据时,一个主要挑战是它可能无法定期提供,因为道路站和接收车辆之间的无线传输可能会受到不可预测的延迟和数据丢失的影响。在设计连接和自主车辆的控制策略时,应该考虑这个问题。因此,本文提出了一种新颖的“盲目actor-critic”算法,在延迟和/或丢失数据的情况下保证鲁棒驾驶性能。该新颖算法包括三个关键机制:虚拟固定采样周期、Temporal-Difference和Monte Carlo学习以及即时奖励值的数值近似。为了应对V2X数据的时变性问题,我们首先阐明了这个问题。然后,我们详细解释了“盲目actor-critic”算法,并突出了为补偿V2X数据时变性问题而提出的组件。我们在仿真环境中评估了该算法的性能,并将其与基准方法进行了比较。结果显示,与传统actor-critic算法相比,训练指标得到了改善。此外,测试结果表明,即使在V2X网络可靠性较低的情况下,我们的方法也具有鲁棒控制。
https://arxiv.org/abs/2405.05072
In the field of computer vision, self-supervised learning has emerged as a method to extract robust features from unlabeled data, where models derive labels autonomously from the data itself, without the need for manual annotation. This paper provides a comprehensive review of discriminative approaches of self-supervised learning within the domain of computer vision, examining their evolution and current status. Through an exploration of various methods including contrastive, self-distillation, knowledge distillation, feature decorrelation, and clustering techniques, we investigate how these approaches leverage the abundance of unlabeled data. Finally, we have comparison of self-supervised learning methods on the standard ImageNet classification benchmark.
在计算机视觉领域,自监督学习已成为从未标记数据中提取强大特征的一种方法,在这种方法中,模型根据数据本身自主生成标签,无需手动注释。本文对计算机视觉领域自监督学习的判别方法进行了全面的回顾,探讨了它们的演变和当前状态。通过探索各种方法,包括对比学习、自监督、知识蒸馏、特征装饰和聚类技术,我们研究了这些方法如何充分利用未标记数据的丰富性。最后,我们在ImageNet分类基准上比较了自监督学习方法。
https://arxiv.org/abs/2405.04969
Predicting the future trajectories of dynamic traffic actors is a cornerstone task in autonomous driving. Though existing notable efforts have resulted in impressive performance improvements, a gap persists in scene cognitive and understanding of the complex traffic semantics. This paper proposes Traj-LLM, the first to investigate the potential of using Large Language Models (LLMs) without explicit prompt engineering to generate future motion from agents' past/observed trajectories and scene semantics. Traj-LLM starts with sparse context joint coding to dissect the agent and scene features into a form that LLMs understand. On this basis, we innovatively explore LLMs' powerful comprehension abilities to capture a spectrum of high-level scene knowledge and interactive information. Emulating the human-like lane focus cognitive function and enhancing Traj-LLM's scene comprehension, we introduce lane-aware probabilistic learning powered by the pioneering Mamba module. Finally, a multi-modal Laplace decoder is designed to achieve scene-compliant multi-modal predictions. Extensive experiments manifest that Traj-LLM, fortified by LLMs' strong prior knowledge and understanding prowess, together with lane-aware probability learning, outstrips state-of-the-art methods across evaluation metrics. Moreover, the few-shot analysis further substantiates Traj-LLM's performance, wherein with just 50% of the dataset, it outperforms the majority of benchmarks relying on complete data utilization. This study explores equipping the trajectory prediction task with advanced capabilities inherent in LLMs, furnishing a more universal and adaptable solution for forecasting agent motion in a new way.
预测动态交通角色的未来轨迹是自动驾驶中的一个重要任务。尽管已经取得了很多显著的性能改进,但场景认知和理解复杂交通语义之间仍然存在差距。本文提出了Traj-LLM,第一个研究使用明确提示工程的大型语言模型(LLM)从代理商过去/观察到的轨迹和场景语义中生成未来运动的尝试。Traj-LLM从稀疏上下文联合编码开始分解代理商和场景特征为LLM可以理解的形式。在此基础上,我们创新地探讨了LLM的强大的理解能力,以捕捉高级场景知识和交互信息。通过模拟人类车道关注认知功能和增强Traj-LLM的场景理解,我们引入了由Mamba模块引导的具有雷达域注意的概率学习。最后,设计了一个多模态Laplace解码器,以实现场景兼容的多模态预测。大量实验证明,Traj-LLM在LLM的强烈先验知识和理解能力的支持下,与具有雷达域注意的概率学习相结合,超越了最先进的评估指标。此外,少数样本分析进一步证实了Traj-LLM的性能,其中只需使用50%的数据集,它就超越了大多数基于完整数据利用的基准。本研究探讨了将LLM的高级特性应用于轨迹预测任务,提供了一种更通用和可扩展的预测代理商运动的新方法。
https://arxiv.org/abs/2405.04909
For autonomous robotics applications, it is crucial that robots are able to accurately measure their potential state and perceive their environment, including other agents within it (e.g., cobots interacting with humans). The redundancy of these measurements is important, as it allows for planning and execution of recovery protocols in the event of sensor failure or external disturbances. Visual estimation can provide this redundancy through the use of low-cost sensors and server as a standalone source of proprioception when no encoder-based sensing is available. Therefore, we estimate the configuration of the robot jointly with its pose, which provides a complete spatial understanding of the observed robot. We present GISR - a method for deep configuration and robot-to-camera pose estimation that prioritizes real-time execution. GISR is comprised of two modules: (i) a geometric initialization module, efficiently computing an approximate robot pose and configuration, and (ii) an iterative silhouette-based refinement module that refines the initial solution in only a few iterations. We evaluate our method on a publicly available dataset and show that GISR performs competitively with existing state-of-the-art approaches, while being significantly faster compared to existing methods of the same class. Our code is available at this https URL.
对于自主机器人应用,确保机器人能够准确测量其潜在状态并感知其环境(包括其内部的其他机器人,例如与人类交互的协作机器人),对冗余进行重要评估,以便在传感器故障或外部干扰的情况下执行恢复协议。视觉估计可以通过使用低成本传感器和服务器作为自包含姿态感觉器时提供冗余来实现。因此,我们与姿态一起估计机器人的配置,这提供了对观察到的机器人的完整空间理解。我们提出了GISR - 一个注重实时执行的机器人配置和机器人-相机姿态估计方法。GISR由两个模块组成:(i)一个几何初始化模块,高效计算出近似的机器人姿态和配置;(ii)一个迭代轮廓基于平滑的优化模块,仅在几次迭代后对初始解决方案进行平滑。我们在公开可用的数据集上评估我们的方法,并证明了GISR与现有高级方法具有竞争力,同时比相同类型的现有方法速度更快。我们的代码可在此处访问:https://www.thorlabs.com/newgrouppage9.cfm?objectgroup_id=11375
https://arxiv.org/abs/2405.04890
The search for refining 3D LiDAR data has attracted growing interest motivated by recent techniques such as supervised learning or generative model-based methods. Existing approaches have shown the possibilities for using diffusion models to generate refined LiDAR data with high fidelity, although the performance and speed of such methods have been limited. These limitations make it difficult to execute in real-time, causing the approaches to struggle in real-world tasks such as autonomous navigation and human-robot interaction. In this work, we introduce a novel approach based on conditional diffusion models for fast and high-quality sparse-to-dense upsampling of 3D scene point clouds through an image representation. Our method employs denoising diffusion probabilistic models trained with conditional inpainting masks, which have been shown to give high performance on image completion tasks. We introduce a series of experiments, including multiple datasets, sampling steps, and conditional masks, to determine the ideal configuration, striking a balance between performance and inference speed. This paper illustrates that our method outperforms the baselines in sampling speed and quality on upsampling tasks using the KITTI-360 dataset. Furthermore, we illustrate the generalization ability of our approach by simultaneously training on real-world and synthetic datasets, introducing variance in quality and environments.
寻找精化3D LiDAR数据的搜索吸引了越来越多的关注,这是由最近使用的如监督学习或基于生成模型的方法等技术引起的。虽然已经证明了使用扩散模型生成具有高保真度的精化LiDAR数据的可能性,但这种方法的性能和速度仍然有限。这些限制使得在实时执行中很难实现,导致在现实世界的任务(如自主导航和人类机器人交互)中,这些方法遇到困难。 在本文中,我们介绍了一种基于条件扩散模型的新的方法,用于通过图像表示对3D场景点云进行高保真度的平滑和压缩。我们的方法采用带条件修补掩码的噪声扩散概率模型进行训练,这些模型已经在图像完成任务中表现出良好的性能。我们介绍了一系列实验,包括多个数据集、采样步骤和条件掩码,以确定理想的配置,在性能和推理速度之间取得平衡。本文证明了,我们的方法在KITTI-360数据集上的采样速度和质量方面超过了基线。此外,我们还通过同时在一手真实世界和合成数据上训练,展示了我们方法的一般化能力。我们还展示了在不同质量和环境下的平滑和压缩效果。
https://arxiv.org/abs/2405.04889
Off-road autonomy validation presents unique challenges due to the unpredictable and dynamic nature of off-road environments. Traditional methods focusing on sequentially sweeping across the parameter space for variability analysis struggle to comprehensively assess the performance and safety of off-road autonomous systems within the imposed time constraints. This paper proposes leveraging scalable digital twin simulations within high-performance computing (HPC) clusters to address this challenge. By harnessing the computational power of HPC clusters, our approach aims to provide a scalable and efficient means to validate off-road autonomy algorithms, enabling rapid iteration and testing of autonomy algorithms under various conditions. We demonstrate the effectiveness of our framework through performance evaluations of the HPC cluster in terms of simulation parallelization and present the systematic variability analysis of a candidate off-road autonomy algorithm to identify potential vulnerabilities in the autonomy stack's perception, planning and control modules.
由于非公路环境的不确定性和动态性,离散验证方法在验证离散自主系统性能和安全性方面面临着独特的挑战。在规定的时间约束内,传统方法试图通过逐个扫描参数空间来对离散自主系统的性能和安全进行全面的评估,但往往难以全面评估其在约束条件下的表现。本文提出了一种利用高性能计算(HPC)集群中的可扩展数字孪生模拟的方法来解决这个挑战。通过利用HPC集群的计算能力,我们的方法旨在提供一种可扩展和高效的验证离散自主算法的方法,使自主算法在各种条件下能够快速迭代和测试。我们通过评估HPC集群在仿真并行方面的性能,证明了我们框架的有效性,并将候选离散自主算法的系统变异分析呈现出来,以确定自主系统栈中感知、规划和控制模块的潜在漏洞。
https://arxiv.org/abs/2405.04743
The applications of artificial intelligence (AI) are rapidly evolving, and they are also commonly used in safety-critical domains, such as autonomous driving and medical diagnosis, where functional safety is paramount. In AI-driven systems, uncertainty estimation allows the user to avoid overconfidence predictions and achieve functional safety. Therefore, the robustness and reliability of model predictions can be improved. However, conventional uncertainty estimation methods, such as the deep ensemble method, impose high computation and, accordingly, hardware (latency and energy) overhead because they require the storage and processing of multiple models. Alternatively, Monte Carlo dropout (MC-dropout) methods, although having low memory overhead, necessitate numerous ($\sim 100$) forward passes, leading to high computational overhead and latency. Thus, these approaches are not suitable for battery-powered edge devices with limited computing and memory resources. In this paper, we propose the Tiny-Deep Ensemble approach, a low-cost approach for uncertainty estimation on edge devices. In our approach, only normalization layers are ensembled $M$ times, with all ensemble members sharing common weights and biases, leading to a significant decrease in storage requirements and latency. Moreover, our approach requires only one forward pass in a hardware architecture that allows batch processing for inference and uncertainty estimation. Furthermore, it has approximately the same memory overhead compared to a single model. Therefore, latency and memory overhead are reduced by a factor of up to $\sim M\times$. Nevertheless, our method does not compromise accuracy, with an increase in inference accuracy of up to $\sim 1\%$ and a reduction in RMSE of $17.17\%$ in various benchmark datasets, tasks, and state-of-the-art architectures.
人工智能(AI)的应用正在迅速发展,而且通常也用于关键安全领域,如自动驾驶和医疗诊断,其中功能安全性至关重要。在AI驱动的系统中,不确定性估计允许用户避免过度自信的预测,并实现功能安全性。因此,模型预测的鲁棒性和可靠性可以得到提高。然而,传统的 uncertainty estimation 方法,如 deep ensemble 方法,由于需要存储和处理多个模型,因此具有高计算和硬件(延迟和能源)开销。相反,蒙特卡洛 dropout(MC-dropout)方法虽然具有较低的内存开销,但需要进行大量(约100)前向传递,导致高计算开销和延迟。因此,对于具有有限计算和内存资源的电池驱动边缘设备,这些方法并不适用。在本文中,我们提出了 Tiny-Deep Ensemble 方法,这是一种适用于边缘设备的低成本不确定性估计方法。在我们的方法中,只对 normalization 层进行了 $M$ 次聚类,所有聚类成员共享共同的权重和偏置,导致存储开销和延迟显著降低。此外,与单模型相比,内存开销基本相同。因此,延迟和内存开销被减少了大约 $M\times$。然而,我们的方法没有牺牲准确性的话,在各种基准数据集、任务和现代架构上的推理精度提高了约1%,而均方误差(RMSE)减少了17.17%。
https://arxiv.org/abs/2405.05286
This study investigates the metacognitive capabilities of Large Language Models relative to human metacognition in the context of the International Coaching Federation ICF mimicking exam, a situational judgment test related to coaching competencies. Using a mixed method approach, we assessed the metacognitive performance, including sensitivity, accuracy in probabilistic predictions, and bias, of human participants and five advanced LLMs (GPT-4, Claude-3-Opus 3, Mistral Large, Llama 3, and Gemini 1.5 Pro). The results indicate that LLMs outperformed humans across all metacognitive metrics, particularly in terms of reduced overconfidence, compared to humans. However, both LLMs and humans showed less adaptability in ambiguous scenarios, adhering closely to predefined decision frameworks. The study suggests that Generative AI can effectively engage in human-like metacognitive processing without conscious awareness. Implications of the study are discussed in relation to development of AI simulators that scaffold cognitive and metacognitive aspects of mastering coaching competencies. More broadly, implications of these results are discussed in relation to development of metacognitive modules that lead towards more autonomous and intuitive AI systems.
这项研究探讨了大型语言模型相对于人类元认知能力在国际教练协会ICF模仿考试(与教练能力相关的情境判断测试)背景下的元认知表现。采用混合方法,我们评估了人类参与者和五种先进的LLM(GPT-4,Claude-3-Opus 3,Mistral Large,Llama 3和Gemini 1.5 Pro)在元认知方面的表现,包括敏感度、概率预测的准确性和偏见。研究结果表明,LLM在所有元认知指标上都超过了人类,尤其是在减少过度自信方面。然而,LLM和人类在模糊场景下的适应性都较低,仍然紧守着预先定义的决策框架。研究建议,生成型AI可以在没有自觉意识的情况下有效进行类似人类元认知处理。研究结果与开发支持性AI模拟器以支持教练技能方面的认知和元认知方面的发展有关。更广泛地说,这些结果与开发指向更自主和直觉型AI系统的元认知模块有关。
https://arxiv.org/abs/2405.05285
Routing protocols help in transmitting the sensed data from UAVs monitoring the targets (called target UAVs) to the BS. However, the highly dynamic nature of an autonomous, decentralized UAV network leads to frequent route breaks or traffic disruptions. Traditional routing schemes cannot quickly adapt to dynamic UAV networks and/or incur large control overhead and delays. To establish stable, high-quality routes from target UAVs to the BS, we design a hybrid reactive routing scheme called pipe routing that is mobility, congestion, and energy-aware. The pipe routing scheme discovers routes on-demand and proactively switches to alternate high-quality routes within a limited region around the active routes (called the pipe) when needed, reducing the number of route breaks and increasing data throughput. We then design a novel topology control-based pipe routing scheme to maintain robust connectivity in the pipe region around the active routes, leading to improved route stability and increased throughput with minimal impact on the coverage performance of the UAV network.
路由协议有助于将来自监测目标(称为目标UAVs)的感知数据传输到BS。然而,自主和去中心化的UAV网络的高度动态特性导致经常出现路由中断或交通拥堵。传统的路由方案无法快速适应动态UAV网络,并且/或需要承担较大的控制开销和延迟。为了从目标UAVs建立稳定、高质量的路由,我们设计了一个名为管道路由的混合反应式路由方案,该方案具有移动性、拥塞和能源意识。管道路由方案在需要时动态发现路线,并在需要时主动切换到沿主动路由周围有限区域内的备用高质路线,从而减少路线中断,增加数据传输速率。接下来,我们设计了一个新型的基于控制的路由方案,以保持管道区域中活动的路由的稳健连接,从而提高路由稳定性,并最小化对UAV网络覆盖性能的影响。
https://arxiv.org/abs/2405.04678