Autonomous vehicles are gradually entering city roads today, with the help of high-definition maps (HDMaps). However, the reliance on HDMaps prevents autonomous vehicles from stepping into regions without this expensive digital infrastructure. This fact drives many researchers to study online HDMap generation algorithms, but the performance of these algorithms at far regions is still unsatisfying. We present P-MapNet, in which the letter P highlights the fact that we focus on incorporating map priors to improve model performance. Specifically, we exploit priors in both SDMap and HDMap. On one hand, we extract weakly aligned SDMap from OpenStreetMap, and encode it as an additional conditioning branch. Despite the misalignment challenge, our attention-based architecture adaptively attends to relevant SDMap skeletons and significantly improves performance. On the other hand, we exploit a masked autoencoder to capture the prior distribution of HDMap, which can serve as a refinement module to mitigate occlusions and artifacts. We benchmark on the nuScenes and Argoverse2 datasets. Through comprehensive experiments, we show that: (1) our SDMap prior can improve online map generation performance, using both rasterized (by up to $+18.73$ $\rm mIoU$) and vectorized (by up to $+8.50$ $\rm mAP$) output representations. (2) our HDMap prior can improve map perceptual metrics by up to $6.34\%$. (3) P-MapNet can be switched into different inference modes that covers different regions of the accuracy-efficiency trade-off landscape. (4) P-MapNet is a far-seeing solution that brings larger improvements on longer ranges. Codes and models are publicly available at this https URL.
自动驾驶汽车今天正在逐步进入城市道路,得益于高清晰度地图(HDMaps)的帮助。然而,对HDMaps的依赖使得自动驾驶汽车无法进入没有这种昂贵数字基础设施的区域。这一点导致许多研究人员研究在线HDMap生成算法,但這些算法的性能在远距离地区仍然不令人满意。我们提出了P-MapNet,其中字母P突出了我们专注于结合地图先验以提高模型性能的事实。具体来说,我们利用SDMap和HDMap的预分布。一方面,我们提取了OpenStreetMap中弱对齐的SDMap,并将其编码为额外的条件分支。尽管存在对齐挑战,但我们的自适应架构会适应性地关注相关的SDMap骨架,显著提高性能。另一方面,我们利用掩码自动编码器来捕捉HDMap的先验分布,这可以作为减少遮挡和伪影的优化模块。我们在nuScenes和Argoverse2数据集上进行基准测试。通过全面的实验,我们证明了:(1)我们的SDMap先验可以提高在线地图生成性能,无论是通过平面化(最高+18.73 $\rm mIoU$)还是向量化(最高+8.50 $\rm mAP$)输出表示。(2)我们的HDMap先验可以提高地图感知指标,最多+6.34%。(3)P-MapNet可以切换到不同的推理模式,涵盖准确性与效率权衡曲线的不同区域。(4)P-MapNet是一个具有远见性的解决方案,在更远的距离上带来更大的改进。代码和模型可以从该https URL公开获取。
https://arxiv.org/abs/2403.10521
Integrating Large Language Models (VLMs) and Vision-Language Models (VLMs) with robotic systems enables robots to process and understand complex natural language instructions and visual information. However, a fundamental challenge remains: for robots to fully capitalize on these advancements, they must have a deep understanding of their physical embodiment. The gap between AI models cognitive capabilities and the understanding of physical embodiment leads to the following question: Can a robot autonomously understand and adapt to its physical form and functionalities through interaction with its environment? This question underscores the transition towards developing self-modeling robots without reliance on external sensory or pre-programmed knowledge about their structure. Here, we propose a meta self modeling that can deduce robot morphology through proprioception (the internal sense of position and movement). Our study introduces a 12 DoF reconfigurable legged robot, accompanied by a diverse dataset of 200k unique configurations, to systematically investigate the relationship between robotic motion and robot morphology. Utilizing a deep neural network model comprising a robot signature encoder and a configuration decoder, we demonstrate the capability of our system to accurately predict robot configurations from proprioceptive signals. This research contributes to the field of robotic self-modeling, aiming to enhance understanding of their physical embodiment and adaptability in real world scenarios.
将大型语言模型(VLMs)和视觉语言模型(VLMs)与机器人系统集成,使机器人能够处理和理解复杂的自然语言指令和视觉信息。然而,一个基本挑战 remains:为了充分利用这些进步,机器人必须对其物理 embodiment 具有深入的理解。AI 模型认知能力和物理 embodiment 的理解之间的差距导致了以下问题:机器人是否可以通过与环境的交互自主理解并适应其物理形态和功能?这个问题突出了开发无需依赖外部感官或预编程知识其结构的自我建模机器人的过渡。 在这里,我们提出了一种元自我建模,可以通过本体感觉(内部感觉位置和运动)来推断机器人的外形。我们的研究引入了一台12个自由度可重构的机器人,并随着一个包含200k个独特配置的多样数据集,系统性地研究了机器运动和机器人外形之间的关系。利用包括机器人签名编码器和一个配置解码器的深度神经网络模型,我们证明了我们的系统能够准确预测从本体感觉信号预测机器人配置。这项研究为机器人自建模领域做出了贡献,旨在增强其在现实场景中理解和适应能力。
https://arxiv.org/abs/2403.10496
The surge in black-box AI models has prompted the need to explain the internal mechanism and justify their reliability, especially in high-stakes applications, such as healthcare and autonomous driving. Due to the lack of a rigorous definition of explainable AI (XAI), a plethora of research related to explainability, interpretability, and transparency has been developed to explain and analyze the model from various perspectives. Consequently, with an exhaustive list of papers, it becomes challenging to have a comprehensive overview of XAI research from all aspects. Considering the popularity of neural networks in AI research, we narrow our focus to a specific area of XAI research: gradient based explanations, which can be directly adopted for neural network models. In this review, we systematically explore gradient based explanation methods to date and introduce a novel taxonomy to categorize them into four distinct classes. Then, we present the essence of technique details in chronological order and underscore the evolution of algorithms. Next, we introduce both human and quantitative evaluations to measure algorithm performance. More importantly, we demonstrate the general challenges in XAI and specific challenges in gradient based explanations. We hope that this survey can help researchers understand state-of-the-art progress and their corresponding disadvantages, which could spark their interest in addressing these issues in future work.
黑盒AI模型的激增引发了解释内部机制并证明其可靠性的需要,特别是在高风险应用中,如医疗和自动驾驶等领域。由于可解释AI(XAI)的严谨定义缺失,为了解释和分析模型从各种角度进行大量的研究。因此,随着一系列论文的详细列出,全面了解XAI研究方面变得具有挑战性。考虑到神经网络在人工智能研究中的流行,我们将重点缩小为XAI研究的一个具体领域:基于梯度的解释,可以直接应用于神经网络模型。 在本文回顾中,我们系统地探讨了迄今为止的基于梯度的解释方法,并引入了一个新的分类体系将它们分为四个不同的类别。然后,我们按时间顺序呈现了技术细节,强调了解算法的演变过程。接下来,我们引入了人类和定量评估来衡量算法的性能。更重要的是,我们展示了XAI的一般挑战和基于梯度的解释的特殊挑战。我们希望这次调查可以帮助研究人员了解最先进的进展,以及他们相应的不足之处,激发他们在未来的工作中关注这些问题。
https://arxiv.org/abs/2403.10415
Image segmentation is one of the most fundamental problems in computer vision and has drawn a lot of attentions due to its vast applications in image understanding and autonomous driving. However, designing effective and efficient segmentation neural architectures is a labor-intensive process that may require lots of trials by human experts. In this paper, we address the challenge of integrating multi-head self-attention into high resolution representation CNNs efficiently, by leveraging architecture search. Manually replacing convolution layers with multi-head self-attention is non-trivial due to the costly overhead in memory to maintain high resolution. By contrast, we develop a multi-target multi-branch supernet method, which not only fully utilizes the advantages of high-resolution features, but also finds the proper location for placing multi-head self-attention module. Our search algorithm is optimized towards multiple objective s (e.g., latency and mIoU) and capable of finding architectures on Pareto frontier with arbitrary number of branches in a single search. We further present a series of model via Hybrid Convolutional-Transformer Architecture Search (HyCTAS) method that searched for the best hybrid combination of light-weight convolution layers and memory-efficient self-attention layers between branches from different resolutions and fuse to high resolution for both efficiency and effectiveness. Extensive experiments demonstrate that HyCTAS outperforms previous methods on semantic segmentation task. Code and models are available at \url{this https URL}.
图像分割是计算机视觉中最基本的问题之一,由于其在图像理解和自动驾驶中的广泛应用,因此受到了很多关注。然而,设计有效且高效的分割神经架构是一个劳动密集的过程,可能需要许多专家的人工作尝试。在本文中,我们通过利用架构搜索解决了将多头自注意力集成到高分辨率表示CNNs中的问题,通过构建多目标多分支超级网络。通过手动替换卷积层为多头自注意力,由于需要高昂的内存开销来维持高分辨率,因此这是不可能的。相反,我们开发了一种多目标多分支超级网络方法,不仅充分利用了高分辨率特征的优势,而且发现了放置多头自注意力的适当位置。我们的搜索算法针对多个目标(如延迟和mIoU)进行优化,可以在一个搜索中找到架构在帕累托前沿的任意数量分支上的最优架构。我们还通过HyCTAS方法展示了一系列模型,该方法在寻找不同分辨率分支的最佳轻量级卷积层和内存高效的自注意力层之间进行了搜索,将高分辨率与效率进行了平衡。大量实验证明,HyCTAS在语义分割任务上优于以前的算法。代码和模型可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2403.10413
Accurate positioning of underwater robots in confined environments is crucial for inspection and mapping tasks and is also a prerequisite for autonomous operations. Presently, there are no positioning systems available that are suited for real-world use in confined underwater environments, unconstrained by environmental lighting and water turbidity levels and have sufficient accuracy for reliable and repeatable navigation. This shortage presents a significant barrier to enhancing the capabilities of ROVs in such scenarios. This paper introduces an innovative positioning system for ROVs operating in confined, cluttered underwater settings, achieved through the collaboration of an omnidirectional surface vehicle and an ROV. A formulation is proposed and evaluated in the simulation against ground truth. The experimental results from the simulation form a proof of principle of the proposed system and also demonstrate its deployability. Unlike many previous approaches, the system does not rely on fixed infrastructure or tracking of features in the environment and can cover large enclosed areas without additional equipment.
在水下机器人准确定位在受限环境中的重要性不亚于在海洋中进行探测和绘图任务,也是自主操作的先决条件。目前,尚无适用于水下现实环境中的定位系统,这些系统不受环境光照和水浊度的影响,具有足够的准确性和可重复性导航。这种不足严重地阻碍了在受限场景中提高ROV能力。本文提出了一种创新的水下机器人定位系统,该系统由全向型水面车辆和ROV共同开发。针对仿真进行了公式提出并进行了评估。仿真结果证明了所提出的系统的原则,同时也证明了其可部署性。与许多先前的方法不同,该系统不依赖于固定的基础设施或环境特征的跟踪,可以在没有额外设备的情况下覆盖较大的封闭区域。
https://arxiv.org/abs/2403.10397
The field of autonomous driving has attracted considerable interest in approaches that directly infer 3D objects in the Bird's Eye View (BEV) from multiple cameras. Some attempts have also explored utilizing 2D detectors from single images to enhance the performance of 3D detection. However, these approaches rely on a two-stage process with separate detectors, where the 2D detection results are utilized only once for token selection or query initialization. In this paper, we present a single model termed SimPB, which simultaneously detects 2D objects in the perspective view and 3D objects in the BEV space from multiple cameras. To achieve this, we introduce a hybrid decoder consisting of several multi-view 2D decoder layers and several 3D decoder layers, specifically designed for their respective detection tasks. A Dynamic Query Allocation module and an Adaptive Query Aggregation module are proposed to continuously update and refine the interaction between 2D and 3D results, in a cyclic 3D-2D-3D manner. Additionally, Query-group Attention is utilized to strengthen the interaction among 2D queries within each camera group. In the experiments, we evaluate our method on the nuScenes dataset and demonstrate promising results for both 2D and 3D detection tasks. Our code is available at: this https URL.
自动驾驶领域吸引了相当大的关注,尤其是在从多个相机直接推断鸟视图(BEV)中的3D对象的方法。一些尝试还探索了利用单个图像中的2D检测器来提高3D检测的性能。然而,这些方法依赖于两个阶段的处理过程,其中仅在关键词选择或查询初始化时利用2D检测结果。在本文中,我们提出了一个名为SimPB的单一模型,该模型同时从多个相机检测2D物体和3D物体。为实现这一目标,我们引入了一个由多个鸟视图2D检测层和几个3D检测层组成的混合编码器。为了不断更新和优化2D和3D结果之间的交互,我们提出了动态查询分配模块和自适应查询聚合模块。此外,我们还使用了查询组注意来加强每个相机组内2D查询之间的互动。在实验中,我们在 nuScenes 数据集上评估了我们的方法,并展示了对于2D和3D检测任务的积极结果。我们的代码可在此处访问:https://this URL。
https://arxiv.org/abs/2403.10353
This paper explores the problem of continual learning (CL) of vision-language models (VLMs) in open domains, where the models need to perform continual updating and inference on a streaming of datasets from diverse seen and unseen domains with novel classes. Such a capability is crucial for various applications in open environments, e.g., AI assistants, autonomous driving systems, and robotics. Current CL studies mostly focus on closed-set scenarios in a single domain with known classes. Large pre-trained VLMs like CLIP have demonstrated superior zero-shot recognition ability, and a number of recent studies leverage this ability to mitigate catastrophic forgetting in CL, but they focus on closed-set CL in a single domain dataset. Open-domain CL of large VLMs is significantly more challenging due to 1) large class correlations and domain gaps across the datasets and 2) the forgetting of zero-shot knowledge in the pre-trained VLMs in addition to the knowledge learned from the newly adapted datasets. In this work we introduce a novel approach, termed CoLeCLIP, that learns an open-domain CL model based on CLIP. It addresses these challenges by a joint learning of a set of task prompts and a cross-domain class vocabulary. Extensive experiments on 11 domain datasets show that CoLeCLIP outperforms state-of-the-art methods for open-domain CL under both task- and class-incremental learning settings.
本文探讨了在开放领域中,视觉语言模型(VLMs)的持续学习(CL)问题,这些模型需要对来自不同可见和不可见领域的数据流进行持续更新和推理,并且需要学习新的类。这种能力对于各种开放环境中的应用程序(例如AI助手、自动驾驶系统和机器人)至关重要。目前,大多数CL研究都集中在单个域内的已知类闭包场景。像CLIP这样的大预训练VLM已经证明了卓越的零散拍摄识别能力,并且一些最近的研究利用这种能力来减轻CL中的灾难性遗忘,但他们集中在单个域数据集中的闭包CL。 大VLMs在开放域中的CL挑战非常大,因为数据集之间存在大的类别相关性和领域差异,以及预训练VLM在从新适应的数据集中学习新知识时会遗忘零散拍摄知识。在本文中,我们引入了一种新方法,称为CoLeCLIP,它基于CLIP学习开放域CL模型。它通过联合学习一组任务提示和跨领域类词汇表来解决这些挑战。在11个领域数据集上的广泛实验表明,CoLeCLIP在任务和类增益学习设置下都优于最先进的开放域CL方法。
https://arxiv.org/abs/2403.10245
In recent advancements within the domain of Large Language Models (LLMs), there has been a notable emergence of agents capable of addressing Robotic Process Automation (RPA) challenges through enhanced cognitive capabilities and sophisticated reasoning. This development heralds a new era of scalability and human-like adaptability in goal attainment. In this context, we introduce AUTONODE (Autonomous User-interface Transformation through Online Neuro-graphic Operations and Deep Exploration). AUTONODE employs advanced neuro-graphical techniques to facilitate autonomous navigation and task execution on web interfaces, thereby obviating the necessity for predefined scripts or manual intervention. Our engine empowers agents to comprehend and implement complex workflows, adapting to dynamic web environments with unparalleled efficiency. Our methodology synergizes cognitive functionalities with robotic automation, endowing AUTONODE with the ability to learn from experience. We have integrated an exploratory module, DoRA (Discovery and mapping Operation for graph Retrieval Agent), which is instrumental in constructing a knowledge graph that the engine utilizes to optimize its actions and achieve objectives with minimal supervision. The versatility and efficacy of AUTONODE are demonstrated through a series of experiments, highlighting its proficiency in managing a diverse array of web-based tasks, ranging from data extraction to transaction processing.
在大型语言模型(LLMs)领域最近的研究进展中,出现了一些能够通过增强认知能力和复杂的推理能力来应对机器人流程自动化(RPA)挑战的智能体。这一发展预示着在实现目标的过程中将进入一个可扩展和具有人类相似适应性的新时代。在这个背景下,我们介绍了一个名为AUTONODE(通过在线神经图网络操作和深度探索实现自主用户界面转换)的系统。AUTONODE采用先进的神经图网络技术来促进自主导航和任务执行在网页界面上,从而消除了需要预定义脚本或手动干预的必要性。我们的引擎使智能体能够理解并实施复杂的任务流程,适应于动态的网页环境,效率无与伦比。我们的方法论将认知功能与机器人自动化相结合,使AUTONODE具有从经验中学习的能力。我们引入了一个探索模块,DoRA(用于构建知识图的发现和映射操作),该模块对于构建引擎使用的知识图至关重要。通过一系列实验,我们展示了AUTONODE的多样性和有效性,涵盖了从数据提取到交易处理的各类网页任务。
https://arxiv.org/abs/2403.10171
The value of roadside perception, which could extend the boundaries of autonomous driving and traffic management, has gradually become more prominent and acknowledged in recent years. However, existing roadside perception approaches only focus on the single-infrastructure sensor system, which cannot realize a comprehensive understanding of a traffic area because of the limited sensing range and blind spots. Orienting high-quality roadside perception, we need Roadside Cooperative Perception (RCooper) to achieve practical area-coverage roadside perception for restricted traffic areas. Rcooper has its own domain-specific challenges, but further exploration is hindered due to the lack of datasets. We hence release the first real-world, large-scale RCooper dataset to bloom the research on practical roadside cooperative perception, including detection and tracking. The manually annotated dataset comprises 50k images and 30k point clouds, including two representative traffic scenes (i.e., intersection and corridor). The constructed benchmarks prove the effectiveness of roadside cooperation perception and demonstrate the direction of further research. Codes and dataset can be accessed at: this https URL.
近年来,路边感知的价值,这可以扩展自动驾驶和交通管理的边界,逐渐变得更加突出和被认可。然而,现有的路边感知方法仅关注单一基础设施传感器系统,由于其感知范围有限和盲区,无法全面了解交通区域。因此,我们需要道路合作感知(RCooper)来实现受限制交通区域的道路合作感知。虽然RCooper有自己的领域特定挑战,但由于缺乏数据集,进一步探索受到阻碍。因此,我们发布了第一个真实世界的较大规模的RCooper数据集,以促进关于实际道路合作感知的 research,包括检测和跟踪。手动标注的数据集包括50K张图像和30K个点云,包括两个代表性的交通场景(即路口和走廊)。构建的基准证明了道路合作感知的效果,并表明了进一步研究的方向。代码和数据集可以通过这个链接访问:https:// this URL。
https://arxiv.org/abs/2403.10145
This paper presents a proof-of-concept study that examines the utilization of generative AI and mobile robotics for autonomous laboratory monitoring in the pharmaceutical R&D laboratory. The study investigates the potential advantages of anomaly detection and automated reporting by multi-modal model and Vision Foundation Model (VFM), which have the potential to enhance compliance and safety in laboratory environments. Additionally, the paper discusses the current limitations of the generative AI approach and proposes future directions for its application in lab monitoring.
本文提出了一项概念研究,旨在探讨生成式 AI 和移动机器人技术在制药 R&D 实验室中用于自主实验室监测的潜在优势。研究探讨了多模态模型和 Vision Foundation 模型(VFM)进行异常检测和自动报告的优势,这些模型在实验室环境中具有提高合规性和安全性的潜力。此外,论文讨论了生成式 AI 方法的当前局限性,并提出了未来在实验室监测中应用其未来的方向。
https://arxiv.org/abs/2403.10108
Human-centered dynamic scene understanding plays a pivotal role in enhancing the capability of robotic and autonomous systems, in which Video-based Human-Object Interaction (V-HOI) detection is a crucial task in semantic scene understanding, aimed at comprehensively understanding HOI relationships within a video to benefit the behavioral decisions of mobile robots and autonomous driving systems. Although previous V-HOI detection models have made significant strides in accurate detection on specific datasets, they still lack the general reasoning ability like human beings to effectively induce HOI relationships. In this study, we propose V-HOI Multi-LLMs Collaborated Reasoning (V-HOI MLCR), a novel framework consisting of a series of plug-and-play modules that could facilitate the performance of current V-HOI detection models by leveraging the strong reasoning ability of different off-the-shelf pre-trained large language models (LLMs). We design a two-stage collaboration system of different LLMs for the V-HOI task. Specifically, in the first stage, we design a Cross-Agents Reasoning scheme to leverage the LLM conduct reasoning from different aspects. In the second stage, we perform Multi-LLMs Debate to get the final reasoning answer based on the different knowledge in different LLMs. Additionally, we devise an auxiliary training strategy that utilizes CLIP, a large vision-language model to enhance the base V-HOI models' discriminative ability to better cooperate with LLMs. We validate the superiority of our design by demonstrating its effectiveness in improving the prediction accuracy of the base V-HOI model via reasoning from multiple perspectives.
人类中心化的动态场景理解在增强机器人和自动驾驶系统的能力方面发挥了关键作用,其中基于视频的人-对象交互(V-HOI)检测是语义场景理解中至关重要的一环,旨在全面理解视频中HOI关系,以帮助移动机器人和自动驾驶系统的行为决策。尽管以前基于视频的人-对象交互检测模型已经在特定的数据集上取得了显著的进展,但它们仍然缺乏像人类一样的一般推理能力,无法有效诱导HOI关系。在本文中,我们提出了V-HOI Multi-LLMs Collaborated Reasoning(V-HOI MLCR),一种由一系列可插拔的模块组成的全新框架,通过利用不同预训练大型语言模型(LLMs)的强推理能力,以改善现有V-HOI检测模型的性能。我们为V-HOI任务设计了两个阶段的协作系统。具体来说,在第一阶段,我们设计了一个跨代理商推理方案,以利用LLM从不同方面进行推理。在第二阶段,我们进行了Multi-LLMs Debate,以根据不同LLM的知识得出最终推理答案。此外,我们还设计了一种辅助训练策略,利用CLIP(一个大型视觉-语言模型)增强基于V-HOI的基线模型的判别能力,从而更好地与LLM合作。我们通过从多个角度进行推理,证明了我们设计的优越性,并通过提高基线V-HOI模型的预测准确性来证实了这一点。
https://arxiv.org/abs/2403.10107
Recent research on mobile robot navigation has focused on socially aware navigation in crowded environments. However, existing methods do not adequately account for human robot interactions and demand accurate location information from omnidirectional sensors, rendering them unsuitable for practical applications. In response to this need, this study introduces a novel algorithm, BNBRL+, predicated on the partially observable Markov decision process framework to assess risks in unobservable areas and formulate movement strategies under uncertainty. BNBRL+ consolidates belief algorithms with Bayesian neural networks to probabilistically infer beliefs based on the positional data of humans. It further integrates the dynamics between the robot, humans, and inferred beliefs to determine the navigation paths and embeds social norms within the reward function, thereby facilitating socially aware navigation. Through experiments in various risk laden scenarios, this study validates the effectiveness of BNBRL+ in navigating crowded environments with blind spots. The model's ability to navigate effectively in spaces with limited visibility and avoid obstacles dynamically can significantly improve the safety and reliability of autonomous vehicles.
近年来,关于移动机器人导航的研究主要集中在在拥挤环境中实现社会意识导航。然而,现有的方法并没有充分考虑到人机交互,并且要求从全方位传感器中获取准确的定位信息,这使得它们不适合实际应用。为了满足这一需求,本研究引入了一种名为 BNBRL+ 的新算法,基于部分可观测马尔可夫决策过程框架来评估无观测区域的风险并制定运动策略。BNBRL+ 将信念算法与贝叶斯神经网络相结合,根据人类的位置数据概率性地推断信念。它还整合了机器人、人类和推断的信念之间的动态,以确定导航路径和奖励函数中社交规范的嵌入,从而促进社会意识导航。通过在各种充满风险的场景中进行实验,本研究验证了 BNBRL+ 在具有盲区的拥挤环境中有效导航的能力。该模型在有限可视空间中有效避开障碍并动态地适应环境的能力可以显著提高自动驾驶车辆的安全性和可靠性。
https://arxiv.org/abs/2403.10105
Autonomous driving demands high-quality LiDAR data, yet the cost of physical LiDAR sensors presents a significant scaling-up challenge. While recent efforts have explored deep generative models to address this issue, they often consume substantial computational resources with slow generation speeds while suffering from a lack of realism. To address these limitations, we introduce RangeLDM, a novel approach for rapidly generating high-quality range-view LiDAR point clouds via latent diffusion models. We achieve this by correcting range-view data distribution for accurate projection from point clouds to range images via Hough voting, which has a critical impact on generative learning. We then compress the range images into a latent space with a variational autoencoder, and leverage a diffusion model to enhance expressivity. Additionally, we instruct the model to preserve 3D structural fidelity by devising a range-guided discriminator. Experimental results on KITTI-360 and nuScenes datasets demonstrate both the robust expressiveness and fast speed of our LiDAR point cloud generation.
自动驾驶需要高质量的激光雷达数据,但物理激光雷达传感器的成本存在显著的扩展挑战。虽然最近的努力探索了使用深度生成模型来解决这个问题的方法,但它们通常在生成速度慢的同时,由于缺乏现实感而消耗大量计算资源。为了应对这些限制,我们引入了RangeLDM,一种通过潜在扩散模型迅速生成高质量范围视激光雷达点云的新方法。我们通过通过Hough投票纠正范围视数据分布来准确从点云到范围图像的投影,从而实现了这一点。这对生成学习具有关键影响。然后,我们使用变分自编码器将范围图像压缩到潜在空间,并利用扩散模型增强表现力。此外,我们还通过设计范围指导的判别器来指示模型保留3D结构保真度。在KITTI-360和nuScenes数据集上的实验结果表明,我们的激光雷达点云生成具有稳健的表现力和高速度。
https://arxiv.org/abs/2403.10094
Multi-agent perception (MAP) allows autonomous systems to understand complex environments by interpreting data from multiple sources. This paper investigates intermediate collaboration for MAP with a specific focus on exploring "good" properties of collaborative view (i.e., post-collaboration feature) and its underlying relationship to individual views (i.e., pre-collaboration features), which were treated as an opaque procedure by most existing works. We propose a novel framework named CMiMC (Contrastive Mutual Information Maximization for Collaborative Perception) for intermediate collaboration. The core philosophy of CMiMC is to preserve discriminative information of individual views in the collaborative view by maximizing mutual information between pre- and post-collaboration features while enhancing the efficacy of collaborative views by minimizing the loss function of downstream tasks. In particular, we define multi-view mutual information (MVMI) for intermediate collaboration that evaluates correlations between collaborative views and individual views on both global and local scales. We establish CMiMNet based on multi-view contrastive learning to realize estimation and maximization of MVMI, which assists the training of a collaboration encoder for voxel-level feature fusion. We evaluate CMiMC on V2X-Sim 1.0, and it improves the SOTA average precision by 3.08% and 4.44% at 0.5 and 0.7 IoU (Intersection-over-Union) thresholds, respectively. In addition, CMiMC can reduce communication volume to 1/32 while achieving performance comparable to SOTA. Code and Appendix are released at this https URL.
多智能体感知(MAP)允许自主系统通过从多个来源解释数据来理解复杂环境。本文研究了MAP与特定关注点的中间合作,重点探索"好"合作视图(即后合作特征)及其与个体视图(即前合作特征)之间的潜在关系,这是大多数现有作品所处理为不可见的过程。我们提出了名为CMiMC(对比 mutual information 最大化协作感知)的新框架,用于中间合作。CMiMC的核心理念是通过最大化前和后合作特征之间的互信息来保留个体视图的区分性信息,同时通过最小化下游任务的损失函数来提高合作视图的有效性。 特别是,我们定义了多视图互信息(MVMI),用于中间合作,它评估了合作视图与个体视图之间的全局和局部规模上的相关性。我们基于多视图对比学习建立了CMiMNet,以实现MVMI的估计和最大化,从而辅助级联特征融合器的训练。我们在V2X-Sim 1.0上评估了CMiMC,它分别将平均精度提高了3.08%和4.44%在0.5和0.7 IoU(交集 over 统一)阈值。此外,CMiMC可以将通信量减少到1/32,同时实现与SOTA性能相当的结果。代码和附件发布在https://这个网址。
https://arxiv.org/abs/2403.10068
Cross-technology communication(CTC) enables seamless interactions between diverse wireless technologies. Most existing work is based on reversing the transmission path to identify the appropriate payload to generate the waveform that the target devices can recognize. However, this method suffers from many limitations, including dependency on specific technologies and the necessity for intricate algorithms to mitigate distortion. In this work, we present NNCTC, a Neural-Network-based Cross-Technology Communication framework inspired by the adaptability of trainable neural models in wireless communications. By converting signal processing components within the CTC pipeline into neural models, the NNCTC is designed for end-to-end training without requiring labeled data. This enables the NNCTC system to autonomously derive the optimal CTC payload, which significantly eases the development complexity and showcases the scalability potential for various CTC links. Particularly, we construct a CTC system from Wi-Fi to ZigBee. The NNCTC system outperforms the well-recognized WEBee and WIDE design in error performance, achieving an average packet reception rate(PRR) of 92.3% and an average symbol error rate(SER) as low as 1.3%.
跨技术通信(CTC)使各种无线技术之间实现无缝互动。现有研究基本上是基于反向传输路径来确定适当的载波,以使目标设备可以识别的波形。然而,这种方法存在许多局限性,包括对特定技术的依赖和需要复杂的算法来缓解失真。在这项工作中,我们提出了NNCTC,一种基于神经网络的跨技术通信框架,受到了无线通信中可训练神经模型的适应性的启发。通过将CTC管道内的信号处理组件转换为神经模型,NNCTC旨在实现端到端训练,而不需要标记数据。这使得NNCTC系统能够自主计算最优的CTC载波,从而大大减轻开发复杂度,并展示了各种CTC链的扩展潜力。特别地,我们构建了一个从Wi-Fi到ZigBee的CTC系统。与著名的WEBee和WIDE设计相比,NNCTC系统的误码率性能优越,平均数据接收率(PRR)达到92.3%,平均符号误码率(SER)甚至达到1.3%。
https://arxiv.org/abs/2403.10014
Unsupervised domain adaptation (UDA) is vital for alleviating the workload of labeling 3D point cloud data and mitigating the absence of labels when facing a newly defined domain. Various methods of utilizing images to enhance the performance of cross-domain 3D segmentation have recently emerged. However, the pseudo labels, which are generated from models trained on the source domain and provide additional supervised signals for the unseen domain, are inadequate when utilized for 3D segmentation due to their inherent noisiness and consequently restrict the accuracy of neural networks. With the advent of 2D visual foundation models (VFMs) and their abundant knowledge prior, we propose a novel pipeline VFMSeg to further enhance the cross-modal unsupervised domain adaptation framework by leveraging these models. In this work, we study how to harness the knowledge priors learned by VFMs to produce more accurate labels for unlabeled target domains and improve overall performance. We first utilize a multi-modal VFM, which is pre-trained on large scale image-text pairs, to provide supervised labels (VFM-PL) for images and point clouds from the target domain. Then, another VFM trained on fine-grained 2D masks is adopted to guide the generation of semantically augmented images and point clouds to enhance the performance of neural networks, which mix the data from source and target domains like view frustums (FrustumMixing). Finally, we merge class-wise prediction across modalities to produce more accurate annotations for unlabeled target domains. Our method is evaluated on various autonomous driving datasets and the results demonstrate a significant improvement for 3D segmentation task.
无监督域适应(UDA)对于减轻标注3D点云数据的劳动量并在面对新定义领域时缓解缺乏标签非常重要。最近,出现了许多利用图像增强跨域3D分割性能的方法。然而,由于其固有的噪声问题,伪标签,即从预训练于源域的模型生成的提供给未见领域额外监督信号的模型,在用于3D分割时是不够的,从而限制了神经网络的准确性。随着二维视觉基础模型(VFMs)的出现及其丰富的知识储备,我们提出了一个名为VFMSeg的新管道VFMSeg,通过利用这些模型进一步增强了跨模态的无监督域适应框架。 在这项工作中,我们研究了VFMs通过学习知识储备如何为未标记的目标域产生更准确标签,从而提高整体性能。首先,我们利用预训练的跨模态VFM,该模型在大型图像-文本对上进行预训练,为源域的图像和点云提供监督标签(VFM-PL)。然后,我们选择另一个在细粒度2D掩码上训练的VFM,该模型用于生成语义增强的图像和点云,以提高神经网络的性能,这些数据来自源域和目标域,就像视差(FrustumMixing)一样混合数据。最后,我们将类别级别的预测合并,以产生更准确的无标记目标域的注释。 我们对各种自动驾驶数据集进行了评估,结果表明,我们的方法在3D分割任务上取得了显著的改进。
https://arxiv.org/abs/2403.10001
Modern autonomous systems, such as flying, legged, and wheeled robots, are generally characterized by high-dimensional nonlinear dynamics, which presents challenges for model-based safety-critical control design. Motivated by the success of reduced-order models in robotics, this paper presents a tutorial on constructive safety-critical control via reduced-order models and control barrier functions (CBFs). To this end, we provide a unified formulation of techniques in the literature that share a common foundation of constructing CBFs for complex systems from CBFs for much simpler systems. Such ideas are illustrated through formal results, simple numerical examples, and case studies of real-world systems to which these techniques have been experimentally applied.
现代自主系统(如飞行、腿部和轮式机器人)通常具有高维非线性动力学,这使得基于模型的安全关键控制设计具有挑战性。为了激励在机器人领域中成功使用降维模型,本文介绍了一种通过降维模型和控制障碍函数(CBFs)进行建模的安全关键控制。为此,我们给出了文献中共享常见基础的建模CBFs用于复杂系统的统一形式。这些思想通过形式结果、简单的数值示例和实际系统的案例研究得到了说明。
https://arxiv.org/abs/2403.09865
Tasks such as autonomous navigation, 3D reconstruction, and object recognition near the water surfaces are crucial in marine robotics applications. However, challenges arise due to dynamic disturbances, e.g., light reflections and refraction from the random air-water interface, irregular liquid flow, and similar factors, which can lead to potential failures in perception and navigation systems. Traditional computer vision algorithms struggle to differentiate between real and virtual image regions, significantly complicating tasks. A virtual image region is an apparent representation formed by the redirection of light rays, typically through reflection or refraction, creating the illusion of an object's presence without its actual physical location. This work proposes a novel approach for segmentation on real and virtual image regions, exploiting synthetic images combined with domain-invariant information, a Motion Entropy Kernel, and Epipolar Geometric Consistency. Our segmentation network does not need to be re-trained if the domain changes. We show this by deploying the same segmentation network in two different domains: simulation and the real world. By creating realistic synthetic images that mimic the complexities of the water surface, we provide fine-grained training data for our network (MARVIS) to discern between real and virtual images effectively. By motion & geometry-aware design choices and through comprehensive experimental analysis, we achieve state-of-the-art real-virtual image segmentation performance in unseen real world domain, achieving an IoU over 78% and a F1-Score over 86% while ensuring a small computational footprint. MARVIS offers over 43 FPS (8 FPS) inference rates on a single GPU (CPU core). Our code and dataset are available here this https URL.
类似于自主导航、水下三维重建和水上物体识别等任务在海洋机器人应用中至关重要。然而,由于动态干扰,例如随机水界面上的光线反射和折射、不规则的液体流动等因素,会导致感知和导航系统出现潜在故障。传统的计算机视觉算法很难区分真实和虚拟图像区域,从而大大复杂了任务。 虚拟图像区域是由光线通过反射或折射指向形成的明显表示,通常产生物体存在的错觉,而不需要其实际物理位置。本文提出了一种在真实和虚拟图像区域进行分割的新颖方法,利用了组合域无关信息、运动熵核和Epipolar Geometric Consistency。 我们的分割网络不需要在领域变化时重新训练。我们通过在两个不同的领域部署相同的分割网络来证明这一点:模拟和现实世界。通过创建类似于水表面复杂性的现实合成图像,为我们的网络(MARVIS)提供精细的训练数据,以有效地区分真实和虚拟图像。通过运动和几何感知设计选择以及全面的实验分析,我们在未见过的现实世界领域实现了最先进的实虚图像分割性能,达到IoU超过78%和F1- Score超过86%,同时确保具有较小的计算开销。MARVIS在单个GPU(CPU核心)上的推理率为43 FPS(8 FPS)。我们的代码和数据集可以在这个链接处获取:https://www.xxxxxxx.com。
https://arxiv.org/abs/2403.09850
In this paper, we introduce the first large-scale video prediction model in the autonomous driving discipline. To eliminate the restriction of high-cost data collection and empower the generalization ability of our model, we acquire massive data from the web and pair it with diverse and high-quality text descriptions. The resultant dataset accumulates over 2000 hours of driving videos, spanning areas all over the world with diverse weather conditions and traffic scenarios. Inheriting the merits from recent latent diffusion models, our model, dubbed GenAD, handles the challenging dynamics in driving scenes with novel temporal reasoning blocks. We showcase that it can generalize to various unseen driving datasets in a zero-shot manner, surpassing general or driving-specific video prediction counterparts. Furthermore, GenAD can be adapted into an action-conditioned prediction model or a motion planner, holding great potential for real-world driving applications.
在本文中,我们在自动驾驶领域引入了第一个大规模视频预测模型。为了消除高成本数据收集的限制并增强模型的泛化能力,我们从互联网上获取大量数据,并将其与多样且高质量的文本描述相结合。这样产生的数据集累积了超过2000小时的驾驶视频,覆盖了世界各地各种天气条件和交通场景。在继承最近来自潜在扩散模型的优点的基础上,我们称之为GenAD的模型处理了驾驶场景中具有新颖的时间推理单元的挑战性动态。我们展示,它可以以零散的方式推广到各种未见过的驾驶数据集,超越了通用或驾驶特定视频预测的替代品。此外,GenAD可以改编成动作条件预测模型或运动规划器,在现实驾驶应用中具有巨大的潜力。
https://arxiv.org/abs/2403.09630
Forestry constitutes a key element for a sustainable future, while it is supremely challenging to introduce digital processes to improve efficiency. The main limitation is the difficulty of obtaining accurate maps at high temporal and spatial resolution as a basis for informed forestry decision-making, due to the vast area forests extend over and the sheer number of trees. To address this challenge, we present an autonomous Micro Aerial Vehicle (MAV) system which purely relies on cost-effective and light-weight passive visual and inertial sensors to perform under-canopy autonomous navigation. We leverage visual-inertial simultaneous localization and mapping (VI-SLAM) for accurate MAV state estimates and couple it with a volumetric occupancy submapping system to achieve a scalable mapping framework which can be directly used for path planning. As opposed to a monolithic map, submaps inherently deal with inevitable drift and corrections from VI-SLAM, since they move with pose estimates as they are updated. To ensure the safety of the MAV during navigation, we also propose a novel reference trajectory anchoring scheme that moves and deforms the reference trajectory the MAV is tracking upon state updates from the VI-SLAM system in a consistent way, even upon large changes in state estimates due to loop-closures. We thoroughly validate our system in both real and simulated forest environments with high tree densities in excess of 400 trees per hectare and at speeds up to 3 m/s - while not encountering a single collision or system failure. To the best of our knowledge this is the first system which achieves this level of performance in such unstructured environment using low-cost passive visual sensors and fully on-board computation including VI-SLAM.
林业是实现可持续未来至关重要的一部分,而将其引入数字过程以提高效率则具有极大的挑战性。主要的限制是由于森林面积广阔和树木众多,因此准确地获取高时间和空间分辨率的数据作为 informed 林业决策的依据是非常困难的。为了应对这一挑战,我们提出了一个自治的微型无人机(MAV)系统,该系统仅依赖成本效益高和轻量级的被动视觉和惯性传感器执行林下自主导航。我们利用视觉-惯性同时定位与映射(VI-SLAM)技术对MAV状态估计,并将其与体积占用下映射系统相结合以实现可扩展的映射框架,可以直接用于路径规划。与单体地图不同,子图本质上会处理VI-SLAM系统在更新时必然产生的漂移和修正,因为它们随着姿态估计而移动。为了确保MAV在导航过程中的安全性,我们还提出了一个新颖的参考轨迹锚定方案,在VI-SLAM系统状态更新时以一致的方式移动和变形MAV跟踪的参考轨迹。我们在超过400棵树/公顷的实际森林环境中进行了充分验证,并将其速度提高到3米/秒,尽管在状态估计出现大幅度变化时,没有发生碰撞或系统故障。据我们所知,这是第一个在如此无结构环境中使用低成本被动视觉传感器并实现完全车载计算的系统。
https://arxiv.org/abs/2403.09596