Large 2D vision-language models (2D-LLMs) have gained significant attention by bridging Large Language Models (LLMs) with images using a simple projector. Inspired by their success, large 3D point cloud-language models (3D-LLMs) also integrate point clouds into LLMs. However, directly aligning point clouds with LLM requires expensive training costs, typically in hundreds of GPU-hours on A100, which hinders the development of 3D-LLMs. In this paper, we introduce MiniGPT-3D, an efficient and powerful 3D-LLM that achieves multiple SOTA results while training for only 27 hours on one RTX 3090. Specifically, we propose to align 3D point clouds with LLMs using 2D priors from 2D-LLMs, which can leverage the similarity between 2D and 3D visual information. We introduce a novel four-stage training strategy for modality alignment in a cascaded way, and a mixture of query experts module to adaptively aggregate features with high efficiency. Moreover, we utilize parameter-efficient fine-tuning methods LoRA and Norm fine-tuning, resulting in only 47.8M learnable parameters, which is up to 260x fewer than existing methods. Extensive experiments show that MiniGPT-3D achieves SOTA on 3D object classification and captioning tasks, with significantly cheaper training costs. Notably, MiniGPT-3D gains an 8.12 increase on GPT-4 evaluation score for the challenging object captioning task compared to ShapeLLM-13B, while the latter costs 160 total GPU-hours on 8 A800. We are the first to explore the efficient 3D-LLM, offering new insights to the community. Code and weights are available at this https URL.
大型的2D视觉语言模型(2D-LLMs)通过简单地使用投影器将大型语言模型(LLMs)与图像相连接,已经引起了 significant 的关注。受到他们的成功启发,大型3D点云语言模型(3D-LLMs)也 将点云集成到 LLMs 中。然而,直接将点云与 LLM 对齐需要昂贵的训练成本,通常在 A100 上需要数百个 GPU-小时,这阻碍了 3D-LLMs 的开发。在本文中,我们介绍了 MiniGPT-3D,一种高效且强大的3D-LLM,在仅训练27小时的情况下实现了多个SOTA结果。具体来说,我们提出了一种使用来自2D-LLMs的2D先验来对齐3D点云与LLM的方法,可以利用2D和3D视觉信息的相似性。我们还提出了一个新颖的四阶段模态对齐训练策略,以及一个混合查询专家模块以高效地适应性地聚合特征。此外,我们还利用参数高效的微调方法 LoRA 和 Norm 微调,实现了仅47.8M可学习参数,比现有方法少260倍。 extensive实验证明,MiniGPT-3D在3D物体分类和文本摘要任务上实现了SOTA,具有显著的训练成本优势。值得注意的是,与ShapeLLM-13B相比,MiniGPT-3D在具有挑战性的物体文本摘要任务上获得了8.12的提高,而后者需要160个总共的GPU-小时,在8个A800上。我们是第一个探索高效3D-LLM,为社区提供了新的见解。代码和权重可以从该https URL获取。
https://arxiv.org/abs/2405.01413
As human-machine interaction continues to evolve, the capacity for environmental perception is becoming increasingly crucial. Integrating the two most common types of sensory data, images, and point clouds, can enhance detection accuracy. However, currently, no model exists that can simultaneously detect an object's position in both point clouds and images and ascertain their corresponding relationship. This information is invaluable for human-machine interactions, offering new possibilities for their enhancement. In light of this, this paper introduces an end-to-end Consistency Object Detection (COD) algorithm framework that requires only a single forward inference to simultaneously obtain an object's position in both point clouds and images and establish their correlation. Furthermore, to assess the accuracy of the object correlation between point clouds and images, this paper proposes a new evaluation metric, Consistency Precision (CP). To verify the effectiveness of the proposed framework, an extensive set of experiments has been conducted on the KITTI and DAIR-V2X datasets. The study also explored how the proposed consistency detection method performs on images when the calibration parameters between images and point clouds are disturbed, compared to existing post-processing methods. The experimental results demonstrate that the proposed method exhibits excellent detection performance and robustness, achieving end-to-end consistency detection. The source code will be made publicly available at this https URL.
随着人机交互的不断进化,环境感知的能力变得越来越重要。将两种最常见的感官数据,图像和点云,集成在一起可以提高检测精度。然而,目前尚无同时检测物体在点云和图像中的位置以及确定它们相应关系的产品。这些信息对于人机交互非常重要,为它们的增强提供了新的可能性。因此,本文介绍了一个端到端的一致性物体检测(COD)算法框架,只需要一个前向推理即可同时获得物体在点云和图像中的位置并确定它们之间的关联。此外,为了评估点云和图像之间物体关联的准确性,本文提出了一个新的评估指标,一致性精度(CP)。为了验证所提出的框架的有效性,在KITTI和DAIR-V2X数据集上进行了大量实验。研究还探讨了在图像和点云之间的校准参数受到干扰时,所提出的一致性检测方法在图片上的表现,与现有后处理方法进行了比较。实验结果表明,与现有方法相比,所提出的检测方法具有卓越的检测性能和鲁棒性,实现了端到端的一致性检测。源代码将公开在https://这个网址上。
https://arxiv.org/abs/2405.01258
Sports analysis and viewing play a pivotal role in the current sports domain, offering significant value not only to coaches and athletes but also to fans and the media. In recent years, the rapid development of virtual reality (VR) and augmented reality (AR) technologies have introduced a new platform for watching games. Visualization of sports competitions in VR/AR represents a revolutionary technology, providing audiences with a novel immersive viewing experience. However, there is still a lack of related research in this area. In this work, we present for the first time a comprehensive system for sports competition analysis and real-time visualization on VR/AR platforms. First, we utilize multiview LiDARs and cameras to collect multimodal game data. Subsequently, we propose a framework for multi-player tracking and pose estimation based on a limited amount of supervised data, which extracts precise player positions and movements from point clouds and images. Moreover, we perform avatar modeling of players to obtain their 3D models. Ultimately, using these 3D player data, we conduct competition analysis and real-time visualization on VR/AR. Extensive quantitative experiments demonstrate the accuracy and robustness of our multi-player tracking and pose estimation framework. The visualization results showcase the immense potential of our sports visualization system on the domain of watching games on VR/AR devices. The multimodal competition dataset we collected and all related code will be released soon.
体育分析和实时观看在当前体育领域中扮演着关键角色,为教练、运动员和球迷以及媒体提供了宝贵的价值。近年来,虚拟现实(VR)和增强现实(AR)技术的快速发展为观看比赛提供了新的平台。在VR/AR中可视化体育比赛代表了一种革命性的技术,为观众提供了新颖的沉浸式观看体验。然而,在这个领域仍然缺乏相关研究。在这项工作中,我们首次提出了一个完整的体育比赛分析及实时可视化在VR/AR平台上的系统。首先,我们利用多视角 LiDAR 和相机收集多模态游戏数据。接着,我们提出了一种基于有限监督数据的多玩家跟踪和姿态估计框架,从点云和图像中提取精确的球员位置和运动。此外,我们还为玩家创建了3D模型。最后,利用这些3D玩家数据,我们在VR/AR上进行比赛分析和实时可视化。大量的定量实验证明了我们多玩家跟踪和姿态估计框架的准确性和稳健性。可视化结果展示了我们在VR/AR设备领域观看游戏的巨大潜力。我们收集的多模态比赛数据和所有相关的代码即将发布。
https://arxiv.org/abs/2405.01112
Photorealistic simulation plays a crucial role in applications such as autonomous driving, where advances in neural radiance fields (NeRFs) may allow better scalability through the automatic creation of digital 3D assets. However, reconstruction quality suffers on street scenes due to largely collinear camera motions and sparser samplings at higher speeds. On the other hand, the application often demands rendering from camera views that deviate from the inputs to accurately simulate behaviors like lane changes. In this paper, we propose several insights that allow a better utilization of Lidar data to improve NeRF quality on street scenes. First, our framework learns a geometric scene representation from Lidar, which is fused with the implicit grid-based representation for radiance decoding, thereby supplying stronger geometric information offered by explicit point cloud. Second, we put forth a robust occlusion-aware depth supervision scheme, which allows utilizing densified Lidar points by accumulation. Third, we generate augmented training views from Lidar points for further improvement. Our insights translate to largely improved novel view synthesis under real driving scenes.
照片现实模拟在自动驾驶等应用中扮演着关键角色,因为神经辐射场(NeRFs)的进步可能允许通过自动创建数字3D资产来实现更好的可扩展性。然而,在街景中,由于主要是平行的相机运动和高速时的采样稀疏,重建质量下降。另一方面,应用程序通常要求从相机视角进行渲染,以准确模拟行为,如变道。在本文中,我们提出了几个见解,使得Lidar数据能够更好地用于改善街景中的NeRF质量。首先,我们的框架从Lidar中学习几何场景表示,并将其与隐式网格表示的辐射解码相结合,从而提供来自明确点云的更强的几何信息。其次,我们提出了一个鲁棒的可视化深度监督方案,允许通过累积使用密集的Lidar点。第三,我们从Lidar点生成增强的训练视图,以进一步改进。我们的见解使得在现实驾驶场景中产生了显著改进的新视图合成。
https://arxiv.org/abs/2405.00900
In this paper, we present Sim-Grasp, a robust 6-DOF two-finger grasping system that integrates advanced language models for enhanced object manipulation in cluttered environments. We introduce the Sim-Grasp-Dataset, which includes 1,550 objects across 500 scenarios with 7.9 million annotated labels, and develop Sim-GraspNet to generate grasp poses from point clouds. The Sim-Grasp-Polices achieve grasping success rates of 97.14% for single objects and 87.43% and 83.33% for mixed clutter scenarios of Levels 1-2 and Levels 3-4 objects, respectively. By incorporating language models for target identification through text and box prompts, Sim-Grasp enables both object-agnostic and target picking, pushing the boundaries of intelligent robotic systems.
在本文中,我们提出了Sim-Grasp系统,这是一个 robust 的6-DOF两指握持系统,集成了先进的语言模型以增强在复杂环境中的物体操作。我们引入了Sim-Grasp-Dataset,其中包括500个场景中1550个物体的79000个注释标签,并开发了Sim-GraspNet来生成点云中的抓持姿势。Sim-Grasp-Policies在单物体和混合复杂场景(级别1-2和级别3-4)中的抓持成功率为97.14%和87.43%和83.33%。通过通过文本和框提示集成目标识别语言模型,Sim-Grasp enabling both object-agnostic and target picking,推动了智能机器人系统的边界。
https://arxiv.org/abs/2405.00841
In recent years, zero-shot learning has attracted the focus of many researchers, due to its flexibility and generality. Many approaches have been proposed to achieve the zero-shot classification of the point clouds for 3D object understanding, following the schema of CLIP. However, in the real world, the point clouds could be extremely sparse, dramatically limiting the effectiveness of the 3D point cloud encoders, and resulting in the misalignment of point cloud features and text embeddings. To the point cloud encoders to fit the extremely sparse point clouds without re-running the pre-training procedure which could be time-consuming and expensive, in this work, we propose an unsupervised model adaptation approach to enhance the point cloud encoder for the extremely sparse point clouds. We propose a novel fused-cross attention layer that expands the pre-trained self-attention layer with additional learnable tokens and attention blocks, which effectively modifies the point cloud features while maintaining the alignment between point cloud features and text embeddings. We also propose a complementary learning-based self-distillation schema that encourages the modified features to be pulled apart from the irrelevant text embeddings without overfitting the feature space to the observed text embeddings. Extensive experiments demonstrate that the proposed approach effectively increases the zero-shot capability on extremely sparse point clouds, and overwhelms other state-of-the-art model adaptation approaches.
近年来,由于其灵活性和普适性,零样本学习(Zero-Shot Learning)吸引了许多研究人员的关注。为了实现3D物体理解中点云的零样本分类,许多方法提出了基于CLIP的方案。然而,在现实生活中,点云可能非常稀疏,极大地限制了3D点云编码器的有效性,并导致点云特征与文本嵌入之间的不匹配。为了适应稀疏的点云,避免重新进行预训练,我们在这个工作中提出了一个无监督的模型适应方法,以增强适应稀疏点云的点云编码器。我们提出了一个新颖的融合跨注意层,通过增加可学习标记和注意力模块,扩展了预训练的自注意力层,有效地修改点云特征,同时保持点云特征与文本嵌入之间的对齐。我们还提出了一个基于互补学习的自监督损失模式,鼓励修改后的特征从相关的文本嵌入中分离出来,以避免对特征空间对观察到的文本嵌入过拟合。大量实验证明,与最先进的模型适应方法相比,所提出的方案在稀疏点云上显著提高了零样本能力,并超越了其他方法。
https://arxiv.org/abs/2404.19639
Visual control policies can encounter significant performance degradation when visual conditions like lighting or camera position differ from those seen during training -- often exhibiting sharp declines in capability even for minor differences. In this work, we examine robustness to a suite of these types of visual changes for RGB-D and point cloud based visual control policies. To perform these experiments on both model-free and model-based reinforcement learners, we introduce a novel Point Cloud World Model (PCWM) and point cloud based control policies. Our experiments show that policies that explicitly encode point clouds are significantly more robust than their RGB-D counterparts. Further, we find our proposed PCWM significantly outperforms prior works in terms of sample efficiency during training. Taken together, these results suggest reasoning about the 3D scene through point clouds can improve performance, reduce learning time, and increase robustness for robotic learners. Project Webpage: this https URL
视觉控制策略在视觉条件与训练时观察到的条件不同的情况下,可能会遇到显著的性能降级。在本文中,我们研究了基于RGB-D和点云的视觉控制策略对这类视觉变化的鲁棒性。为了在模型无关和基于模型的强化学习上进行这些实验,我们引入了一种新颖的点云世界模型(PCWM)和基于点云的控制策略。我们的实验结果表明,明确编码点云的策略比它们的RGB-D对应策略更稳健。此外,我们还发现在训练过程中,我们的PCWM显著优于先前的作品,具有更高的训练样本效率。结合这些结果,我们可以得出这样的结论:通过点云来推理3D场景可以提高机器学习者的性能、降低学习时间,并增加其鲁棒性。项目网页:https:// this URL
https://arxiv.org/abs/2404.18926
Mamba, a recent selective structured state space model, performs excellently on long sequence modeling tasks. Mamba mitigates the modeling constraints of convolutional neural networks and offers advanced modeling capabilities similar to those of Transformers, through global receptive fields and dynamic weighting. Crucially, it achieves this without incurring the quadratic computational complexity typically associated with Transformers. Due to its advantages over the former two mainstream foundation models, Mamba exhibits great potential to be a visual foundation model. Researchers are actively applying Mamba to various computer vision tasks, leading to numerous emerging works. To help keep pace with the rapid advancements in computer vision, this paper aims to provide a comprehensive review of visual Mamba approaches. This paper begins by delineating the formulation of the original Mamba model. Subsequently, our review of visual Mamba delves into several representative backbone networks to elucidate the core insights of the visual Mamba. We then categorize related works using different modalities, including image, video, point cloud, multi-modal, and others. Specifically, for image applications, we further organize them into distinct tasks to facilitate a more structured discussion. Finally, we discuss the challenges and future research directions for visual Mamba, providing insights for future research in this quickly evolving area. A comprehensive list of visual Mamba models reviewed in this work is available at this https URL.
Mamba是一个最近的选择性结构化状态空间模型,在长序列建模任务中表现出色。Mamba通过全局感受野和动态权重减轻了卷积神经网络的建模约束,并提供了与Transformer相似的建模能力。关键是,它在不产生通常与Transformer相关的二次计算复杂性的情况下实现了这一点。由于其在前两种主流基础模型上的优势,Mamba在视觉领域具有巨大的潜在成为视觉基础模型的潜力。研究人员正积极将Mamba应用于各种计算机视觉任务,导致了许多新兴的工作。为了跟上计算机视觉领域快速发展的步伐,本文旨在对视觉Mamba方法进行全面回顾。本文首先概述了原始Mamba模型的表述。接着,我们对视觉Mamba进行了深入研究,以阐明视觉Mamba的核心见解。然后,我们根据不同的模式对相关研究进行分类,包括图像、视频、点云、多模态等。特别地,对于图像应用,我们进一步将它们划分为不同的任务,以促进更结构化的讨论。最后,我们讨论了视觉Mamba的挑战和未来研究方向,为未来研究在快速发展的领域提供了见解。本文全面回顾的视觉Mamba模型列表可以在该https URL找到。
https://arxiv.org/abs/2404.18861
Although point cloud models have gained significant improvements in prediction accuracy over recent years, their trustworthiness is still not sufficiently investigated. In terms of global explainability, Activation Maximization (AM) techniques in the image domain are not directly transplantable due to the special structure of the point cloud models. Existing studies exploit generative models to yield global explanations that can be perceived by humans. However, the opacity of the generative models themselves and the introduction of additional priors call into question the plausibility and fidelity of the explanations. In this work, we demonstrate that when the classifier predicts different types of instances, the intermediate layer activations are differently activated, known as activation flows. Based on this property, we propose an activation flow-based AM method that generates global explanations that can be perceived without incorporating any generative model. Furthermore, we reveal that AM based on generative models fails the sanity checks and thus lack of fidelity. Extensive experiments show that our approach dramatically enhances the perceptibility of explanations compared to other AM methods that are not based on generative models. Our code is available at: this https URL
尽管点云模型在近年来在预测准确性方面取得了显著的提高,但它们的可靠性仍没有充分调查。在全局可解释性方面,由于点云模型的特殊结构,图像域中的激活最大化(AM)技术是不可直接移植的。现有研究表明,通过利用生成模型生成全局解释,可以被人类感知。然而,生成模型的自身不透明性和引入附加假设导致了对解释的可信度和忠实性的怀疑。在本文中,我们证明了当分类器预测不同类型的实例时,中间层激活不同,称为激活流。基于这种特性,我们提出了一个基于激活流的自监督AM方法,可以生成无需包含任何生成模型的全局解释。此外,我们还发现基于生成模型的AM方法无法通过 sanity checks,因此缺乏可靠性。大量实验证明,与基于其他非生成模型(例如元学习)的AM方法相比,我们的方法显著增强了解释的可感知性。我们的代码可在此处访问:https:// this URL
https://arxiv.org/abs/2404.18760
This study investigates the application of PointNet and PointNet++ in the classification of LiDAR-generated point cloud data, a critical component for achieving fully autonomous vehicles. Utilizing a modified dataset from the Lyft 3D Object Detection Challenge, we examine the models' capabilities to handle dynamic and complex environments essential for autonomous navigation. Our analysis shows that PointNet and PointNet++ achieved accuracy rates of 79.53% and 84.24%, respectively. These results underscore the models' robustness in interpreting intricate environmental data, which is pivotal for the safety and efficiency of autonomous vehicles. Moreover, the enhanced detection accuracy, particularly in distinguishing pedestrians from other objects, highlights the potential of these models to contribute substantially to the advancement of autonomous vehicle technology.
本研究探讨了点Net和点Net++在LiDAR生成的点云数据分类中的应用,这对实现完全自动驾驶车辆至关重要。利用来自Lyft 3D物体检测挑战的修改后的数据集,我们研究了模型在处理自动驾驶导航所需的关键动态和复杂环境的能力。我们的分析显示,点Net和点Net++的准确率分别为79.53%和84.24%。这些结果强调了模型在解释复杂环境数据方面的稳健性,这对自动驾驶车辆的安全和效率至关重要。此外,增强的检测精度,特别是对行人与其他物体进行区分,突出了这些模型对自动驾驶车辆技术进步的潜在贡献。
https://arxiv.org/abs/2404.18665
Collective Perception has attracted significant attention in recent years due to its advantage for mitigating occlusion and expanding the field-of-view, thereby enhancing reliability, efficiency, and, most crucially, decision-making safety. However, developing collective perception models is highly resource demanding due to extensive requirements of processing input data for many agents, usually dozens of images and point clouds for a single frame. This not only slows down the model development process for collective perception but also impedes the utilization of larger models. In this paper, we propose an agent-based training framework that handles the deep learning modules and agent data separately to have a cleaner data flow structure. This framework not only provides an API for flexibly prototyping the data processing pipeline and defining the gradient calculation for each agent, but also provides the user interface for interactive training, testing and data visualization. Training experiment results of four collective object detection models on the prominent collective perception benchmark OPV2V show that the agent-based training can significantly reduce the GPU memory consumption and training time while retaining inference performance. The framework and model implementations are available at \url{this https URL}
近年来,由于其减轻遮挡并扩大视野的优势,集合感知在 collective perception 中引起了广泛关注。这使得集合感知在提高可靠性、效率和最关键的是决策安全性方面具有优势。然而,开发集合感知模型需要大量的资源,因为对于每个帧,输入数据的处理需求很高,通常有数十个图像和点云。这不仅会减缓集合感知的模型开发过程,而且还会阻碍更大模型的使用。在本文中,我们提出了一个基于代理的训练框架,该框架分别处理深度学习模块和代理数据,以实现更干净的数据流结构。这个框架不仅提供了灵活地原型化数据处理管道和定义每个代理的梯度计算的 API,还提供了用户界面进行交互式训练、测试和数据可视化。在著名的集合感知基准 OPV2V 上训练四个集合物体检测模型的实验结果表明,基于代理的训练可以显著减少 GPU 内存消耗和训练时间,同时保留推理性能。框架和模型实现可在 \url{this <https://this <https://this URL>}
https://arxiv.org/abs/2404.18617
This paper proposes a framework for the 3D reconstruction of satellites in low-Earth orbit, utilizing videos captured by small amateur telescopes. The video data obtained from these telescopes differ significantly from data for standard 3D reconstruction tasks, characterized by intense motion blur, atmospheric turbulence, pervasive background light pollution, extended focal length and constrained observational perspectives. To address these challenges, our approach begins with a comprehensive pre-processing workflow that encompasses deep learning-based image restoration, feature point extraction and camera pose initialization. We proceed with the application of an improved 3D Gaussian splatting algorithm for reconstructing the 3D model. Our technique supports simultaneous 3D Gaussian training and pose estimation, enabling the robust generation of intricate 3D point clouds from sparse, noisy data. The procedure is further bolstered by a post-editing phase designed to eliminate noise points inconsistent with our prior knowledge of a satellite's geometric constraints. We validate our approach using both synthetic datasets and actual observations of China's Space Station, showcasing its significant advantages over existing methods in reconstructing 3D space objects from ground-based observations.
本文提出了一种利用小业余望远镜捕获的视频对低地球轨道卫星进行三维重建的框架。这些望远镜获得的视频数据与标准的三维重建任务的数据显示有很大的差异,其特点为强烈的运动模糊、大气扰动、广泛的背景光污染和延长焦距,并且存在约束的观测视角。为了应对这些挑战,我们的方法从全面预处理工作开始,包括基于深度学习的图像修复、特征点提取和相机姿态初始化。我们接下来应用改进的3D高斯展平算法来重建3D模型。我们的技术支持同时进行3D高斯训练和姿态估计,从而能够从稀疏、噪声数据中生成精致的3D点云。这一过程进一步得到了一个后编辑阶段的支持,该阶段旨在消除与先前知识不符的噪声点。我们通过使用中国空间站的真实观测数据来验证我们的方法,展示了其从地面观测数据中重构3D空间物体的重要优势。
https://arxiv.org/abs/2404.18394
Although large multi-modality models (LMMs) have seen extensive exploration and application in various quality assessment studies, their integration into Point Cloud Quality Assessment (PCQA) remains unexplored. Given LMMs' exceptional performance and robustness in low-level vision and quality assessment tasks, this study aims to investigate the feasibility of imparting PCQA knowledge to LMMs through text supervision. To achieve this, we transform quality labels into textual descriptions during the fine-tuning phase, enabling LMMs to derive quality rating logits from 2D projections of point clouds. To compensate for the loss of perception in the 3D domain, structural features are extracted as well. These quality logits and structural features are then combined and regressed into quality scores. Our experimental results affirm the effectiveness of our approach, showcasing a novel integration of LMMs into PCQA that enhances model understanding and assessment accuracy. We hope our contributions can inspire subsequent investigations into the fusion of LMMs with PCQA, fostering advancements in 3D visual quality analysis and beyond.
尽管大型多模态模型(LMMs)已经在各种质量评估研究中得到了广泛探索和应用,但将LMM集成到点云质量评估(PCQA)中仍然是一个未被探索的问题。鉴于LMM在低级视觉和质量评估任务中的卓越表现和稳健性,本研究旨在调查通过文本监督将PCQA知识传递给LMM的可行性。为了实现这一目标,我们在微调阶段将质量标签转换为文本描述,使LMM可以从点云的二维投影中提取质量评分逻辑。为了弥补在3D领域中感知到的损失,我们还提取了结构特征。然后将这些质量评分和结构特征进行结合并回归到质量分数。我们的实验结果证实了我们的方法的有效性,展示了将LMM与PCQA相结合的新颖之处,提高了模型理解和评估准确性。我们希望我们的研究可以激励后续对LMM与PCQA融合的研究,促进在3D视觉质量分析和 beyond方面的进步。
https://arxiv.org/abs/2404.18203
In this work, we propose a novel discriminative framework for dexterous grasp generation, named Dexterous Grasp TRansformer (DGTR), capable of predicting a diverse set of feasible grasp poses by processing the object point cloud with only one forward pass. We formulate dexterous grasp generation as a set prediction task and design a transformer-based grasping model for it. However, we identify that this set prediction paradigm encounters several optimization challenges in the field of dexterous grasping and results in restricted performance. To address these issues, we propose progressive strategies for both the training and testing phases. First, the dynamic-static matching training (DSMT) strategy is presented to enhance the optimization stability during the training phase. Second, we introduce the adversarial-balanced test-time adaptation (AB-TTA) with a pair of adversarial losses to improve grasping quality during the testing phase. Experimental results on the DexGraspNet dataset demonstrate the capability of DGTR to predict dexterous grasp poses with both high quality and diversity. Notably, while keeping high quality, the diversity of grasp poses predicted by DGTR significantly outperforms previous works in multiple metrics without any data pre-processing. Codes are available at this https URL .
在这项工作中,我们提出了一个名为Dexterous Grasp TRansformer(DGTR)的新颖的抓取生成框架,能够通过仅一次前向传递处理物体点云来预测多样的一组抓取姿势。我们将抓取生成定义为预测任务,并为此设计了一个基于Transformer的抓取模型。然而,我们发现这种集预测范式在抓取领域遇到了几个优化挑战,导致性能受限。为了应对这些问题,我们在训练和测试阶段都提出了渐进的策略。首先,我们引入了动态静态匹配训练(DSMT)策略来提高训练阶段的优化稳定性。其次,我们引入了一对对抗性损失的 adversarial-balanced 测试时间适应(AB-TTA)策略来提高测试阶段的抓取质量。 DexGraspNet 数据集上的实验结果表明,DGTR 具有预测高质和多样抓取姿势的能力。值得注意的是,尽管保持高质量,DGTR 预测的抓取姿势多样性在多个指标上显著超过了之前的工作,而无需进行数据预处理。代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2404.18135
In this article, a novel approach for merging 3D point cloud maps in the context of egocentric multi-robot exploration is presented. Unlike traditional methods, the proposed approach leverages state-of-the-art place recognition and learned descriptors to efficiently detect overlap between maps, eliminating the need for the time-consuming global feature extraction and feature matching process. The estimated overlapping regions are used to calculate a homogeneous rigid transform, which serves as an initial condition for the GICP point cloud registration algorithm to refine the alignment between the maps. The advantages of this approach include faster processing time, improved accuracy, and increased robustness in challenging environments. Furthermore, the effectiveness of the proposed framework is successfully demonstrated through multiple field missions of robot exploration in a variety of different underground environments.
在本文中,提出了一种在自旋多机器人 exploration 背景下合并 3D 点云地图的新方法。与传统方法不同,所提出的方法利用最先进的点位识别和学到的描述符来有效地检测地图之间的重叠,消除了全局特征提取和匹配过程所需的时间。估计的重叠区域用于计算同构形刚变换,作为 GICP 点云配准算法的初始条件,以对地图进行对齐。这种方法的优势包括更快的处理时间、更高的准确性和在更复杂的环境中增强的鲁棒性。此外,通过在各种不同地下环境中进行机器人探索,成功证明了所提出的框架的有效性。
https://arxiv.org/abs/2404.18006
Humans excel at forming mental maps of their surroundings, equipping them to understand object relationships and navigate based on language queries. Our previous work SI Maps [1] showed that having instance-level information and the semantic understanding of an environment helps significantly improve performance for language-guided tasks. We extend this instance-level approach to 3D while increasing the pipeline's robustness and improving quantitative and qualitative results. Our method leverages foundational models for object recognition, image segmentation, and feature extraction. We propose a representation that results in a 3D point cloud map with instance-level embeddings, which bring in the semantic understanding that natural language commands can query. Quantitatively, the work improves upon the success rate of language-guided tasks. At the same time, we qualitatively observe the ability to identify instances more clearly and leverage the foundational models and language and image-aligned embeddings to identify objects that, otherwise, a closed-set approach wouldn't be able to identify.
人类在构建周围环境的心理地图方面表现出色,这使他们能够理解物体之间的关系并根据语言查询进行导航。我们之前的SI Maps [1]工作表明,具有实例级信息和环境语义理解对于语言引导任务的表现会有显著的提高。我们在3D模型中扩展了实例级方法,同时提高了模型的稳健性和提高了量化与定性结果。我们的方法利用了物体识别、图像分割和特征提取的基础模型。我们提出了一个具有实例级嵌入的3D点云地图,该地图具有实例级语义理解,可以带来自然语言指令可以查询的语义理解。在量化方面,这项工作在语言引导任务的 success rate方面有所提高。同时,我们定性地观察到,该方法能够更清楚地识别实例,并且能够利用基础模型和语言与图像对齐的嵌入来识别物体,否则这种方法无法识别物体。
https://arxiv.org/abs/2404.17922
Semi-supervised 3D object detection can benefit from the promising pseudo-labeling technique when labeled data is limited. However, recent approaches have overlooked the impact of noisy pseudo-labels during training, despite efforts to enhance pseudo-label quality through confidence-based filtering. In this paper, we examine the impact of noisy pseudo-labels on IoU-based target assignment and propose the Reliable Student framework, which incorporates two complementary approaches to mitigate errors. First, it involves a class-aware target assignment strategy that reduces false negative assignments in difficult classes. Second, it includes a reliability weighting strategy that suppresses false positive assignment errors while also addressing remaining false negatives from the first step. The reliability weights are determined by querying the teacher network for confidence scores of the student-generated proposals. Our work surpasses the previous state-of-the-art on KITTI 3D object detection benchmark on point clouds in the semi-supervised setting. On 1% labeled data, our approach achieves a 6.2% AP improvement for the pedestrian class, despite having only 37 labeled samples available. The improvements become significant for the 2% setting, achieving 6.0% AP and 5.7% AP improvements for the pedestrian and cyclist classes, respectively.
在标签数据有限的情况下,半监督的3D物体检测可以通过有前景的伪标签技术受益。然而,最近的尝试忽略了在训练过程中噪音伪标签的影响,尽管通过基于信心的过滤来增强伪标签质量的尝试。在本文中,我们研究了噪音伪标签对IoU基于目标分配的影响,并提出了可信赖学生框架,该框架包含两种互补方法来减轻错误。首先,它涉及一个类感知的目标分配策略,可以减少难以分类类别的错误否定分配。其次,它包括一个可靠性加权策略,可以在抑制错误 positive assignment error的同时解决第一步骤中的剩余错误 false negatives。可靠性权重是由查询学生生成的建议的网络的置信度分数来确定的。我们在半监督设置的KITTI 3D物体检测基准点云上超越了以前的最先进水平。在仅1%标记数据的条件下,我们的方法实现了行人类的AP改进率6.2%,而只有37个标记样本可用。这种改进在2%设置中变得显著,分别实现了行人类和自行车类的AP改进率6.0%和5.7%。
https://arxiv.org/abs/2404.17910
Text-to-point-cloud cross-modal localization is an emerging vision-language task critical for future robot-human collaboration. It seeks to localize a position from a city-scale point cloud scene based on a few natural language instructions. In this paper, we address two key limitations of existing approaches: 1) their reliance on ground-truth instances as input; and 2) their neglect of the relative positions among potential instances. Our proposed model follows a two-stage pipeline, including a coarse stage for text-cell retrieval and a fine stage for position estimation. In both stages, we introduce an instance query extractor, in which the cells are encoded by a 3D sparse convolution U-Net to generate the multi-scale point cloud features, and a set of queries iteratively attend to these features to represent instances. In the coarse stage, a row-column relative position-aware self-attention (RowColRPA) module is designed to capture the spatial relations among the instance queries. In the fine stage, a multi-modal relative position-aware cross-attention (RPCA) module is developed to fuse the text and point cloud features along with spatial relations for improving fine position estimation. Experiment results on the KITTI360Pose dataset demonstrate that our model achieves competitive performance with the state-of-the-art models without taking ground-truth instances as input.
文本到点云跨模态定位是一种新兴的视觉语言任务,对未来的机器人-人类协作至关重要。它试图从城市规模的点云场景中根据几条自然语言指令局部定位一个位置。在本文中,我们解决了现有方法的两个关键限制:1)他们依赖于真实实例作为输入;2)他们忽视了潜在实例之间的相对位置。我们提出的模型采用两阶段流程,包括粗阶段和细阶段。在两个阶段中,我们引入了实例查询提取器,其中单元通过3D稀疏卷积U-Net编码生成多尺度点云特征,同时有一组查询逐步关注这些特征以表示实例。在粗阶段,设计了一个行列相对位置感知自注意力(RowColRPA)模块,以捕捉实例查询之间的空间关系。在细阶段,开发了一个多模态相对位置感知交叉注意力(RPCA)模块,以融合文本和点云特征以及空间关系来提高细位置估计。在KITTI360Pose数据集的实验结果中,我们的模型与最先进的模型在不需要使用真实实例的情况下实现了竞争性的性能。
https://arxiv.org/abs/2404.17845
Complementary to prevalent LiDAR and camera systems, millimeter-wave (mmWave) radar is robust to adverse weather conditions like fog, rainstorms, and blizzards but offers sparse point clouds. Current techniques enhance the point cloud by the supervision of LiDAR's data. However, high-performance LiDAR is notably expensive and is not commonly available on vehicles. This paper presents mmEMP, a supervised learning approach that enhances radar point clouds using a low-cost camera and an inertial measurement unit (IMU), enabling crowdsourcing training data from commercial vehicles. Bringing the visual-inertial (VI) supervision is challenging due to the spatial agnostic of dynamic objects. Moreover, spurious radar points from the curse of RF multipath make robots misunderstand the scene. mmEMP first devises a dynamic 3D reconstruction algorithm that restores the 3D positions of dynamic features. Then, we design a neural network that densifies radar data and eliminates spurious radar points. We build a new dataset in the real world. Extensive experiments show that mmEMP achieves competitive performance compared with the SOTA approach training by LiDAR's data. In addition, we use the enhanced point cloud to perform object detection, localization, and mapping to demonstrate mmEMP's effectiveness.
作为当前普遍的LiDAR和相机系统功能的补充,毫米波(mmWave)雷达对逆天气条件如雾、雷暴和暴风雪等具有很强的鲁棒性,但输出点云稀疏。目前的方法通过LiDAR数据的监督来增强点云。然而,高性能LiDAR价格昂贵,在车辆上并不常见。本文提出了一种低成本相机和惯性测量单元(IMU)协同工作,通过自适应波束形成技术增强雷达点云,实现从商用车辆的大众培训数据。由于动态对象的局域性,视觉-惯性(VI)监督带来挑战。此外,来自RF多径污染的伪雷达点使机器人误解场景。mmEMP首先设计了一个动态3D重建算法,恢复了动态特征的3D位置。然后,我们设计了一个神经网络,通过增加雷达数据密度和消除伪雷达点来加强雷达数据。我们在现实世界中构建了一个新数据集。大量实验证明,与通过LiDAR数据训练的当前最佳方法相比,mmEMP具有竞争力的性能。此外,我们使用增强后的点云进行目标检测、定位和映射,以展示mmEMP的有效性。
https://arxiv.org/abs/2404.17229
Recent advances in Vision and Language Models (VLMs) have improved open-world 3D representation, facilitating 3D zero-shot capability in unseen categories. Existing open-world methods pre-train an extra 3D encoder to align features from 3D data (e.g., depth maps or point clouds) with CAD-rendered images and corresponding texts. However, the limited color and texture variations in CAD images can compromise the alignment robustness. Furthermore, the volume discrepancy between pre-training datasets of the 3D encoder and VLM leads to sub-optimal 2D to 3D knowledge transfer. To overcome these issues, we propose OpenDlign, a novel framework for learning open-world 3D representations, that leverages depth-aligned images generated from point cloud-projected depth maps. Unlike CAD-rendered images, our generated images provide rich, realistic color and texture diversity while preserving geometric and semantic consistency with the depth maps. OpenDlign also optimizes depth map projection and integrates depth-specific text prompts, improving 2D VLM knowledge adaptation for 3D learning efficient fine-tuning. Experimental results show that OpenDlign significantly outperforms existing benchmarks in zero-shot and few-shot 3D tasks, exceeding prior scores by 8.0% on ModelNet40 and 16.4% on OmniObject3D with just 6 million tuned parameters. Moreover, integrating generated depth-aligned images into existing 3D learning pipelines consistently improves their performance.
近年来,在Vision和语言模型(VLMs)方面的进步已经提高了开放世界3D表示,推动了在未见类别的3D零击能力。现有的开放世界方法在预训练3D编码器时添加了一个额外的3D编码器,使其将来自3D数据(如深度图或点云)的特征与CAD渲染图像和相关文本对齐。然而,CAD图像中有限的颜色和纹理变化可能会削弱对齐稳健性。此外,预训练3D编码器数据集和VLM数据集之间的体积差异导致了2D到3D知识传递的低效。为了克服这些问题,我们提出了OpenDlign,一种学习开放世界3D表示的新框架,它利用点云投影得到的深度图生成的深度对齐图像。与CAD渲染图像不同,我们的生成图像在保持几何和语义一致性的同时,提供了丰富、逼真的颜色和纹理多样性。此外,OpenDlign还优化了深度图投影并集成了深度特定文本提示,提高了2D VLM对3D学习的知识迁移效率。实验结果表明,OpenDlign在零击和少击3D任务上显著优于现有基准,在仅600万调整参数的情况下,超过了ModelNet40和OmniObject3D的分数。此外,将生成的深度对齐图像集成到现有的3D学习流程中,显著提高了它们的性能。
https://arxiv.org/abs/2404.16538