This paper discusses the results of the third edition of the Monocular Depth Estimation Challenge (MDEC). The challenge focuses on zero-shot generalization to the challenging SYNS-Patches dataset, featuring complex scenes in natural and indoor settings. As with the previous edition, methods can use any form of supervision, i.e. supervised or self-supervised. The challenge received a total of 19 submissions outperforming the baseline on the test set: 10 among them submitted a report describing their approach, highlighting a diffused use of foundational models such as Depth Anything at the core of their method. The challenge winners drastically improved 3D F-Score performance, from 17.51% to 23.72%.
本文讨论了第三版Monocular Depth Estimation Challenge(MDEC)的结果。该挑战关注于将零散样本到具有挑战性的SYNCSH-Patches数据集,场景位于自然和室内环境中。与前几版一样,方法可以使用任何形式的监督,即监督或自监督。挑战在测试集上总共获得了19个提交,超过了基线:其中10个提交了一份报告,描述了他们的方法,并突出了基础模型如Depth Anything在方法核心中发现的扩散使用情况。挑战获胜者大幅提高了3D F- Score性能,从17.51%到23.72%。
https://arxiv.org/abs/2404.16831
Physically realistic materials are pivotal in augmenting the realism of 3D assets across various applications and lighting conditions. However, existing 3D assets and generative models often lack authentic material properties. Manual assignment of materials using graphic software is a tedious and time-consuming task. In this paper, we exploit advancements in Multimodal Large Language Models (MLLMs), particularly GPT-4V, to present a novel approach, Make-it-Real: 1) We demonstrate that GPT-4V can effectively recognize and describe materials, allowing the construction of a detailed material library. 2) Utilizing a combination of visual cues and hierarchical text prompts, GPT-4V precisely identifies and aligns materials with the corresponding components of 3D objects. 3) The correctly matched materials are then meticulously applied as reference for the new SVBRDF material generation according to the original diffuse map, significantly enhancing their visual authenticity. Make-it-Real offers a streamlined integration into the 3D content creation workflow, showcasing its utility as an essential tool for developers of 3D assets.
物理真实感材料在增强3D资产的各种应用和光照条件下的真实感方面至关重要。然而,现有的3D资产和生成模型通常缺乏真实材料的属性。使用图形软件手动分配材料是一个费力且耗时的任务。在本文中,我们利用多模态大型语言模型(MMLMs)的进步,特别是GPT-4V,提出了一个新方法,名为Make-it-Real:1)我们证明了GPT-4V可以有效地识别和描述材料,使得构建详细材料库成为可能。2)利用视觉提示和分层文本提示,GPT-4V准确地识别和校准材料与3D物体相应部件的对应关系。3)然后,正确匹配的材料被用作根据原始漫射图生成新的SVBRDF材料的新参考,显著增强了它们的视觉真实感。Make-it-Real使3D内容创建工作流程更加流畅,展示了它在开发者3D资产方面作为关键工具的重要作用。
https://arxiv.org/abs/2404.16829
Developing generalist foundation model has recently attracted tremendous attention among researchers in the field of AI for Medicine (AI4Medicine). A pivotal insight in developing these models is their reliance on dataset scaling, which emphasizes the requirements on developing open-source medical image datasets that incorporate diverse supervision signals across various imaging modalities. In this paper, we introduce RadGenome-Chest CT, a comprehensive, large-scale, region-guided 3D chest CT interpretation dataset based on CT-RATE. Specifically, we leverage the latest powerful universal segmentation and large language models, to extend the original datasets (over 25,692 non-contrast 3D chest CT volume and reports from 20,000 patients) from the following aspects: (i) organ-level segmentation masks covering 197 categories, which provide intermediate reasoning visual clues for interpretation; (ii) 665 K multi-granularity grounded reports, where each sentence of the report is linked to the corresponding anatomical region of CT volume in the form of a segmentation mask; (iii) 1.3 M grounded VQA pairs, where questions and answers are all linked with reference segmentation masks, enabling models to associate visual evidence with textual explanations. All grounded reports and VQA pairs in the validation set have gone through manual verification to ensure dataset quality. We believe that RadGenome-Chest CT can significantly advance the development of multimodal medical foundation models, by training to generate texts based on given segmentation regions, which is unattainable with previous relevant datasets. We will release all segmentation masks, grounded reports, and VQA pairs to facilitate further research and development in this field.
在人工智能领域(AI4Medicine)的研究者中,开发通用基础模型最近引起了巨大的关注。这些模型的关键在于它们对数据集扩大的依赖,强调开发包含各种成像模式下不同监督信号的开放医疗图像数据集。在本文中,我们介绍了RadGenome-Chest CT,一个基于CT-RATE的全面、大规模、区域指导的3D chest CT解释数据集。具体来说,我们利用最先进的强大通用分割和大型语言模型,从以下方面扩展了原始数据集:(一)覆盖197个类别的器官级别分割掩码,为解释提供中间推理的视觉提示;(二)665K个多粒度 grounded 报告,其中每个报告的句子都与相应的 CT 体积的解剖区域通过分割掩码链接;(三)1.3M个 grounded VQA 对,其中问题及其答案都与参考分割掩码链接,使模型能够将视觉证据与文本解释相关联。所有验证集中的 grounded 报告和 VQA 对都经过手动验证,以确保数据集质量。我们相信,RadGenome-Chest CT 可以通过根据给定分割区域生成文本,从而显著推动多模态医疗基础模型的开发,这是之前相关数据集无法实现的。我们将释放所有分割掩码、 grounded 报告和 VQA 对,以促进该领域进一步的研究和发展。
https://arxiv.org/abs/2404.16754
We address the problem of regressing 3D human pose and shape from a single image, with a focus on 3D accuracy. The current best methods leverage large datasets of 3D pseudo-ground-truth (p-GT) and 2D keypoints, leading to robust performance. With such methods, we observe a paradoxical decline in 3D pose accuracy with increasing 2D accuracy. This is caused by biases in the p-GT and the use of an approximate camera projection model. We quantify the error induced by current camera models and show that fitting 2D keypoints and p-GT accurately causes incorrect 3D poses. Our analysis defines the invalid distances within which minimizing 2D and p-GT losses is detrimental. We use this to formulate a new loss Threshold-Adaptive Loss Scaling (TALS) that penalizes gross 2D and p-GT losses but not smaller ones. With such a loss, there are many 3D poses that could equally explain the 2D evidence. To reduce this ambiguity we need a prior over valid human poses but such priors can introduce unwanted bias. To address this, we exploit a tokenized representation of human pose and reformulate the problem as token prediction. This restricts the estimated poses to the space of valid poses, effectively providing a uniform prior. Extensive experiments on the EMDB and 3DPW datasets show that our reformulated keypoint loss and tokenization allows us to train on in-the-wild data while improving 3D accuracy over the state-of-the-art. Our models and code are available for research at this https URL.
我们关注从一张图片上回归3D人体姿势和形状的问题,重点关注3D准确性。目前最佳方法利用大量的3D伪地面真(p-GT)和2D关键点数据集,导致稳健的性能。然而,随着2D准确性的增加,3D姿势准确性的下降是一个悖论。这是由于p-GT和近似相机投影模型的偏差导致的。我们计算了当前相机模型引起的误差,并表明,精确地匹配2D关键点和p-GT确实会导致错误的3D姿势。我们的分析定义了在最小化2D和p-GT损失时会导致无效距离的区间。我们使用这个方法来定义一个新的损失函数:Threshold-Adaptive Loss Scaling(TALS)。这个损失函数惩罚 gross 2D和p-GT损失,但不惩罚更小的损失。有了这样的损失,有很多3D姿势都可以解释2D证据。为了减少这种歧义,我们需要在有效的人体姿势上建立一个先验,但这样的先验可能会引入不必要的偏差。为了解决这个问题,我们利用人体姿势的标记化表示来重新定义问题,并将其转化为标记预测问题。这限制了估计姿势在有效姿势的空间内,有效地提供了均匀的先验。在EMDB和3DPW数据集上进行的大量实验证明,我们重新定义的关键点损失和标记化使我们能够在野外数据上进行训练,同时提高3D准确性超过现有水平。我们的模型和代码可在https://这个链接上进行研究。
https://arxiv.org/abs/2404.16752
This paper addresses the task of 3D clothed human generation from textural descriptions. Previous works usually encode the human body and clothes as a holistic model and generate the whole model in a single-stage optimization, which makes them struggle for clothing editing and meanwhile lose fine-grained control over the whole generation process. To solve this, we propose a layer-wise clothed human representation combined with a progressive optimization strategy, which produces clothing-disentangled 3D human models while providing control capacity for the generation process. The basic idea is progressively generating a minimal-clothed human body and layer-wise clothes. During clothing generation, a novel stratified compositional rendering method is proposed to fuse multi-layer human models, and a new loss function is utilized to help decouple the clothing model from the human body. The proposed method achieves high-quality disentanglement, which thereby provides an effective way for 3D garment generation. Extensive experiments demonstrate that our approach achieves state-of-the-art 3D clothed human generation while also supporting cloth editing applications such as virtual try-on. Project page: this http URL
本文讨论了从文本描述中生成3D带衣服的人的任务。以前的工作通常将人体和衣服编码为一个整体模型,并在一个阶段优化中生成整个模型,这使得他们在衣物编辑方面挣扎,同时失去了对整个生成过程的细粒度控制。为了解决这个问题,我们提出了一个逐层的带衣服的人表示与渐进优化策略相结合的方法,从而在生成过程中实现衣物分离的3D人体模型,并提供了对生成过程的控制能力。基本思路是逐步生成最小带衣服的人体和逐层生成衣服。在服装生成过程中,我们提出了一种新的分层组合渲染方法来融合多层人体模型,并使用新的损失函数帮助解耦服装模型与人体。所提出的方法实现了高质量的分离,从而为3D服装生成提供了一种有效的方法。大量的实验证明,我们的方法在实现最先进的3D带衣服的人生成的同时,还支持虚拟试穿等衣物编辑应用。项目页面:http:// this http URL
https://arxiv.org/abs/2404.16748
We propose a novel multi-stage trans-dimensional architecture for multi-view cardiac image segmentation. Our method exploits the relationship between long-axis (2D) and short-axis (3D) magnetic resonance (MR) images to perform a sequential 3D-to-2D-to-3D segmentation, segmenting the long-axis and short-axis images. In the first stage, 3D segmentation is performed using the short-axis image, and the prediction is transformed to the long-axis view and used as a segmentation prior in the next stage. In the second step, the heart region is localized and cropped around the segmentation prior using a Heart Localization and Cropping (HLC) module, focusing the subsequent model on the heart region of the image, where a 2D segmentation is performed. Similarly, we transform the long-axis prediction to the short-axis view, localize and crop the heart region and again perform a 3D segmentation to refine the initial short-axis segmentation. We evaluate our proposed method on the Multi-Disease, Multi-View & Multi-Center Right Ventricular Segmentation in Cardiac MRI (M&Ms-2) dataset, where our method outperforms state-of-the-art methods in segmenting cardiac regions of interest in both short-axis and long-axis images. The pre-trained models, source code, and implementation details will be publicly available.
我们提出了一个新颖的多阶段多视角心肌图像分割架构。我们的方法利用长轴(2D)和短轴(3D)磁共振(MR)图像之间的关系进行级联3D-to-2D-to-3D分割,分割长轴和短轴图像。在第一阶段,使用短轴图像进行3D分割,并将预测转换为长轴视图,用作下一阶段的分割先决条件。在第二阶段,使用心定位和裁剪(HLC)模块将心区域定位和裁剪在分割先决条件周围,将后续模型聚焦于图像中的心区域,并进行2D分割。同样,我们将长轴预测转换为短轴视图,将心区域定位和裁剪,并再次进行3D分割,以优化初始的短轴分割。我们在M&M-2数据集上评估我们的方法,该数据集包括多病种、多视角和多中心右心室分割。我们的方法在短轴和长轴图像中分割感兴趣的心脏区域方面均优于最先进的Methods。预训练模型、源代码和实现细节将公开可用。
https://arxiv.org/abs/2404.16708
While neural implicit representations have gained popularity in multi-view 3D reconstruction, previous work struggles to yield physically plausible results, thereby limiting their applications in physics-demanding domains like embodied AI and robotics. The lack of plausibility originates from both the absence of physics modeling in the existing pipeline and their inability to recover intricate geometrical structures. In this paper, we introduce PhyRecon, which stands as the first approach to harness both differentiable rendering and differentiable physics simulation to learn implicit surface representations. Our framework proposes a novel differentiable particle-based physical simulator seamlessly integrated with the neural implicit representation. At its core is an efficient transformation between SDF-based implicit representation and explicit surface points by our proposed algorithm, Surface Points Marching Cubes (SP-MC), enabling differentiable learning with both rendering and physical losses. Moreover, we model both rendering and physical uncertainty to identify and compensate for the inconsistent and inaccurate monocular geometric priors. The physical uncertainty additionally enables a physics-guided pixel sampling to enhance the learning of slender structures. By amalgamating these techniques, our model facilitates efficient joint modeling with appearance, geometry, and physics. Extensive experiments demonstrate that PhyRecon significantly outperforms all state-of-the-art methods in terms of reconstruction quality. Our reconstruction results also yield superior physical stability, verified by Isaac Gym, with at least a 40% improvement across all datasets, opening broader avenues for future physics-based applications.
虽然多视角3D重建中神经隐式表示已经获得了越来越多的关注,但之前的 work 很难产生物理上合理的成果,从而限制了它们在需要物理要求的领域(如 embodied AI 和机器人学)的应用。缺乏可信度源于现有流程中缺少物理建模以及它们无法恢复复杂的几何结构。在本文中,我们引入了 PhyRecon,这是第一个利用可导渲染和可导物理仿真来学习隐式表面表示的方法。我们的框架将新颖的可导粒子基于物理仿真与神经隐式表示无缝集成。其核心是基于我们提出的表面点前进立方(SP-MC)算法在 SDF 基于隐式表示和显式表面点之间进行有效的转换,实现基于渲染和物理损失的可导学习。此外,我们还建模了渲染和物理不确定性以识别和弥补不一致和不准确的单目几何先验。物理不确定性还允许我们进行基于物理的像素采样,以增强对细长结构的学习。通过将这些技术相结合,我们的模型实现了与外观、几何和物理的效率共生建模。大量实验证明,PhyRecon 在重建质量方面显著超过了所有现有方法。我们的重建结果还证明了伊萨·格雷戈尔(Isaac Gym)验证的卓越物理稳定性,在所有数据集上实现了至少 40% 的改进,为未来的基于物理的应用于开辟了更广泛的道路。
https://arxiv.org/abs/2404.16666
We revisit certain problems of pose estimation based on 3D--2D correspondences between features which may be points or lines. Specifically, we address the two previously-studied minimal problems of estimating camera extrinsics from $p \in \{ 1, 2 \}$ point--point correspondences and $l=3-p$ line--line correspondences. To the best of our knowledge, all of the previously-known practical solutions to these problems required computing the roots of degree $\ge 4$ (univariate) polynomials when $p=2$, or degree $\ge 8$ polynomials when $p=1.$ We describe and implement two elementary solutions which reduce the degrees of the needed polynomials from $4$ to $2$ and from $8$ to $4$, respectively. We show experimentally that the resulting solvers are numerically stable and fast: when compared to the previous state-of-the art, we may obtain nearly an order of magnitude speedup. The code is available at \url{this https URL\_absolute}
我们回顾了基于3D--2D对应关系的某些姿态估计问题,这些问题可能是个点或线。具体来说,我们解决了之前研究过的最小问题:从{1,2}点--点对应关系中估计相机外项,以及从$l=3-p$线--线对应关系中估计相机外项。据我们所知,所有之前已知的问题解决方案都需要在$p=2$时计算次数$\ge 4$(单变量)多项式的根,或者在$p=1$时计算次数$\ge 8$多项式的根。我们描述并实现了两种简化解决方案,它们分别将需要的多项式的次数从4降低到2,从8降低到4。我们证明了这些求解器在数值稳定性和速度方面都是快速的:与之前的先进水平相比,我们可能可以实现近一个数量级的速度提升。代码可在此处下载:https://this URL_absolute
https://arxiv.org/abs/2404.16552
In this paper, we propose a novel approach to address the problem of camera and radar sensor fusion for 3D object detection in autonomous vehicle perception systems. Our approach builds on recent advances in deep learning and leverages the strengths of both sensors to improve object detection performance. Precisely, we extract 2D features from camera images using a state-of-the-art deep learning architecture and then apply a novel Cross-Domain Spatial Matching (CDSM) transformation method to convert these features into 3D space. We then fuse them with extracted radar data using a complementary fusion strategy to produce a final 3D object representation. To demonstrate the effectiveness of our approach, we evaluate it on the NuScenes dataset. We compare our approach to both single-sensor performance and current state-of-the-art fusion methods. Our results show that the proposed approach achieves superior performance over single-sensor solutions and could directly compete with other top-level fusion methods.
在本文中,我们提出了一种新的方法来解决自动驾驶感知系统中3D物体检测的问题。我们的方法基于最近在深度学习方面的进展,并利用两个传感器的优势来提高物体检测性能。具体来说,我们使用最先进的深度学习架构提取相机图像的2D特征,然后应用一种新颖的跨域空间匹配(CDSM)变换方法将它们转换为3D空间。接着,我们使用互补的融合策略将提取的雷达数据与2D特征融合,产生最终的3D物体表示。为了证明我们方法的有效性,我们在 nuScenes 数据集上进行了评估。我们将我们的方法与单传感器性能和当前最先进的融合方法进行了比较。我们的结果表明,与单传感器解决方案相比,所提出的方法具有卓越的性能,并可以直接与其他顶级融合方法竞争。
https://arxiv.org/abs/2404.16548
Recent advances in Vision and Language Models (VLMs) have improved open-world 3D representation, facilitating 3D zero-shot capability in unseen categories. Existing open-world methods pre-train an extra 3D encoder to align features from 3D data (e.g., depth maps or point clouds) with CAD-rendered images and corresponding texts. However, the limited color and texture variations in CAD images can compromise the alignment robustness. Furthermore, the volume discrepancy between pre-training datasets of the 3D encoder and VLM leads to sub-optimal 2D to 3D knowledge transfer. To overcome these issues, we propose OpenDlign, a novel framework for learning open-world 3D representations, that leverages depth-aligned images generated from point cloud-projected depth maps. Unlike CAD-rendered images, our generated images provide rich, realistic color and texture diversity while preserving geometric and semantic consistency with the depth maps. OpenDlign also optimizes depth map projection and integrates depth-specific text prompts, improving 2D VLM knowledge adaptation for 3D learning efficient fine-tuning. Experimental results show that OpenDlign significantly outperforms existing benchmarks in zero-shot and few-shot 3D tasks, exceeding prior scores by 8.0% on ModelNet40 and 16.4% on OmniObject3D with just 6 million tuned parameters. Moreover, integrating generated depth-aligned images into existing 3D learning pipelines consistently improves their performance.
近年来,在Vision和语言模型(VLMs)方面的进步已经提高了开放世界3D表示,推动了在未见类别的3D零击能力。现有的开放世界方法在预训练3D编码器时添加了一个额外的3D编码器,使其将来自3D数据(如深度图或点云)的特征与CAD渲染图像和相关文本对齐。然而,CAD图像中有限的颜色和纹理变化可能会削弱对齐稳健性。此外,预训练3D编码器数据集和VLM数据集之间的体积差异导致了2D到3D知识传递的低效。为了克服这些问题,我们提出了OpenDlign,一种学习开放世界3D表示的新框架,它利用点云投影得到的深度图生成的深度对齐图像。与CAD渲染图像不同,我们的生成图像在保持几何和语义一致性的同时,提供了丰富、逼真的颜色和纹理多样性。此外,OpenDlign还优化了深度图投影并集成了深度特定文本提示,提高了2D VLM对3D学习的知识迁移效率。实验结果表明,OpenDlign在零击和少击3D任务上显著优于现有基准,在仅600万调整参数的情况下,超过了ModelNet40和OmniObject3D的分数。此外,将生成的深度对齐图像集成到现有的3D学习流程中,显著提高了它们的性能。
https://arxiv.org/abs/2404.16538
Generative 3D face models featuring disentangled controlling factors hold immense potential for diverse applications in computer vision and computer graphics. However, previous 3D face modeling methods face a challenge as they demand specific labels to effectively disentangle these factors. This becomes particularly problematic when integrating multiple 3D face datasets to improve the generalization of the model. Addressing this issue, this paper introduces a Weakly-Supervised Disentanglement Framework, denoted as WSDF, to facilitate the training of controllable 3D face models without an overly stringent labeling requirement. Adhering to the paradigm of Variational Autoencoders (VAEs), the proposed model achieves disentanglement of identity and expression controlling factors through a two-branch encoder equipped with dedicated identity-consistency prior. It then faithfully re-entangles these factors via a tensor-based combination mechanism. Notably, the introduction of the Neutral Bank allows precise acquisition of subject-specific information using only identity labels, thereby averting degeneration due to insufficient supervision. Additionally, the framework incorporates a label-free second-order loss function for the expression factor to regulate deformation space and eliminate extraneous information, resulting in enhanced disentanglement. Extensive experiments have been conducted to substantiate the superior performance of WSDF. Our code is available at this https URL.
生成式3D面部模型具有解耦的控制因素,在计算机视觉和计算机图形学中具有巨大的应用潜力。然而,之前的3D面部建模方法遇到了一个挑战,因为它们需要特定的标签来有效地解耦这些因素。当整合多个3D面部数据集来提高模型的泛化能力时,这个问题变得尤为严重。为了解决这个问题,本文引入了一个弱监督解耦框架(WSDF),以促进无需过于严格标签要求来训练可控制3D面部模型的训练。遵循变分自编码器(VAE)的范例,所提出的模型通过配备专用身份一致性先验的两个分支编码器实现对身份和表达控制因素的解耦。然后,它通过张量组合机制忠实地重新解耦这些因素。值得注意的是,引入中值银行允许仅使用身份标签来精确获取主题特定信息,从而避免了由于监督不足而导致的退化。此外,该框架还包括一个无标签的二阶损失函数来调节变形空间,消除多余信息,从而增强解耦。已经进行了大量实验来证明WSDF的优越性能。我们的代码可在此处访问:https://url.cn/xyz4444
https://arxiv.org/abs/2404.16536
3D object generation has undergone significant advancements, yielding high-quality results. However, fall short of achieving precise user control, often yielding results that do not align with user expectations, thus limiting their applicability. User-envisioning 3D object generation faces significant challenges in realizing its concepts using current generative models due to limited interaction capabilities. Existing methods mainly offer two approaches: (i) interpreting textual instructions with constrained controllability, or (ii) reconstructing 3D objects from 2D images. Both of them limit customization to the confines of the 2D reference and potentially introduce undesirable artifacts during the 3D lifting process, restricting the scope for direct and versatile 3D modifications. In this work, we introduce Interactive3D, an innovative framework for interactive 3D generation that grants users precise control over the generative process through extensive 3D interaction capabilities. Interactive3D is constructed in two cascading stages, utilizing distinct 3D representations. The first stage employs Gaussian Splatting for direct user interaction, allowing modifications and guidance of the generative direction at any intermediate step through (i) Adding and Removing components, (ii) Deformable and Rigid Dragging, (iii) Geometric Transformations, and (iv) Semantic Editing. Subsequently, the Gaussian splats are transformed into InstantNGP. We introduce a novel (v) Interactive Hash Refinement module to further add details and extract the geometry in the second stage. Our experiments demonstrate that Interactive3D markedly improves the controllability and quality of 3D generation. Our project webpage is available at \url{this https URL}.
3D对象生成已经取得了显著的进步,产生了高质量的结果。然而,由于缺乏用户控制,通常无法实现精确的用户期望,从而限制了其应用范围。用户可视化3D对象生成面临很大的挑战,因为在目前的生成模型中具有有限的交互能力。现有的方法主要提出了两种方法:(i)通过约束可控制性的文本指令进行解释,或者(ii)从2D图像中重构3D对象。两种方法都限制了对2D参考范围内的定制,并且在3D提升过程中可能引入不良伪影,从而限制了直接和多功能的3D修改范围。在这项工作中,我们引入了Interactive3D,一种创新的交互式3D生成框架,通过广泛的3D交互功能赋予用户对生成过程的精确控制。Interactive3D分为两个级联阶段构建,利用不同的3D表示方法。第一个阶段采用高斯平铺进行直接用户交互,通过(i)添加和移除组件, (ii)可形变和刚体拖拽, (iii)几何变换和(iv)语义编辑来修改和指导生成方向。然后,高斯平铺被转换为InstantNGP。我们引入了一种新颖的(v)交互式哈希平滑模块,以进一步增加细节并提取第二阶段的几何。我们的实验证明,Interactive3D显著提高了3D生成的可控性和质量。我们的项目网页可以通过 \url{这个链接}访问。
https://arxiv.org/abs/2404.16510
The prevalent approaches of unsupervised 3D object detection follow cluster-based pseudo-label generation and iterative self-training processes. However, the challenge arises due to the sparsity of LiDAR scans, which leads to pseudo-labels with erroneous size and position, resulting in subpar detection performance. To tackle this problem, this paper introduces a Commonsense Prototype-based Detector, termed CPD, for unsupervised 3D object detection. CPD first constructs Commonsense Prototype (CProto) characterized by high-quality bounding box and dense points, based on commonsense intuition. Subsequently, CPD refines the low-quality pseudo-labels by leveraging the size prior from CProto. Furthermore, CPD enhances the detection accuracy of sparsely scanned objects by the geometric knowledge from CProto. CPD outperforms state-of-the-art unsupervised 3D detectors on Waymo Open Dataset (WOD), PandaSet, and KITTI datasets by a large margin. Besides, by training CPD on WOD and testing on KITTI, CPD attains 90.85% and 81.01% 3D Average Precision on easy and moderate car classes, respectively. These achievements position CPD in close proximity to fully supervised detectors, highlighting the significance of our method. The code will be available at this https URL.
大多数无监督的三维物体检测方法遵循基于聚类的伪标签生成和迭代自训练过程。然而,由于激光雷达扫描的稀疏性,导致伪标签具有错误的大小和位置,从而导致检测性能不佳。为了解决这个问题,本文引入了一种以常识原型为基础的检测器,称为CPD,用于无监督三维物体检测。CPD首先基于常识直觉构建了高质量的边界框和密集点的高质量常识原型(CProto)。然后,CPD通过利用CProto的大小先验来优化低质量伪标签。此外,CPD通过CProto的几何知识提高了稀疏扫描对象检测的准确性。CPD在Waymo Open Dataset(WOD)、PandaSet和KITTI数据集上优于最先进的无监督三维检测器。此外,通过在WOD和KITTI上训练CPD并进行测试,CPD在容易和 moderate 车辆类别上获得了90.85%和81.01%的3D平均精度。这些成就使CPD与完全监督的检测器相接近,强调了我们的方法的重要性。代码将在该https URL上可用。
https://arxiv.org/abs/2404.16493
In the context of imitation learning applied to dexterous robotic hands, the high complexity of the systems makes learning complex manipulation tasks challenging. However, the numerous datasets depicting human hands in various different tasks could provide us with better knowledge regarding human hand motion. We propose a method to leverage multiple large-scale task-agnostic datasets to obtain latent representations that effectively encode motion subtrajectories that we included in a transformer-based behavior cloning method. Our results demonstrate that employing latent representations yields enhanced performance compared to conventional behavior cloning methods, particularly regarding resilience to errors and noise in perception and proprioception. Furthermore, the proposed approach solely relies on human demonstrations, eliminating the need for teleoperation and, therefore, accelerating the data acquisition process. Accurate inverse kinematics for fingertip retargeting ensures precise transfer from human hand data to the robot, facilitating effective learning and deployment of manipulation policies. Finally, the trained policies have been successfully transferred to a real-world 23Dof robotic system.
在将模仿学习应用于灵巧机器人手的应用中,系统的复杂性使得学习复杂的操作任务具有挑战性。然而,描述人类在不同任务中操作的丰富数据集可以为我们在运动子轨迹方面提供更好的知识。我们提出了一种利用多个大型、任务无关的数据集的方法,以获得有效的表示我们包括在基于Transformer的行为复制方法中的运动子轨迹的潜在表示。我们的结果表明,使用潜在表示能够提高与传统行为复制方法的性能,特别是关于感知和本体知觉中的错误和噪声的鲁棒性。此外,所提出的方法仅依赖于人类演示,因此消除了遥控的需求,从而加速了数据收集过程。准确的手指重新定位的逆运动学确保了从人类手数据到机器人的精确传递,促进了有效的学习和部署操作策略。最后,已经训练好的策略已经被成功地应用于一个23Dof的实物机器人系统。
https://arxiv.org/abs/2404.16483
While originally developed for novel view synthesis, Neural Radiance Fields (NeRFs) have recently emerged as an alternative to multi-view stereo (MVS). Triggered by a manifold of research activities, promising results have been gained especially for texture-less, transparent, and reflecting surfaces, while such scenarios remain challenging for traditional MVS-based approaches. However, most of these investigations focus on close-range scenarios, with studies for airborne scenarios still missing. For this task, NeRFs face potential difficulties at areas of low image redundancy and weak data evidence, as often found in street canyons, facades or building shadows. Furthermore, training such networks is computationally expensive. Thus, the aim of our work is twofold: First, we investigate the applicability of NeRFs for aerial image blocks representing different characteristics like nadir-only, oblique and high-resolution imagery. Second, during these investigations we demonstrate the benefit of integrating depth priors from tie-point measures, which are provided during presupposed Bundle Block Adjustment. Our work is based on the state-of-the-art framework VolSDF, which models 3D scenes by signed distance functions (SDFs), since this is more applicable for surface reconstruction compared to the standard volumetric representation in vanilla NeRFs. For evaluation, the NeRF-based reconstructions are compared to results of a publicly available benchmark dataset for airborne images.
虽然最初是为 novel view synthesis 设计的,但近年来 Neural Radiance Fields (NeRFs) 已经作为一种多视图立体 (MVS) 的替代方案得到了广泛应用。受到多种研究活动的触发,尤其是在缺乏纹理、透明和反射表面的情况下,NeRFs 的表现尤为出色,而传统 MVS 方法在这些问题上仍然具有挑战性。然而,这些研究主要集中在近景场景,尽管已经对空气场景进行了研究,但仍有缺失。对于这项任务,NeRFs 在低图像冗余和弱数据证据的领域可能会面临潜在的困难,正如在街巷、建筑立面或建筑物阴影中常见的情况。此外,训练这类网络在计算上较为昂贵。因此,我们工作的目标是双重的:首先,我们研究 NeRFs 在代表不同特性的航空图像块上的适用性;其次,在這些調查期間,我們將展示將來自點測量學的深度 prior 整合到預假 Bundle Block Adjustment 中的好處。我们的工作基於最先进的框架 VolSDF,它通過點距函數 (SDF) 建模 3D 场景,因為這比標準的 NeRFs 的表面重建更適合作用。對於評估,我們將 NeRF 基於的重建與空氣中可獲得的公开數據集的結果進行比較。
https://arxiv.org/abs/2404.16429
Image-guided object assembly represents a burgeoning research topic in computer vision. This paper introduces a novel task: translating multi-view images of a structural 3D model (for example, one constructed with building blocks drawn from a 3D-object library) into a detailed sequence of assembly instructions executable by a robotic arm. Fed with multi-view images of the target 3D model for replication, the model designed for this task must address several sub-tasks, including recognizing individual components used in constructing the 3D model, estimating the geometric pose of each component, and deducing a feasible assembly order adhering to physical rules. Establishing accurate 2D-3D correspondence between multi-view images and 3D objects is technically challenging. To tackle this, we propose an end-to-end model known as the Neural Assembler. This model learns an object graph where each vertex represents recognized components from the images, and the edges specify the topology of the 3D model, enabling the derivation of an assembly plan. We establish benchmarks for this task and conduct comprehensive empirical evaluations of Neural Assembler and alternative solutions. Our experiments clearly demonstrate the superiority of Neural Assembler.
图像引导的对象装配是一个在计算机视觉领域迅速发展的研究课题。本文介绍了一种新颖的任务:将多视角图(例如,使用从3D对象库绘制的建筑块构建的3D模型)翻译成机器人臂可执行的详细装配指令序列。通过多视角图的目标3D模型的复制,这个任务设计的模型需要解决几个子任务,包括识别用于构建3D模型的各个组件、估计每个组件的几何姿态以及根据物理规则推导出可行的装配顺序。在多视角图像和3D对象之间建立准确的2D-3D对应关系技术上具有挑战性。为了解决这个问题,我们提出了一个端到端的模型,称为神经装配器。这个模型学习了一个对象图,其中每个顶点表示图像中识别出的组件,边指定3D模型的拓扑结构,从而可以推导出装配计划。我们为这个任务建立了基准,并对神经装配器及其替代方案进行了全面的实证评估。我们的实验结果清楚地证明了神经装配器的优越性。
https://arxiv.org/abs/2404.16423
This paper presents a robust fine-tuning method designed for pre-trained 3D point cloud models, to enhance feature robustness in downstream fine-tuned models. We highlight the limitations of current fine-tuning methods and the challenges of learning robust models. The proposed method, named Weight-Space Ensembles for Fine-Tuning then Linear Probing (WiSE-FT-LP), integrates the original pre-training and fine-tuning models through weight space integration followed by Linear Probing. This approach significantly enhances the performance of downstream fine-tuned models under distribution shifts, improving feature robustness while maintaining high performance on the target distribution. We apply this robust fine-tuning method to mainstream 3D point cloud pre-trained models and evaluate the quality of model parameters and the degradation of downstream task performance. Experimental results demonstrate the effectiveness of WiSE-FT-LP in enhancing model robustness, effectively balancing downstream task performance and model feature robustness without altering the model structures.
本文提出了一种用于预训练3D点云模型的稳健微调方法,以提高下游微调模型的特征鲁棒性。我们重点介绍了当前微调方法的局限性和学习稳健模型的挑战。所提出的方法,名为权重空间集成用于微调线性探测(WiSE-FT-LP),通过权重空间整合和线性探测来整合原始预训练和微调模型。这种方法在分布变化下显著增强了下游微调模型的性能,同时保持目标分布上的高性能。我们将这种稳健微调方法应用于主流3D点云预训练模型,并评估模型的参数质量和下游任务性能的退化。实验结果表明,WiSE-FT-LP在增强模型鲁棒性方面非常有效,有效地平衡了下游任务性能和模型特征鲁棒性,同时不改变模型结构。
https://arxiv.org/abs/2404.16422
In this paper, we study the problem of 3D reconstruction from a single-view RGB image and propose a novel approach called DIG3D for 3D object reconstruction and novel view synthesis. Our method utilizes an encoder-decoder framework which generates 3D Gaussians in decoder with the guidance of depth-aware image features from encoder. In particular, we introduce the use of deformable transformer, allowing efficient and effective decoding through 3D reference point and multi-layer refinement adaptations. By harnessing the benefits of 3D Gaussians, our approach offers an efficient and accurate solution for 3D reconstruction from single-view images. We evaluate our method on the ShapeNet SRN dataset, getting PSNR of 24.21 and 24.98 in car and chair dataset, respectively. The result outperforming the recent method by around 2.25%, demonstrating the effectiveness of our method in achieving superior results.
在本文中,我们研究了从单视RGB图像中进行3D重建的问题,并提出了名为DIG3D的三维物体重建和新颖视图合成方法。我们的方法利用了一个编码器-解码器框架,在编码器的指导下生成3D高斯分布。特别地,我们引入了形变Transformer,通过3D参考点和多层精细修复适应来实现高效的解码。通过利用3D高斯分布的优势,我们的方法为从单视图像中进行3D重建提供了有效且准确的方法。我们在ShapeNet SRN数据集上评估我们的方法,得到汽车和椅子数据集的PSNR分别为24.21和24.98。该结果比最近的方法约领先2.25%,证明了我们在实现卓越结果方面的有效性。
https://arxiv.org/abs/2404.16323
In this paper, we address the problem of enclosing an arbitrarily moving target in three dimensions by a single pursuer, which is an unmanned aerial vehicle (UAV), for maximum coverage while also ensuring the pursuer's safety by preventing collisions with the target. The proposed guidance strategy steers the pursuer to a safe region of space surrounding the target, allowing it to maintain a certain distance from the latter while offering greater flexibility in positioning and converging to any orbit within this safe zone. Our approach is distinguished by the use of nonholonomic constraints to model vehicles with accelerations serving as control inputs and coupled engagement kinematics to craft the pursuer's guidance law meticulously. Furthermore, we leverage the concept of the Lyapunov Barrier Function as a powerful tool to constrain the distance between the pursuer and the target within asymmetric bounds, thereby ensuring the pursuer's safety within the predefined region. To validate the efficacy and robustness of our algorithm, we conduct experimental tests by implementing a high-fidelity quadrotor model within Software-in-the-loop (SITL) simulations, encompassing various challenging target maneuver scenarios. The results obtained showcase the resilience of the proposed guidance law, effectively handling arbitrarily maneuvering targets, vehicle/autopilot dynamics, and external disturbances. Our method consistently delivers stable global enclosing behaviors, even in response to aggressive target maneuvers, and requires only relative information for successful execution.
在本文中,我们研究了在三维空间中用一个追击器(UAV)围住一个任意运动的靶子的问题,同时确保追击者的安全,防止与目标发生碰撞。所提出的引导策略将追击器引导到位于靶子周围的 safe 区域,同时提供更大的灵活性来确定性和收敛到 safe 区域内的任何轨道。我们的方法的特点在于使用非齐次约束来建模作为控制输入的加速度的车辆,以及采用耦合运动学来精心塑造追击者的引导律。此外,我们还利用Lyapunov Barrier Function的概念作为强大的工具来限制追击器与目标之间的距离,从而确保在定义区域内追击者的安全。为了验证我们算法的有效性和鲁棒性,我们在软件循环(SITL)仿真中实现了高保真度的四旋翼模型,涵盖了各种具有挑战性的目标操纵场景。得到的结果表明,所提出的引导律具有弹性,有效地处理了任意运动目标的车辆/自动驾驶器动力学以及外部干扰。我们的方法始终如一地提供稳定的全局围栏行为,即使在激进的目标操纵响应下也能有效处理,并且只需要相对信息来成功执行。
https://arxiv.org/abs/2404.16312
Lane detection has made significant progress in recent years, but there is not a unified architecture for its two sub-tasks: 2D lane detection and 3D lane detection. To fill this gap, we introduce BézierFormer, a unified 2D and 3D lane detection architecture based on Bézier curve lane representation. BézierFormer formulate queries as Bézier control points and incorporate a novel Bézier curve attention mechanism. This attention mechanism enables comprehensive and accurate feature extraction for slender lane curves via sampling and fusing multiple reference points on each curve. In addition, we propose a novel Chamfer IoU-based loss which is more suitable for the Bézier control points regression. The state-of-the-art performance of BézierFormer on widely-used 2D and 3D lane detection benchmarks verifies its effectiveness and suggests the worthiness of further exploration.
近年来,在车道检测方面取得了显著的进步,但二维和三维车道检测的两个子任务并没有一个统一的架构。为了填补这一空白,我们引入了BézierFormer,一种基于Bézier曲线车道表示的统一二维和三维车道检测架构。BézierFormer将查询表示为Bézier控制点,并引入了一种新颖的Bézier曲线注意力机制。这种注意力机制通过采样和融合每个曲线上的多个参考点,实现对细小车道曲线的全面而准确的特征提取。此外,我们提出了一种新的Chamfer IoU基于损失,该损失更适合Bézier控制点回归。BézierFormer在广泛使用的二维和三维车道检测基准测试中的最先进性能证实了其有效性和进一步探索的必要性。
https://arxiv.org/abs/2404.16304