In this work, we recover the underlying 3D structure of non-geometrically consistent scenes. We focus our analysis on hand-drawn images from cartoons and anime. Many cartoons are created by artists without a 3D rendering engine, which means that any new image of a scene is hand-drawn. The hand-drawn images are usually faithful representations of the world, but only in a qualitative sense, since it is difficult for humans to draw multiple perspectives of an object or scene 3D consistently. Nevertheless, people can easily perceive 3D scenes from inconsistent inputs! In this work, we correct for 2D drawing inconsistencies to recover a plausible 3D structure such that the newly warped drawings are consistent with each other. Our pipeline consists of a user-friendly annotation tool, camera pose estimation, and image deformation to recover a dense structure. Our method warps images to obey a perspective camera model, enabling our aligned results to be plugged into novel-view synthesis reconstruction methods to experience cartoons from viewpoints never drawn before. Our project page is https://toon3d.studio/.
在这项工作中,我们恢复了非几何一致性场景的潜在3D结构。我们的分析重点在于手绘的动漫图像。许多动漫是由没有3D渲染引擎的艺术家创作的,这意味着任何场景的新图像都是手绘的。手绘图像通常忠实于现实世界,但只有从定性意义上说,因为人类很难 consistently绘制物体或场景的多个视角。然而,人们可以很容易地从不一致的输入中感知3D场景!在这项工作中,我们纠正了2D绘图不一致性,以恢复一个可信的3D结构,使得新扭曲的图像相互一致。我们的流程包括一个用户友好的注释工具、目标姿态估计和图像变形以恢复密集结构。我们的方法扭曲图像以遵守透视相机模型,使我们得到的结果可以插入到从未见过的观点合成重建方法中,从从未体验过的角度合成卡通。我们的项目页面是https://toon3d.studio/。
https://arxiv.org/abs/2405.10320
Advances in 3D reconstruction have enabled high-quality 3D capture, but require a user to collect hundreds to thousands of images to create a 3D scene. We present CAT3D, a method for creating anything in 3D by simulating this real-world capture process with a multi-view diffusion model. Given any number of input images and a set of target novel viewpoints, our model generates highly consistent novel views of a scene. These generated views can be used as input to robust 3D reconstruction techniques to produce 3D representations that can be rendered from any viewpoint in real-time. CAT3D can create entire 3D scenes in as little as one minute, and outperforms existing methods for single image and few-view 3D scene creation. See our project page for results and interactive demos at this https URL .
3D重建技术的进步使得高质量的3D捕捉成为可能,但需要用户收集数百到数千张图像来创建3D场景。我们提出了一种名为CAT3D的方法,通过使用多视角扩散模型模拟这种现实世界的捕捉过程,来创建任何3D物体。给定任意数量的输入图像和一组目标新视角,我们的模型生成场景中高度一致的新视角。这些生成的视图可以作为输入,用于具有实时渲染能力的稳健3D重建技术,产生可以从任何视角渲染的3D表示。CAT3D可以在不到一分钟的时间内创建整个3D场景,并超越了现有方法在单张图像和少数视角3D场景创建方面的表现。请查看我们的项目页面,以查看结果和交互式演示。https://url.com/cat3d 。
https://arxiv.org/abs/2405.10314
As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as in-context learning, step-by-step reasoning, open-vocabulary capabilities, and extensive world knowledge, we underscore their potential to significantly advance spatial comprehension and interaction within embodied Artificial Intelligence (AI) systems. Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs). It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue, as well as LLM-based agents for spatial reasoning, planning, and navigation. The paper also includes a brief review of other methods that integrate 3D and language. The meta-analysis presented in this paper reveals significant progress yet underscores the necessity for novel approaches to harness the full potential of 3D-LLMs. Hence, with this paper, we aim to chart a course for future research that explores and expands the capabilities of 3D-LLMs in understanding and interacting with the complex 3D world. To support this survey, we have established a project page where papers related to our topic are organized and listed: this https URL.
随着大型语言模型(LLMs)的不断发展,其与3D空间数据的集成取得了快速进展,为理解和与物理空间进行交互提供了前所未有的能力。这项调查对LLMs处理、理解和生成3D数据的方法进行了全面概述。强调LLMs的独特优势,如上下文学习、逐步推理、开放词汇功能和广泛的世界知识,我们强调它们在智能体人工智能(AI)系统中将空间理解和交互的重大潜力。我们的研究跨越了各种3D数据表示,从点云到神经辐射场(NeRFs)。它研究了LLM与3D场景理解、捕捉、问答和对话等任务的集成,以及基于LLM的空间推理、规划和导航等代理。此外,本文还简要回顾了其他将3D和语言集成的方法。本文中的元分析揭示了显著的进展,但同时也强调了需要新的方法来充分利用3D-LLMs的全部潜力。因此,本文旨在为未来研究绘制一个探索和扩展3D-LLMs理解和服务于复杂3D世界的道路的路线图。为了支持这项调查,我们建立了一个与我们的主题相关的项目页,其中列出了相关的论文:https://url。
https://arxiv.org/abs/2405.10255
Brain lesion segmentation plays an essential role in neurological research and diagnosis. As brain lesions can be caused by various pathological alterations, different types of brain lesions tend to manifest with different characteristics on different imaging modalities. Due to this complexity, brain lesion segmentation methods are often developed in a task-specific manner. A specific segmentation model is developed for a particular lesion type and imaging modality. However, the use of task-specific models requires predetermination of the lesion type and imaging modality, which complicates their deployment in real-world scenarios. In this work, we propose a universal foundation model for 3D brain lesion segmentation, which can automatically segment different types of brain lesions for input data of various imaging modalities. We formulate a novel Mixture of Modality Experts (MoME) framework with multiple expert networks attending to different imaging modalities. A hierarchical gating network combines the expert predictions and fosters expertise collaboration. Furthermore, we introduce a curriculum learning strategy during training to avoid the degeneration of each expert network and preserve their specialization. We evaluated the proposed method on nine brain lesion datasets, encompassing five imaging modalities and eight lesion types. The results show that our model outperforms state-of-the-art universal models and provides promising generalization to unseen datasets.
翻译:脑部病变分割在神经科学研究和诊断中起着关键作用。由于脑部病变可以由多种病理改变引起,不同类型的脑部病变在不同的成像方式上表现出的特点也不同。由于这种复杂性,通常在特定任务下开发脑部病变分割方法。为特定病变类型和成像方式开发特定的分割模型。然而,使用任务特定模型需要预先确定病变类型和成像方式,这使得它们在现实场景中难以部署。在这项工作中,我们提出了一个通用的基于病变部位的3D脑部病变分割模型,可以自动为各种成像方式下的输入数据分割不同类型的脑部病变。我们提出了一个名为混合模式专家(MoME)框架的多专家网络,关注不同的成像方式。一个分层的筛选网络结合专家预测并促进专家之间的专业知识。此外,在训练过程中引入了学习策略,以避免每个专家网络的退化,并保留其专业化的特色。我们对该方法对九个脑部病变数据集进行了评估,包括五种成像方式和八个病变类型。结果显示,我们的模型超越了最先进的通用模型,并为未见过的数据集提供了有前景的泛化。
https://arxiv.org/abs/2405.10246
Active reconstruction technique enables robots to autonomously collect scene data for full coverage, relieving users from tedious and time-consuming data capturing process. However, designed based on unsuitable scene representations, existing methods show unrealistic reconstruction results or the inability of online quality evaluation. Due to the recent advancements in explicit radiance field technology, online active high-fidelity reconstruction has become achievable. In this paper, we propose GS-Planner, a planning framework for active high-fidelity reconstruction using 3D Gaussian Splatting. With improvement on 3DGS to recognize unobserved regions, we evaluate the reconstruction quality and completeness of 3DGS map online to guide the robot. Then we design a sampling-based active reconstruction strategy to explore the unobserved areas and improve the reconstruction geometric and textural quality. To establish a complete robot active reconstruction system, we choose quadrotor as the robotic platform for its high agility. Then we devise a safety constraint with 3DGS to generate executable trajectories for quadrotor navigation in the 3DGS map. To validate the effectiveness of our method, we conduct extensive experiments and ablation studies in highly realistic simulation scenes.
主动重建技术使机器人能够自主收集场景数据以实现全面覆盖,从而摆脱用户从繁琐且耗时的数据捕捉过程。然而,由于不合适的场景表示,现有的方法表现出不现实的重构结果或在线质量评估无能为力。由于最近在显式辐射场技术方面的进步,在线主动高保真度重构已经成为可能。在本文中,我们提出了GS-Planner,一种使用3D高斯平铺进行主动高保真度重构的规划框架。通过提高3DGS以识别未观察到的区域,我们在线评估了3DGS地图的重建质量和完整性,以指导机器人的行动。然后我们设计了一种基于采样的主动重构策略,以探索未观察到的区域并提高重建几何和纹理质量。为了建立完整的机器人主动重建系统,我们选择了四旋翼作为机器人平台,因为它具有高度的敏捷性。然后我们通过3DGS生成机器人导航在3DGS地图上的可执行轨迹的安全约束。为了验证我们方法的有效性,我们在具有高度逼真的仿真场景中进行了广泛的实验和消融研究。
https://arxiv.org/abs/2405.10142
Deformable image registration (alignment) is highly sought after in numerous clinical applications, such as computer aided diagnosis and disease progression analysis. Deep Convolutional Neural Network (DCNN)-based image registration methods have demonstrated advantages in terms of registration accuracy and computational speed. However, while most methods excel at global alignment, they often perform worse in aligning local regions. To address this challenge, this paper proposes a mask-guided encoder-decoder DCNN-based image registration method, named as MrRegNet. This approach employs a multi-resolution encoder for feature extraction and subsequently estimates multi-resolution displacement fields in the decoder to handle the substantial deformation of images. Furthermore, segmentation masks are employed to direct the model's attention toward aligning local regions. The results show that the proposed method outperforms traditional methods like Demons and a well-known deep learning method, VoxelMorph, on a public 3D brain MRI dataset (OASIS) and a local 2D brain MRI dataset with large deformations. Importantly, the image alignment accuracies are significantly improved at local regions guided by segmentation masks. Github link:this https URL.
塑形图像注册(对齐)在许多临床应用中受到高度关注,如计算机辅助诊断和疾病进展分析。基于深度卷积神经网络(DCNN)的图像注册方法在注册准确性和计算速度方面表现出了优势。然而,虽然大多数方法在全局对齐方面表现出色,但它们在局部区域对齐方面往往表现得更差。为解决这个问题,本文提出了一种基于mask-guided encoder-decoder DCNN图像注册方法,称为MrRegNet。该方法采用多分辨率编码器用于特征提取,并随后在解码器中估计多分辨率位移场,以处理图像的巨额变形。此外,还使用分割掩码来引导模型的注意力指向对齐局部区域。结果表明,与传统方法如Demons和著名的深度学习方法VoxelMorph相比,所提出的方法在公共3D脑MRI数据集(OASIS)和具有较大变形 locally的2D脑MRI数据集上显著表现出色。重要的是,在由分割掩码引导的局部区域,图像对齐准确度得到了显著提高。Github链接:this <https://github.com/>.
https://arxiv.org/abs/2405.10068
In recent years considerable research in LiDAR semantic segmentation was conducted, introducing several new state of the art models. However, most research focuses on single-scan point clouds, limiting performance especially in long distance outdoor scenarios, by omitting time-sequential information. Moreover, varying-density and occlusions constitute significant challenges in single-scan approaches. In this paper we propose a LiDAR point cloud preprocessing and postprocessing method. This multi-stage approach, in conjunction with state of the art models in a multi-scan setting, aims to solve those challenges. We demonstrate the benefits of our method through quantitative evaluation with the given models in single-scan settings. In particular, we achieve significant improvements in mIoU performance of over 5 percentage point in medium range and over 10 percentage point in far range. This is essential for 3D semantic scene understanding in long distance as well as for applications where offline processing is permissible.
近年来,在LiDAR语义分割方面进行了大量研究,引入了几个最先进的模型。然而,大部分研究都关注于单扫描点云,在远距离户外场景中限制了性能,通过忽略时间序列信息。此外,变化密度和遮挡是单扫描方法中的重大挑战。在本文中,我们提出了一个LiDAR点云预处理和后处理方法。这种多阶段方法与多扫描设置中的先进模型相结合,旨在解决这些挑战。我们通过单扫描设置中给定模型的定量评估来证明我们方法的益处。特别是,在中距离范围内,我们的方法在mIoU性能上取得了显著的改善,超过5个百分点;在远距离范围内,我们的方法在mIoU性能上取得了超过10个百分点。这对于在远距离中实现3D语义图理解以及允许离线处理的应用非常重要。
https://arxiv.org/abs/2405.10046
The accelerated progress of artificial intelligence (AI) has popularized deep learning models across domains, yet their inherent opacity poses challenges, notably in critical fields like healthcare, medicine and the geosciences. Explainable AI (XAI) has emerged to shed light on these "black box" models, helping decipher their decision making process. Nevertheless, different XAI methods yield highly different explanations. This inter-method variability increases uncertainty and lowers trust in deep networks' predictions. In this study, for the first time, we propose a novel framework designed to enhance the explainability of deep networks, by maximizing both the accuracy and the comprehensibility of the explanations. Our framework integrates various explanations from established XAI methods and employs a non-linear "explanation optimizer" to construct a unique and optimal explanation. Through experiments on multi-class and binary classification tasks in 2D object and 3D neuroscience imaging, we validate the efficacy of our approach. Our explanation optimizer achieved superior faithfulness scores, averaging 155% and 63% higher than the best performing XAI method in the 3D and 2D applications, respectively. Additionally, our approach yielded lower complexity, increasing comprehensibility. Our results suggest that optimal explanations based on specific criteria are derivable and address the issue of inter-method variability in the current XAI literature.
人工智能(AI)的快速发展已经在各个领域普及了深度学习模型,然而其固有的不透明性在关键领域如医疗、医学和地质学等领域提出了挑战。可解释人工智能(XAI)应运而生,帮助揭示这些“黑盒子”模型,并解释其决策过程。然而,不同的XAI方法得出的解释高度不同。这种方法间的变异性增加了不确定性,降低了 deep 网络预测的信任度。在这项研究中,我们首次提出了一个新框架,旨在提高 deep 网络的可解释性,通过最大化 both 解释的准确性和全面性来实现。我们的框架整合了现有的 XAI 方法的各个解释,并采用了一个非线性的“解释优化器”来构建独特的最优解释。通过在 2D 物体和 3D 神经科学成像的多分类和二分类任务上的实验,我们验证了我们的方法的有效性。我们的解释优化器实现了比最佳应用在 3D 和 2D 领域的 XAI 方法更高的忠实度分数,分别平均高于 155% 和 63%。此外,我们的方法还产生了较低的复杂性,提高了可理解性。我们的结果表明,基于特定标准的最优解释是可导出的,并解决了当前 XAI 文献中方法间变异性问题的难题。
https://arxiv.org/abs/2405.10008
The maturity classification of specialty crops such as strawberries and tomatoes is an essential agricultural downstream activity for selective harvesting and quality control (QC) at production and packaging sites. Recent advancements in Deep Learning (DL) have produced encouraging results in color images for maturity classification applications. However, hyperspectral imaging (HSI) outperforms methods based on color vision. Multivariate analysis methods and Convolutional Neural Networks (CNN) deliver promising results; however, a large amount of input data and the associated preprocessing requirements cause hindrances in practical application. Conventionally, the reflectance intensity in a given electromagnetic spectrum is employed in estimating fruit maturity. We present a feature extraction method to empirically demonstrate that the peak reflectance in subbands such as 500-670 nm (pigment band) and the wavelength of the peak position, and contrarily, the trough reflectance and its corresponding wavelength within 671-790 nm (chlorophyll band) are convenient to compute yet distinctive features for the maturity classification. The proposed feature selection method is beneficial because preprocessing, such as dimensionality reduction, is avoided before every prediction. The feature set is designed to capture these traits. The best SOTA methods, among 3D-CNN, 1D-CNN, and SVM, achieve at most 90.0 % accuracy for strawberries and 92.0 % for tomatoes on our dataset. Results show that the proposed method outperforms the SOTA as it yields an accuracy above 98.0 % in strawberry and 96.0 % in tomato classification. A comparative analysis of the time efficiency of these methods is also conducted, which shows the proposed method performs prediction at 13 Frames Per Second (FPS) compared to the maximum 1.16 FPS attained by the full-spectrum SVM classifier.
草莓和西红柿等特色作物的成熟度分类是生产和包装站点选择性收获和质量控制(QC)过程中必不可少的重要农业下游活动。最近在深度学习(DL)方面的进步在颜色图像的成熟度分类应用中产生了鼓舞人心的结果。然而,基于色彩视觉的方法在成熟度分类上劣后于 hyperspectral imaging(HSI)。多变量分析方法和卷积神经网络(CNN)产生了积极的结果;然而,大量的输入数据及其相关预处理要求给应用带来障碍。通常,在给定的电磁频谱中的反射强度被用来估计果实成熟度。我们提出了一个特征提取方法,以经验证明在色素带(500-670纳米,色素带)子频段和最大峰位波长以及相反,在671-790纳米(叶绿素带)中的峰谷反射强度和其相应的波长是方便计算且具有区分性的特征,用于成熟度分类。所提出的特征选择方法有益处,因为预测之前,预处理,例如降维,被避免了。特征集旨在捕捉这些特征。在3D-CNN、1D-CNN和SVM中,最好的SOTA方法,即草莓和西红柿数据集中的3D-CNN,在草莓和西红柿上的准确度分别为90.0%和92.0%。结果表明,与SOTA相比,所提出的方法具有更高的准确度,草莓的准确度为98.0%,西红柿的准确度为96.0%。还进行了这些方法的比较分析,比较了它们的预测时间效率,结果表明,与 full-spectral SVM 分类器达到的最大1.16 FPS 相比,所提出的 method 在13 FPS 的预测速度上表现出色。
https://arxiv.org/abs/2405.09955
Infrared physical adversarial examples are of great significance for studying the security of infrared AI systems that are widely used in our lives such as autonomous driving. Previous infrared physical attacks mainly focused on 2D infrared pedestrian detection which may not fully manifest its destructiveness to AI systems. In this work, we propose a physical attack method against infrared detectors based on 3D modeling, which is applied to a real car. The goal is to design a set of infrared adversarial stickers to make cars invisible to infrared detectors at various viewing angles, distances, and scenes. We build a 3D infrared car model with real infrared characteristics and propose an infrared adversarial pattern generation method based on 3D mesh shadow. We propose a 3D control points-based mesh smoothing algorithm and use a set of smoothness loss functions to enhance the smoothness of adversarial meshes and facilitate the sticker implementation. Besides, We designed the aluminum stickers and conducted physical experiments on two real Mercedes-Benz A200L cars. Our adversarial stickers hid the cars from Faster RCNN, an object detector, at various viewing angles, distances, and scenes. The attack success rate (ASR) was 91.49% for real cars. In comparison, the ASRs of random stickers and no sticker were only 6.21% and 0.66%, respectively. In addition, the ASRs of the designed stickers against six unseen object detectors such as YOLOv3 and Deformable DETR were between 73.35%-95.80%, showing good transferability of the attack performance across detectors.
红外物理攻击范例对研究广泛应用于我们生活中的红外人工智能系统的安全性具有重大意义,如自动驾驶。以前的红外物理攻击主要集中在2D红外行人检测,这可能不足以完全表现其破坏性给人工智能系统。在这项工作中,我们提出了基于3D建模的对红外检测器的物理攻击方法,应用于实际汽车。目标是设计一组红外攻击贴纸,使汽车在不同的视角、距离和场景下对红外检测器不可见。我们基于真实红外特征构建了一个3D红外汽车模型,并提出了基于3D网格阴影的红外攻击图案生成方法。我们提出了基于3D控制点网格平滑算法,并使用一组平滑损失函数增强攻击mesh的平滑度,并促进贴纸的实现。此外,我们设计了一种铝制贴纸,并在两个实际梅赛德斯-奔驰A200L汽车上进行了物理实验。我们的攻击贴纸在不同的视角、距离和场景下成功隐藏了Faster RCNN(物体检测器)视野内的汽车。攻击成功率(ASR)为91.49%。与随机贴纸和无贴纸相比,ASR分别为6.21%和0.66%。此外,设计贴纸对六种未见过的物体检测器(如YOLOv3和Deformable DETR)的ASR在73.35%-95.80%之间,表明攻击性能的跨检测器效果很好。
https://arxiv.org/abs/2405.09924
We introduce RoScenes, the largest multi-view roadside perception dataset, which aims to shed light on the development of vision-centric Bird's Eye View (BEV) approaches for more challenging traffic scenes. The highlights of RoScenes include significantly large perception area, full scene coverage and crowded traffic. More specifically, our dataset achieves surprising 21.13M 3D annotations within 64,000 $m^2$. To relieve the expensive costs of roadside 3D labeling, we present a novel BEV-to-3D joint annotation pipeline to efficiently collect such a large volume of data. After that, we organize a comprehensive study for current BEV methods on RoScenes in terms of effectiveness and efficiency. Tested methods suffer from the vast perception area and variation of sensor layout across scenes, resulting in performance levels falling below expectations. To this end, we propose RoBEV that incorporates feature-guided position embedding for effective 2D-3D feature assignment. With its help, our method outperforms state-of-the-art by a large margin without extra computational overhead on validation set. Our dataset and devkit will be made available at \url{this https URL}.
我们介绍了RoScenes,这是最大的多视角路边感知数据集,旨在阐明在更复杂的交通场景中,视觉中心鸟眼视(BEV)方法的发展。RoScenes的重点包括显著的感知区域、完整的场景覆盖和拥挤的交通。具体来说,我们的数据集在64,000 $m^2的面积上实现了令人惊讶的21.13M 3D注释。为了减轻道路边3D标注昂贵的成本,我们提出了一个新颖的BEV到3D联合注释管道,以有效地收集如此大的数据量。此后,我们组织了一项关于RoScenes上当前BEV方法的全面研究,从有效性、效率等方面进行评估。经过测试的方法存在感知区域巨大和场景传感器布局变化多样的问题,导致性能水平低于预期。因此,我们提出了RoBEV,它采用基于特征的定位嵌入来有效进行2D-3D特征分配。有了它的帮助,我们的方法在验证集上显著优于最先进的水平,而无需额外的计算开销。我们的数据集和开发工具包将公开在 \url{这个链接} 上。
https://arxiv.org/abs/2405.09883
3D face registration is an important process in which a 3D face model is aligned and mapped to a template face. However, the task of 3D face registration becomes particularly challenging when dealing with partial face data, where only limited facial information is available. To address this challenge, this paper presents a novel deep learning-based approach that combines quasi-conformal geometry with deep neural networks for partial face registration. The proposed framework begins with a Landmark Detection Network that utilizes curvature information to detect the presence of facial features and estimate their corresponding coordinates. These facial landmark features serve as essential guidance for the registration process. To establish a dense correspondence between the partial face and the template surface, a registration network based on quasiconformal theories is employed. The registration network establishes a bijective quasiconformal surface mapping aligning corresponding partial faces based on detected landmarks and curvature values. It consists of the Coefficients Prediction Network, which outputs the optimal Beltrami coefficient representing the surface mapping. The Beltrami coefficient quantifies the local geometric distortion of the mapping. By controlling the magnitude of the Beltrami coefficient through a suitable activation function, the bijectivity and geometric distortion of the mapping can be controlled. The Beltrami coefficient is then fed into the Beltrami solver network to reconstruct the corresponding mapping. The surface registration enables the acquisition of corresponding regions and the establishment of point-wise correspondence between different partial faces, facilitating precise shape comparison through the evaluation of point-wise geometric differences at these corresponding regions. Experimental results demonstrate the effectiveness of the proposed method.
3D面部配准是将3D面部模型与模板面部对齐并映射的过程。然而,在处理部分面部数据时,3D面部配准变得特别具有挑战性,因为在这种情况下,面部信息有限。为了应对这一挑战,本文提出了一种基于深度学习的全新方法,将准同构几何与深度神经网络相结合用于部分面部配准。该框架首先使用特征检测网络利用曲率信息来检测面部特征并估计其对应坐标。这些面部特征用于指导配准过程。为了建立部分面部与模板表面之间的密集对应关系,采用基于准同构理论的注册网络。该注册网络基于检测到的特征点和曲率值建立双射的准同构表面映射。它由系数预测网络组成,该网络输出表面映射的最优Beltrami系数。Beltrami系数衡量映射的局部几何变形。通过通过适当的激活函数控制Beltrami系数的幅度,可以控制映射的准同构性和几何变形。然后将Beltrami系数输入Beltrami求解网络以重构相应的映射。表面配准使相应的区域获得获取,不同部分面部的点对之间建立点对点关系,从而通过评估这些相应区域中的点对几何差异来精确形状比较。实验结果证明了所提出方法的有效性。
https://arxiv.org/abs/2405.09880
We present Dual3D, a novel text-to-3D generation framework that generates high-quality 3D assets from texts in only $1$ minute.The key component is a dual-mode multi-view latent diffusion model. Given the noisy multi-view latents, the 2D mode can efficiently denoise them with a single latent denoising network, while the 3D mode can generate a tri-plane neural surface for consistent rendering-based denoising. Most modules for both modes are tuned from a pre-trained text-to-image latent diffusion model to circumvent the expensive cost of training from scratch. To overcome the high rendering cost during inference, we propose the dual-mode toggling inference strategy to use only $1/10$ denoising steps with 3D mode, successfully generating a 3D asset in just $10$ seconds without sacrificing quality. The texture of the 3D asset can be further enhanced by our efficient texture refinement process in a short time. Extensive experiments demonstrate that our method delivers state-of-the-art performance while significantly reducing generation time. Our project page is available at this https URL
我们提出了Dual3D,一种新颖的从文本到3D生成的框架,只需1分钟生成高质量3D资产。关键组件是一种双模式多视角潜在扩散模型。由于存在噪声多视角潜在,2D模式可以通过一个单一的潜在去噪网络 efficiently消除它们,而3D模式可以生成一个三平面神经表面,实现基于渲染的去噪。两种模式的大部分模块都从预训练的文本到图像潜在扩散模型调节,以避免从零开始训练的昂贵成本。为了在推理过程中克服高渲染成本,我们提出了使用3D模式的双模式切换推理策略,只需1/10的去噪步数,成功地在10秒钟内生成了一个3D资产,而没有牺牲质量。通过我们短时间内进行的有效纹理平滑过程,可以进一步增强3D资产的纹理。大量实验证明,我们的方法在提高生成速度的同时,取得了与最先进的性能相当的结果。您可以在此链接查看我们的项目页面:https://www.example.com
https://arxiv.org/abs/2405.09874
Multi-line LiDAR is widely used in autonomous vehicles, so point cloud-based 3D detectors are essential for autonomous driving. Extracting rich multi-scale features is crucial for point cloud-based 3D detectors in autonomous driving due to significant differences in the size of different types of objects. However, due to the real-time requirements, large-size convolution kernels are rarely used to extract large-scale features in the backbone. Current 3D detectors commonly use feature pyramid networks to obtain large-scale features; however, some objects containing fewer point clouds are further lost during downsampling, resulting in degraded performance. Since pillar-based schemes require much less computation than voxel-based schemes, they are more suitable for constructing real-time 3D detectors. Hence, we propose PillarNeXt, a pillar-based scheme. We redesigned the feature encoding, the backbone, and the neck of the 3D detector. We propose Voxel2Pillar feature encoding, which uses a sparse convolution constructor to construct pillars with richer point cloud features, especially height features. Moreover, additional learnable parameters are added, which enables the initial pillar to achieve higher performance capabilities. We extract multi-scale and large-scale features in the proposed fully sparse backbone, which does not utilize large-size convolutional kernels; the backbone consists of the proposed multi-scale feature extraction module. The neck consists of the proposed sparse ConvNeXt, whose simple structure significantly improves the performance. The effectiveness of the proposed PillarNeXt is validated on the Waymo Open Dataset, and object detection accuracy for vehicles, pedestrians, and cyclists is improved; we also verify the effectiveness of each proposed module in detail.
多线激光雷达在自动驾驶中得到了广泛应用,因此基于点云的3D检测器对自动驾驶至关重要。由于不同类型物体的大小差异很大,因此从点云中提取丰富的多尺度特征对于自动驾驶中的点云测距器至关重要。然而,由于实时要求,大型卷积核通常不会用于从骨干网络中提取大尺度特征。目前,大多数3D检测器使用特征金字塔网络获取大尺度特征;然而,在 downsampling 过程中,一些包含较少点云的对象进一步丢失,导致性能下降。由于基于柱面的方案需要比基于体素的方案更少的计算,因此它们更适合用于构建实时3D检测器。因此,我们提出了PillarNeXt,一种基于柱面的方案。我们重新设计了3D检测器的特征编码器、骨干网络和颈部。我们提出了Voxel2Pillar特征编码器,它使用稀疏卷积构建具有丰富点云特征的支柱,特别是高度特征。此外,还增加了可学习的参数,使得初始支柱能够实现更高的性能能力。我们在提出的完全稀疏骨干中提取多尺度和大尺度特征,这并没有使用大型卷积核;骨干由提出的多尺度特征提取模块组成。颈部分别由提出的稀疏ConvNeXt组成,其简单的结构显著提高了性能。提出的PillarNeXt的有效性在Waymo Open Dataset上得到了验证,并对车辆、行人、自行车等的对象检测精度进行了提高。我们还详细验证了每个提出的模块的有效性。
https://arxiv.org/abs/2405.09828
Diffusion models have recently gained significant traction due to their ability to generate high-fidelity and diverse images and videos conditioned on text prompts. In medicine, this application promises to address the critical challenge of data scarcity, a consequence of barriers in data sharing, stringent patient privacy regulations, and disparities in patient population and demographics. By generating realistic and varying medical 2D and 3D images, these models offer a rich, privacy-respecting resource for algorithmic training and research. To this end, we introduce MediSyn, a pair of instruction-tuned text-guided latent diffusion models with the ability to generate high-fidelity and diverse medical 2D and 3D images across specialties and modalities. Through established metrics, we show significant improvement in broad medical image and video synthesis guided by text prompts.
扩散模型最近因能够根据文本提示生成高质量和高多样性的图像和视频而获得了显著的关注。在医学领域,这种应用有望解决数据稀缺性的关键挑战,即数据共享的障碍、严格的患者隐私法规以及患者人群和 demographic 差异。通过生成真实和多样化的医学2D和3D图像,这些模型为机器学习训练和研究提供了宝贵的隐私尊重资源。因此,我们引入了MediSyn,一对经过指令微调的文本引导的潜在扩散模型,具有生成跨专业和模块的高质量和多样化医疗2D和3D图像的能力。通过 established metrics,我们证明了基于文本提示的广泛医疗图像和视频合成明显改善。
https://arxiv.org/abs/2405.09806
3D cameras have emerged as a critical source of information for applications in robotics and autonomous driving. These cameras provide robots with the ability to capture and utilize point clouds, enabling them to navigate their surroundings and avoid collisions with other objects. However, current standard camera evaluation metrics often fail to consider the specific application context. These metrics typically focus on measures like Chamfer distance (CD) or Earth Mover's Distance (EMD), which may not directly translate to performance in real-world scenarios. To address this limitation, we propose a novel metric for point cloud evaluation, specifically designed to assess the suitability of 3D cameras for the critical task of collision avoidance. This metric incorporates application-specific considerations and provides a more accurate measure of a camera's effectiveness in ensuring safe robot navigation.
3D相机已成为机器人学和自动驾驶应用程序的关键信息来源。这些相机使机器人能够捕获并利用点云,从而使它们能够感知周围环境并避免与其他物体发生碰撞。然而,当前的标准相机评估指标通常无法考虑特定应用程序的上下文。这些指标通常关注诸如Chamfer距离(CD)或Earth Mover's Distance(EMD)等指标,这些指标可能无法直接转化为现实场景中的性能。为了克服这一局限,我们提出了一个专门用于点云评估的新指标,该指标旨在评估3D相机在碰撞避免关键任务中的适用性。该指标考虑了应用程序特定的因素,并提供了一种更准确地衡量相机确保安全机器人导航有效性的方法。
https://arxiv.org/abs/2405.09755
3D content creation plays a vital role in various applications, such as gaming, robotics simulation, and virtual reality. However, the process is labor-intensive and time-consuming, requiring skilled designers to invest considerable effort in creating a single 3D asset. To address this challenge, text-to-3D generation technologies have emerged as a promising solution for automating 3D creation. Leveraging the success of large vision language models, these techniques aim to generate 3D content based on textual descriptions. Despite recent advancements in this area, existing solutions still face significant limitations in terms of generation quality and efficiency. In this survey, we conduct an in-depth investigation of the latest text-to-3D creation methods. We provide a comprehensive background on text-to-3D creation, including discussions on datasets employed in training and evaluation metrics used to assess the quality of generated 3D models. Then, we delve into the various 3D representations that serve as the foundation for the 3D generation process. Furthermore, we present a thorough comparison of the rapidly growing literature on generative pipelines, categorizing them into feedforward generators, optimization-based generation, and view reconstruction approaches. By examining the strengths and weaknesses of these methods, we aim to shed light on their respective capabilities and limitations. Lastly, we point out several promising avenues for future research. With this survey, we hope to inspire researchers further to explore the potential of open-vocabulary text-conditioned 3D content creation.
3D内容创作在各种应用中发挥着重要作用,如游戏、机器人模拟和虚拟现实。然而,该过程费力且耗时,需要熟练的设计师投入大量精力创作单个3D资产。为应对这一挑战,文本到3D生成技术作为一种有前途的自动化3D创作的解决方案应运而生。通过利用大型视觉语言模型的成功,这些技术旨在根据文本描述生成3D内容。尽管在最近一段时间内这一领域取得了进展,但现有的解决方案在生成质量和效率方面仍然存在显著的限制。在本次调查中,我们深入研究了最新的文本到3D创作方法。我们提供了关于文本到3D创作的全面背景,包括讨论训练和评估指标所使用的数据集以及用于评估生成3D模型的质量的评估指标。接着,我们深入探讨了作为3D生成过程基础的各种3D表示。此外,我们还对迅速发展的关于生成管道的研究进行了全面的比较,并将它们分为前馈生成、基于优化的生成和视图重构方法。通过分析这些方法的优缺点,我们希望揭示它们各自的潜能和局限。最后,我们指出了未来研究的几个有前景的方向。通过这次调查,我们希望激励研究人员进一步探索开放词汇文本条件下3D内容创作的潜力。
https://arxiv.org/abs/2405.09431
This research reports VascularPilot3D, the first 3D fully autonomous endovascular robot navigation system. As an exploration toward autonomous guidewire navigation, VascularPilot3D is developed as a complete navigation system based on intra-operative imaging systems (fluoroscopic X-ray in this study) and typical endovascular robots. VascularPilot3D adopts previously researched fast 3D-2D vessel registration algorithms and guidewire segmentation methods as its perception modules. We additionally propose three modules: a topology-constrained 2D-3D instrument end-point lifting method, a tree-based fast path planning algorithm, and a prior-free endovascular navigation strategy. VascularPilot3D is compatible with most mainstream endovascular robots. Ex-vivo experiments validate that VascularPilot3D achieves 100% success rate among 25 trials. It reduces the human surgeon's overall control loops by 18.38%. VascularPilot3D is promising for general clinical autonomous endovascular navigations.
这项研究报道了VascularPilot3D,这是第一个3D完全自主式内窥镜导航系统。作为自主引导线导航探索,VascularPilot3D是基于内窥镜成像系统(本研究中的荧光X射线)和典型内窥镜机器人开发的完整导航系统。VascularPilot3D采用之前研究过的快速3D-2D血管配准算法和引导线分割方法作为其感知模块。此外,我们还提出了三个模块:基于树的高速路径规划算法、基于约束的2D-3D器械端点提升方法和无需先验的内窥镜导航策略。VascularPilot3D兼容大多数主流内窥镜机器人。实验验证表明,VascularPilot3D在25个试点研究中实现了100%的成功率。它减少了人类外科医生的总操作循环次数 by 18.38%。VascularPilot3D在一般临床自主内窥镜导航方面具有前景。
https://arxiv.org/abs/2405.09375
While content-based image retrieval (CBIR) has been extensively studied in natural image retrieval, its application to medical images presents ongoing challenges, primarily due to the 3D nature of medical images. Recent studies have shown the potential use of pre-trained vision embeddings for CBIR in the context of radiology image retrieval. However, a benchmark for the retrieval of 3D volumetric medical images is still lacking, hindering the ability to objectively evaluate and compare the efficiency of proposed CBIR approaches in medical imaging. In this study, we extend previous work and establish a benchmark for region-based and multi-organ retrieval using the TotalSegmentator dataset (TS) with detailed multi-organ annotations. We benchmark embeddings derived from pre-trained supervised models on medical images against embeddings derived from pre-trained unsupervised models on non-medical images for 29 coarse and 104 detailed anatomical structures in volume and region levels. We adopt a late interaction re-ranking method inspired by text matching for image retrieval, comparing it against the original method proposed for volume and region retrieval achieving retrieval recall of 1.0 for diverse anatomical regions with a wide size range. The findings and methodologies presented in this paper provide essential insights and benchmarks for the development and evaluation of CBIR approaches in the context of medical imaging.
虽然基于内容的图像检索(CBIR)在自然图像检索中已经得到了广泛研究,但在医学图像中应用时仍然存在挑战,主要原因是医学图像的3D性质。最近的研究表明,在放射学图像检索背景下,预训练视觉嵌入可能有用于CBIR。然而,还没有一个用于检索3D体积医学图像的基准,这阻碍了客观评估和比较所提出的CBIR方法在医学成像中的效率。在这项研究中,我们延长了以前的工作,并使用TotalSegmentator数据集(TS)建立了基于区域的和多器官检索的基准,并对医学图像和非医学图像的预训练嵌入进行了比较。我们对29个粗粒度和104个详细解剖结构的体积和区域水平的预训练嵌入进行了比较,采用了一种类似于文本匹配的晚期交互重新排名方法,将其与体积和区域检索的原始方法进行比较,实现了检索召回率为1.0,具有多样解剖结构的广泛大小范围。本文所提出的研究成果和方法提供了开发和评估CBIR方法在医学成像领域的必要见解和基准。
https://arxiv.org/abs/2405.09334
In this paper, we present an innovative technique for the path planning of flying robots in a 3D environment in Rough Mereology terms. The main goal was to construct the algorithm that would generate the mereological potential fields in 3-dimensional space. To avoid falling into the local minimum, we assist with a weighted Euclidean distance. Moreover, a searching path from the start point to the target, with respect to avoiding the obstacles was applied. The environment was created by connecting two cameras working in real-time. To determine the gate and elements of the world inside the map was responsible the Python Library OpenCV [1] which recognized shapes and colors. The main purpose of this paper is to apply the given results to drones.
在本文中,我们提出了一种创新的方法,用于在 rough melee 环境下对飞行机器人的路径进行规划。主要目标是为 3D 空间中的飞行机器人生成只论域 potential fields。为了避免陷入局部最小值,我们使用加权欧氏距离来协助算法。此外,我们还应用了从起点到目标点的搜索路径,以避免障碍物。环境是由实时连接的两个相机创建的。确定地图内世界的门和元素的是 Python 库 OpenCV [1],它识别形状和颜色。本文的主要目的是将所得到的结果应用于无人机。
https://arxiv.org/abs/2405.09282