Brain tumor segmentation is a fundamental step in assessing a patient's cancer progression. However, manual segmentation demands significant expert time to identify tumors in 3D multimodal brain MRI scans accurately. This reliance on manual segmentation makes the process prone to intra- and inter-observer variability. This work proposes a brain tumor segmentation method as part of the BraTS-GoAT challenge. The task is to segment tumors in brain MRI scans automatically from various populations, such as adults, pediatrics, and underserved sub-Saharan Africa. We employ a recent CNN architecture for medical image segmentation, namely MedNeXt, as our baseline, and we implement extensive model ensembling and postprocessing for inference. Our experiments show that our method performs well on the unseen validation set with an average DSC of 85.54% and HD95 of 27.88. The code is available on this https URL.
肿瘤分割是评估患者癌症进展的重要步骤。然而,手动分割需要大量专业时间在3D多模态脑部MRI扫描中准确地识别肿瘤。这种对手动分割的依赖使得过程容易受到内和间观察者变异性。本文提出了一种作为 BraTS-GoAT 挑战的一部分的脑肿瘤分割方法。任务是从各种人群中自动分割脑MRI扫描中的肿瘤,包括成人、儿科和欠发达的撒哈拉以南非洲。我们采用最近的一个卷积神经网络架构——MedNeXt 作为基础,并对推理进行 extensive model ensemble 和 postprocessing。我们的实验结果表明,我们的方法在未见过的验证集上的平均DSC为85.54%和HD95为27.88。代码可以在这个 https:// URL 上找到。
https://arxiv.org/abs/2405.02852
3D object detectors for point clouds often rely on a pooling-based PointNet to encode sparse points into grid-like voxels or pillars. In this paper, we identify that the common PointNet design introduces an information bottleneck that limits 3D object detection accuracy and scalability. To address this limitation, we propose PVTransformer: a transformer-based point-to-voxel architecture for 3D detection. Our key idea is to replace the PointNet pooling operation with an attention module, leading to a better point-to-voxel aggregation function. Our design respects the permutation invariance of sparse 3D points while being more expressive than the pooling-based PointNet. Experimental results show our PVTransformer achieves much better performance compared to the latest 3D object detectors. On the widely used Waymo Open Dataset, our PVTransformer achieves state-of-the-art 76.5 mAPH L2, outperforming the prior art of SWFormer by +1.7 mAPH L2.
3D物体检测器通常依赖于基于池化的点网络将稀疏点编码成类似于网格状的体素或柱状体。在本文中,我们发现常见的点网络设计引入了一个信息瓶颈,这限制了3D物体检测的准确性和可扩展性。为了应对这个限制,我们提出了PVTransformer:一个基于Transformer的点对体素架构。我们的关键想法是用一个关注模块取代PointNet的池化操作,导致更好的点对体素聚合函数。我们的设计尊重稀疏3D点的排列不变性,同时比基于池化的点网络更具表现力。实验结果表明,我们的PVTransformer在最新的3D物体检测器上的性能远优于现有的检测器。在广泛使用的Waymo Open Dataset上,我们的PVTransformer实现了 state-of-the-art 76.5 mAPH L2,比先前的技术水平+1.7 mAPH L2 更好。
https://arxiv.org/abs/2405.02811
Attempt to use convolutional neural network to achieve kinematic analysis of plane bar structure. Through 3dsMax animation software and OpenCV module, self-build image dataset of geometrically stable system and geometrically unstable system. we construct and train convolutional neural network model based on the TensorFlow and Keras deep learning platform framework. The model achieves 100% accuracy on the training set, validation set, and test set. The accuracy on the additional test set is 93.7%, indicating that convolutional neural network can learn and master the relevant knowledge of kinematic analysis of structural mechanics. In the future, the generalization ability of the model can be improved through the diversity of dataset, which has the potential to surpass human experts for complex structures. Convolutional neural network has certain practical value in the field of kinematic analysis of structural mechanics. Using visualization technology, we reveal how convolutional neural network learns and recognizes structural features. Using pre-trained VGG16 model for feature extraction and fine-tuning, we found that the generalization ability is inferior to the self-built model.
尝试使用卷积神经网络来实现平面梁结构的运动分析。通过3dsMax动画软件和OpenCV模块,基于TensorFlow和Keras深度学习平台框架构建和训练几何稳定系统和高不稳定性系统图像数据集。该模型在训练集、验证集和测试集上的准确度均为100%。在附加测试集上的准确度为93.7%,表明卷积神经网络可以学习和掌握相关结构力学运动分析的知识。通过数据集的多样性来提高模型的泛化能力,该模型在复杂结构上的表现有可能超过人类专家。在结构力学运动分析领域,卷积神经网络具有一定的实际价值。通过可视化技术,我们揭示了卷积神经网络学习和识别结构特征的过程。使用预训练的VGG16模型进行特征提取和微调,我们发现自建模型的泛化能力要强于预训练模型。
https://arxiv.org/abs/2405.02807
A transformer-based deep learning model, MR-Transformer, was developed for total knee replacement (TKR) prediction using magnetic resonance imaging (MRI). The model incorporates the ImageNet pre-training and captures three-dimensional (3D) spatial correlation from the MR images. The performance of the proposed model was compared to existing state-of-the-art deep learning models for knee injury diagnosis using MRI. Knee MR scans of four different tissue contrasts from the Osteoarthritis Initiative and Multicenter Osteoarthritis Study databases were utilized in the study. Experimental results demonstrated the state-of-the-art performance of the proposed model on TKR prediction using MRI.
为了使用磁共振成像(MRI)进行全膝关节置换(TKR)预测,我们开发了一种基于Transformer的深度学习模型MR-Transformer。该模型包括ImageNet预训练,并捕获了来自MRI的3D空间关联。与使用MRI进行膝关节损伤诊断的现有最先进的深度学习模型进行比较。本研究利用了Osteoarthritis Initiative和Multicenter Osteoarthritis Study数据库中的四种不同组织对比的膝关节MRI数据。实验结果表明,基于MRI的TKR预测中,所提出的模型的性能达到了最先进水平。
https://arxiv.org/abs/2405.02784
The perception of 3D motion of surrounding traffic participants is crucial for driving safety. While existing works primarily focus on general large motions, we contend that the instantaneous detection and quantification of subtle motions is equally important as they indicate the nuances in driving behavior that may be safety critical, such as behaviors near a stop sign of parking positions. We delve into this under-explored task, examining its unique challenges and developing our solution, accompanied by a carefully designed benchmark. Specifically, due to the lack of correspondences between consecutive frames of sparse Lidar point clouds, static objects might appear to be moving - the so-called swimming effect. This intertwines with the true object motion, thereby posing ambiguity in accurate estimation, especially for subtle motions. To address this, we propose to leverage local occupancy completion of object point clouds to densify the shape cue, and mitigate the impact of swimming artifacts. The occupancy completion is learned in an end-to-end fashion together with the detection of moving objects and the estimation of their motion, instantaneously as soon as objects start to move. Extensive experiments demonstrate superior performance compared to standard 3D motion estimation approaches, particularly highlighting our method's specialized treatment of subtle motions.
周围交通参与者的3D运动感知对驾驶安全至关重要。虽然现有的工作主要关注大的运动,但我们认为微妙的运动的即时检测和量化同样重要。它们表明了驾驶行为中可能具有关键性的细微差别,比如靠近停车标志的行为。我们深入研究这个尚未被充分探索的任务,检查其独特挑战,并开发我们的解决方案,同时附带一个精心设计的基准。 具体来说,由于连续帧之间稀疏的Lidar点云之间没有对应关系,静止物体可能看起来在运动 - 所谓的游泳效应。这种交织与真实物体运动相互作用,从而导致对准确估计的模糊不确定性,特别是在微妙运动上。为了应对这个问题,我们提出了一种利用局部占有率完成物体点云的方法来填充形状线索,并减轻游泳伪影的影响。占有率完成是在物体开始运动时同时检测和估计其运动的过程中学习的。 大量的实验证明,与标准3D运动估计方法相比,我们的方法具有卓越的性能,特别是突出了我们方法对微妙运动的专门处理。
https://arxiv.org/abs/2405.02781
Neuron reconstruction, one of the fundamental tasks in neuroscience, rebuilds neuronal morphology from 3D light microscope imaging data. It plays a critical role in analyzing the structure-function relationship of neurons in the nervous system. However, due to the scarcity of neuron datasets and high-quality SWC annotations, it is still challenging to develop robust segmentation methods for single neuron reconstruction. To address this limitation, we aim to distill the consensus knowledge from massive natural image data to aid the segmentation model in learning the complex neuron structures. Specifically, in this work, we propose a novel training paradigm that leverages a 2D Vision Transformer model pre-trained on large-scale natural images to initialize our Transformer-based 3D neuron segmentation model with a tailored 2D-to-3D weight transferring strategy. Our method builds a knowledge sharing connection between the abundant natural and the scarce neuron image domains to improve the 3D neuron segmentation ability in a data-efficiency manner. Evaluated on a popular benchmark, BigNeuron, our method enhances neuron segmentation performance by 8.71% over the model trained from scratch with the same amount of training samples.
神经元重建是神经科学中的一个基本任务,它通过从3D光显微镜图像数据中重构神经元形态学来分析神经系统中神经元的结构和功能关系。然而,由于神经元数据集的稀缺性和高质量的SWC注释的缺乏,开发用于单神经元重建的稳健分割方法仍然具有挑战性。为了克服这一限制,我们旨在从大规模自然图像数据中提取共识知识,以帮助分割模型学习复杂的神经元结构。具体来说,在本文中,我们提出了一种利用预训练于大型自然图像的2D Vision Transformer模型作为初始化,以实现基于Transformer的3D神经元分割模型的自适应2D到3D权转移策略。我们的方法建立了丰富自然和稀疏神经元图像领域之间的知识共享联系,以以数据效率的方式提高神经元分割能力。在流行的基准测试BigNeuron上进行评估,我们的方法将自定义模型的神经元分割性能提高了8.71%。
https://arxiv.org/abs/2405.02686
In recent years, autonomous driving has garnered escalating attention for its potential to relieve drivers' burdens and improve driving safety. Vision-based 3D occupancy prediction, which predicts the spatial occupancy status and semantics of 3D voxel grids around the autonomous vehicle from image inputs, is an emerging perception task suitable for cost-effective perception system of autonomous driving. Although numerous studies have demonstrated the greater advantages of 3D occupancy prediction over object-centric perception tasks, there is still a lack of a dedicated review focusing on this rapidly developing field. In this paper, we first introduce the background of vision-based 3D occupancy prediction and discuss the challenges in this task. Secondly, we conduct a comprehensive survey of the progress in vision-based 3D occupancy prediction from three aspects: feature enhancement, deployment friendliness and label efficiency, and provide an in-depth analysis of the potentials and challenges of each category of methods. Finally, we present a summary of prevailing research trends and propose some inspiring future outlooks. To provide a valuable reference for researchers, a regularly updated collection of related papers, datasets, and codes is organized at this https URL.
近年来,自动驾驶因为其减轻驾驶员负担和提高驾驶安全性的潜在优势而备受关注。基于视觉的3D占用预测,预测自动驾驶车辆周围3D体素网格的空间占用状态和语义,是一个适合自动驾驶低成本感知系统的 emerging perception 任务。尽管大量研究表明,与物体中心感知任务相比,3D占用预测具有更大的优势,但目前仍缺乏针对这一快速发展的领域的专门 review。在本文中,我们首先介绍了基于视觉的3D占用预测的背景,并讨论了这项任务的挑战。然后,我们从三个方面对基于视觉的3D占用预测的研究进展进行全面调查:特征增强、部署友好性和标签效率,并深入分析每种方法的潜在和挑战。最后,我们总结了当前研究趋势,并提出了鼓舞人心的未来展望。为了为研究人员提供有价值的参考,在 https://www. this URL 处组织了一个定期更新的相关论文、数据和代码的集合。
https://arxiv.org/abs/2405.02595
Active learning in 3D scene reconstruction has been widely studied, as selecting informative training views is critical for the reconstruction. Recently, Neural Radiance Fields (NeRF) variants have shown performance increases in active 3D reconstruction using image rendering or geometric uncertainty. However, the simultaneous consideration of both uncertainties in selecting informative views remains unexplored, while utilizing different types of uncertainty can reduce the bias that arises in the early training stage with sparse inputs. In this paper, we propose ActiveNeuS, which evaluates candidate views considering both uncertainties. ActiveNeuS provides a way to accumulate image rendering uncertainty while avoiding the bias that the estimated densities can introduce. ActiveNeuS computes the neural implicit surface uncertainty, providing the color uncertainty along with the surface information. It efficiently handles the bias by using the surface information and a grid, enabling the fast selection of diverse viewpoints. Our method outperforms previous works on popular datasets, Blender and DTU, showing that the views selected by ActiveNeuS significantly improve performance.
已经在三维场景重建中广泛研究了积极学习,因为选择有信息的训练视角对于重建至关重要。最近,神经辐射场(NeRF)变体通过图像渲染或几何不确定性在积极三维重建中显示出性能提高。然而,同时考虑选择信息丰富的视角仍然是一个未探索的问题,而利用不同类型的不确定性可以减少在训练早期阶段出现稀疏输入导致的偏差。在本文中,我们提出了ActiveNeuS,它考虑了 both uncertainties(不确定性)。ActiveNeuS通过累积图像渲染不确定性,同时避免估计密度可能引入的偏差。ActiveNeuS计算神经隐性表面不确定性,提供表面信息以及颜色不确定性。它有效地处理偏差,通过表面信息和网格实现观点的快速选择。我们的方法在流行的数据集Blender和DTU上优于以前的工作,表明ActiveNeuS选择的观点显著提高了性能。
https://arxiv.org/abs/2405.02568
Currently, the foundation models represented by large language models have made dramatic progress and are used in a very wide range of domains including 2D and 3D vision. As one of the important application domains of foundation models, earth observation has attracted attention and various approaches have been developed. When considering earth observation as a single image capture, earth observation imagery can be processed as an image with three or more channels, and when it comes with multiple image captures of different timestamps at one location, the temporal observation can be considered as a set of continuous image resembling video frames or medical SCAN slices. This paper presents Spatio-Temporal SwinMAE (ST-SwinMAE), an architecture which particularly focuses on representation learning for spatio-temporal image processing. Specifically, it uses a hierarchical Masked Auto-encoder (MAE) with Video Swin Transformer blocks. With the architecture, we present a pretrained model named Degas 100M as a geospatial foundation model. Also, we propose an approach for transfer learning with Degas 100M, which both pretrained encoder and decoder of MAE are utilized with skip connections added between them to achieve multi-scale information communication, forms an architecture named Spatio-Temporal SwinUNet (ST-SwinUNet). Our approach shows significant improvements of performance over existing state-of-the-art of foundation models. Specifically, for transfer learning of the land cover downstream task on the PhilEO Bench dataset, it shows 10.4\% higher accuracy compared with other geospatial foundation models on average.
目前,大型语言模型所代表的基模型已经取得了显著的进步,并在包括2D和3D视觉在内的各种领域得到了广泛应用。作为基础模型的重要应用领域之一,地球观测吸引了人们的注意,并开发了各种方法。当将地球观测视为单张图像捕捉时,地球观测图像可以处理为具有三个或更多通道的图像,而当它位于一个位置的多个不同时间戳的图像捕捉时,时间观察可以被视为一系列连续的图像,类似于视频帧或医学SCAN切片。本文介绍了一种名为Spatio-Temporal SwinMAE(ST-SwinMAE)的架构,该架构特别关注空间-时间图像处理中的表示学习。具体来说,它使用了一个层次化的遮罩自编码器(MAE)和视频Swin Transformer块。通过这种架构,我们提出了一个名为Spatio-Temporal SwinUNet(ST-SwinUNet)的预训练模型。我们还提出了一种使用Degas 100M作为空间-时间基础模型的迁移学习方法,该模型包括MAE的预训练编码器和解码器,并在它们之间添加了跳跃连接以实现多尺度信息交流,形成了一个名为Spatio-Temporal SwinUNet的架构。我们的方法在现有基础模型性能上显示出显著的改进。具体来说,在菲欧埃奥基准数据集上对地表覆盖下游任务的迁移学习中,它比其他空间-时间基础模型平均高10.4%。
https://arxiv.org/abs/2405.02512
Magnetic resonance imaging (MRI) and positron emission tomography (PET) are increasingly used in multimodal analysis of neurodegenerative disorders. While MRI is broadly utilized in clinical settings, PET is less accessible. Many studies have attempted to use deep generative models to synthesize PET from MRI scans. However, they often suffer from unstable training and inadequately preserve brain functional information conveyed by PET. To this end, we propose a functional imaging constrained diffusion (FICD) framework for 3D brain PET image synthesis with paired structural MRI as input condition, through a new constrained diffusion model (CDM). The FICD introduces noise to PET and then progressively removes it with CDM, ensuring high output fidelity throughout a stable training phase. The CDM learns to predict denoised PET with a functional imaging constraint introduced to ensure voxel-wise alignment between each denoised PET and its ground truth. Quantitative and qualitative analyses conducted on 293 subjects with paired T1-weighted MRI and 18F-fluorodeoxyglucose (FDG)-PET scans suggest that FICD achieves superior performance in generating FDG-PET data compared to state-of-the-art methods. We further validate the effectiveness of the proposed FICD on data from a total of 1,262 subjects through three downstream tasks, with experimental results suggesting its utility and generalizability.
磁共振成像(MRI)和正电子发射断层扫描(PET)在多模态分析神经退行性疾病方面越来越受到欢迎。尽管MRI在临床环境中得到了广泛应用,但PET却不太易获取。许多研究试图使用深度生成模型从MRI扫描中合成PET,但这些模型往往在训练过程中不稳定,并且不能很好地保留PET中传递给大脑的功能信息。为此,我们提出了一个功能成像约束扩散(FICD)框架,用于使用成对结构MRI生成3D脑PET图像,并通过一个新的约束扩散模型(CDM)实现。FICD引入了噪声到PET,然后通过CDM逐步去除它,确保在稳定的训练阶段具有高输出保真度。CDM学会了通过引入功能成像约束来预测去噪PET,以确保每个去噪PET与其实际对照之间进行逐个像素对齐。对293名成对T1加权MRI和18F-氟代葡萄糖(FDG)-PET扫描的受试者的定量定性分析结果表明,FICD在生成FDG-PET数据方面具有比现有方法更卓越的性能。我们还通过三个下游任务验证了所提出的FICD在总1262个受试者数据上的有效性,实验结果表明了其效用和可扩展性。
https://arxiv.org/abs/2405.02504
Despite significant advancements in Neural Radiance Fields (NeRFs), the renderings may still suffer from aliasing and blurring artifacts, since it remains a fundamental challenge to effectively and efficiently characterize anisotropic areas induced by the cone-casting procedure. This paper introduces a Ripmap-Encoded Platonic Solid representation to precisely and efficiently featurize 3D anisotropic areas, achieving high-fidelity anti-aliasing renderings. Central to our approach are two key components: Platonic Solid Projection and Ripmap encoding. The Platonic Solid Projection factorizes the 3D space onto the unparalleled faces of a certain Platonic solid, such that the anisotropic 3D areas can be projected onto planes with distinguishable characterization. Meanwhile, each face of the Platonic solid is encoded by the Ripmap encoding, which is constructed by anisotropically pre-filtering a learnable feature grid, to enable featurzing the projected anisotropic areas both precisely and efficiently by the anisotropic area-sampling. Extensive experiments on both well-established synthetic datasets and a newly captured real-world dataset demonstrate that our Rip-NeRF attains state-of-the-art rendering quality, particularly excelling in the fine details of repetitive structures and textures, while maintaining relatively swift training times.
尽管在神经元辐射场(NeRFs)方面取得了显著的进步,但渲染仍然可能存在混叠和模糊伪影,因为仍然难以有效地和高效地描述由透镜成形过程产生的各向同性区域是一个基本挑战。本文引入了一种Ripmap编码的Platonic固体表示来精确和有效地特征化3D各向同性区域,实现高保真度的抗混叠渲染。我们方法的核心是两个关键组件:Platonic固体投影和Ripmap编码。Platonic固体投影将3D空间分解为某个Platonic固体的独特表面,使得各向同性3D区域可以投影到具有可区分特征的平面上。同时,每个面都用Ripmap编码编码,该编码是由预处理学习到的特征网格进行非各向同性预处理,以实现对投影各向同性区域的准确和高效编码。在既定的合成数据集和一个新的捕捉到的现实世界数据集上进行广泛的实验证明,我们的Rip-NeRF达到最先进的渲染质量,尤其是在重复结构和纹理的细小细节方面表现出色,同时保持相对较快的训练时间。
https://arxiv.org/abs/2405.02386
Existing VLMs can track in-the-wild 2D video objects while current generative models provide powerful visual priors for synthesizing novel views for the highly under-constrained 2D-to-3D object lifting. Building upon this exciting progress, we present DreamScene4D, the first approach that can generate three-dimensional dynamic scenes of multiple objects from monocular in-the-wild videos with large object motion across occlusions and novel viewpoints. Our key insight is to design a "decompose-then-recompose" scheme to factorize both the whole video scene and each object's 3D motion. We first decompose the video scene by using open-vocabulary mask trackers and an adapted image diffusion model to segment, track, and amodally complete the objects and background in the video. Each object track is mapped to a set of 3D Gaussians that deform and move in space and time. We also factorize the observed motion into multiple components to handle fast motion. The camera motion can be inferred by re-rendering the background to match the video frames. For the object motion, we first model the object-centric deformation of the objects by leveraging rendering losses and multi-view generative priors in an object-centric frame, then optimize object-centric to world-frame transformations by comparing the rendered outputs against the perceived pixel and optical flow. Finally, we recompose the background and objects and optimize for relative object scales using monocular depth prediction guidance. We show extensive results on the challenging DAVIS, Kubric, and self-captured videos, detail some limitations, and provide future directions. Besides 4D scene generation, our results show that DreamScene4D enables accurate 2D point motion tracking by projecting the inferred 3D trajectories to 2D, while never explicitly trained to do so.
现有的VLMs可以在野外追踪2D视频对象,而当前的生成模型可以为高度约束的2D到3D物体提升提供强大的视觉先验,以生成新颖的视角。在此基础上,我们提出了DreamScene4D,这是第一个可以从单目野生动物视频生成多物体三维动态场景的方法,具有大物体运动跨越遮挡和新视角。我们的关键见解是设计一个“分解-然后-重构”方案,将整个视频场景和每个对象的3D运动分解。我们首先通过使用开箱即用的词汇mask跟踪器和适应性图像扩散模型来分解视频场景,分割和跟踪视频中的物体和背景。每个物体跟踪映射到一组3D高斯,它们在空间和时间上扭曲和移动。此外,我们还将观察到的运动分解为多个组件,以处理快速运动。通过重新渲染背景以匹配视频帧,可以推断出相机运动。对于物体运动,我们首先通过利用渲染损失和物体中心帧的多视图生成先验在物体中心建模物体本体的变形,然后通过将渲染输出与感知像素和光学流进行比较,优化物体本体到世界帧的变换。最后,我们通过单目深度预测指导来重构背景和物体,并优化相对物体比例。我们在具有挑战性的DAVIS、Kubric和自捕获视频中展示了广泛的结果,详细介绍了其局限性,并提供了未来的方向。除了4D场景生成外,我们的结果表明,DreamScene4D通过将推断的3D轨迹投影到2D来准确跟踪2D物体运动,而从未明确训练过这样做。
https://arxiv.org/abs/2405.02280
We present a novel method for robotic manipulation tasks in human environments that require reasoning about the 3D geometric relationship between a pair of objects. Traditional end-to-end trained policies, which map from pixel observations to low-level robot actions, struggle to reason about complex pose relationships and have difficulty generalizing to unseen object configurations. To address these challenges, we propose a method that learns to reason about the 3D geometric relationship between objects, focusing on the relationship between key parts on one object with respect to key parts on another object. Our standalone model utilizes Weighted SVD to reason about both pose relationships between articulated parts and between free-floating objects. This approach allows the robot to understand the relationship between the oven door and the oven body, as well as the relationship between the lasagna plate and the oven, for example. By considering the 3D geometric relationship between objects, our method enables robots to perform complex manipulation tasks that reason about object-centric representations. We open source the code and demonstrate the results here
我们提出了一种新的方法,用于在人类环境中进行机器人操作任务,该任务需要关于一对物体之间3D几何关系的推理。传统的端到端训练策略,将像素观察映射到低级机器人动作,很难推理关于复杂姿态关系,并且很难推广到未见过的物体配置。为了应对这些挑战,我们提出了一个学习如何推理物体之间3D几何关系的策略,重点关注一个物体上关键部分与另一个物体上关键部分之间的关系。我们的独立模型利用加权SVD来推理关于活动部件之间和自由漂浮物体之间的姿态关系。这种方法允许机器人理解烤箱门和烤箱身体之间的关系,以及披萨盘和烤箱之间的关系,例如。通过考虑物体之间的3D几何关系,我们的方法使机器人能够执行复杂的操作任务,这些任务基于物体中心表示。我们开源了该代码,并在这里展示了结果。
https://arxiv.org/abs/2405.02241
Autonomous locomotion for mobile ground robots in unstructured environments such as waypoint navigation or flipper control requires a sufficiently accurate prediction of the robot-terrain interaction. Heuristics like occupancy grids or traversability maps are widely used but limit actions available to robots with active flippers as joint positions are not taken into account. We present a novel iterative geometric method to predict the 3D pose of mobile ground robots with active flippers on uneven ground with high accuracy and online planning capabilities. This is achieved by utilizing the ability of signed distance fields to represent surfaces with sub-voxel accuracy. The effectiveness of the presented approach is demonstrated on two different tracked robots in simulation and on a real platform. Compared to a tracking system as ground truth, our method predicts the robot position and orientation with an average accuracy of 3.11 cm and 3.91°, outperforming a recent heightmap-based approach. The implementation is made available as an open-source ROS package.
自治移动地面机器人在非结构化环境中(如路径规划或翻转控制)实现自主移动需要对机器人与地面之间的相互作用进行足够准确的预测。类似于占用网格或可穿越性地图等启发式方法被广泛使用,但它们限制了具有活动翻板的机器人的可用动作,因为它们没有考虑到关节位置。我们提出了一种新颖的迭代几何方法,可以预测带有活动翻板的移动地面机器人在不平滑地面上的3D姿态,具有高精度和在线规划能力。这是通过利用签名距离场表示具有子像素准确度的表面来实现的。所提出的方法的有效性在模拟中和真实平台上进行了演示。与跟踪系统作为地面真实情况相比,我们的方法预测机器人的位置和方向具有平均准确度为3.11cm和3.91°,超过了最近基于高图的方法的性能。该实现可作为开源ROS包提供。
https://arxiv.org/abs/2405.02121
The accuracy and robustness of 3D human pose estimation (HPE) are limited by 2D pose detection errors and 2D to 3D ill-posed challenges, which have drawn great attention to Multi-Hypothesis HPE research. Most existing MH-HPE methods are based on generative models, which are computationally expensive and difficult to train. In this study, we propose a Probabilistic Restoration 3D Human Pose Estimation framework (PRPose) that can be integrated with any lightweight single-hypothesis model. Specifically, PRPose employs a weakly supervised approach to fit the hidden probability distribution of the 2D-to-3D lifting process in the Single-Hypothesis HPE model and then reverse-map the distribution to the 2D pose input through an adaptive noise sampling strategy to generate reasonable multi-hypothesis samples effectively. Extensive experiments on 3D HPE benchmarks (Human3.6M and MPI-INF-3DHP) highlight the effectiveness and efficiency of PRPose. Code is available at: this https URL.
3D人体姿态估计(HPE)的准确性和鲁棒性受到二维姿态检测错误和二维到三维非线性挑战的限制,这些已经引起了多假设性HPE研究的广泛关注。现有的MH-HPE方法都是基于生成模型的,这些模型计算代价高且训练困难。在这项研究中,我们提出了一个概率修复3D人体姿态估计框架(PRPose),可以与任何轻量级的单假设模型集成。具体来说,PRPose采用了一种弱监督方法来适应单假设HPE模型中2D-to-3D提升过程的隐藏概率分布,然后通过自适应噪声采样策略将分布反向映射到2D姿态输入,从而有效地生成合理的多个假设样本。在3D HPE基准(Human3.6M和MPI-INF-3DHP)上的大量实验揭示了PRPose的有效性和效率。代码可在此处下载:https://this URL。
https://arxiv.org/abs/2405.02114
Motivation: Alzheimer's Disease hallmarks include amyloid-beta deposits and brain atrophy, detectable via PET and MRI scans, respectively. PET is expensive, invasive and exposes patients to ionizing radiation. MRI is cheaper, non-invasive, and free from ionizing radiation but limited to measuring brain atrophy. Goal: To develop an 3D image translation model that synthesizes amyloid-beta PET images from T1-weighted MRI, exploiting the known relationship between amyloid-beta and brain atrophy. Approach: The model was trained on 616 PET/MRI pairs and validated with 264 pairs. Results: The model synthesized amyloid-beta PET images from T1-weighted MRI with high-degree of similarity showing high SSIM and PSNR metrics (SSIM>0.95&PSNR=28). Impact: Our model proves the feasibility of synthesizing amyloid-beta PET images from structural MRI ones, significantly enhancing accessibility for large-cohort studies and early dementia detection, while also reducing cost, invasiveness, and radiation exposure.
动机:阿尔茨海默病的关键特征包括淀粉样蛋白β(amyloid-β)沉积和脑萎缩,可以通过PET和MRI扫描检测到。PET费用昂贵,侵入性较强,且会暴露患者接受放射线治疗。MRI虽然比PET便宜,非侵入性,但只能测量脑萎缩,有限制。目标:开发一个3D图像翻译模型,从T1加权MRI合成amyloid-β PET图像,利用已知amyloid-β和脑萎缩之间的关系。方法:该模型在616个PET/MRI对上进行训练,并通过264个对进行验证。结果:该模型从T1加权MRI上合成了高程度的amyloid-β PET图像,具有很高的SSIM和PSNR指标(SSIM>0.95&PSNR=28)。影响:我们的模型证明了从结构MRI合成amyloid-β PET图像的可能性,显著增强了大型队列研究和早期痴呆症检测的可用性,同时降低了成本、侵入性和放射线暴露。
https://arxiv.org/abs/2405.02109
With the wide application of knowledge distillation between an ImageNet pre-trained teacher model and a learnable student model, industrial anomaly detection has witnessed a significant achievement in the past few years. The success of knowledge distillation mainly relies on how to keep the feature discrepancy between the teacher and student model, in which it assumes that: (1) the teacher model can jointly represent two different distributions for the normal and abnormal patterns, while (2) the student model can only reconstruct the normal distribution. However, it still remains a challenging issue to maintain these ideal assumptions in practice. In this paper, we propose a simple yet effective two-stage industrial anomaly detection framework, termed as AAND, which sequentially performs Anomaly Amplification and Normality Distillation to obtain robust feature discrepancy. In the first anomaly amplification stage, we propose a novel Residual Anomaly Amplification (RAA) module to advance the pre-trained teacher encoder. With the exposure of synthetic anomalies, it amplifies anomalies via residual generation while maintaining the integrity of pre-trained model. It mainly comprises a Matching-guided Residual Gate and an Attribute-scaling Residual Generator, which can determine the residuals' proportion and characteristic, respectively. In the second normality distillation stage, we further employ a reverse distillation paradigm to train a student decoder, in which a novel Hard Knowledge Distillation (HKD) loss is built to better facilitate the reconstruction of normal patterns. Comprehensive experiments on the MvTecAD, VisA, and MvTec3D-RGB datasets show that our method achieves state-of-the-art performance.
知识蒸馏在工业异常检测中的应用已经取得了显著成就。知识蒸馏的成功主要依赖于如何保持教师和 student模型之间的特征差异,其中它假设:(1)教师模型可以共同表示正常和异常模式的两种不同分布,而(2)学生模型只能重构正常分布。然而,在实践中仍然存在一个具有挑战性的问题,即维持这些理想假设。在本文中,我们提出了一个简单而有效的工业异常检测框架,称为AAND,它分为两个阶段依次执行异常增强和正常分化。在第一个异常增强阶段,我们提出了一个新的残差异常增强(RAA)模块,以提高预训练教师编码器的性能。通过暴露合成异常,它通过残差生成来放大异常,同时保持预训练模型的完整性。它主要由一个匹配引导的残差门和一个属性缩放的残差生成器组成,可以分别确定残差的比率和特征。在第二个正则化分化阶段,我们进一步采用反向蒸馏范式训练学生解码器,其中构建了一种新的硬知识蒸馏(HKD)损失,以更好地促进对正常模式的重建。在MvTecAD、VisA和MvTec3D-RGB数据集上进行全面的实验证明,我们的方法达到了最先进的性能水平。
https://arxiv.org/abs/2405.02068
The advances in the Neural Radiance Fields (NeRF) research offer extensive applications in diverse domains, but protecting their copyrights has not yet been researched in depth. Recently, NeRF watermarking has been considered one of the pivotal solutions for safely deploying NeRF-based 3D representations. However, existing methods are designed to apply only to implicit or explicit NeRF representations. In this work, we introduce an innovative watermarking method that can be employed in both representations of NeRF. This is achieved by fine-tuning NeRF to embed binary messages in the rendering process. In detail, we propose utilizing the discrete wavelet transform in the NeRF space for watermarking. Furthermore, we adopt a deferred back-propagation technique and introduce a combination with the patch-wise loss to improve rendering quality and bit accuracy with minimum trade-offs. We evaluate our method in three different aspects: capacity, invisibility, and robustness of the embedded watermarks in the 2D-rendered images. Our method achieves state-of-the-art performance with faster training speed over the compared state-of-the-art methods.
Neural Radiance Fields (NeRF) research advances offer extensive applications in diverse domains, but protecting their copyrights has not yet been researched in depth. Recently, NeRF watermarking has been considered one of the pivotal solutions for safely deploying NeRF-based 3D representations. However, existing methods are designed to apply only to implicit or explicit NeRF representations. In this work, we introduce an innovative watermarking method that can be employed in both representations of NeRF. This is achieved by fine-tuning NeRF to embed binary messages in the rendering process. In detail, we propose utilizing the discrete wavelet transform in the NeRF space for watermarking. Furthermore, we adopt a deferred back-propagation technique and introduce a combination with the patch-wise loss to improve rendering quality and bit accuracy with minimum trade-offs. We evaluate our method in three different aspects: capacity, invisibility, and robustness of the embedded watermarks in the 2D-rendered images. Our method achieves state-of-the-art performance with faster training speed over the compared state-of-the-art methods.
https://arxiv.org/abs/2405.02066
In the fields of photogrammetry, computer vision and computer graphics, the task of neural 3D scene reconstruction has led to the exploration of various techniques. Among these, 3D Gaussian Splatting stands out for its explicit representation of scenes using 3D Gaussians, making it appealing for tasks like 3D point cloud extraction and surface reconstruction. Motivated by its potential, we address the domain of 3D scene reconstruction, aiming to leverage the capabilities of the Microsoft HoloLens 2 for instant 3D Gaussian Splatting. We present HoloGS, a novel workflow utilizing HoloLens sensor data, which bypasses the need for pre-processing steps like Structure from Motion by instantly accessing the required input data i.e. the images, camera poses and the point cloud from depth sensing. We provide comprehensive investigations, including the training process and the rendering quality, assessed through the Peak Signal-to-Noise Ratio, and the geometric 3D accuracy of the densified point cloud from Gaussian centers, measured by Chamfer Distance. We evaluate our approach on two self-captured scenes: An outdoor scene of a cultural heritage statue and an indoor scene of a fine-structured plant. Our results show that the HoloLens data, including RGB images, corresponding camera poses, and depth sensing based point clouds to initialize the Gaussians, are suitable as input for 3D Gaussian Splatting.
在摄影测量、计算机视觉和计算机图形学领域,神经3D场景重建的任务促使我们探索各种技术。在这些技术中,3D高斯平铺因使用3D高斯形式的场景表示而脱颖而出,这使得它对诸如3D点云提取和表面重建等任务具有吸引力。为了利用其潜力,我们转向3D场景重建领域,旨在利用微软HoloLens 2的即时3D高斯平铺功能。我们介绍了HoloGS,一种利用HoloLens传感器数据的新工作流程,无需进行预处理步骤,即可直接访问所需输入数据,即深度感測的图像、摄影机姿态和点云。我们提供了全面的调查,包括训练过程和渲染质量,通过Peak信号-噪声比进行评估,以及通过Chamfer距离测量来自高斯中心的密度点云的3D几何准确性。我们对我们的方法在两个自 capture场景进行了评估:文化遗产雕像的户外场景和精细结构植物的室内场景。我们的结果表明,HoloLens数据,包括RGB图像、相应的相机姿态和基于深度感測的点云,作为3D高斯平铺的输入是合适的。
https://arxiv.org/abs/2405.02005
The paper considers the problem of human-scale RF sensing utilizing a network of resource-constrained MIMO radars with low range-azimuth resolution. The radars operate in the mmWave band and obtain time-varying 3D point cloud (PC) information that is sensitive to body movements. They also observe the same scene from different views and cooperate while sensing the environment using a sidelink communication channel. Conventional cooperation setups allow the radars to mutually exchange raw PC information to improve ego sensing. The paper proposes a federation mechanism where the radars exchange the parameters of a Bayesian posterior measure of the observed PCs, rather than raw data. The radars act as distributed parameter servers to reconstruct a global posterior (i.e., federated posterior) using Bayesian tools. The paper quantifies and compares the benefits of radar federation with respect to cooperation mechanisms. Both approaches are validated by experiments with a real-time demonstration platform. Federation makes minimal use of the sidelink communication channel (20 ÷ 25 times lower bandwidth use) and is less sensitive to unresolved targets. On the other hand, cooperation reduces the mean absolute target estimation error of about 20%.
本文考虑了利用具有低距离角分辨率、资源受限的MIMO雷达网络进行人规模射频感知的問題。这些雷达在毫米波频段运行,并获取对体运动敏感的时间变化3D点云(PC)信息。它们还从不同的视角观察相同的场景,并通过侧链通信通道同时感測环境。传统的合作方案使雷达能够相互交换原始PC信息以提高自感。本文提出了一种联邦机制,其中雷达交换的是观察到的PC的后验分布参数,而不是原始数据。雷达作为分布式参数服务器,使用贝叶斯工具重建全局后验(即联邦后验)。本文对雷达联邦与合作机制的优劣进行了定量和比较。两种方法都在实时演示平台上通过实验进行了验证。联邦对侧链通信通道(20/25倍带宽利用率)的使用最少,对未解决的目标不敏感。另一方面,合作减少了大约20%的目标绝对估计误差。
https://arxiv.org/abs/2405.01995