3D Swin Transformer (3D-ST) known for its hierarchical attention and window-based processing, excels in capturing intricate spatial relationships within images. Spatial-spectral Transformer (SST), meanwhile, specializes in modeling long-range dependencies through self-attention mechanisms. Therefore, this paper introduces a novel method: an attentional fusion of these two transformers to significantly enhance the classification performance of Hyperspectral Images (HSIs). What sets this approach apart is its emphasis on the integration of attentional mechanisms from both architectures. This integration not only refines the modeling of spatial and spectral information but also contributes to achieving more precise and accurate classification results. The experimentation and evaluation of benchmark HSI datasets underscore the importance of employing disjoint training, validation, and test samples. The results demonstrate the effectiveness of the fusion approach, showcasing its superiority over traditional methods and individual transformers. Incorporating disjoint samples enhances the robustness and reliability of the proposed methodology, emphasizing its potential for advancing hyperspectral image classification.
3D Swin Transformer(3D-ST)以其层次化的注意力和基于窗口的处理而闻名,在图像中捕捉复杂的空间关系方面表现出色。同时,Spatial-spectral Transformer(SST)专门通过自注意力机制建模长距离依赖关系。因此,本文提出了一种新颖的方法:这两个变压器的注意合并,可以显著提高超分辨率图像(HSIs)的分类性能。这种方法的特点在于其强调将两种架构的注意力机制进行整合。这种整合不仅精炼了空间和光谱信息的建模,还促进了更精确和准确的分类结果的实现。基准HSI数据集的实验和评估强调了对使用离散训练、验证和测试样本的重要性。结果表明,融合方法的有效性得到了展示,其优越性超过了传统方法和单独的变压器。纳入离散样本增强了所提出方法的可行性、可靠性和潜在的提高超分辨率图像分类的能力。
https://arxiv.org/abs/2405.01095
Reconstructing a hand mesh from a single RGB image is a challenging task because hands are often occluded by objects. Most previous works attempted to introduce more additional information and adopt attention mechanisms to improve 3D reconstruction results, but it would increased computational complexity. This observation prompts us to propose a new and concise architecture while improving computational efficiency. In this work, we propose a simple and effective 3D hand mesh reconstruction network HandSSCA, which is the first to incorporate state space modeling into the field of hand pose estimation. In the network, we have designed a novel state space channel attention module that extends the effective sensory field, extracts hand features in the spatial dimension, and enhances hand regional features in the channel dimension. This design helps to reconstruct a complete and detailed hand mesh. Extensive experiments conducted on well-known datasets featuring challenging hand-object occlusions (such as FREIHAND, DEXYCB, and HO3D) demonstrate that our proposed HandSSCA achieves state-of-the-art performance while maintaining a minimal parameter count.
从单个RGB图像中重构手网格是一个具有挑战性的任务,因为手经常被物体遮挡。之前的工作尝试引入更多附加信息并采用注意机制来提高3D重建结果,但会增加计算复杂度。这个观察结果促使我们提出一种新而简洁的架构,同时提高计算效率。在本文中,我们提出了一个简单而有效的3D手网格重构网络HandSSCA,这是第一个将状态空间建模应用到手姿态估计领域的网络。在网络中,我们设计了一个新颖的状态空间通道关注模块,扩展了有效的感官场,提取了手部的空间维度,并在通道维度上增强了手部区域特征。这种设计有助于重构完整和详细的手网格。在已知具有挑战性手-物体遮挡的数据集(如FREIHAND、DEXYCB和HO3D)上进行的广泛实验证明,我们的HandSSCA网络在保持最小参数计数的同时实现了最先进的性能。
https://arxiv.org/abs/2405.01066
Simulating soil reflectance spectra is invaluable for soil-plant radiative modeling and training machine learning models, yet it is difficult as the intricate relationships between soil structure and its constituents. To address this, a fully data-driven soil optics generative model (SOGM) for simulation of soil reflectance spectra based on soil property inputs was developed. The model is trained on an extensive dataset comprising nearly 180,000 soil spectra-property pairs from 17 datasets. It generates soil reflectance spectra from text-based inputs describing soil properties and their values rather than only numerical values and labels in binary vector format. The generative model can simulate output spectra based on an incomplete set of input properties. SOGM is based on the denoising diffusion probabilistic model (DDPM). Two additional sub-models were also built to complement the SOGM: a spectral padding model that can fill in the gaps for spectra shorter than the full visible-near-infrared range (VIS-NIR; 400 to 2499 nm), and a wet soil spectra model that can estimate the effects of water content on soil reflectance spectra given the dry spectrum predicted by the SOGM. The SOGM was up-scaled by coupling with the Helios 3D plant modeling software, which allowed for generation of synthetic aerial images of simulated soil and plant scenes. It can also be easily integrated with soil-plant radiation model used for remote sensin research like PROSAIL. The testing results of the SOGM on new datasets that not included in model training proved that the model can generate reasonable soil reflectance spectra based on available property inputs. The presented models are openly accessible on: this https URL.
模拟土壤反射光谱对于土壤-植物辐射建模和训练机器学习模型非常有价值,然而,它具有挑战性,因为土壤结构和其组成之间的复杂关系。为了解决这个问题,基于土壤属性输入的完全数据驱动土壤光学生成模型(SOGM)用于模拟土壤反射光谱的开发。该模型在包括17个数据集的近180,000个土壤光谱-属性对的数据集上进行训练。它从基于文本的输入描述土壤属性和其值生成土壤反射光谱,而不是仅以二进制向量格式表示数值和标签。生成模型可以根据一组输入属性模拟输出光谱。SOGM基于去噪扩散概率模型(DDPM)。还开发了两个补充模型来补充SOGM:一个填充光谱长度的模型,可以填补光谱较短于完整可见-近红外范围(VIS-NIR;400至2499纳米)的缺口,和一个干土光谱模型,可以根据SOGM预测的干土光谱估计水分含量对土壤反射光谱的影响。通过与Helios 3D植物建模软件耦合,SOGM进行了放大,从而能够生成模拟土壤和植物场景的合成高空图像。它还可以轻松地与用于远地感研究的光谱遥感模型如PROSAIL集成。对SOGM在新数据集上的测试结果表明,基于现有属性输入,该模型可以生成合理的土壤反射光谱。所提出的模型在:https://这个URL公开可用。
https://arxiv.org/abs/2405.01060
This paper presents a novel latent 3D diffusion model for the generation of neural voxel fields, aiming to achieve accurate part-aware structures. Compared to existing methods, there are two key designs to ensure high-quality and accurate part-aware generation. On one hand, we introduce a latent 3D diffusion process for neural voxel fields, enabling generation at significantly higher resolutions that can accurately capture rich textural and geometric details. On the other hand, a part-aware shape decoder is introduced to integrate the part codes into the neural voxel fields, guiding the accurate part decomposition and producing high-quality rendering results. Through extensive experimentation and comparisons with state-of-the-art methods, we evaluate our approach across four different classes of data. The results demonstrate the superior generative capabilities of our proposed method in part-aware shape generation, outperforming existing state-of-the-art methods.
本文提出了一种新颖的潜在3D扩散模型,用于生成神经元体素场,旨在实现准确的部分感知结构。与现有方法相比,有两个关键设计可以确保高质量和准确的部分感知生成。一方面,我们引入了一个用于神经元体素场的潜在3D扩散过程,使得生成的分辨率比现有方法更高,能够准确捕捉丰富的纹理和几何细节。另一方面,引入了一个部分感知形状解码器,将部分代码整合到神经元体素场中,指导准确的部分分解并产生高质量渲染结果。通过广泛的实验和与最先进方法的比较,我们评估了我们方法在部分感知形状生成方面的优越性。结果表明,与现有方法相比,我们提出的方法在部分感知形状生成方面的表现更加卓越,超越了现有最先进的方法。
https://arxiv.org/abs/2405.00998
Surgical scene simulation plays a crucial role in surgical education and simulator-based robot learning. Traditional approaches for creating these environments with surgical scene involve a labor-intensive process where designers hand-craft tissues models with textures and geometries for soft body simulations. This manual approach is not only time-consuming but also limited in the scalability and realism. In contrast, data-driven simulation offers a compelling alternative. It has the potential to automatically reconstruct 3D surgical scenes from real-world surgical video data, followed by the application of soft body physics. This area, however, is relatively uncharted. In our research, we introduce 3D Gaussian as a learnable representation for surgical scene, which is learned from stereo endoscopic video. To prevent over-fitting and ensure the geometrical correctness of these scenes, we incorporate depth supervision and anisotropy regularization into the Gaussian learning process. Furthermore, we apply the Material Point Method, which is integrated with physical properties, to the 3D Gaussians to achieve realistic scene deformations. Our method was evaluated on our collected in-house and public surgical videos datasets. Results show that it can reconstruct and simulate surgical scenes from endoscopic videos efficiently-taking only a few minutes to reconstruct the surgical scene-and produce both visually and physically plausible deformations at a speed approaching real-time. The results demonstrate great potential of our proposed method to enhance the efficiency and variety of simulations available for surgical education and robot learning.
手术场景模拟在手术教育和基于模拟器的机器人学习中发挥着关键作用。传统的方法创建这些环境需要设计师花费大量的时间手工制作组织模型,纹理和几何数据,以实现软身体仿真。这种手动方法不仅费时,而且可扩展性和现实性有限。相比之下,数据驱动模拟提供了令人兴奋的替代方案。它有可能自动从现实世界的手术视频数据中重构3D手术场景,然后应用软身体物理学。然而,这个领域仍然相对未知。在我们的研究中,我们将3D高斯作为一个可学习表示手术场景的模型,从立体内窥镜视频中学到。为了防止过拟合并确保场景的几何正确性,我们将深度监督和各向同性正则化引入到高斯学习过程中。此外,我们将材料点方法应用于3D高斯,以实现逼真的场景变形。我们对内部和公共手术视频数据集进行了评估。结果表明,该方法可以高效地重构和模拟手术场景,仅用几分钟就可以重构手术场景,并产生几乎实时可观和物理变形。结果证明了我们对所提出方法的提高效率和多样性的潜力。
https://arxiv.org/abs/2405.00956
Recent advancements in automatic 3D avatar generation guided by text have made significant progress. However, existing methods have limitations such as oversaturation and low-quality output. To address these challenges, we propose X-Oscar, a progressive framework for generating high-quality animatable avatars from text prompts. It follows a sequential Geometry->Texture->Animation paradigm, simplifying optimization through step-by-step generation. To tackle oversaturation, we introduce Adaptive Variational Parameter (AVP), representing avatars as an adaptive distribution during training. Additionally, we present Avatar-aware Score Distillation Sampling (ASDS), a novel technique that incorporates avatar-aware noise into rendered images for improved generation quality during optimization. Extensive evaluations confirm the superiority of X-Oscar over existing text-to-3D and text-to-avatar approaches. Our anonymous project page: this https URL.
近年来,随着自动3D虚拟角色生成技术基于文本的进步,已经取得了显著的进展。然而,现有的方法存在一些局限性,如过饱和和低质量输出。为了应对这些挑战,我们提出了X-Oscar,一种基于文本提示生成高质量动感的虚拟角色的渐进框架。它遵循了序列化几何->纹理->动画范式,通过逐步生成简化优化。为了解决过饱和问题,我们引入了自适应变分参数(AVP),将虚拟角色表示为在训练期间的自适应分布。此外,我们提出了Avatar-aware Score Distillation Sampling(ASDS),一种将虚拟意识到的噪声融入渲染图像中以提高优化过程中的生成质量的新技术。广泛的评估证实了X-Oscar在现有文本到3D和文本到虚拟角色方法中的优越性。我们的匿名项目页面:这是这个链接:https:// this URL。
https://arxiv.org/abs/2405.00954
Hyperspectral Imaging (HSI) serves as an important technique in remote sensing. However, high dimensionality and data volume typically pose significant computational challenges. Band selection is essential for reducing spectral redundancy in hyperspectral imagery while retaining intrinsic critical information. In this work, we propose a novel hyperspectral band selection model by decomposing the data into a low-rank and smooth component and a sparse one. In particular, we develop a generalized 3D total variation (G3DTV) by applying the $\ell_1^p$-norm to derivatives to preserve spatial-spectral smoothness. By employing the alternating direction method of multipliers (ADMM), we derive an efficient algorithm, where the tensor low-rankness is implied by the tensor CUR decomposition. We demonstrate the effectiveness of the proposed approach through comparisons with various other state-of-the-art band selection techniques using two benchmark real-world datasets. In addition, we provide practical guidelines for parameter selection in both noise-free and noisy scenarios.
超分辨率成像(HSI)在遥感中是一个重要的技术。然而,高维度和数据量通常会带来显著的计算挑战。带选择对于在超分辨率图像中减少光谱重叠并保留固有关键信息至关重要。在这项工作中,我们提出了一种新的超分辨率带选择模型,通过将数据分解为低秩和平滑组件和稀疏组件。特别,我们通过应用$\ell_1^p$范数来保留空间-频谱平滑性,开发了一个通用的3D总方差(G3DTV)。通过采用交替方向乘子法(ADMM),我们推导出一种高效的算法,其中张量低秩性隐含于张量CUR分解。我们通过与各种最先进的带选择技术进行比较,证明了所提出方法的有效性。此外,我们还为噪声无党和噪声场景提供了参数选择的实际建议。
https://arxiv.org/abs/2405.00951
We present EchoScene, an interactive and controllable generative model that generates 3D indoor scenes on scene graphs. EchoScene leverages a dual-branch diffusion model that dynamically adapts to scene graphs. Existing methods struggle to handle scene graphs due to varying numbers of nodes, multiple edge combinations, and manipulator-induced node-edge operations. EchoScene overcomes this by associating each node with a denoising process and enables collaborative information exchange, enhancing controllable and consistent generation aware of global constraints. This is achieved through an information echo scheme in both shape and layout branches. At every denoising step, all processes share their denoising data with an information exchange unit that combines these updates using graph convolution. The scheme ensures that the denoising processes are influenced by a holistic understanding of the scene graph, facilitating the generation of globally coherent scenes. The resulting scenes can be manipulated during inference by editing the input scene graph and sampling the noise in the diffusion model. Extensive experiments validate our approach, which maintains scene controllability and surpasses previous methods in generation fidelity. Moreover, the generated scenes are of high quality and thus directly compatible with off-the-shelf texture generation. Code and trained models are open-sourced.
我们提出了EchoScene,一种交互式且可控制的三维室内场景生成模型,它在场景图上生成3D室内场景。EchoScene利用了双分支扩散模型,该模型动态地适应场景图。由于节点数量、边组合和操纵器诱导的节点边缘操作等因素的不同,现有的方法在处理场景图时遇到困难。EchoScene通过将每个节点与去噪过程相关联,实现了协同信息交流,增强了可控制和一致性的生成,同时考虑了全局约束。这是通过形状和布局分支的信息回声方案实现的。在去噪步骤中,所有进程将去噪数据共享给一个信息交换单元,该单元使用图卷积对这些更新进行组合。该方案确保了去噪过程受到场景图的整体理解的影响,从而促进了全局一致场景的生成。生成的场景在推理过程中可以进行编辑,并从扩散模型的噪声中采样。大量实验验证了我们的方法,保持场景的可控性并超过了前方法在生成质量上的表现。此外,生成的场景具有高质量,因此与现成的纹理生成兼容。代码和训练的模型都是开源的。
https://arxiv.org/abs/2405.00915
Photorealistic simulation plays a crucial role in applications such as autonomous driving, where advances in neural radiance fields (NeRFs) may allow better scalability through the automatic creation of digital 3D assets. However, reconstruction quality suffers on street scenes due to largely collinear camera motions and sparser samplings at higher speeds. On the other hand, the application often demands rendering from camera views that deviate from the inputs to accurately simulate behaviors like lane changes. In this paper, we propose several insights that allow a better utilization of Lidar data to improve NeRF quality on street scenes. First, our framework learns a geometric scene representation from Lidar, which is fused with the implicit grid-based representation for radiance decoding, thereby supplying stronger geometric information offered by explicit point cloud. Second, we put forth a robust occlusion-aware depth supervision scheme, which allows utilizing densified Lidar points by accumulation. Third, we generate augmented training views from Lidar points for further improvement. Our insights translate to largely improved novel view synthesis under real driving scenes.
照片现实模拟在自动驾驶等应用中扮演着关键角色,因为神经辐射场(NeRFs)的进步可能允许通过自动创建数字3D资产来实现更好的可扩展性。然而,在街景中,由于主要是平行的相机运动和高速时的采样稀疏,重建质量下降。另一方面,应用程序通常要求从相机视角进行渲染,以准确模拟行为,如变道。在本文中,我们提出了几个见解,使得Lidar数据能够更好地用于改善街景中的NeRF质量。首先,我们的框架从Lidar中学习几何场景表示,并将其与隐式网格表示的辐射解码相结合,从而提供来自明确点云的更强的几何信息。其次,我们提出了一个鲁棒的可视化深度监督方案,允许通过累积使用密集的Lidar点。第三,我们从Lidar点生成增强的训练视图,以进一步改进。我们的见解使得在现实驾驶场景中产生了显著改进的新视图合成。
https://arxiv.org/abs/2405.00900
Recent breakthroughs in single-image 3D portrait reconstruction have enabled telepresence systems to stream 3D portrait videos from a single camera in real-time, potentially democratizing telepresence. However, per-frame 3D reconstruction exhibits temporal inconsistency and forgets the user's appearance. On the other hand, self-reenactment methods can render coherent 3D portraits by driving a personalized 3D prior, but fail to faithfully reconstruct the user's per-frame appearance (e.g., facial expressions and lighting). In this work, we recognize the need to maintain both coherent identity and dynamic per-frame appearance to enable the best possible realism. To this end, we propose a new fusion-based method that fuses a personalized 3D subject prior with per-frame information, producing temporally stable 3D videos with faithful reconstruction of the user's per-frame appearances. Trained only using synthetic data produced by an expression-conditioned 3D GAN, our encoder-based method achieves both state-of-the-art 3D reconstruction accuracy and temporal consistency on in-studio and in-the-wild datasets.
近年来,在单图像3D人物重建方面的突破使得远程会诊系统能够实时从单个相机流式传输3D人物视频,这有可能使远程会诊民主化。然而,每帧3D重建展示出时间不一致性,并忘记用户的形象。另一方面,自演算法可以通过驱动个性化的3D先验来生成连贯的3D肖像,但它无法准确地重构用户的每帧外貌(例如,面部表情和照明)。在本文中,我们认识到需要维持连贯的身份和动态每帧外貌,以实现最佳的现实感。为此,我们提出了一个新的基于融合的方法,将个性化的3D主体先验与每帧信息相结合,产生具有忠实重构用户每帧外貌的temporally stable 3D视频。仅使用通过表情条件生成器的合成数据进行训练,我们的编码器基础方法在实验室和自然数据集上实现最佳的3D重建精度和时间一致性。
https://arxiv.org/abs/2405.00794
Recently, 3D Gaussian Splatting, as a novel 3D representation, has garnered attention for its fast rendering speed and high rendering quality. However, this comes with high memory consumption, e.g., a well-trained Gaussian field may utilize three million Gaussian primitives and over 700 MB of memory. We credit this high memory footprint to the lack of consideration for the relationship between primitives. In this paper, we propose a memory-efficient Gaussian field named SUNDAE with spectral pruning and neural compensation. On one hand, we construct a graph on the set of Gaussian primitives to model their relationship and design a spectral down-sampling module to prune out primitives while preserving desired signals. On the other hand, to compensate for the quality loss of pruning Gaussians, we exploit a lightweight neural network head to mix splatted features, which effectively compensates for quality losses while capturing the relationship between primitives in its weights. We demonstrate the performance of SUNDAE with extensive results. For example, SUNDAE can achieve 26.80 PSNR at 145 FPS using 104 MB memory while the vanilla Gaussian splatting algorithm achieves 25.60 PSNR at 160 FPS using 523 MB memory, on the Mip-NeRF360 dataset. Codes are publicly available at this https URL.
最近,3D高斯展平(3D Gausian Splatting)作为一种新颖的3D表示方式,因其快速的渲染速度和高的渲染质量而引起了关注。然而,这也带来了高内存消耗,例如,经过良好训练的高斯场可能会使用300万高斯原型,需要超过700MB的内存。我们将高内存消耗归因于对原语之间关系的缺乏考虑。在本文中,我们提出了一个内存高效的Gausian场,名为SUNDAE,通过谱滤波和神经补偿实现。一方面,我们在高斯原子的集合上构建一个图,建模它们之间的关系,并设计了一个用于保留所需信号的高斯下采样模块,另一方面,为了弥补修剪高斯场的质量损失,我们利用轻量级的神经网络头混合展平特征,有效地补偿质量损失并捕获原语之间的关系。我们通过广泛的实验结果证明了SUNDAE的性能。例如,SUNDAE在145 FPS时可以达到26.80 PSNR,而原神和高斯展平算法在160 FPS时可以达到25.60 PSNR,在Mip-NeRF360数据集上。代码 publicly available at this https URL.
https://arxiv.org/abs/2405.00676
Generative models have enabled intuitive image creation and manipulation using natural language. In particular, diffusion models have recently shown remarkable results for natural image editing. In this work, we propose to apply diffusion techniques to edit textures, a specific class of images that are an essential part of 3D content creation pipelines. We analyze existing editing methods and show that they are not directly applicable to textures, since their common underlying approach, manipulating attention maps, is unsuitable for the texture domain. To address this, we propose a novel approach that instead manipulates CLIP image embeddings to condition the diffusion generation. We define editing directions using simple text prompts (e.g., "aged wood" to "new wood") and map these to CLIP image embedding space using a texture prior, with a sampling-based approach that gives us identity-preserving directions in CLIP space. To further improve identity preservation, we project these directions to a CLIP subspace that minimizes identity variations resulting from entangled texture attributes. Our editing pipeline facilitates the creation of arbitrary sliders using natural language prompts only, with no ground-truth annotated data necessary.
生成模型已经使得使用自然语言在图像上进行直观的创作和编辑。特别是,扩散模型最近在自然图像编辑方面取得了显著的成果。在这项工作中,我们提出将扩散技术应用于编辑纹理,纹理是3D内容创建流程的重要组成部分。我们分析现有的编辑方法,并表明它们不适用于纹理,因为它们的共同基础方法——操纵关注图——不适用于纹理领域。为了解决这个问题,我们提出了一种新颖的方法,即通过调控CLIP图像嵌入来控制扩散生成。我们使用简单的文本提示(例如," aged wood " 到 " new wood ")定义编辑方向,并使用纹理先验将这些方向映射到CLIP图像嵌入空间,采用基于采样的方法,在CLIP空间中给出与纹理属性无关的身份保持方向。为了进一步提高身份保留,我们将这些方向投影到CLIP子空间中,该子空间最小化由纠缠纹理属性引起的身份变化。我们的编辑流程使用自然语言提示仅创建任意滑块,无需标注的地面真数据。
https://arxiv.org/abs/2405.00672
Neural Radiance Fields (NeRF) have shown impressive results in 3D reconstruction and generating novel views. A key challenge within NeRF is the editing of reconstructed scenes, such as object removal, which requires maintaining consistency across multiple views and ensuring high-quality synthesised perspectives. Previous studies have incorporated depth priors, typically from LiDAR or sparse depth measurements provided by COLMAP, to improve the performance of object removal in NeRF. However, these methods are either costly or time-consuming. In this paper, we propose a novel approach that integrates monocular depth estimates with NeRF-based object removal models to significantly reduce time consumption and enhance the robustness and quality of scene generation and object removal. We conducted a thorough evaluation of COLMAP's dense depth reconstruction on the KITTI dataset to verify its accuracy in depth map generation. Our findings suggest that COLMAP can serve as an effective alternative to a ground truth depth map where such information is missing or costly to obtain. Additionally, we integrated various monocular depth estimation methods into the removal NeRF model, i.e., SpinNeRF, to assess their capacity to improve object removal performance. Our experimental results highlight the potential of monocular depth estimation to substantially improve NeRF applications.
Neural Radiance Fields (NeRF) 在 3D 重建和生成新视图方面已经取得了令人印象深刻的成果。 NeRF 中的关键挑战之一是编辑重构场景,例如物体移除,这需要在多个视图中保持一致并确保高质合成视角。之前的研究已经利用深度优先项,通常来自 LiDAR 或稀疏深度测量提供的 COLMAP,来提高 NeRF 中物体移除的性能。然而,这些方法要么代价高昂,要么费时。在本文中,我们提出了一种新方法,将单目深度估计与基于 NeRF 的物体移除模型相结合,显著减少了时间消耗,并提高了场景生成和物体移除的稳健性和质量。我们对 COLMAP 在 KITTI 数据集上的密集深度重建进行了详细的评估,以验证其深度图生成的准确性。我们的研究结果表明,COLMAP 可以作为当深度图缺失或昂贵无法获得时的有效地面真值深度图的替代。此外,我们将各种单目深度估计方法(例如 SpinNeRF)集成到移除 NeRF 模型中,以评估它们提高物体移除性能的能力。我们的实验结果突出了单目深度估计在极大地改善 NeRF 应用中的潜力。
https://arxiv.org/abs/2405.00630
We present a novel approach for long-term human trajectory prediction, which is essential for long-horizon robot planning in human-populated environments. State-of-the-art human trajectory prediction methods are limited by their focus on collision avoidance and short-term planning, and their inability to model complex interactions of humans with the environment. In contrast, our approach overcomes these limitations by predicting sequences of human interactions with the environment and using this information to guide trajectory predictions over a horizon of up to 60s. We leverage Large Language Models (LLMs) to predict interactions with the environment by conditioning the LLM prediction on rich contextual information about the scene. This information is given as a 3D Dynamic Scene Graph that encodes the geometry, semantics, and traversability of the environment into a hierarchical representation. We then ground these interaction sequences into multi-modal spatio-temporal distributions over human positions using a probabilistic approach based on continuous-time Markov Chains. To evaluate our approach, we introduce a new semi-synthetic dataset of long-term human trajectories in complex indoor environments, which also includes annotations of human-object interactions. We show in thorough experimental evaluations that our approach achieves a 54% lower average negative log-likelihood (NLL) and a 26.5% lower Best-of-20 displacement error compared to the best non-privileged baselines for a time horizon of 60s.
我们提出了一种新的长期人类轨迹预测方法,这对于在人口环境中进行长距离机器人规划至关重要。最先进的的人类轨迹预测方法受到其关注点在于避障和短期规划的限制,以及它们无法模拟人类与环境之间复杂互动的限制。相比之下,我们的方法通过预测与环境的交互序列,并利用这些信息来指导轨迹预测,超出了这些限制。我们利用大型语言模型(LLMs)预测环境交互,通过根据场景的丰富上下文信息对LLM预测进行条件化。该信息以3D动态场景图的形式编码环境的几何、语义和可访问性。然后,我们使用基于连续时间 Markov 链的概率方法将交互序列 grounded 到人类位置的多模态时空分布中。为了评估我们的方法,我们引入了一个新的半合成数据集,其中包含复杂室内环境中长期人类轨迹的注释,同时也包括人类-物体交互的注释。我们通过彻底的实验评估,展示了我们的方法在60秒时间间隔内的平均负对数(NLL)降低了54%,最佳 of 20 移动误差降低了26.5%。
https://arxiv.org/abs/2405.00552
This paper focuses on training a robust RGB-D registration model without ground-truth pose supervision. Existing methods usually adopt a pairwise training strategy based on differentiable rendering, which enforces the photometric and the geometric consistency between the two registered frames as supervision. However, this frame-to-frame framework suffers from poor multi-view consistency due to factors such as lighting changes, geometry occlusion and reflective materials. In this paper, we present NeRF-UR, a novel frame-to-model optimization framework for unsupervised RGB-D registration. Instead of frame-to-frame consistency, we leverage the neural radiance field (NeRF) as a global model of the scene and use the consistency between the input and the NeRF-rerendered frames for pose optimization. This design can significantly improve the robustness in scenarios with poor multi-view consistency and provides better learning signal for the registration model. Furthermore, to bootstrap the NeRF optimization, we create a synthetic dataset, Sim-RGBD, through a photo-realistic simulator to warm up the registration model. By first training the registration model on Sim-RGBD and later unsupervisedly fine-tuning on real data, our framework enables distilling the capability of feature extraction and registration from simulation to reality. Our method outperforms the state-of-the-art counterparts on two popular indoor RGB-D datasets, ScanNet and 3DMatch. Code and models will be released for paper reproduction.
本文专注于在没有地面姿态监督的情况下训练鲁棒且支持多视图一致性的RGB-D配准模型。现有的方法通常采用基于可导渲染的成对训练策略,以强制两帧之间保持光学一致性和几何一致性作为监督。然而,这种框架由于诸如光照变化、几何遮挡和反光材料等因素,导致多视图一致性较差。在本文中,我们提出了NeRF-UR,一种新的基于帧到模型的无监督RGB-D配准优化框架。我们不再关注帧到帧的一致性,而是利用神经辐射场(NeRF)作为场景的全局模型,并使用输入和NeRF重新渲染的帧之间的一致性来进行姿态优化。这种设计可以在具有较差多视图一致性的场景中显著提高鲁棒性,并为配准模型提供更好的学习信号。此外,为了激发NeRF优化,我们通过照片实感模拟创建了仿真的数据集Sim-RGBD,并通过先在Sim-RGBD上训练注册模型,然后在真实数据上进行无监督微调,使我们的框架将模拟能力从仿真传递到现实。在两个流行的室内RGB-D数据集ScanNet和3DMatch上,我们的方法超越了最先进的同类方法。代码和模型将公开发布,以供论文复制品使用。
https://arxiv.org/abs/2405.00507
Background and purpose: Deformable image registration (DIR) is a crucial tool in radiotherapy for extracting and modelling organ motion. However, when significant changes and sliding boundaries are present, it faces compromised accuracy and uncertainty, determining the subsequential contour propagation and dose accumulation procedures. Materials and methods: We propose an implicit neural representation (INR)-based approach modelling motion continuously in both space and time, named Continues-sPatial-Temporal DIR (CPT-DIR). This method uses a multilayer perception (MLP) network to map 3D coordinate (x,y,z) to its corresponding velocity vector (vx,vy,vz). The displacement vectors (dx,dy,dz) are then calculated by integrating velocity vectors over time. The MLP's parameters can rapidly adapt to new cases without pre-training, enhancing optimisation. The DIR's performance was tested on the DIR-Lab dataset of 10 lung 4DCT cases, using metrics of landmark accuracy (TRE), contour conformity (Dice) and image similarity (MAE). Results: The proposed CPT-DIR can reduce landmark TRE from 2.79mm to 0.99mm, outperforming B-splines' results for all cases. The MAE of the whole-body region improves from 35.46HU to 28.99HU. Furthermore, CPT-DIR surpasses B-splines for accuracy in the sliding boundary region, lowering MAE and increasing Dice coefficients for the ribcage from 65.65HU and 90.41% to 42.04HU and 90.56%, versus 75.40HU and 89.30% without registration. Meanwhile, CPT-DIR offers significant speed advantages, completing in under 15 seconds compared to a few minutes with the conventional B-splines method. Conclusion: Leveraging the continuous representations, the CPT-DIR method significantly enhances registration accuracy, automation and speed, outperforming traditional B-splines in landmark and contour precision, particularly in the challenging areas.
背景和目的:曲面图像配准(DIR)在放射治疗中是提取和建模器官运动的关键工具。然而,当存在显著的变化和滑动边界时,它面临精度和不确定性的妥协,从而确定后续轮廓传播和剂量积累过程。材料和方法:我们提出了一种基于隐式神经表示(INR)的方法,在空间和时间上建模连续运动,名为继续-空间-时间曲面DIR(CPT-DIR)。该方法使用多层感知(MLP)网络将3D坐标(x,y,z)映射到其相应的速度向量(vx,vy,vz)。然后通过积分速度向量计算位移向量(dx,dy,dz)。MLP的参数可以快速适应新的病例,无需预训练,提高优化。DIR的性能在10个肺4DCT数据集上进行了测试,使用地标准确性(TRE)、轮廓一致性(Dice)和图像相似性(MAE)等指标。结果:与预训练的B-splines方法相比,CPT-DIR可以降低地标TRE从2.79mm降低到0.99mm,在所有病例中优于B-splines。整个身体的MAE从35.46HU降低到28.99HU。此外,CPT-DIR在滑动边界区域的精度超过了B-splines,降低了MAE并增加了脊椎的Dice系数从65.65HU和90.41%降低到42.04HU和90.56%,与75.40HU和89.30%没有配准相比。同时,CPT-DIR具有显著的速
https://arxiv.org/abs/2405.00430
State-of-the-art neural implicit surface representations have achieved impressive results in indoor scene reconstruction by incorporating monocular geometric priors as additional supervision. However, we have observed that multi-view inconsistency between such priors poses a challenge for high-quality reconstructions. In response, we present NC-SDF, a neural signed distance field (SDF) 3D reconstruction framework with view-dependent normal compensation (NC). Specifically, we integrate view-dependent biases in monocular normal priors into the neural implicit representation of the scene. By adaptively learning and correcting the biases, our NC-SDF effectively mitigates the adverse impact of inconsistent supervision, enhancing both the global consistency and local details in the reconstructions. To further refine the details, we introduce an informative pixel sampling strategy to pay more attention to intricate geometry with higher information content. Additionally, we design a hybrid geometry modeling approach to improve the neural implicit representation. Experiments on synthetic and real-world datasets demonstrate that NC-SDF outperforms existing approaches in terms of reconstruction quality.
先进的神经隐式表面表示已经通过将单目几何先验作为附加监督在室内场景重构中取得了显著的成果。然而,我们观察到,这种先验之间的多视角不一致会给高质量重构带来挑战。为了应对这一挑战,我们提出了NC-SDF,一种带有视图相关正则化的神经签名距离场(SDF)3D重构框架。具体来说,我们将单目正则化先验中的视点依赖偏差融入了场景的神经隐式表示中。通过自适应学习和校正偏差,我们的NC-SDF有效地减轻了不一致监督的不良影响,提高了重构的全局一致性和局部细节。为了进一步优化细节,我们引入了一种关注具有更高信息含量的复杂几何的有意义的像素采样策略。此外,我们还设计了一个混合几何建模方法来提高神经隐式表示。在合成和真实世界数据集上的实验证明,NC-SDF在重建质量方面优于现有方法。
https://arxiv.org/abs/2405.00340
In this project, we have explored machine learning approaches for predicting hearing loss thresholds on the brain's gray matter 3D images. We have solved the problem statement in two phases. In the first phase, we used a 3D CNN model to reduce high-dimensional input into latent space and decode it into an original image to represent the input in rich feature space. In the second phase, we utilized this model to reduce input into rich features and used these features to train standard machine learning models for predicting hearing thresholds. We have experimented with autoencoders and variational autoencoders in the first phase for dimensionality reduction and explored random forest, XGBoost and multi-layer perceptron for regressing the thresholds. We split the given data set into training and testing sets and achieved an 8.80 range and 22.57 range for PT500 and PT4000 on the test set, respectively. We got the lowest RMSE using multi-layer perceptron among the other models. Our approach leverages the unique capabilities of VAEs to capture complex, non-linear relationships within high-dimensional neuroimaging data. We rigorously evaluated the models using various metrics, focusing on the root mean squared error (RMSE). The results highlight the efficacy of the multi-layer neural network model, which outperformed other techniques in terms of accuracy. This project advances the application of data mining in medical diagnostics and enhances our understanding of age-related hearing loss through innovative machine-learning frameworks.
在本项目里,我们探讨了预测听觉阈值的非机器学习方法。我们分两个阶段解决了问题陈述。在第一阶段,我们使用3D CNN模型将高维输入 reduce 到低维空间并对其进行解码,将其转化为具有丰富特征空间的原始图像,以表示输入。在第二阶段,我们利用该模型将输入 reduce 到丰富特征,并使用这些特征训练标准机器学习模型以预测听阈值。我们在第一阶段尝试了自动编码器和变分自编码器,用于降维,并探讨了随机森林、XGBoost 和多层感知器对阈值的回归。我们将给定的数据集划分为训练和测试集,在测试集上的PT500和PT4000的测试结果分别为8.80和22.57。我们获得了使用多层感知器最低的均方误差(RMSE)。我们的方法利用了VAE的独特能力,在高度维的神经影像数据中捕捉到复杂和非线性的关系。我们通过各种指标对模型进行严格的评估,重点关注根均方误差(RMSE)。结果显示,多层神经网络模型的效果显著优于其他技术。本项目通过创新机器学习框架将数据挖掘应用于医疗诊断,并通过深入研究年龄相关听觉损失,提高了我们对该领域的理解。
https://arxiv.org/abs/2405.00142
Deep learning has become the de facto method for medical image segmentation, with 3D segmentation models excelling in capturing complex 3D structures and 2D models offering high computational efficiency. However, segmenting 2.5D images, which have high in-plane but low through-plane resolution, is a relatively unexplored challenge. While applying 2D models to individual slices of a 2.5D image is feasible, it fails to capture the spatial relationships between slices. On the other hand, 3D models face challenges such as resolution inconsistencies in 2.5D images, along with computational complexity and susceptibility to overfitting when trained with limited data. In this context, 2.5D models, which capture inter-slice correlations using only 2D neural networks, emerge as a promising solution due to their reduced computational demand and simplicity in implementation. In this paper, we introduce CSA-Net, a flexible 2.5D segmentation model capable of processing 2.5D images with an arbitrary number of slices through an innovative Cross-Slice Attention (CSA) module. This module uses the cross-slice attention mechanism to effectively capture 3D spatial information by learning long-range dependencies between the center slice (for segmentation) and its neighboring slices. Moreover, CSA-Net utilizes the self-attention mechanism to understand correlations among pixels within the center slice. We evaluated CSA-Net on three 2.5D segmentation tasks: (1) multi-class brain MRI segmentation, (2) binary prostate MRI segmentation, and (3) multi-class prostate MRI segmentation. CSA-Net outperformed leading 2D and 2.5D segmentation methods across all three tasks, demonstrating its efficacy and superiority. Our code is publicly available at this https URL.
深度学习已经成为医疗图像分割的既定方法,其中3D分割模型在捕捉复杂3D结构方面表现出色,而2D模型则具有高计算效率。然而,分割2.5D图像(具有高平面内分辨率但低通过平面分辨率)是一个相对较未探索的挑战。虽然将2D模型应用于2.5D图像的单个切片是可行的,但它无法捕捉切片之间的空间关系。另一方面,3D模型面临分辨率不一致、计算复杂度以及训练数据有限时过拟合的挑战。在这种情况下,使用仅基于2D神经网络的2.5D模型作为一种有前景的解决方案,因为它们具有较低的计算需求和易于实现的简单性。在本文中,我们介绍了CSA-Net,一种灵活的2.5D分割模型,可以通过创新的自注意力(CSA)模块处理任意数量的切片。该模块利用跨切片关注机制有效地捕捉3D空间信息,通过学习中心切片(用于分割)及其相邻切片之间的长距离依赖关系。此外,CSA-Net还利用自注意力机制来理解中心切片内像素之间的相关性。我们在三个2.5D分割任务上评估了CSA-Net:(1)多分类脑部MRI分割,(2)二分类前列腺MRI分割,(3)多分类前列腺MRI分割。CSA-Net在所有三个任务上都超过了领先的2D和2.5D分割方法,证明了其有效性和优越性。我们的代码可在此https:// URL上公开获取。
https://arxiv.org/abs/2405.00130
Contemporary 3D research, particularly in reconstruction and generation, heavily relies on 2D images for inputs or supervision. However, current designs for these 2D-3D mapping are memory-intensive, posing a significant bottleneck for existing methods and hindering new applications. In response, we propose a pair of highly scalable components for 3D neural fields: Lightplane Render and Splatter, which significantly reduce memory usage in 2D-3D mapping. These innovations enable the processing of vastly more and higher resolution images with small memory and computational costs. We demonstrate their utility in various applications, from benefiting single-scene optimization with image-level losses to realizing a versatile pipeline for dramatically scaling 3D reconstruction and generation. Code: \url{this https URL}.
当代 3D 研究,尤其是在建模和生成方面,严重依赖 2D 图像作为输入或指导。然而,当前为这些 2D-3D 映射设计的现有方法是内存密集型,这为现有方法和阻碍新的应用造成了显著的瓶颈。为了应对这一问题,我们提出了名为 Lightplane Render 和 Splatter 的两个高度可扩展的 3D 神经网络组件,它们大幅减少了 2D-3D 映射中的内存使用。这些创新使得用较小的内存和计算成本处理大量更高质量的图像成为可能。我们在各种应用中证明了它们的价值,从通过图像级别损失实现单场景优化,到实现用于大幅度扩展 3D 建模和生成的大幅度流程。代码:\url{这个链接}。
https://arxiv.org/abs/2404.19760