Diffusion-based audio and music generation models commonly generate music by constructing an image representation of audio (e.g., a mel-spectrogram) and then converting it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their effectiveness. We propose MusicHiFi -- an efficient high-fidelity stereophonic vocoder. Our method employs a cascade of three generative adversarial networks (GANs) that convert low-resolution mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth expansion, and upmixes to stereophonic audio. Compared to previous work, we propose 1) a unified GAN-based generator and discriminator architecture and training procedure for each stage of our cascade, 2) a new fast, near downsampling-compatible bandwidth extension module, and 3) a new fast downmix-compatible mono-to-stereo upmixer that ensures the preservation of monophonic content in the output. We evaluate our approach using both objective and subjective listening tests and find our approach yields comparable or better audio quality, better spatialization control, and significantly faster inference speed compared to past work. Sound examples are at this https URL.
扩散基于音频和音乐的生成模型通常通过构建音频的图像表示(例如,一个频谱图)并使用相位重建模型或 vocoder 将它转换为音频来生成音乐。然而,典型的 vocoder 产生的音频在较低的分辨率(例如 16-24 kHz)上,这限制了它们的有效性。我们提出 MusicHiFi -- 一个高效的高保真立体声 vocoder。我们的方法采用了一个级联的三个生成对抗网络(GANs)来将低分辨率频谱图转换为音频,通过带宽扩展将低分辨率音频升级到高分辨率音频,并使用上混合器将单声道音频转换为立体声音频。与之前的工作相比,我们提出了以下1) 对于我们级联的每个阶段,使用统一的 GAN 生成器和判别器架构以及训练程序;2) 一个新型的快速、近降采样兼容的带宽扩展模块;3) 一个快速降采样兼容的单声道到立体声上混合器,确保在输出中保留单声道内容的完整性。我们通过客观和主观听测试评估了我们的方法,发现我们的方法产生了相当不错的音频质量、更好的空间定位控制,以及比过去工作显著更快的推理速度。音频示例可以在以下链接中找到:https://www.example.com/
https://arxiv.org/abs/2403.10493
Plug-and-Play Priors (PnP) is a well-known class of methods for solving inverse problems in computational imaging. PnP methods combine physical forward models with learned prior models specified as image denoisers. A common issue with the learned models is that of a performance drop when there is a distribution shift between the training and testing data. Test-time training (TTT) was recently proposed as a general strategy for improving the performance of learned models when training and testing data come from different distributions. In this paper, we propose PnP-TTT as a new method for overcoming distribution shifts in PnP. PnP-TTT uses deep equilibrium learning (DEQ) for optimizing a self-supervised loss at the fixed points of PnP iterations. PnP-TTT can be directly applied on a single test sample to improve the generalization of PnP. We show through simulations that given a sufficient number of measurements, PnP-TTT enables the use of image priors trained on natural images for image reconstruction in magnetic resonance imaging (MRI).
插件和播放是一种在计算成像中解决反问题的著名方法。PnP方法将物理前向模型与通过图像去噪器指定的学习先验模型相结合。学习模型的常见问题是,在训练和测试数据之间存在分布偏移时,性能下降。最近,测试时训练(TTT)作为一种改进学习模型在训练和测试数据来自不同分布时性能的通用策略被提出。在本文中,我们提出PnP-TTT作为一种克服PnP中分布偏移的新方法。PnP-TTT使用深度平衡学习(DEQ)在PnP迭代固定点的自监督损失。通过仿真,我们证明了足够多的测量值时,PnP-TTT使得在MRI中利用训练的自然图像中的图像先验进行图像重建是可能的。
https://arxiv.org/abs/2403.10374
Recent progress in human shape learning, shows that neural implicit models are effective in generating 3D human surfaces from limited number of views, and even from a single RGB image. However, existing monocular approaches still struggle to recover fine geometric details such as face, hands or cloth wrinkles. They are also easily prone to depth ambiguities that result in distorted geometries along the camera optical axis. In this paper, we explore the benefits of incorporating depth observations in the reconstruction process by introducing ANIM, a novel method that reconstructs arbitrary 3D human shapes from single-view RGB-D images with an unprecedented level of accuracy. Our model learns geometric details from both multi-resolution pixel-aligned and voxel-aligned features to leverage depth information and enable spatial relationships, mitigating depth ambiguities. We further enhance the quality of the reconstructed shape by introducing a depth-supervision strategy, which improves the accuracy of the signed distance field estimation of points that lie on the reconstructed surface. Experiments demonstrate that ANIM outperforms state-of-the-art works that use RGB, surface normals, point cloud or RGB-D data as input. In addition, we introduce ANIM-Real, a new multi-modal dataset comprising high-quality scans paired with consumer-grade RGB-D camera, and our protocol to fine-tune ANIM, enabling high-quality reconstruction from real-world human capture.
近年来在人体形状学习方面的进步表明,神经隐含模型在从有限视角下生成三维人体表面以及甚至单个RGB图像时非常有效。然而,现有的单目方法仍然很难从仅有的几个视角下恢复细粒度几何细节,如面部、手部或布料皱纹。它们还容易产生深度模糊,导致沿着相机光学轴的扭曲几何。在本文中,我们探讨了在重建过程中将深度观察纳入其中的好处,通过引入ANIM,一种在单目RGB-D图像上重构任意3D人体形状的新方法,前所未有的准确度。我们的模型从多分辨率像素对齐和体素对齐的特征中学习几何细节,以利用深度信息并实现空间关系,减轻深度模糊。我们进一步通过引入深度监督策略来提高重构形状的质量,提高距离场估计点在重构表面上的准确性。实验证明,ANIM在用作输入的RGB、表面法线、点云或RGB-D数据上优于最先进的论文。此外,我们还引入了ANIM-Real,一个新的多模态数据集,包括高质量扫描和消费级RGB-D相机,以及调整ANIM的协议,使其从现实世界的人类捕捉中实现高质量重构。
https://arxiv.org/abs/2403.10357
Accelerating dynamic MRI is essential for enhancing clinical applications, such as adaptive radiotherapy, and improving patient comfort. Traditional deep learning (DL) approaches for accelerated dynamic MRI reconstruction typically rely on predefined or random subsampling patterns, applied uniformly across all temporal phases. This standard practice overlooks the potential benefits of leveraging temporal correlations and lacks the adaptability required for case-specific subsampling optimization, which holds the potential for maximizing reconstruction quality. Addressing this gap, we present a novel end-to-end framework for adaptive dynamic MRI subsampling and reconstruction. Our pipeline integrates a DL-based adaptive sampler, generating case-specific dynamic subsampling patterns, trained end-to-end with a state-of-the-art 2D dynamic reconstruction network, namely vSHARP, which effectively reconstructs the adaptive dynamic subsampled data into a moving image. Our method is assessed using dynamic cine cardiac MRI data, comparing its performance against vSHARP models that employ common subsampling trajectories, and pipelines trained to optimize dataset-specific sampling schemes alongside vSHARP reconstruction. Our results indicate superior reconstruction quality, particularly at high accelerations.
加速动态MRI对于增强临床应用,如自适应放疗,以及提高患者舒适度具有关键作用。传统的深度学习(DL)方法加速动态MRI重构通常依赖于预定义的或随机的子采样模式,在所有时间阶段上均匀应用。这种标准化做法忽视了利用时间关联的潜力,并缺乏针对具体子采样优化的适应性,这有可能提高重建质量。 为解决这一空白,我们提出了一个用于自适应动态MRI子采样和重构的端到端框架。我们的管道包括基于DL的适应采样器,生成特定病例的动态子采样模式,与最先进的2D动态重建网络——vSHARP——端到端训练,有效将适应动态子采样数据还原为动态图像。我们对该方法使用动态电影心脏MRI数据进行了评估,将其性能与采用常见子采样轨迹的vSHARP模型的性能进行了比较,以及与vSHARP重构的训练数据特定的采样方案相关的 pipeline。我们的结果表明,在高速运动下,重建质量具有优势,尤其是在高加速运动下。
https://arxiv.org/abs/2403.10346
Neural implicit surface representation methods have recently shown impressive 3D reconstruction results. However, existing solutions struggle to reconstruct urban outdoor scenes due to their large, unbounded, and highly detailed nature. Hence, to achieve accurate reconstructions, additional supervision data such as LiDAR, strong geometric priors, and long training times are required. To tackle such issues, we present SCILLA, a new hybrid implicit surface learning method to reconstruct large driving scenes from 2D images. SCILLA's hybrid architecture models two separate implicit fields: one for the volumetric density and another for the signed distance to the surface. To accurately represent urban outdoor scenarios, we introduce a novel volume-rendering strategy that relies on self-supervised probabilistic density estimation to sample points near the surface and transition progressively from volumetric to surface representation. Our solution permits a proper and fast initialization of the signed distance field without relying on any geometric prior on the scene, compared to concurrent methods. By conducting extensive experiments on four outdoor driving datasets, we show that SCILLA can learn an accurate and detailed 3D surface scene representation in various urban scenarios while being two times faster to train compared to previous state-of-the-art solutions.
近年来,神经隐式表面表示方法在3D建模方面取得了令人印象深刻的成果。然而,由于其大、无限制和高度详细的特点,现有的解决方案在重建城市户外场景方面存在较大困难。因此,为了获得准确的重建,需要添加一些监督数据,如激光雷达(LiDAR)、强大的几何先验和长的训练时间。为解决这些问题,我们提出了SCILLA,一种新的混合隐式表面学习方法,用于从2D图像重建大型驾驶场景。SCILLA的混合架构建模了两个隐式场:一个是体积密度,另一个是表面签名距离。为了准确地表示城市户外场景,我们引入了一种新的体积渲染策略,它依赖于自监督概率密度估计来采样表面附近的关键点,并从体积到表面表示逐渐过渡。与同时期方法相比,我们的解决方案不需要依赖于场景的任何几何先验,从而实现了一个合适且快速的初始化签名距离场。通过在四个户外驾驶数据集上进行广泛的实验,我们发现,SCILLA可以在各种城市场景中学习到准确的、细致的3D表面场景表示,同时其训练速度是前 state-of-the-art 解决方案的两倍。
https://arxiv.org/abs/2403.10344
In recent years, Neural Radiance Fields (NeRFs) have demonstrated significant potential in encoding highly-detailed 3D geometry and environmental appearance, positioning themselves as a promising alternative to traditional explicit representation for 3D scene reconstruction. However, the predominant reliance on RGB imaging presupposes ideal lighting conditions: a premise frequently unmet in robotic applications plagued by poor lighting or visual obstructions. This limitation overlooks the capabilities of infrared (IR) cameras, which excel in low-light detection and present a robust alternative under such adverse scenarios. To tackle these issues, we introduce Thermal-NeRF, the first method that estimates a volumetric scene representation in the form of a NeRF solely from IR imaging. By leveraging a thermal mapping and structural thermal constraint derived from the thermal characteristics of IR imaging, our method showcasing unparalleled proficiency in recovering NeRFs in visually degraded scenes where RGB-based methods fall short. We conduct extensive experiments to demonstrate that Thermal-NeRF can achieve superior quality compared to existing methods. Furthermore, we contribute a dataset for IR-based NeRF applications, paving the way for future research in IR NeRF reconstruction.
近年来,神经辐射场(NeRFs)已经在编码高度详尽的三维几何和环境外观方面展现出显著潜力,被视为传统显式表示法的有希望的替代方法,用于三维场景重构。然而,主要依赖于红外的成像假设了理想的光线条件,而在机器人应用中常常无法实现的光线条件。这个限制忽略了红外(IR)摄像机在低光检测方面的能力,后者在恶劣场景中表现出卓越的稳健性。为了应对这些问题,我们引入了 Thermal-NeRF,是第一种仅从红外成像估算体积场景表示的 NeRF 的方法。通过利用从红外图像的热特性获得的结构和热约束,我们的方法在视觉衰减的场景中表现出无与伦比的恢复 NeRFs 的能力。我们进行了广泛的实验来证明 Thermal-NeRF 可以在现有方法中实现卓越的质量和效果。此外,我们还为 IR-based NeRF 应用贡献了一个数据集,为未来在 IR NeRF 重建方面的研究铺平了道路。
https://arxiv.org/abs/2403.10340
For elastomer-based tactile sensors, represented by visuotactile sensors, routine calibration of mechanical parameters (Young's modulus and Poisson's ratio) has been shown to be important for force reconstruction. However, the reliance on existing in-situ calibration methods for accurate force measurements limits their cost-effective and flexible applications. This article proposes a new in-situ calibration scheme that relies only on comparing contact deformation. Based on the detailed derivations of the normal contact and torsional contact theories, we designed a simple and low-cost calibration device, EasyCalib, and validated its effectiveness through extensive finite element analysis. We also explored the accuracy of EasyCalib in the practical application and demonstrated that accurate contact distributed force reconstruction can be realized based on the mechanical parameters obtained. EasyCalib balances low hardware cost, ease of operation, and low dependence on technical expertise and is expected to provide the necessary accuracy guarantees for wide applications of visuotactile sensors in the wild.
对于基于弹性材料的触觉传感器,即视觉触觉传感器,常规机械参数(杨氏模量和泊松比)的校准对于力重构非常重要。然而,依赖现有的原位校准方法进行精确力测量限制了它们的成本效益和灵活应用。本文提出了一种仅基于比较接触变形的新原位校准方案, EasyCalib,并通过大量的有限元分析验证了其实效性。我们还探讨了EasyCalib在实际应用中的准确性,并证明了根据获得的机械参数实现准确分布式力重构是可能的。EasyCalib在低硬件成本、易用性和低技术依赖性方面取得了平衡,预计将为野外广泛应用视觉触觉传感器提供所需的准确度保证。
https://arxiv.org/abs/2403.10256
Reconstructing detailed 3D objects from single-view images remains a challenging task due to the limited information available. In this paper, we introduce FDGaussian, a novel two-stage framework for single-image 3D reconstruction. Recent methods typically utilize pre-trained 2D diffusion models to generate plausible novel views from the input image, yet they encounter issues with either multi-view inconsistency or lack of geometric fidelity. To overcome these challenges, we propose an orthogonal plane decomposition mechanism to extract 3D geometric features from the 2D input, enabling the generation of consistent multi-view images. Moreover, we further accelerate the state-of-the-art Gaussian Splatting incorporating epipolar attention to fuse images from different viewpoints. We demonstrate that FDGaussian generates images with high consistency across different views and reconstructs high-quality 3D objects, both qualitatively and quantitatively. More examples can be found at our website this https URL.
从单视图图像中重建详细的三维物体仍然是一个具有挑战性的任务,因为可用信息有限。在本文中,我们引入了FDGaussian,一种新的两阶段框架,用于从单视图图像中重建详细的三维物体。最近的方法通常利用预训练的2D扩散模型生成输入图像的合理新视图,但它们要么遇到多视角不一致问题,要么缺乏几何一致性。为了克服这些挑战,我们提出了一个正交平面分解机制,从2D输入中提取3D几何特征,从而生成一致的多视角图像。此外,我们通过引入 epipolar 注意机制进一步加速最先进的高斯膨胀,将来自不同视角的图像融合在一起。我们证明了FDGaussian在不同的视图中具有高度的一致性,并成功重建了高质量的三维物体,无论是定性的还是定量的。更多例子可以在我们的网站上找到,网址是https://this URL。
https://arxiv.org/abs/2403.10242
We propose a novel rolling shutter bundle adjustment method for neural radiance fields (NeRF), which utilizes the unordered rolling shutter (RS) images to obtain the implicit 3D representation. Existing NeRF methods suffer from low-quality images and inaccurate initial camera poses due to the RS effect in the image, whereas, the previous method that incorporates the RS into NeRF requires strict sequential data input, limiting its widespread applicability. In constant, our method recovers the physical formation of RS images by estimating camera poses and velocities, thereby removing the input constraints on sequential data. Moreover, we adopt a coarse-to-fine training strategy, in which the RS epipolar constraints of the pairwise frames in the scene graph are used to detect the camera poses that fall into local minima. The poses detected as outliers are corrected by the interpolation method with neighboring poses. The experimental results validate the effectiveness of our method over state-of-the-art works and demonstrate that the reconstruction of 3D representations is not constrained by the requirement of video sequence input.
我们提出了一个新颖的滚动 shutter(RS)卷积神经网络辐射场(NeRF)调整方法,该方法利用无序滚动 shutter(RS)图像来获得隐式的三维表示。现有的 NeRF 方法由于图片中的 RS 效应导致低质量图像和错误的初始相机姿态,而先前的将 RS 融入 NeRF 的方法需要严格的顺序数据输入,限制了其普遍应用。在不变的情况下,我们的方法通过估计相机姿态和速度,恢复场景图中外部帧的 RS 极化约束,从而消除了对序列数据的输入约束。此外,我们采用了一种粗-到-精细的训练策略,其中使用场景图中外部帧的 RS 极化约束来检测陷入局部最小值的相机姿态。检测到的异常姿态通过相邻姿态的插值方法进行纠正。实验结果证实了我们的方法在现有工作之上具有有效性,并表明三维表示的重建并不受视频序列输入要求的限制。
https://arxiv.org/abs/2403.10119
Deep unfolding networks (DUN) have emerged as a popular iterative framework for accelerated magnetic resonance imaging (MRI) reconstruction. However, conventional DUN aims to reconstruct all the missing information within the entire null space in each iteration. Thus it could be challenging when dealing with highly ill-posed degradation, usually leading to unsatisfactory reconstruction. In this work, we propose a Progressive Divide-And-Conquer (PDAC) strategy, aiming to break down the subsampling process in the actual severe degradation and thus perform reconstruction sequentially. Starting from decomposing the original maximum-a-posteriori problem of accelerated MRI, we present a rigorous derivation of the proposed PDAC framework, which could be further unfolded into an end-to-end trainable network. Specifically, each iterative stage in PDAC focuses on recovering a distinct moderate degradation according to the decomposition. Furthermore, as part of the PDAC iteration, such decomposition is adaptively learned as an auxiliary task through a degradation predictor which provides an estimation of the decomposed sampling mask. Following this prediction, the sampling mask is further integrated via a severity conditioning module to ensure awareness of the degradation severity at each stage. Extensive experiments demonstrate that our proposed method achieves superior performance on the publicly available fastMRI and Stanford2D FSE datasets in both multi-coil and single-coil settings.
深度展开网络(DUN)已成为加速磁共振成像(MRI)重建的流行迭代框架。然而,传统的DUN旨在在每次迭代中重构整个缺失信息,因此当处理高度欠拟合时,可能会导致不满意的重建。在本文中,我们提出了一种渐进式分割和 conquer(PDAC)策略,旨在在实际严重降解中分解子采样过程,从而进行逐层重建。从加速MRI的最大后验问题出发,我们给出了PDAC框架的严格递推,可以进一步展开成一个端到端的训练网络。具体来说,每个PDAC迭代阶段专注于根据分解来恢复一个独特的中度降解。此外,作为PDAC迭代的一部分,这种分解通过降解预测器作为辅助任务进行学习,该预测器提供对分解采样掩码的估计。根据这个预测,采样掩码通过严重条件模块进一步整合,以保证在每個阶段意识到降解的严重程度。大量实验证明,我们提出的方法在多 coil和单 coil设置下的公开可用 fastMRI 和 Stanford2D FSE 数据集上取得了卓越的性能。
https://arxiv.org/abs/2403.10064
3D Gaussian splatting, emerging as a groundbreaking approach, has drawn increasing attention for its capabilities of high-fidelity reconstruction and real-time rendering. However, it couples the appearance and geometry of the scene within the Gaussian attributes, which hinders the flexibility of editing operations, such as texture swapping. To address this issue, we propose a novel approach, namely Texture-GS, to disentangle the appearance from the geometry by representing it as a 2D texture mapped onto the 3D surface, thereby facilitating appearance editing. Technically, the disentanglement is achieved by our proposed texture mapping module, which consists of a UV mapping MLP to learn the UV coordinates for the 3D Gaussian centers, a local Taylor expansion of the MLP to efficiently approximate the UV coordinates for the ray-Gaussian intersections, and a learnable texture to capture the fine-grained appearance. Extensive experiments on the DTU dataset demonstrate that our method not only facilitates high-fidelity appearance editing but also achieves real-time rendering on consumer-level devices, e.g. a single RTX 2080 Ti GPU.
3D高斯平铺是一种突破性的方法,因其高保真度和实时渲染能力而受到越来越多的关注。然而,它将场景在Gaussian属性中的外观和几何耦合在一起,这会限制编辑操作的灵活性,例如纹理交换。为解决这个问题,我们提出了一个名为Texture-GS的新方法,通过将纹理映射到3D表面来解耦外观和几何,从而促进外观编辑。从技术上讲,解耦是通过我们提出的纹理映射模块实现的,该模块包括一个紫外映射的MLP来学习3D高斯中心的光紫外坐标,一个局部泰勒展开的MLP来有效地近似光线Gaussian交点的紫外坐标,和一个可学习的纹理来捕捉细粒度的外观。在DTU数据集上进行的大量实验证明,我们的方法不仅促进了高保真度的外观编辑,而且在消费者级别的设备上实现了实时渲染,例如单个RTX 2080 Ti GPU。
https://arxiv.org/abs/2403.10050
Histo-genomic multi-modal methods have recently emerged as a powerful paradigm, demonstrating significant potential for improving cancer prognosis. However, genome sequencing, unlike histopathology imaging, is still not widely accessible in underdeveloped regions, limiting the application of these multi-modal approaches in clinical settings. To address this, we propose a novel Genome-informed Hyper-Attention Network, termed G-HANet, which is capable of effectively distilling the histo-genomic knowledge during training to elevate uni-modal whole slide image (WSI)-based inference for the first time. Compared with traditional knowledge distillation methods (i.e., teacher-student architecture) in other tasks, our end-to-end model is superior in terms of training efficiency and learning cross-modal interactions. Specifically, the network comprises the cross-modal associating branch (CAB) and hyper-attention survival branch (HSB). Through the genomic data reconstruction from WSIs, CAB effectively distills the associations between functional genotypes and morphological phenotypes and offers insights into the gene expression profiles in the feature space. Subsequently, HSB leverages the distilled histo-genomic associations as well as the generated morphology-based weights to achieve the hyper-attention modeling of the patients from both histopathology and genomic perspectives to improve cancer prognosis. Extensive experiments are conducted on five TCGA benchmarking datasets and the results demonstrate that G-HANet significantly outperforms the state-of-the-art WSI-based methods and achieves competitive performance with genome-based and multi-modal methods. G-HANet is expected to be explored as a useful tool by the research community to address the current bottleneck of insufficient histo-genomic data pairing in the context of cancer prognosis and precision oncology.
近年来,历史基因组多模态方法已成为一种强大的范式,显著改善了癌症预后。然而,与历史病理学成像不同,基因组测序在欠发达地区仍然很难获得,限制了这些多模态方法在临床设置中的应用。为了应对这个问题,我们提出了一个名为 G-HANet 的新颖基因组指导超关注网络,它能够有效在训练过程中提取组织学基因型和形态学表型的 histo-genomic 知识,首次将单模态 whole slide image (WSI) 推理提升到前所未有的水平。与其他任务中的传统知识蒸馏方法(即教师-学生架构)相比,我们的端到端模型在训练效率和学习跨模态相互作用方面具有优势。具体来说,网络包括跨模态相关分支(CAB)和超关注生存分支(HSB)。通过从 WSIs 中重建基因组数据,CAB 有效地提取功能基因型和形态表型之间的关联,并为特征空间中的基因表达提供洞察。接着,HSB 利用提取的 histo-genomic 关联以及生成的形态学基础权重,实现从历史病理学和基因组观点对患者进行超关注建模,从而提高癌症预后。在五个 TCGA 基准数据集上进行了大量实验,结果表明,G-HANet 显著优于基于 WSI 的最先进方法,并与其基于基因组和多模态方法达到竞争水平。预计 G-HANet 将作为研究社区解决癌症预后和精确肿瘤疗法的当前瓶颈的有效工具。
https://arxiv.org/abs/2403.10040
While text-to-3D and image-to-3D generation tasks have received considerable attention, one important but under-explored field between them is controllable text-to-3D generation, which we mainly focus on in this work. To address this task, 1) we introduce Multi-view ControlNet (MVControl), a novel neural network architecture designed to enhance existing pre-trained multi-view diffusion models by integrating additional input conditions, such as edge, depth, normal, and scribble maps. Our innovation lies in the introduction of a conditioning module that controls the base diffusion model using both local and global embeddings, which are computed from the input condition images and camera poses. Once trained, MVControl is able to offer 3D diffusion guidance for optimization-based 3D generation. And, 2) we propose an efficient multi-stage 3D generation pipeline that leverages the benefits of recent large reconstruction models and score distillation algorithm. Building upon our MVControl architecture, we employ a unique hybrid diffusion guidance method to direct the optimization process. In pursuit of efficiency, we adopt 3D Gaussians as our representation instead of the commonly used implicit representations. We also pioneer the use of SuGaR, a hybrid representation that binds Gaussians to mesh triangle faces. This approach alleviates the issue of poor geometry in 3D Gaussians and enables the direct sculpting of fine-grained geometry on the mesh. Extensive experiments demonstrate that our method achieves robust generalization and enables the controllable generation of high-quality 3D content.
尽管文本到3D和图像到3D生成任务已经得到了相当的关注,但它们之间一个重要的但尚未被深入探讨的领域是可控制文本到3D生成,这是我们主要关注的方向。为解决这个任务,我们引入了Multi-view ControlNet (MVControl),一种专为增强现有的预训练多视角扩散模型而设计的全新神经网络架构,通过引入额外的输入条件(如边缘、深度、法线和画笔图)来增强现有的多视角扩散模型。我们的创新之处在于引入了一个控制模块,通过局部和全局嵌入来控制基础扩散模型,这些嵌入是由输入条件图像和相机姿态计算出来的。一旦训练完成,MVControl能够为基于优化的3D生成提供3D扩散指导。 我们还提出了一个高效的多阶段3D生成管道,利用最近的大规模重建模型和评分分离算法的优势。在基于MVControl的架构基础上,我们采用了一种独特的混合扩散指导方法来指导优化过程。为了追求效率,我们采用了3D高斯作为我们的表示,而不是通常使用的隐式表示。我们还开创了将高斯约束绑定到网格三角形面的SuGaR混合表示的使用。这种方法减轻了3D高斯中几何不良的问题,并使网格上可以直接塑造细粒度几何。大量实验证明,我们的方法实现了稳健的泛化,并能够实现高质量3D内容的可控制生成。
https://arxiv.org/abs/2403.09981
We have built a custom mobile multi-camera large-space dense light field capture system, which provides a series of high-quality and sufficiently dense light field images for various scenarios. Our aim is to contribute to the development of popular 3D scene reconstruction algorithms such as IBRnet, NeRF, and 3D Gaussian splitting. More importantly, the collected dataset, which is much denser than existing datasets, may also inspire space-oriented light field reconstruction, which is potentially different from object-centric 3D reconstruction, for immersive VR/AR experiences. We utilized a total of 40 GoPro 10 cameras, capturing images of 5k resolution. The number of photos captured for each scene is no less than 1000, and the average density (view number within a unit sphere) is 134.68. It is also worth noting that our system is capable of efficiently capturing large outdoor scenes. Addressing the current lack of large-space and dense light field datasets, we made efforts to include elements such as sky, reflections, lights and shadows that are of interest to researchers in the field of 3D reconstruction during the data capture process. Finally, we validated the effectiveness of our provided dataset on three popular algorithms and also integrated the reconstructed 3DGS results into the Unity engine, demonstrating the potential of utilizing our datasets to enhance the realism of virtual reality (VR) and create feasible interactive spaces. The dataset is available at our project website.
我们已经开发了一个定制的移动多相机大空间密集光场捕捉系统,为各种场景提供了高质量且足够密集的光场图像。我们的目标是为流行的3D场景重建算法如IBRnet、NeRF和3D高斯分割做出贡献。更重要的是,所收集的数据集(比现有数据集密度更高)可能还会激发空间导向的光场重建,这可能与物体中心3D重建不同,为沉浸式VR/AR体验。我们使用了总共40个GoPro 10相机,拍摄了5K分辨率的照片。每个场景拍摄的照片数量不少于1000张,平均密度(单位球内的视数)为134.68。值得注意的是,我们的系统还具有捕捉大型户外场景的能力。为解决当前大型空间和密集光场数据集的不足,我们在数据捕捉过程中努力包括了天空、反射、灯光和阴影等对3D建模领域的研究人员感兴趣的元素。最后,我们在三个流行的算法上验证了所提供数据集的有效性,并将重构的3DGS结果整合到Unity引擎中,证明了利用我们的数据集增强虚拟现实(VR)和创建可行交互空间的潜力。数据集可以在我们的项目网站上获取。
https://arxiv.org/abs/2403.09973
Tasks such as autonomous navigation, 3D reconstruction, and object recognition near the water surfaces are crucial in marine robotics applications. However, challenges arise due to dynamic disturbances, e.g., light reflections and refraction from the random air-water interface, irregular liquid flow, and similar factors, which can lead to potential failures in perception and navigation systems. Traditional computer vision algorithms struggle to differentiate between real and virtual image regions, significantly complicating tasks. A virtual image region is an apparent representation formed by the redirection of light rays, typically through reflection or refraction, creating the illusion of an object's presence without its actual physical location. This work proposes a novel approach for segmentation on real and virtual image regions, exploiting synthetic images combined with domain-invariant information, a Motion Entropy Kernel, and Epipolar Geometric Consistency. Our segmentation network does not need to be re-trained if the domain changes. We show this by deploying the same segmentation network in two different domains: simulation and the real world. By creating realistic synthetic images that mimic the complexities of the water surface, we provide fine-grained training data for our network (MARVIS) to discern between real and virtual images effectively. By motion & geometry-aware design choices and through comprehensive experimental analysis, we achieve state-of-the-art real-virtual image segmentation performance in unseen real world domain, achieving an IoU over 78% and a F1-Score over 86% while ensuring a small computational footprint. MARVIS offers over 43 FPS (8 FPS) inference rates on a single GPU (CPU core). Our code and dataset are available here this https URL.
类似于自主导航、水下三维重建和水上物体识别等任务在海洋机器人应用中至关重要。然而,由于动态干扰,例如随机水界面上的光线反射和折射、不规则的液体流动等因素,会导致感知和导航系统出现潜在故障。传统的计算机视觉算法很难区分真实和虚拟图像区域,从而大大复杂了任务。 虚拟图像区域是由光线通过反射或折射指向形成的明显表示,通常产生物体存在的错觉,而不需要其实际物理位置。本文提出了一种在真实和虚拟图像区域进行分割的新颖方法,利用了组合域无关信息、运动熵核和Epipolar Geometric Consistency。 我们的分割网络不需要在领域变化时重新训练。我们通过在两个不同的领域部署相同的分割网络来证明这一点:模拟和现实世界。通过创建类似于水表面复杂性的现实合成图像,为我们的网络(MARVIS)提供精细的训练数据,以有效地区分真实和虚拟图像。通过运动和几何感知设计选择以及全面的实验分析,我们在未见过的现实世界领域实现了最先进的实虚图像分割性能,达到IoU超过78%和F1- Score超过86%,同时确保具有较小的计算开销。MARVIS在单个GPU(CPU核心)上的推理率为43 FPS(8 FPS)。我们的代码和数据集可以在这个链接处获取:https://www.xxxxxxx.com。
https://arxiv.org/abs/2403.09850
Constructing a 3D scene capable of accommodating open-ended language queries, is a pivotal pursuit, particularly within the domain of robotics. Such technology facilitates robots in executing object manipulations based on human language directives. To tackle this challenge, some research efforts have been dedicated to the development of language-embedded implicit fields. However, implicit fields (e.g. NeRF) encounter limitations due to the necessity of processing a large number of input views for reconstruction, coupled with their inherent inefficiencies in inference. Thus, we present the GaussianGrasper, which utilizes 3D Gaussian Splatting to explicitly represent the scene as a collection of Gaussian primitives. Our approach takes a limited set of RGB-D views and employs a tile-based splatting technique to create a feature field. In particular, we propose an Efficient Feature Distillation (EFD) module that employs contrastive learning to efficiently and accurately distill language embeddings derived from foundational models. With the reconstructed geometry of the Gaussian field, our method enables the pre-trained grasping model to generate collision-free grasp pose candidates. Furthermore, we propose a normal-guided grasp module to select the best grasp pose. Through comprehensive real-world experiments, we demonstrate that GaussianGrasper enables robots to accurately query and grasp objects with language instructions, providing a new solution for language-guided manipulation tasks. Data and codes can be available at this https URL.
构建一个能够容纳开放性语言查询的三维场景是一个关键的追求,尤其是在机器人领域。这种技术使机器人能够根据人类语言指令执行物体操作。为了解决这个问题,一些研究努力致力于开发语言嵌入的隐含场。然而,隐含场(例如 NeRF)由于处理大量输入视图进行重构的必要性和其推理效率低下而遇到困难。因此,我们提出了 GaussianGrasper,它利用3D高斯平铺来明确表示场景为一个由高斯基本模型产生的高斯粒子的集合。我们的方法采用有限数量的RGB-D视图,并采用基于贴图的平铺技术创建一个特征场。特别地,我们提出了一个高效的特征蒸馏(EFD)模块,它采用对比学习有效地和准确地蒸馏来自基础模型的语言嵌入。通过全面的现实世界实验,我们证明了 GaussianGrasper 使机器人能够准确地用语言指令查询和抓取物体,为语言引导操作任务提供了一种新的解决方案。数据和代码可以在该https URL上获得。
https://arxiv.org/abs/2403.09637
We present Score-Guided Human Mesh Recovery (ScoreHMR), an approach for solving inverse problems for 3D human pose and shape reconstruction. These inverse problems involve fitting a human body model to image observations, traditionally solved through optimization techniques. ScoreHMR mimics model fitting approaches, but alignment with the image observation is achieved through score guidance in the latent space of a diffusion model. The diffusion model is trained to capture the conditional distribution of the human model parameters given an input image. By guiding its denoising process with a task-specific score, ScoreHMR effectively solves inverse problems for various applications without the need for retraining the task-agnostic diffusion model. We evaluate our approach on three settings/applications. These are: (i) single-frame model fitting; (ii) reconstruction from multiple uncalibrated views; (iii) reconstructing humans in video sequences. ScoreHMR consistently outperforms all optimization baselines on popular benchmarks across all settings. We make our code and models available at the this https URL.
我们提出了Score-Guided Human Mesh Recovery (ScoreHMR)方法,用于解决3D人体姿态和形状重建的逆问题。这些问题通过传统的优化技术来解决,但ScoreHMR通过扩散模型的潜在空间中的分数指导来实现与图像观察的同步。扩散模型被训练来捕获给定输入图像的人体模型参数的联合分布。通过将任务特定的分数作为其去噪过程的指导,ScoreHMR有效地解决了各种应用中的逆问题,而无需重新训练无关于任务的扩散模型。我们在三个设置/应用中评估了我们的方法。这些设置/应用是:(i)单帧模型拟合;(ii)从多个未校准的视角进行重建;(iii)在视频序列中重建人类。ScoreHMR在所有设置中都一致地超过了所有优化基线。我们将我们的代码和模型公开在https:// this URL上。
https://arxiv.org/abs/2403.09623
Reconstructing and simulating elastic objects from visual observations is crucial for applications in computer vision and robotics. Existing methods, such as 3D Gaussians, provide modeling for 3D appearance and geometry but lack the ability to simulate physical properties or optimize parameters for heterogeneous objects. We propose Spring-Gaus, a novel framework that integrates 3D Gaussians with physics-based simulation for reconstructing and simulating elastic objects from multi-view videos. Our method utilizes a 3D Spring-Mass model, enabling the optimization of physical parameters at the individual point level while decoupling the learning of physics and appearance. This approach achieves great sample efficiency, enhances generalization, and reduces sensitivity to the distribution of simulation particles. We evaluate Spring-Gaus on both synthetic and real-world datasets, demonstrating accurate reconstruction and simulation of elastic objects. This includes future prediction and simulation under varying initial states and environmental parameters. Project page: this https URL.
基于视觉观察的重构和模拟弹性物体对于计算机视觉和机器人应用至关重要。现有的方法,如3D高斯,为3D外观和几何形状提供了建模,但缺乏模拟异质物体物理性质或优化参数的能力。我们提出Spring-Gaus,一种将3D高斯与基于物理的模拟相结合,用于从多视角视频中重构和模拟弹性物体的框架。我们的方法利用了3D Spring-Mass模型,在个体点水平上优化物理参数,同时将学习物理与外观分离。这种方法实现了高样本效率,增强了泛化能力,并降低了模拟粒子分布对结果的敏感度。我们在 synthetic 和 real-world 数据集上评估 Spring-Gaus,证明了弹性物体的精确重构和模拟。这包括在不断变化的基本初始状态和环境参数下进行未来预测和模拟。项目页面:https:// this URL。
https://arxiv.org/abs/2403.09434
The task of separating dynamic objects from static environments using NeRFs has been widely studied in recent years. However, capturing large-scale scenes still poses a challenge due to their complex geometric structures and unconstrained dynamics. Without the help of 3D motion cues, previous methods often require simplified setups with slow camera motion and only a few/single dynamic actors, leading to suboptimal solutions in most urban setups. To overcome such limitations, we present RoDUS, a pipeline for decomposing static and dynamic elements in urban scenes, with thoughtfully separated NeRF models for moving and non-moving components. Our approach utilizes a robust kernel-based initialization coupled with 4D semantic information to selectively guide the learning process. This strategy enables accurate capturing of the dynamics in the scene, resulting in reduced artifacts caused by NeRF on background reconstruction, all by using self-supervision. Notably, experimental evaluations on KITTI-360 and Pandaset datasets demonstrate the effectiveness of our method in decomposing challenging urban scenes into precise static and dynamic components.
近年来,通过利用NeRFs将动态对象与静态环境进行分离的任务已经受到了广泛研究。然而,由于大规模场景的复杂几何结构和无约束动力学,捕捉大型场景仍然具有挑战性。如果没有3D运动线索,以前的方法通常需要简化设置,包括缓慢的相机运动和只有几个/单一动态演员,导致在大多数城市场景中得到低效的解决方案。为了克服这些限制,我们提出了RoDUS,一条用于分解城市场景中的静态和动态元素的管道,以及为移动和非移动组件分别使用精心分离的NeRF模型的自监督方法。我们的方法利用了基于韧性的内核与4D语义信息的组合来选择性地引导学习过程。这种策略能够准确捕捉场景中的动态,从而减少由NeRF在背景重建中产生的伪影,全部都是通过自监督。值得注意的是,在KITTI-360和Pandaset数据集上的实验评估证明了我们在分解具有挑战性的城市场景为精确的静态和动态组件方面取得了有效成果。
https://arxiv.org/abs/2403.09419
3D Gaussian splatting (3DGS) has recently demonstrated impressive capabilities in real-time novel view synthesis and 3D reconstruction. However, 3DGS heavily depends on the accurate initialization derived from Structure-from-Motion (SfM) methods. When trained with randomly initialized point clouds, 3DGS fails to maintain its ability to produce high-quality images, undergoing large performance drops of 4-5 dB in PSNR. Through extensive analysis of SfM initialization in the frequency domain and analysis of a 1D regression task with multiple 1D Gaussians, we propose a novel optimization strategy dubbed RAIN-GS (Relaxing Accurate Initialization Constraint for 3D Gaussian Splatting), that successfully trains 3D Gaussians from random point clouds. We show the effectiveness of our strategy through quantitative and qualitative comparisons on multiple datasets, largely improving the performance in all settings. Our project page and code can be found at this https URL.
3D Gaussian splatting (3DGS) 最近在实时新视图合成和 3D 重建方面展示了令人印象深刻的 capabilities。然而,3DGS 高度依赖于来自结构从运动(SfM)方法的准确初始化。当用随机初始化的点云进行训练时,3DGS 无法保持其产生高质量图像的能力,性能下降达到 4-5 dB 在 PSNR。通过分析 SfM 初始化在频域中的广泛性和分析具有多个 1D 高斯的情况,我们提出了一个名为 RAIN-GS(放松准确初始化约束 3D Gaussian Splatting)的新优化策略,成功地从随机点云中训练 3D 高斯。我们通过多个数据集的定量和定性比较展示了我们策略的有效性,大大提高了所有设置的性能。我们的项目页和代码可以在这个链接中找到。
https://arxiv.org/abs/2403.09413