In this research, we introduce MaeFuse, a novel autoencoder model designed for infrared and visible image fusion (IVIF). The existing approaches for image fusion often rely on training combined with downstream tasks to obtain high-level visual information, which is effective in emphasizing target objects and delivering impressive results in visual quality and task-specific applications. MaeFuse, however, deviates from the norm. Instead of being driven by downstream tasks, our model utilizes a pretrained encoder from Masked Autoencoders (MAE), which facilities the omni features extraction for low-level reconstruction and high-level vision tasks, to obtain perception friendly features with a low cost. In order to eliminate the domain gap of different modal features and the block effect caused by the MAE encoder, we further develop a guided training strategy. This strategy is meticulously crafted to ensure that the fusion layer seamlessly adjusts to the feature space of the encoder, gradually enhancing the fusion effect. It facilitates the comprehensive integration of feature vectors from both infrared and visible modalities, preserving the rich details inherent in each. MaeFuse not only introduces a novel perspective in the realm of fusion techniques but also stands out with impressive performance across various public datasets.
在这项研究中,我们引入了MaeFuse,一种专为红外和可见图像融合(IVIF)设计的全新自编码器模型。与现有的图像融合方法通常依赖于下游任务的训练以获得高级视觉信息不同,这种方法在强调目标物体和提供令人印象深刻的视觉质量及任务特定应用方面非常有效。然而,MaeFuse与常规方法有所不同。我们不仅仅受到下游任务的驱动,而是利用了来自遮蔽自动编码器(MAE)的预训练编码器,该编码器具有低级别重建和高级视觉任务的跨模态特征提取能力,以获得感知友好的低代价特征。为了消除不同模态特征之间的领域差异和MAE编码器引起的问题,我们进一步开发了一种引导训练策略。这种策略精心设计,确保融合层能平滑地适应编码器的特征空间,逐渐增强融合效果。它有助于将来自红外和可见模态的特征向量进行全面整合,保留每种模态中丰富细节的丰富性。MaeFuse不仅在融合技术的领域带来了新颖的观点,而且在各种公开数据集上表现出令人印象深刻的性能。
https://arxiv.org/abs/2404.11016
Human communication is multi-modal; e.g., face-to-face interaction involves auditory signals (speech) and visual signals (face movements and hand gestures). Hence, it is essential to exploit multiple modalities when designing machine learning-based facial expression recognition systems. In addition, given the ever-growing quantities of video data that capture human facial expressions, such systems should utilize raw unlabeled videos without requiring expensive annotations. Therefore, in this work, we employ a multitask multi-modal self-supervised learning method for facial expression recognition from in-the-wild video data. Our model combines three self-supervised objective functions: First, a multi-modal contrastive loss, that pulls diverse data modalities of the same video together in the representation space. Second, a multi-modal clustering loss that preserves the semantic structure of input data in the representation space. Finally, a multi-modal data reconstruction loss. We conduct a comprehensive study on this multimodal multi-task self-supervised learning method on three facial expression recognition benchmarks. To that end, we examine the performance of learning through different combinations of self-supervised tasks on the facial expression recognition downstream task. Our model ConCluGen outperforms several multi-modal self-supervised and fully supervised baselines on the CMU-MOSEI dataset. Our results generally show that multi-modal self-supervision tasks offer large performance gains for challenging tasks such as facial expression recognition, while also reducing the amount of manual annotations required. We release our pre-trained models as well as source code publicly
人类交流是多模态的;例如,面对面的交互包括听觉信号(说话)和视觉信号(面部动作和手势)。因此,在设计基于机器学习的面部表情识别系统时,充分利用多个模态是非常重要的。此外,考虑到不断增长的视频数据量,这些系统应该利用未标记的视频原始数据,而不需要昂贵的注释。因此,在这项工作中,我们采用了一种多任务多模态自监督学习方法对野外视频数据进行面部表情识别。我们的模型结合了三种自监督目标函数:第一,多模态对比损失,将相同视频中的不同数据模态聚集在表示空间中。第二,多模态聚类损失,保留输入数据在表示空间中的语义结构。第三,多模态数据重建损失。我们在三个面部表情识别基准上对这种多模态多任务自监督学习方法进行全面研究。为此,我们研究了在不同自监督任务对面部表情识别下游任务进行学习时的性能。我们的模型ConCluGen在CMU-MOSEI数据集上优于多个多模态自监督和完全监督基线。我们的结果表明,多模态自监督任务为具有挑战性的任务(如面部表情识别)提供了较大的性能提升,同时减少了手动注释的数量。我们公开发布预训练模型和源代码。
https://arxiv.org/abs/2404.10904
Recently, 3D Gaussian Splatting (3DGS) has demonstrated impressive novel view synthesis results, while allowing the rendering of high-resolution images in real-time. However, leveraging 3D Gaussians for surface reconstruction poses significant challenges due to the explicit and disconnected nature of 3D Gaussians. In this work, we present Gaussian Opacity Fields (GOF), a novel approach for efficient, high-quality, and compact surface reconstruction in unbounded scenes. Our GOF is derived from ray-tracing-based volume rendering of 3D Gaussians, enabling direct geometry extraction from 3D Gaussians by identifying its levelset, without resorting to Poisson reconstruction or TSDF fusion as in previous work. We approximate the surface normal of Gaussians as the normal of the ray-Gaussian intersection plane, enabling the application of regularization that significantly enhances geometry. Furthermore, we develop an efficient geometry extraction method utilizing marching tetrahedra, where the tetrahedral grids are induced from 3D Gaussians and thus adapt to the scene's complexity. Our evaluations reveal that GOF surpasses existing 3DGS-based methods in surface reconstruction and novel view synthesis. Further, it compares favorably to, or even outperforms, neural implicit methods in both quality and speed.
近年来,3D高斯平铺(3DGS)已经展示了令人印象深刻的全新视图合成结果,同时允许在实时渲染高分辨率图像。然而,利用3D高斯进行表面复原存在重大挑战,因为3D高斯具有显式和离散的性质。在本文中,我们提出了 Gaussian Opacity Fields (GOF),一种用于在无边界场景中实现高效、高质量和紧凑表面复原的新方法。我们的GOF是基于3D高斯的光线追踪体积渲染派生而来的,通过确定其境界线,直接从3D高斯中提取几何,而不需要求助于Poisson重建或TSDF融合,如以前的工作。我们将高斯表面的法线近似为光线与高斯平面相交平面的法线,使得可以应用正则化,显著增强几何。此外,我们开发了一种利用步进四边形进行有效几何提取的方法,其中四边形网格是由3D高斯引起的,因此可以适应场景的复杂性。我们的评估显示,GOF在表面复原和全新视图合成方面超过了现有的3DGS方法。此外,它与神经隐式方法在质量和速度上相比,具有优势,甚至优异。
https://arxiv.org/abs/2404.10772
Two-dimensional (2D) freehand ultrasonography is one of the most commonly used medical imaging modalities, particularly in obstetrics and gynaecology. However, it only captures 2D cross-sectional views of inherently 3D anatomies, losing valuable contextual information. As an alternative to requiring costly and complex 3D ultrasound scanners, 3D volumes can be constructed from 2D scans using machine learning. However this usually requires long computational time. Here, we propose RapidVol: a neural representation framework to speed up slice-to-volume ultrasound reconstruction. We use tensor-rank decomposition, to decompose the typical 3D volume into sets of tri-planes, and store those instead, as well as a small neural network. A set of 2D ultrasound scans, with their ground truth (or estimated) 3D position and orientation (pose) is all that is required to form a complete 3D reconstruction. Reconstructions are formed from real fetal brain scans, and then evaluated by requesting novel cross-sectional views. When compared to prior approaches based on fully implicit representation (e.g. neural radiance fields), our method is over 3x quicker, 46% more accurate, and if given inaccurate poses is more robust. Further speed-up is also possible by reconstructing from a structural prior rather than from scratch.
二维(2D)自由手超声是一种最常用的医学成像方法,尤其是在产科和妇科领域。然而,它只能捕获固有3D解剖结构的2D横截面视图,丢失了宝贵的相关信息。作为替代需要昂贵且复杂3D超声扫描器的做法,可以使用机器学习从2D扫描构建3D体积。然而,这通常需要较长的时间进行计算。在这里,我们提出了RapidVol:一个神经表示框架,以加速切片到体积超声重建。我们使用张量秩分解,将典型的3D体积分解为三角形集,并存储那些三角形,以及一个小型神经网络。需要2D超声扫描的集合,以及它们的地面真实(或估计)3D位置和方向(姿态)构成了完整的3D重建。从真实的胎儿脑部扫描中构建 reconstruction,然后通过请求新的横截面视图进行评估。与基于完全隐式表示的前沿方法(例如神经辐射场)相比,我们的方法快3倍,准确度高46%,即使给不准确的姿态,也更具鲁棒性。此外,通过从结构先验而不是从零开始重构,还可以进一步加速。
https://arxiv.org/abs/2404.10766
Neural reconstruction approaches are rapidly emerging as the preferred representation for 3D scenes, but their limited editability is still posing a challenge. In this work, we propose an approach for 3D scene inpainting -- the task of coherently replacing parts of the reconstructed scene with desired content. Scene inpainting is an inherently ill-posed task as there exist many solutions that plausibly replace the missing content. A good inpainting method should therefore not only enable high-quality synthesis but also a high degree of control. Based on this observation, we focus on enabling explicit control over the inpainted content and leverage a reference image as an efficient means to achieve this goal. Specifically, we introduce RefFusion, a novel 3D inpainting method based on a multi-scale personalization of an image inpainting diffusion model to the given reference view. The personalization effectively adapts the prior distribution to the target scene, resulting in a lower variance of score distillation objective and hence significantly sharper details. Our framework achieves state-of-the-art results for object removal while maintaining high controllability. We further demonstrate the generality of our formulation on other downstream tasks such as object insertion, scene outpainting, and sparse view reconstruction.
神经重建方法正迅速成为3D场景的首选表示方法,但它们的可编辑性仍然存在挑战。在这项工作中,我们提出了一个3D场景修复方法--在修复场景中部分内容替换所需内容。场景修复是一个本质上有问题的任务,因为存在许多合理地替换缺失内容的解决方案。因此,一个好的修复方法应该不仅能够实现高质量合成,还应该具有高程度的控制力。基于这一观察,我们专注于实现对修复内容的显式控制,并利用参考图像作为实现这一目标的有效手段。具体来说,我们引入了RefFusion,一种基于图像修复扩散模型多尺度自适应的3D修复方法。自适应有效地将先验分布适应目标场景,导致评分蒸馏目标变低,因此显著地更清晰地突出细节。我们的框架在物体去除的同时保持高可控制性。我们还进一步证明了我们在其他下游任务上的通用性,例如物体插入、场景修复和稀疏视图重建。
https://arxiv.org/abs/2404.10765
Anomaly detection (AD) is often focused on detecting anomaly areas for industrial quality inspection and medical lesion examination. However, due to the specific scenario targets, the data scale for AD is relatively small, and evaluation metrics are still deficient compared to classic vision tasks, such as object detection and semantic segmentation. To fill these gaps, this work first constructs a large-scale and general-purpose COCO-AD dataset by extending COCO to the AD field. This enables fair evaluation and sustainable development for different methods on this challenging benchmark. Moreover, current metrics such as AU-ROC have nearly reached saturation on simple datasets, which prevents a comprehensive evaluation of different methods. Inspired by the metrics in the segmentation field, we further propose several more practical threshold-dependent AD-specific metrics, ie, m$F_1$$^{.2}_{.8}$, mAcc$^{.2}_{.8}$, mIoU$^{.2}_{.8}$, and mIoU-max. Motivated by GAN inversion's high-quality reconstruction capability, we propose a simple but more powerful InvAD framework to achieve high-quality feature reconstruction. Our method improves the effectiveness of reconstruction-based methods on popular MVTec AD, VisA, and our newly proposed COCO-AD datasets under a multi-class unsupervised setting, where only a single detection model is trained to detect anomalies from different classes. Extensive ablation experiments have demonstrated the effectiveness of each component of our InvAD. Full codes and models are available at this https URL.
异常检测(AD)通常关注工业质量检测和医学伤口检验中的异常区域检测。然而,由于特定场景的目标,AD的数据规模相对较小,与经典视觉任务(如物体检测和语义分割)相比,评估指标仍然缺乏。为了填补这些空白,本文首先通过在AD领域扩展COCO来构建一个大规模且通用的COCO-AD数据集。这使得对于这个具有挑战性的基准,不同方法具有公平的评估和可持续的发展。此外,如分割域中的指标一样,目前的AU-ROC指标在简单的数据集上几乎达到饱和,这阻止了不同方法的全面评估。受到分割域指标的启发,我们进一步提出了几个更具体的AD特定指标,即m$F_1^{.2}_{.8}$,mAcc$^{.2}_{.8}$,mIoU$^{.2}_{.8}$和mIoU-max。受到GAN逆变换高质量重构能力的影响,我们提出了一个简单但更强大的InvAD框架,以实现高质量特征重构。我们的方法在多类无监督设置中提高了基于重构的复原方法在流行MVTec AD,VisA和我们所提出的COCO-AD数据集上的有效性。广泛的消融实验证明了每个组件的有效性。完整代码和模型可以从该链接的https URL中获取。
https://arxiv.org/abs/2404.10760
We propose PyTorchGeoNodes, a differentiable module for reconstructing 3D objects from images using interpretable shape programs. In comparison to traditional CAD model retrieval methods, the use of shape programs for 3D reconstruction allows for reasoning about the semantic properties of reconstructed objects, editing, low memory footprint, etc. However, the utilization of shape programs for 3D scene understanding has been largely neglected in past works. As our main contribution, we enable gradient-based optimization by introducing a module that translates shape programs designed in Blender, for example, into efficient PyTorch code. We also provide a method that relies on PyTorchGeoNodes and is inspired by Monte Carlo Tree Search (MCTS) to jointly optimize discrete and continuous parameters of shape programs and reconstruct 3D objects for input scenes. In our experiments, we apply our algorithm to reconstruct 3D objects in the ScanNet dataset and evaluate our results against CAD model retrieval-based reconstructions. Our experiments indicate that our reconstructions match well the input scenes while enabling semantic reasoning about reconstructed objects.
我们提出了PyTorchGeoNodes,一种用于从图像中重构3D物体的可导形状程序模块。与传统的CAD模型检索方法相比,使用形状程序进行3D建模允许关于重构物体语义属性的推理、编辑、低内存足迹等。然而,在过去的 works中,对3D场景理解的形状程序的利用被大大忽视了。作为我们的主要贡献,我们通过引入一个将Blender中设计的形状程序翻译为高效PyTorch代码的模块,实现了基于梯度的优化。我们还提供了一种基于PyTorchGeoNodes的方法,该方法受到Monte Carlo Tree Search(MCTS)的启发,以共同优化形状程序和重构3D物体。在我们的实验中,我们将我们的算法应用于ScanNet数据集中的3D物体,并使用基于CAD模型检索的重建结果对其进行评估。我们的实验结果表明,我们的重构物与输入场景非常吻合,同时允许对重构物进行语义推理。
https://arxiv.org/abs/2404.10620
Traditional sign language teaching methods face challenges such as limited feedback and diverse learning scenarios. Although 2D resources lack real-time feedback, classroom teaching is constrained by a scarcity of teacher. Methods based on VR and AR have relatively primitive interaction feedback mechanisms. This study proposes an innovative teaching model that uses real-time monocular vision and mixed reality technology. First, we introduce an improved hand-posture reconstruction method to achieve sign language semantic retention and real-time feedback. Second, a ternary system evaluation algorithm is proposed for a comprehensive assessment, maintaining good consistency with experts in sign language. Furthermore, we use mixed reality technology to construct a scenario-based 3D sign language classroom and explore the user experience of scenario teaching. Overall, this paper presents a novel teaching method that provides an immersive learning experience, advanced posture reconstruction, and precise feedback, achieving positive feedback on user experience and learning effectiveness.
传统手语教学方法面临诸如反馈有限和多样学习场景等问题。尽管2D资源缺乏实时反馈,但课堂教学受到教师稀缺的限制。基于VR和AR的方法相对较原始,缺乏交互式反馈机制。本研究提出了一种创新的教学模式,利用实时单目视觉和混合现实技术。首先,我们引入了一种改进的手部姿势重构方法,以实现手语语义保留和实时反馈。其次,我们提出了一个二进制系统评估算法,用于全面评估,与手语专家保持良好的一致性。此外,我们还使用混合现实技术构建了基于情景的3D手语教室,并探讨了情景教学的用户体验。总体而言,本文提出了一种新颖的教学方法,为用户提供沉浸式的学习体验、高级的姿势重构和精确的反馈,实现了用户体验和学习效果的积极反馈。
https://arxiv.org/abs/2404.10490
3D Gaussian Splatting (3D-GS) technique couples 3D Gaussian primitives with differentiable rasterization to achieve high-quality novel view synthesis results while providing advanced real-time rendering performance. However, due to the flaw of its adaptive density control strategy in 3D-GS, it frequently suffers from over-reconstruction issue in intricate scenes containing high-frequency details, leading to blurry rendered images. The underlying reason for the flaw has still been under-explored. In this work, we present a comprehensive analysis of the cause of aforementioned artifacts, namely gradient collision, which prevents large Gaussians in over-reconstructed regions from splitting. To address this issue, we propose the novel homodirectional view-space positional gradient as the criterion for densification. Our strategy efficiently identifies large Gaussians in over-reconstructed regions, and recovers fine details by splitting. We evaluate our proposed method on various challenging datasets. The experimental results indicate that our approach achieves the best rendering quality with reduced or similar memory consumption. Our method is easy to implement and can be incorporated into a wide variety of most recent Gaussian Splatting-based methods. We will open source our codes upon formal publication. Our project page is available at: this https URL
3D高斯平铺(3D-GS)技术将3D高斯基本体与有条件的栅格化渲染相结合以实现高质量的新视图合成结果,同时提供先进的实时渲染性能。然而,由于其在3D-GS中的自适应密度控制策略的缺陷,它经常在包含高频细节的复杂场景中陷入过度重建问题,导致模糊渲染图像。导致缺陷的根本原因至今仍未被深入探讨。在这项工作中,我们全面分析了上述伪影的原因,即梯度碰撞,它阻止了在重构区域中大型高斯体的分裂。为了应对这个问题,我们提出了一个新的同向维度位置梯度作为密度化的标准。我们的策略有效地在重构区域中的大型高斯体,并通过分裂来恢复细节。我们对所提出的方法在各种具有挑战性的数据集上进行了评估。实验结果表明,我们的方法在降低或类似内存消耗的情况下实现了最佳的渲染质量。我们的方法易于实现,可以集成到各种基于高斯平铺的最近方法中。我们将在正式发表后开源我们的代码。我们的项目页面可用于此链接:https://this URL
https://arxiv.org/abs/2404.10484
In this report, we present the 1st place solution for ICCV 2023 OmniObject3D Challenge: Sparse-View Reconstruction. The challenge aims to evaluate approaches for novel view synthesis and surface reconstruction using only a few posed images of each object. We utilize Pixel-NeRF as the basic model, and apply depth supervision as well as coarse-to-fine positional encoding. The experiments demonstrate the effectiveness of our approach in improving sparse-view reconstruction quality. We ranked first in the final test with a PSNR of 25.44614.
在这份报告中,我们展示了ICCV 2023 OmniObject3D挑战的第一个解决方案:稀疏视图重构。该挑战旨在评估使用仅几张姿态图像来对每个对象进行新颖视图合成和表面重构的方法。我们使用Pixel-NeRF作为基本模型,并应用了深度监督以及粗-细位置编码。实验结果表明,我们的方法在提高稀疏视图重构质量方面非常有效。我们在最终的测试中获得了25.44614的PSNR排名。
https://arxiv.org/abs/2404.10441
Human action recognition and performance assessment have been hot research topics in recent years. Recognition problems have mature solutions in the field of sign language, but past research in performance analysis has focused on competitive sports and medical training, overlooking the scoring assessment ,which is an important part of sign language teaching digitalization. In this paper, we analyze the existing technologies for performance assessment and adopt methods that perform well in human pose reconstruction tasks combined with motion rotation embedded expressions, proposing a two-stage sign language performance evaluation pipeline. Our analysis shows that choosing reconstruction tasks in the first stage can provide more expressive features, and using smoothing methods can provide an effective reference for assessment. Experiments show that our method provides good score feedback mechanisms and high consistency with professional assessments compared to end-to-end evaluations.
近年来,人类动作识别和表现评估已成为热门的研究课题。在手语领域,识别问题已经有了成熟的方法解决方案,但是过去的手性能分析主要集中在竞争运动和医疗培训,忽略了评分评估,这是手语教学数字化的重要部分。在本文中,我们分析了现有表现评估技术,并采用了一种在运动旋转嵌入表达式条件下进行表现评估的方法,提出了一种两级手语表现评估管道。我们的分析表明,在第一阶段选择重构任务可以提供更多表现特征,而使用平滑方法可以提供有效的参考进行评估。实验证明,我们的方法在得分反馈机制方面具有良好的效果,与专业评估相比具有高度的一致性。
https://arxiv.org/abs/2404.10383
Implicit neural representations have demonstrated significant promise for 3D scene reconstruction. Recent works have extended their applications to autonomous implicit reconstruction through the Next Best View (NBV) based method. However, the NBV method cannot guarantee complete scene coverage and often necessitates extensive viewpoint sampling, particularly in complex scenes. In the paper, we propose to 1) incorporate frontier-based exploration tasks for global coverage with implicit surface uncertainty-based reconstruction tasks to achieve high-quality reconstruction. and 2) introduce a method to achieve implicit surface uncertainty using color uncertainty, which reduces the time needed for view selection. Further with these two tasks, we propose an adaptive strategy for switching modes in view path planning, to reduce time and maintain superior reconstruction quality. Our method exhibits the highest reconstruction quality among all planning methods and superior planning efficiency in methods involving reconstruction tasks. We deploy our method on a UAV and the results show that our method can plan multi-task views and reconstruct a scene with high quality.
隐式神经表示在3D场景重建方面显示出巨大的潜力。最近的工作将应用扩展到通过基于下一个最好视角(NBV)的方法实现自主隐式重建。然而,NBV方法无法保证完全覆盖场景,通常需要进行广泛的视点采样,尤其是在复杂场景中。在本文中,我们提出了一种方法,将基于前沿的探索任务与基于隐式表面不确定性 based 的重建任务相结合以实现高质量的重建。我们还引入了一种使用颜色不确定性实现隐式表面不确定性的方法,以减少视点选择的时间。更进一步,这两种任务使我们在视路路径规划中采用自适应策略,以减少时间和保持卓越的重建质量。我们的方法在所有规划方法中表现出最高的重建质量,并且在涉及重建任务的方法中具有卓越的规划效率。我们将我们的方法应用于无人机,结果表明,我们的方法可以规划多任务视图并具有高质感的场景重建。
https://arxiv.org/abs/2404.10218
Cryo-electron microscopy (cryo-EM) emerges as a pivotal technology for determining the architecture of cells, viruses, and protein assemblies at near-atomic resolution. Traditional particle picking, a key step in cryo-EM, struggles with manual effort and automated methods' sensitivity to low signal-to-noise ratio (SNR) and varied particle orientations. Furthermore, existing neural network (NN)-based approaches often require extensive labeled datasets, limiting their practicality. To overcome these obstacles, we introduce cryoMAE, a novel approach based on few-shot learning that harnesses the capabilities of Masked Autoencoders (MAE) to enable efficient selection of single particles in cryo-EM images. Contrary to conventional NN-based techniques, cryoMAE requires only a minimal set of positive particle images for training yet demonstrates high performance in particle detection. Furthermore, the implementation of a self-cross similarity loss ensures distinct features for particle and background regions, thereby enhancing the discrimination capability of cryoMAE. Experiments on large-scale cryo-EM datasets show that cryoMAE outperforms existing state-of-the-art (SOTA) methods, improving 3D reconstruction resolution by up to 22.4%.
冷冻电镜(冷冻EM)作为一种在近原子分辨率下确定细胞、病毒和蛋白质装配体架构的关键技术而脱颖而出。传统的颗粒选择步骤,在冷冻EM中是一个关键步骤,但是却面临着手动努力和自动方法对低信号噪声比(SNR)和不同颗粒取向的敏感性。此外,现有的基于神经网络(NN)的方法通常需要大量带标签的数据集,限制了其实用性。为了克服这些障碍,我们引入了冷冻MAE,一种基于少样本学习的新方法,利用了掩码自动编码器(MAE)的特性,实现了在冷冻EM图像中高效选择单个颗粒。 与传统NN-based方法不同,冷冻MAE只需要一个最小的带正粒子图像的训练集,但在颗粒检测方面表现出高效。此外,自监督损失函数的实现确保了粒子和水印区域之间的明显差异,从而提高了冷冻MAE的识别能力。在大型冷冻EM数据集的实验中,冷冻MAE超越了现有最先进的(SOTA)方法,通过提高3D重建分辨率高达22.4%而表现出色。
https://arxiv.org/abs/2404.10178
Hoffmann et al. (2022) propose three methods for estimating a compute-optimal scaling law. We attempt to replicate their third estimation procedure, which involves fitting a parametric loss function to a reconstruction of data from their plots. We find that the reported estimates are inconsistent with their first two estimation methods, fail at fitting the extracted data, and report implausibly narrow confidence intervals--intervals this narrow would require over 600,000 experiments, while they likely only ran fewer than 500. In contrast, our rederivation of the scaling law using the third approach yields results that are compatible with the findings from the first two estimation procedures described by Hoffmann et al.
霍夫曼等人(2022)提出了三种估计计算最优缩放定律的方法。我们尝试复制他们的第三种估计方法,该方法涉及将参数损失函数拟合到从他们的图上提取的数据的重构上。我们发现,报道的估计值与他们前两种估计方法不一致,在拟合提取的数据时失败,并且报告了不合逻辑的置信区间——即使缩小得如此之窄,也需要超过600,000个实验,而他们很可能只运行了不到500个实验。相比之下,我们使用第三种方法重新推导缩放定律,得到的结果与霍夫曼等人(2022)描述的第一和第二种估计方法得出的结果相一致。
https://arxiv.org/abs/2404.10102
Neural Radiance Field (NeRF) is a representation for 3D reconstruction from multi-view images. Despite some recent work showing preliminary success in editing a reconstructed NeRF with diffusion prior, they remain struggling to synthesize reasonable geometry in completely uncovered regions. One major reason is the high diversity of synthetic contents from the diffusion model, which hinders the radiance field from converging to a crisp and deterministic geometry. Moreover, applying latent diffusion models on real data often yields a textural shift incoherent to the image condition due to auto-encoding errors. These two problems are further reinforced with the use of pixel-distance losses. To address these issues, we propose tempering the diffusion model's stochasticity with per-scene customization and mitigating the textural shift with masked adversarial training. During the analyses, we also found the commonly used pixel and perceptual losses are harmful in the NeRF inpainting task. Through rigorous experiments, our framework yields state-of-the-art NeRF inpainting results on various real-world scenes. Project page: this https URL
Neural Radiance Field(NeRF)是从多视角图像的三维重建表示。尽管一些最近的工作在编辑重新构建的NeRF时表明初步成功,但它们仍然难以在完全未覆盖的区域中合成合理的几何形状。一个主要原因是从扩散模型中合成内容的多样性,这阻碍了辐射场收敛到清晰和确定性几何。此外,在应用拉文迪格模型的真实数据时,由于自编码错误,通常会导致图像条件下的纹理平滑度转移。这些问题进一步得到像素距离损失的加剧。为了解决这些问题,我们通过每个场景的定制来调节扩散模型的随机性,并通过掩码对抗训练来减轻纹理平滑。在分析过程中,我们还发现,在NeRF修复任务中,通常使用的像素和感知损失是有害的。通过严谨的实验,我们的框架在各种真实世界场景中产生了最先进的NeRF修复结果。项目页面:https:// this URL
https://arxiv.org/abs/2404.09995
When working with 3D facial data, improving fidelity and avoiding the uncanny valley effect is critically dependent on accurate 3D facial performance capture. Because such methods are expensive and due to the widespread availability of 2D videos, recent methods have focused on how to perform monocular 3D face tracking. However, these methods often fall short in capturing precise facial movements due to limitations in their network architecture, training, and evaluation processes. Addressing these challenges, we propose a novel face tracker, FlowFace, that introduces an innovative 2D alignment network for dense per-vertex alignment. Unlike prior work, FlowFace is trained on high-quality 3D scan annotations rather than weak supervision or synthetic data. Our 3D model fitting module jointly fits a 3D face model from one or many observations, integrating existing neutral shape priors for enhanced identity and expression disentanglement and per-vertex deformations for detailed facial feature reconstruction. Additionally, we propose a novel metric and benchmark for assessing tracking accuracy. Our method exhibits superior performance on both custom and publicly available benchmarks. We further validate the effectiveness of our tracker by generating high-quality 3D data from 2D videos, which leads to performance gains on downstream tasks.
在处理三维面部数据时,提高准确性和避免深度谷效应的关键取决于准确的三维面部表演捕捉。因为这些方法代价昂贵,而且由于2D视频的广泛可用性,最近的方法集中于如何进行单目三维面部跟踪。然而,由于其网络架构、训练和评估过程的局限性,这些方法往往无法准确捕捉到精确的面部运动。为了解决这些问题,我们提出了一个新颖的跟踪器——流形面部(FlowFace),它引入了一种创新的高质量2D对齐网络来解决深度对齐问题。与之前的工作不同,流形面部在高质量3D扫描注释上进行训练,而不是弱监督或合成数据。我们的3D模型拟合模块与一个或多个观察结果相结合,实现了对 identity 和expression 的增强,以及对细节面部特征的重建。此外,我们还提出了一个用于评估跟踪准确性的新指标和基准。我们的方法在自定义和公开可用的基准上都表现出卓越的性能。为了进一步验证跟踪器的有效性,我们通过从2D视频中生成高质量的3D数据,实现了在下游任务上的性能提升。
https://arxiv.org/abs/2404.09819
Large garages are ubiquitous yet intricate scenes in our daily lives, posing challenges characterized by monotonous colors, repetitive patterns, reflective surfaces, and transparent vehicle glass. Conventional Structure from Motion (SfM) methods for camera pose estimation and 3D reconstruction fail in these environments due to poor correspondence construction. To address these challenges, this paper introduces LetsGo, a LiDAR-assisted Gaussian splatting approach for large-scale garage modeling and rendering. We develop a handheld scanner, Polar, equipped with IMU, LiDAR, and a fisheye camera, to facilitate accurate LiDAR and image data scanning. With this Polar device, we present a GarageWorld dataset consisting of five expansive garage scenes with diverse geometric structures and will release the dataset to the community for further research. We demonstrate that the collected LiDAR point cloud by the Polar device enhances a suite of 3D Gaussian splatting algorithms for garage scene modeling and rendering. We also propose a novel depth regularizer for 3D Gaussian splatting algorithm training, effectively eliminating floating artifacts in rendered images, and a lightweight Level of Detail (LOD) Gaussian renderer for real-time viewing on web-based devices. Additionally, we explore a hybrid representation that combines the advantages of traditional mesh in depicting simple geometry and colors (e.g., walls and the ground) with modern 3D Gaussian representations capturing complex details and high-frequency textures. This strategy achieves an optimal balance between memory performance and rendering quality. Experimental results on our dataset, along with ScanNet++ and KITTI-360, demonstrate the superiority of our method in rendering quality and resource efficiency.
大型车库在我们的日常生活中无处不在,但它们复杂的场景却鲜为人知。由于 poor correspondence construction,传统的相机姿态估计和3D重建方法在这些问题中失败。为了解决这些问题,本文介绍了Let'sGo,一种带有激光雷达的Gaussian插值方法,用于大型车库建模和渲染。我们开发了一种便携式扫描仪Polar,配备了惯性导航器、激光雷达和鱼眼相机,以帮助准确扫描 LiDAR 和图像数据。使用Polar设备,我们提出了一个由五个不同几何结构的宽敞车库场景组成的GarageWorld数据集,并将数据集发布给社区进行进一步研究。我们证明了Polar设备收集的LiDAR点云增强了用于车库场景建模和渲染的一组3D Gaussian插值算法。我们还提出了用于3D Gaussian插值算法训练的新型深度正则器,有效地消除了渲染图像中的浮动伪影,并开发了一个轻量级的实时级别细节(LOD)高斯渲染器,用于在基于web的设备上进行实时查看。此外,我们还探索了一种结合传统网格表示简单几何和颜色(例如墙壁和地面)与现代3D Gaussian表示捕捉复杂细节和高频纹理的混合表示策略。这种策略在内存性能和渲染质量之间实现了最优平衡。我们数据集上的实验结果,加上ScanNet++和KITTI-360,证明了我们在渲染质量和资源效率方面的优越性。
https://arxiv.org/abs/2404.09748
Human beings construct perception of space by integrating sparse observations into massively interconnected synapses and neurons, offering a superior parallelism and efficiency. Replicating this capability in AI finds wide applications in medical imaging, AR/VR, and embodied AI, where input data is often sparse and computing resources are limited. However, traditional signal reconstruction methods on digital computers face both software and hardware challenges. On the software front, difficulties arise from storage inefficiencies in conventional explicit signal representation. Hardware obstacles include the von Neumann bottleneck, which limits data transfer between the CPU and memory, and the limitations of CMOS circuits in supporting parallel processing. We propose a systematic approach with software-hardware co-optimizations for signal reconstruction from sparse inputs. Software-wise, we employ neural field to implicitly represent signals via neural networks, which is further compressed using low-rank decomposition and structured pruning. Hardware-wise, we design a resistive memory-based computing-in-memory (CIM) platform, featuring a Gaussian Encoder (GE) and an MLP Processing Engine (PE). The GE harnesses the intrinsic stochasticity of resistive memory for efficient input encoding, while the PE achieves precise weight mapping through a Hardware-Aware Quantization (HAQ) circuit. We demonstrate the system's efficacy on a 40nm 256Kb resistive memory-based in-memory computing macro, achieving huge energy efficiency and parallelism improvements without compromising reconstruction quality in tasks like 3D CT sparse reconstruction, novel view synthesis, and novel view synthesis for dynamic scenes. This work advances the AI-driven signal restoration technology and paves the way for future efficient and robust medical AI and 3D vision applications.
人类通过将稀疏观测整合到密集连接的神经元中,构建了我们对空间的感知,这使得人工智能在医学成像、增强现实(AR)和 embodied AI等领域具有卓越的并行度和效率。在AI中实现这种能力面临着软件和硬件方面的挑战。在软件方面,困难源于传统显式信号表示中存储效率低下。硬件方面,包括由冯·诺伊曼瓶颈限制了CPU和内存之间的数据传输,以及CMOS电路在支持并行处理方面的限制。我们提出了一个软件和硬件协同优化的信号重构系统,可以从稀疏输入中恢复信号。在软件方面,我们使用神经场通过神经网络隐含表示信号,并使用低秩分解和结构化剪裁进一步压缩。在硬件方面,我们设计了一个基于电阻性内存的计算在内存(CIM)平台,包括一个高斯编码器(GE)和一个多层感知器(MLP处理引擎(PE)。GE利用电阻性内存的固有随机性实现高效的输入编码,而PE通过硬件感知量化(HAQ)电路实现精确的权重映射。我们在基于40nm的256Kb电阻性内存的内存计算宏观上展示了系统的效果,实现了巨大的能效和并行度改进,而不会牺牲重构质量,例如3D CT稀疏重建、新颖视图合成和动态场景下的新颖视图合成。这项工作推动了AI驱动的信号修复技术的发展,为未来的高效和可靠的医疗AI和3D视觉应用铺平了道路。
https://arxiv.org/abs/2404.09613
In this paper, we propose a method to segment and recover a static, clean background and multiple 360$^\circ$ objects from observations of scenes at different timestamps. Recent works have used neural radiance fields to model 3D scenes and improved the quality of novel view synthesis, while few studies have focused on modeling the invisible or occluded parts of the training images. These under-reconstruction parts constrain both scene editing and rendering view selection, thereby limiting their utility for synthetic data generation for downstream tasks. Our basic idea is that, by observing the same set of objects in various arrangement, so that parts that are invisible in one scene may become visible in others. By fusing the visible parts from each scene, occlusion-free rendering of both background and foreground objects can be achieved. We decompose the multi-scene fusion task into two main components: (1) objects/background segmentation and alignment, where we leverage point cloud-based methods tailored to our novel problem formulation; (2) radiance fields fusion, where we introduce visibility field to quantify the visible information of radiance fields, and propose visibility-aware rendering for the fusion of series of scenes, ultimately obtaining clean background and 360$^\circ$ object rendering. Comprehensive experiments were conducted on synthetic and real datasets, and the results demonstrate the effectiveness of our method.
在本文中,我们提出了一种基于不同时间戳观察场景的方法来分割和恢复静态、干净的背景以及多个360$^\circ$的对象。最近的工作已经使用神经辐射场来建模3D场景,提高了新视图合成质量,而很少的研究集中于建模训练图像中不可见或被遮挡的部分。这些未重建部分既约束了场景编辑,也约束了渲染视图选择,因此限制了它们在下游任务中生成仿真的可用性。我们的基本想法是,通过观察同一组物体在不同排列,使一个场景中看不见的部分在另一个场景中可能会变得可见。通过将每个场景中的可见部分融合,可以实现背景和前景对象的透明渲染。我们将多场景融合任务分解为两个主要组件:(1)物体/背景分割和对齐,我们利用专门为我们的新问题形式化定义的点云为基础的方法;(2)辐射场融合,我们引入可见度场来量化辐射场的可见信息,并提出了对序列场景的融合的可见度感知渲染,最终获得干净的背景和360$^\circ$对象的渲染。我们对合成和真实数据集进行了全面的实验,结果表明,我们的方法的有效性得到了充分证明。
https://arxiv.org/abs/2404.09426
Reconstructing and editing 3D objects and scenes both play crucial roles in computer graphics and computer vision. Neural radiance fields (NeRFs) can achieve realistic reconstruction and editing results but suffer from inefficiency in rendering. Gaussian splatting significantly accelerates rendering by rasterizing Gaussian ellipsoids. However, Gaussian splatting utilizes a single Spherical Harmonic (SH) function to model both texture and lighting, limiting independent editing capabilities of these components. Recently, attempts have been made to decouple texture and lighting with the Gaussian splatting representation but may fail to produce plausible geometry and decomposition results on reflective scenes. Additionally, the forward shading technique they employ introduces noticeable blending artifacts during relighting, as the geometry attributes of Gaussians are optimized under the original illumination and may not be suitable for novel lighting conditions. To address these issues, we introduce DeferredGS, a method for decoupling and editing the Gaussian splatting representation using deferred shading. To achieve successful decoupling, we model the illumination with a learnable environment map and define additional attributes such as texture parameters and normal direction on Gaussians, where the normal is distilled from a jointly trained signed distance function. More importantly, we apply deferred shading, resulting in more realistic relighting effects compared to previous methods. Both qualitative and quantitative experiments demonstrate the superior performance of DeferredGS in novel view synthesis and editing tasks.
重构和编辑3D对象和场景在计算机图形学和计算机视觉中扮演着关键角色。神经元辐射场(NeRFs)可以实现逼真的重建和编辑结果,但渲染过程中效率较低。高斯平铺显著加速了渲染,通过将高斯椭球体进行平面化。然而,高斯平铺仅使用单个球形谐波(SH)函数来建模纹理和光照,这限制了这些组件的独立编辑能力。最近,人们尝试将纹理和光照与高斯平铺表示解耦,但可能在反射场景上产生不合理的几何和分解结果。此外,他们采用的方法在重新光照过程中引入了明显的混合伪影,因为原始光照下高斯粒子的几何属性被优化,可能不适用于新的照明条件。为了解决这些问题,我们引入了DeferredGS,一种使用延迟光照解耦和编辑高斯平铺表示的方法。为了实现成功的解耦,我们用可学习的环境图建模光照,并在高斯粒子上定义了纹理参数和法线方向等附加属性,其中法线是从联合训练的签名距离函数中蒸馏的。更重要的是,我们应用了延迟光照,导致与以前方法相比更真实的重新光照效果。 both qualitative and quantitative experiments demonstrate the superior performance of DeferredGS in novel view synthesis and editing tasks.
https://arxiv.org/abs/2404.09412