Diffusion-based extreme image compression methods have achieved impressive performance at extremely low bitrates. However, constrained by the iterative denoising process that starts from pure noise, these methods are limited in both fidelity and efficiency. To address these two issues, we present Relay Residual Diffusion Extreme Image Compression (RDEIC), which leverages compressed feature initialization and residual diffusion. Specifically, we first use the compressed latent features of the image with added noise, instead of pure noise, as the starting point to eliminate the unnecessary initial stages of the denoising process. Second, we design a novel relay residual diffusion that reconstructs the raw image by iteratively removing the added noise and the residual between the compressed and target latent features. Notably, our relay residual diffusion network seamlessly integrates pre-trained stable diffusion to leverage its robust generative capability for high-quality reconstruction. Third, we propose a fixed-step fine-tuning strategy to eliminate the discrepancy between the training and inference phases, further improving the reconstruction quality. Extensive experiments demonstrate that the proposed RDEIC achieves state-of-the-art visual quality and outperforms existing diffusion-based extreme image compression methods in both fidelity and efficiency. The source code will be provided in this https URL.
基于扩散的极端图像压缩方法在极低比特率下取得了令人印象深刻的性能。然而,由于从纯噪声开始的迭代去噪过程限制了它们的保真度和效率,这些方法在信噪比方面存在局限性。为了解决这两个问题,我们提出了Relay Residual Diffusion Extreme Image Compression(RDEIC),它利用了压缩特征初始化和残差扩散。具体来说,我们首先使用添加噪音的图像的压缩 latent 特征作为去噪过程的起点,而不是纯噪声。其次,我们设计了一个新的 relay 残差扩散,它通过迭代删除添加的噪音和压缩与目标 latent 特征之间的残差来重构原始图像。值得注意的是,我们的 relay 残差扩散网络无缝地整合了预训练的稳定扩散,以利用其对高质量重建的稳健生成能力。第三,我们提出了一个固定步数微调策略,以消除训练和推理阶段之间的差异,进一步提高了重建质量。大量实验证明,所提出的 RDEIC 达到了最先进的视觉质量,并且在信噪比方面优于现有的扩散型极端图像压缩方法。源代码将在此处的链接提供。
https://arxiv.org/abs/2410.02640
Recently, in-car monitoring has emerged as a promising technology for detecting early-stage abnormal status of the driver and providing timely alerts to prevent traffic accidents. Although training models with multimodal data enhances the reliability of abnormal status detection, the scarcity of labeled data and the imbalance of class distribution impede the extraction of critical abnormal state features, significantly deteriorating training performance. Furthermore, missing modalities due to environment and hardware limitations further exacerbate the challenge of abnormal status identification. More importantly, monitoring abnormal health conditions of passengers, particularly in elderly care, is of paramount importance but remains underexplored. To address these challenges, we introduce our IC3M, an efficient camera-rotation-based multimodal framework for monitoring both driver and passengers in a car. Our IC3M comprises two key modules: an adaptive threshold pseudo-labeling strategy and a missing modality reconstruction. The former customizes pseudo-labeling thresholds for different classes based on the class distribution, generating class-balanced pseudo labels to guide model training effectively, while the latter leverages crossmodality relationships learned from limited labels to accurately recover missing modalities by distribution transferring from available modalities. Extensive experimental results demonstrate that IC3M outperforms state-of-the-art benchmarks in accuracy, precision, and recall while exhibiting superior robustness under limited labeled data and severe missing modality.
近年来,车载监控作为一种检测驾驶员 early阶段异常状态的潜在技术,为预防交通事故提供了及时的警报。尽管使用多模态数据训练模型可以提高异常状态检测的可靠性,但缺乏标注数据和类别分布不平衡会阻碍关键异常状态特征的提取,导致训练性能显著下降。此外,由于环境和硬件限制而缺失的维度进一步加剧了异常状态识别的挑战。更为重要的是,监测老年乘客的异常健康状况(特别是护理领域)至关重要,但这一领域仍然缺乏充分的探索。为了应对这些挑战,我们介绍了我们的 IC3M,即一种高效的多模态框架,用于监测汽车中的驾驶员和乘客。我们的 IC3M 包括两个关键模块:自适应阈值伪标签策略和缺失模态重构。前一个模块根据类别分布自适应地定制伪标签阈值,生成类平衡的伪标签,有效指导模型训练;后一个模块利用从有限标注数据中学习到的跨模态关系,准确通过分布传输从可用模态恢复缺失模态。大量实验结果表明,IC3M 在准确性、精确度和召回方面优于最先进的基准测试,同时具有在有限标注数据和严重缺失模态下表现出优越鲁棒性的特点。
https://arxiv.org/abs/2410.02592
Denoising is one of the fundamental steps of the processing pipeline that converts data captured by a camera sensor into a display-ready image or video. It is generally performed early in the pipeline, usually before demosaicking, although studies swapping their order or even conducting them jointly have been proposed. With the advent of deep learning, the quality of denoising algorithms has steadily increased. Even so, modern neural networks still have a hard time adapting to new noise levels and scenes, which is indispensable for real-world applications. With those in mind, we propose a self-similarity-based denoising scheme that weights both a pre- and a post-demosaicking denoiser for Bayer-patterned CFA video data. We show that a balance between the two leads to better image quality, and we empirically find that higher noise levels benefit from a higher influence pre-demosaicking. We also integrate temporal trajectory prefiltering steps before each denoiser, which further improve texture reconstruction. The proposed method only requires an estimation of the noise model at the sensor, accurately adapts to any noise level, and is competitive with the state of the art, making it suitable for real-world videography.
去噪是图像或视频处理流程中的一个基本步骤,将由相机传感器捕获的数据转换为显示级别的图像或视频。通常在处理流程的早期执行,通常在去噪之前,尽管已经提出了交换顺序或共同进行研究的方法。随着深度学习的出现,去噪算法的质量稳步提高。然而,现代神经网络仍然很难适应新的噪声水平和场景,这对于现实世界应用程序来说至关重要。为此,我们提出了一个基于自相似性的去噪方案,为Bayer-模式化的CFA视频数据中的预和后去噪器分配权重。我们证明了两种去噪器的平衡会带来更好的图像质量,并且通过实验我们发现,较高的噪声水平有助于提高预去噪的影响。我们还将在每个去噪器之前集成时间轨迹预滤波器,进一步改善纹理重建。所提出的方法只需要对传感器进行噪声模型的估计,准确适应任何噪声水平,与最先进的去噪方法相当,适合进行现实世界的视频拍摄。
https://arxiv.org/abs/2410.02572
This paper reviews published research in the field of computer-aided colorization technology. We argue that the colorization task originates from computer graphics, prospers by introducing computer vision, and tends to the fusion of vision and graphics, so we put forward our taxonomy and organize the whole paper chronologically. We extend the existing reconstruction-based colorization evaluation techniques, considering that aesthetic assessment of colored images should be introduced to ensure that colorization satisfies human visual-related requirements and emotions more closely. We perform the colorization aesthetic assessment on seven representative unconditional colorization models and discuss the difference between our assessment and the existing reconstruction-based metrics. Finally, this paper identifies unresolved issues and proposes fruitful areas for future research and development. Access to the project associated with this survey can be obtained at this https URL.
本文回顾了计算机辅助色彩技术领域的已发表研究。我们认为,色彩化任务起源于计算机图形学,通过引入计算机视觉而得到发展,并且倾向于将视觉和图形融合。因此,我们提出了我们的分类体系并按时间顺序组织整篇文章。我们扩展了现有的基于重构的颜色化评估技术,考虑到色彩化应满足人类视觉相关需求和情感,从而使色彩化更加接近人类视觉体验。我们对七个具有代表性的无条件色彩化模型进行了色彩化美学评估,并讨论了我们的评估与现有基于重构的指标之间的差异。最后,本文指出了未解决的问题,并为未来的研究和开发提出了有前景的领域。与本调查相关的项目可以通过此链接获取:https://www.academia.edu/39511842/Unresolved_Issues_and_Future_Research_Development_in_Computer_Aided_Colorization_ Technology.
https://arxiv.org/abs/2410.02288
Recent works in volume rendering, \textit{e.g.} NeRF and 3D Gaussian Splatting (3DGS), significantly advance the rendering quality and efficiency with the help of the learned implicit neural radiance field or 3D Gaussians. Rendering on top of an explicit representation, the vanilla 3DGS and its variants deliver real-time efficiency by optimizing the parametric model with single-view supervision per iteration during training which is adopted from NeRF. Consequently, certain views are overfitted, leading to unsatisfying appearance in novel-view synthesis and imprecise 3D geometries. To solve aforementioned problems, we propose a new 3DGS optimization method embodying four key novel contributions: 1) We transform the conventional single-view training paradigm into a multi-view training strategy. With our proposed multi-view regulation, 3D Gaussian attributes are further optimized without overfitting certain training views. As a general solution, we improve the overall accuracy in a variety of scenarios and different Gaussian variants. 2) Inspired by the benefit introduced by additional views, we further propose a cross-intrinsic guidance scheme, leading to a coarse-to-fine training procedure concerning different resolutions. 3) Built on top of our multi-view regulated training, we further propose a cross-ray densification strategy, densifying more Gaussian kernels in the ray-intersect regions from a selection of views. 4) By further investigating the densification strategy, we found that the effect of densification should be enhanced when certain views are distinct dramatically. As a solution, we propose a novel multi-view augmented densification strategy, where 3D Gaussians are encouraged to get densified to a sufficient number accordingly, resulting in improved reconstruction accuracy.
近年来在体积渲染方面的工作,例如NeRF和3D Gaussian Splatting(3DGS),通过学习隐式神经辐射场或3D高斯,显著提高了渲染质量和效率。基于显式表示的渲染在某些情况下会过拟合,导致新视图合成不满意, imprecise 3D几何形状。为解决上述问题,我们提出了一个新的3DGS优化方法,体现了四个关键的新贡献:1)我们将传统的单视图训练范式转化为多视图训练策略。通过我们提出的多视图调节,可以在不过拟合某些训练视图的情况下进一步优化3D高斯属性。作为通用解决方案,我们在各种场景和不同高斯变体中提高了整体准确性。2)受到额外视图带来的好处启发,我们进一步提出了跨内、外引导方案,导致不同分辨率下的训练过程变得更加粗略到精细。3)在多视图调节训练的基础上,我们进一步提出了跨射线密度策略,在光线交集中从不同视角选择更多的的高斯核进行密度填充。4)通过进一步研究密度策略,我们发现,在某些视图明显不同的情况下,密度效应应该得到增强。作为解决方案,我们提出了一个新的多视图增强密度策略,其中3D高斯被鼓励在相应数量上进行填充,从而提高重建准确性。
https://arxiv.org/abs/2410.02103
Deep learning models have revolutionized various domains, with Multi-Layer Perceptrons (MLPs) being a cornerstone for tasks like data regression and image classification. However, a recent study has introduced Kolmogorov-Arnold Networks (KANs) as promising alternatives to MLPs, leveraging activation functions placed on edges rather than nodes. This structural shift aligns KANs closely with the Kolmogorov-Arnold representation theorem, potentially enhancing both model accuracy and interpretability. In this study, we explore the efficacy of KANs in the context of data representation via autoencoders, comparing their performance with traditional Convolutional Neural Networks (CNNs) on the MNIST, SVHN, and CIFAR-10 datasets. Our results demonstrate that KAN-based autoencoders achieve competitive performance in terms of reconstruction accuracy, thereby suggesting their viability as effective tools in data analysis tasks.
深度学习模型已经在各种领域取得了革命性的突破,多层感知器(MLPs)作为数据回归和图像分类等任务的重要基础。然而,最近一项研究引入了Kolmogorov-Arnold网络(KANs),作为MLP的有力替代,利用放置在边缘的激活函数。这种结构性的转移使KANs与Kolmogorov-Arnold表示定理密切相关,可能提高模型的准确性和可解释性。在这项研究中,我们通过自动编码器探讨了KAN在数据表示方面的效果,比较了它们在MNIST、SVHN和CIFAR-10数据集上的表现与传统卷积神经网络(CNN)的性能。我们的结果表明,基于KAN的自动编码器在重建准确性上实现了与传统CNN相当的竞争性,这表明它们在数据分析等任务中具有实际应用价值。
https://arxiv.org/abs/2410.02077
Monocular Depth and Surface Normals Estimation (MDSNE) is crucial for tasks such as 3D reconstruction, autonomous navigation, and underwater exploration. Current methods rely either on discriminative models, which struggle with transparent or reflective surfaces, or generative models, which, while accurate, are computationally expensive. This paper presents a novel deep learning model for MDSNE, specifically tailored for underwater environments, using a hybrid architecture that integrates Convolutional Neural Networks (CNNs) with Transformers, leveraging the strengths of both approaches. Training effective MDSNE models is often hampered by noisy real-world datasets and the limited generalization of synthetic datasets. To address this, we generate pseudo-labeled real data using multiple pre-trained MDSNE models. To ensure the quality of this data, we propose the Depth Normal Evaluation and Selection Algorithm (DNESA), which evaluates and selects the most reliable pseudo-labeled samples using domain-specific metrics. A lightweight student model is then trained on this curated dataset. Our model reduces parameters by 90% and training costs by 80%, allowing real-time 3D perception on resource-constrained devices. Key contributions include: a novel and efficient MDSNE model, the DNESA algorithm, a domain-specific data pipeline, and a focus on real-time performance and scalability. Designed for real-world underwater applications, our model facilitates low-cost deployments in underwater robots and autonomous vehicles, bridging the gap between research and practical implementation.
单目深度和表面法线估计(MDSNE)对于诸如3D建模、自主导航和水下探索等任务至关重要。目前的方法依赖于分类模型,这些模型在透明的或反射性表面上有困难;或者依赖于生成模型,虽然准确,但计算成本较高。本文提出了一种新的用于MDSNE的深度学习模型,特别针对水下环境,采用结合卷积神经网络(CNNs)和Transformer的混合架构,利用两种方法的优点。训练有效的MDSNE模型通常受到噪声 real-world 数据集和合成数据集的有限泛化能力的困扰。为解决这个问题,我们使用多个预训练的 MDSNE 模型生成伪标签。为了确保数据的质量,我们提出了深度法线评估和选择算法(DNESA),它通过领域特定指标评估和选择最可靠的伪标签样本。然后,在经过筛选的数据集上训练一个轻量级的学生模型。我们的模型将参数减少90%,训练成本减少80%,允许在资源受限的设备上实现实时 3D 感知。关键贡献包括:一种新颖且有效的 MDSNE 模型、DNESA 算法、一个领域特定的数据管道,以及关注实时性能和可扩展性。为了实现现实世界的水下应用,我们的模型促使成本较低的部署在 underwater 机器人或自动驾驶车辆上,将研究和技术实现之间的差距缩小。
https://arxiv.org/abs/2410.02072
Despite the strong prediction power of deep learning models, their interpretability remains an important concern. Disentanglement models increase interpretability by decomposing the latent space into interpretable subspaces. In this paper, we propose the first disentanglement method for pathology images. We focus on the task of detecting tumor-infiltrating lymphocytes (TIL). We propose different ideas including cascading disentanglement, novel architecture, and reconstruction branches. We achieve superior performance on complex pathology images, thus improving the interpretability and even generalization power of TIL detection deep learning models. Our codes are available at this https URL.
尽管深度学习模型的预测能力很强,但它们的可解释性仍然是一个重要的问题。解离模型通过将潜在空间分解为可解释子空间来增加可解释性。在本文中,我们提出了第一个用于病理图像的解离方法。我们专注于肿瘤浸润淋巴细胞(TIL)的检测任务。我们提出了包括级联解离、新架构和重构支路等不同想法。我们在复杂病理图像上的表现优于其他深度学习模型,从而提高了TIL检测深度学习模型的可解释性和泛化能力。我们的代码可在此处访问:https://www.xxx.com/
https://arxiv.org/abs/2410.02012
Cryo-EM is an increasingly popular method for determining the atomic resolution 3D structure of macromolecular complexes (eg, proteins) from noisy 2D images captured by an electron microscope. The computational task is to reconstruct the 3D density of the particle, along with 3D pose of the particle in each 2D image, for which the posterior pose distribution is highly multi-modal. Recent developments in cryo-EM have focused on deep learning for which amortized inference has been used to predict pose. Here, we address key problems with this approach, and propose a new semi-amortized method, cryoSPIN, in which reconstruction begins with amortized inference and then switches to a form of auto-decoding to refine poses locally using stochastic gradient descent. Through evaluation on synthetic datasets, we demonstrate that cryoSPIN is able to handle multi-modal pose distributions during the amortized inference stage, while the later, more flexible stage of direct pose optimization yields faster and more accurate convergence of poses compared to baselines. On experimental data, we show that cryoSPIN outperforms the state-of-the-art cryoAI in speed and reconstruction quality.
冷冻电镜是一种越来越多的用于确定生物大分子复合物(例如蛋白质)的分子分辨率3D结构的方法,这些大分子复合物是由电子透射显微镜捕获的嘈杂2D图像。计算任务是根据后验分布重建颗粒的3D密度,并确定每个2D图像中颗粒的3D姿态,其中后验分布具有很高的多态性。最近,冷冻电镜的发展主要集中在深度学习,其中利用累积推理来预测姿态。在这里,我们解决了这种方法的关键问题,并提出了一个新的半衰期方法,称为冷冻SPIN,其中重建以累积推理开始,然后切换到一种自适应解码形式,以使用随机梯度下降在局部优化姿态。通过在合成数据集上的评估,我们证明了冷冻SPIN在累积推理阶段能够处理多态姿态分布,而直接姿态优化阶段则具有更快、更准确的姿态收敛。在实验数据上,我们证明了冷冻SPIN在速度和重建质量方面优于最先进的冷冻AI。
https://arxiv.org/abs/2406.10455
Image tokenizers are crucial for visual generative models, e.g., diffusion models (DMs) and autoregressive (AR) models, as they construct the latent representation for modeling. Increasing token length is a common approach to improve the image reconstruction quality. However, tokenizers with longer token lengths are not guaranteed to achieve better generation quality. There exists a trade-off between reconstruction and generation quality regarding token length. In this paper, we investigate the impact of token length on both image reconstruction and generation and provide a flexible solution to the tradeoff. We propose ImageFolder, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling to improve both generation efficiency and quality. To enhance the representative capability without increasing token length, we leverage dual-branch product quantization to capture different contexts of images. Specifically, semantic regularization is introduced in one branch to encourage compacted semantic information while another branch is designed to capture the remaining pixel-level details. Extensive experiments demonstrate the superior quality of image generation and shorter token length with ImageFolder tokenizer.
图像标记词器对于视觉生成模型(例如扩散模型(DMs)和自回归模型)至关重要,因为它构建了模型的潜在表示。增加标记词的长度是一种提高图像重建质量的常见方法。然而,具有较长标记词的标记词器并不一定能够实现更好的生成质量。关于标记词长度,重建和生成质量之间存在权衡。在本文中,我们研究了标记词长度对图像重建和生成的影响,并提供了一种灵活的权衡解决方案。我们提出了 ImageFolder,一种语义标记词器,它可以提供空间对齐的图像标记词,在自回归建模过程中可以折叠以提高生成效率和质量。为了增强代表性能力而增加标记词长度,我们利用双分支量化来捕捉图像的不同上下文。具体来说,在分支一引入语义 Regularization,鼓励紧凑的语义信息,而在分支二中,设计用于捕捉剩余的像素级细节。大量实验证明,使用 ImageFolder 标记词器可以实现卓越的图像生成质量和较短的标记词长度。
https://arxiv.org/abs/2410.01756
With the rapidly increasing number of satellites in space and their enhanced capabilities, the amount of earth observation images collected by satellites is exceeding the transmission limits of satellite-to-ground links. Although existing learned image compression solutions achieve remarkable performance by using a sophisticated encoder to extract fruitful features as compression and using a decoder to reconstruct, it is still hard to directly deploy those complex encoders on current satellites' embedded GPUs with limited computing capability and power supply to compress images in orbit. In this paper, we propose COSMIC, a simple yet effective learned compression solution to transmit satellite images. We first design a lightweight encoder (i.e. reducing FLOPs by $2.6\sim 5\times $) on satellite to achieve a high image compression ratio to save satellite-to-ground links. Then, for reconstructions on the ground, to deal with the feature extraction ability degradation due to simplifying encoders, we propose a diffusion-based model to compensate image details when decoding. Our insight is that satellite's earth observation photos are not just images but indeed multi-modal data with a nature of Text-to-Image pairing since they are collected with rich sensor data (e.g. coordinates, timestamp, etc.) that can be used as the condition for diffusion generation. Extensive experiments show that COSMIC outperforms state-of-the-art baselines on both perceptual and distortion metrics.
随着空间卫星数量的快速增长和其功能的提升,卫星收集的地球观测图像数量已经超过了卫星到地面链路的传输限制。尽管现有的图像压缩解决方案通过使用复杂的编码器提取有用的特征并进行解码来达到出色的性能,并在卫星的嵌入式GPU上实现图像压缩,但仍然很难直接将这些复杂的编码器部署到具有有限计算能力和供电能力的当前卫星上。在本文中,我们提出了COSMIC,一种简单而有效的学习压缩解决方案来传输卫星图像。我们首先设计了一个轻量级的编码器(即通过减少FLOPs来降低FLOPs)在卫星上,以实现高图像压缩比,从而节省卫星到地面链路。然后,为了处理由于简化编码器而导致的地面重建失真,我们提出了一个扩散基模型,用于在解码过程中补偿图像细节。我们的观点是,卫星的地球观测照片不仅是图像,而且是具有文本到图像配对的自然多模态数据,因为它们收集了丰富的传感器数据(例如坐标、时间戳等),可以作为扩散生成的条件。大量实验证明,COSMIC在感知和失真度指标上优于最先进的基准模型。
https://arxiv.org/abs/2410.01698
Neural Radiance Fields (NeRF) are widely used for novel-view synthesis and have been adapted for 3D Object Detection (3DOD), offering a promising approach to 3DOD through view-synthesis representation. However, NeRF faces inherent limitations: (i) limited representational capacity for 3DOD due to its implicit nature, and (ii) slow rendering speeds. Recently, 3D Gaussian Splatting (3DGS) has emerged as an explicit 3D representation that addresses these limitations. Inspired by these advantages, this paper introduces 3DGS into 3DOD for the first time, identifying two main challenges: (i) Ambiguous spatial distribution of Gaussian blobs: 3DGS primarily relies on 2D pixel-level supervision, resulting in unclear 3D spatial distribution of Gaussian blobs and poor differentiation between objects and background, which hinders 3DOD; (ii) Excessive background blobs: 2D images often include numerous background pixels, leading to densely reconstructed 3DGS with many noisy Gaussian blobs representing the background, negatively affecting detection. To tackle the challenge (i), we leverage the fact that 3DGS reconstruction is derived from 2D images, and propose an elegant and efficient solution by incorporating 2D Boundary Guidance to significantly enhance the spatial distribution of Gaussian blobs, resulting in clearer differentiation between objects and their background. To address the challenge (ii), we propose a Box-Focused Sampling strategy using 2D boxes to generate object probability distribution in 3D spaces, allowing effective probabilistic sampling in 3D to retain more object blobs and reduce noisy background blobs. Benefiting from our designs, our 3DGS-DET significantly outperforms the SOTA NeRF-based method, NeRF-Det, achieving improvements of +6.6 on mAP@0.25 and +8.1 on mAP@0.5 for the ScanNet dataset, and impressive +31.5 on mAP@0.25 for the ARKITScenes dataset.
神经辐射场(NeRF)广泛用于新颖视角合成,并用于3D物体检测(3DOD),通过视图合成表示为3DOD提供了有前途的方法。然而,NeRF面临固有局限性:(i)由于其隐含性,对3D物体检测的表示能力有限;(ii)渲染速度较慢。最近,3D高斯平铺(3DGS)作为一种明确的3D表示方法,应对这些局限性。受到这些优势的启发,本文将3DGS引入3D物体检测中,并提出了两个主要挑战: (i)平滑高斯斑块的模糊空间分布:3DGS主要依赖于2D像素级别监督,导致不清楚的高斯斑块的3D空间分布和物体与背景之间的区分度较低,阻碍3D物体检测;(ii)过多的背景斑块:2D图像通常包括大量背景像素,导致带有大量噪声的高斯斑块重建,降低了检测效果。为了应对挑战(i),我们利用3DGS从2D图像中提取的事实,提出了一种优雅而有效的解决方案,通过将2D边界指导相结合,显著增强了高斯斑块的空间分布,使得物体与背景之间的区分度更加清晰。为了应对挑战(ii),我们提出了一种使用2D盒子进行目标概率分布的策略,生成物体概率分布在3D空间中,允许在3D中进行有效的概率采样,以保留更多的物体斑块并减少噪声背景斑块。得益于我们的设计,我们的3DGS-DET在ScanNet数据集上显著优于当前最先进的NeRF-based方法,在ArkitScenes数据集上实现了+6.6的mAP@0.25和+8.1的mAP@0.5的改善,并且令人印象深刻的+31.5mAP@0.25在ArkitScenes数据集上。
https://arxiv.org/abs/2410.01647
Quantitative magnetic resonance imaging (qMRI) offers tissue-specific physical parameters with significant potential for neuroscience research and clinical practice. However, lengthy scan times for 3D multiparametric qMRI acquisition limit its clinical utility. Here, we propose SUMMIT, an innovative imaging methodology that includes data acquisition and an unsupervised reconstruction for simultaneous multiparametric qMRI. SUMMIT first encodes multiple important quantitative properties into highly undersampled k-space. It further leverages implicit neural representation incorporated with a dedicated physics model to reconstruct the desired multiparametric maps without needing external training datasets. SUMMIT delivers co-registered T1, T2, T2*, and quantitative susceptibility mapping. Extensive simulations and phantom imaging demonstrate SUMMIT's high accuracy. Additionally, the proposed unsupervised approach for qMRI reconstruction also introduces a novel zero-shot learning paradigm for multiparametric imaging applicable to various medical imaging modalities.
定量磁共振成像(qMRI)为神经科学研究和临床实践提供了具有显著潜力的组织特异性物理参数。然而,长时间的3D多参数qMRI采集时间限制了其临床应用。在这里,我们提出了SUMMIT,一种创新成像方法,包括数据采集和无监督重构以实现同时多参数qMRI。SUMMIT首先将多个重要的定量属性编码为高度欠采样k空间中的多个重要属性。它还利用了与专用物理模型结合的隐式神经表示来重构所需的multiparametric张量,而不需要外部训练数据集。SUMMIT提供了共面的T1、T2、T2*和定量对比度张量。广泛的模拟和幻灯成像证明了SUMMIT的高准确性。此外,所提出的无监督qMRI重构方法还引入了一种适用于各种医学成像模式的零样本学习范式。
https://arxiv.org/abs/2410.01577
Recently, with the development of Neural Radiance Fields and Gaussian Splatting, 3D reconstruction techniques have achieved remarkably high fidelity. However, the latent representations learnt by these methods are highly entangled and lack interpretability. In this paper, we propose a novel part-aware compositional reconstruction method, called GaussianBlock, that enables semantically coherent and disentangled representations, allowing for precise and physical editing akin to building blocks, while simultaneously maintaining high fidelity. Our GaussianBlock introduces a hybrid representation that leverages the advantages of both primitives, known for their flexible actionability and editability, and 3D Gaussians, which excel in reconstruction quality. Specifically, we achieve semantically coherent primitives through a novel attention-guided centering loss derived from 2D semantic priors, complemented by a dynamic splitting and fusion strategy. Furthermore, we utilize 3D Gaussians that hybridize with primitives to refine structural details and enhance fidelity. Additionally, a binding inheritance strategy is employed to strengthen and maintain the connection between the two. Our reconstructed scenes are evidenced to be disentangled, compositional, and compact across diverse benchmarks, enabling seamless, direct and precise editing while maintaining high quality.
近年来,随着神经元辐射场和Gaussian合成的发展,3D 重建技术取得了显著的高保真度。然而,这些方法学所学习的潜在表示高度纠缠且缺乏可解释性。在本文中,我们提出了一种新颖的部分感知组合重建方法,称为GaussianBlock,该方法实现了语义上可解释和去中心化的表示,允许用户精确地编辑和物理地编辑类似于构建块,同时保持高保真度。 我们的GaussianBlock引入了一种结合原始数据和3D高斯信息的混合表示,这是通过对2D语义 prior 的自适应加权获得的。此外,我们采用了一种动态的分裂和合并策略,以实现语义上可解释的原始表示。此外,我们还利用与原始数据混合的3D高斯来优化结构和增强保真度。 此外,我们还采用了一种绑定继承策略来加强和维持这两个之间的联系。通过在各种基准测试中进行验证,我们的重构场景被证明具有可解释性、可组合性和紧凑性,这使得用户可以在不降低质量的情况下实现直接、精确和无缝的编辑。
https://arxiv.org/abs/2410.01535
3D Gaussian splatting (3DGS) offers the capability to achieve real-time high quality 3D scene rendering. However, 3DGS assumes that the scene is in a clear medium environment and struggles to generate satisfactory representations in underwater scenes, where light absorption and scattering are prevalent and moving objects are involved. To overcome these, we introduce a novel Gaussian Splatting-based method, UW-GS, designed specifically for underwater applications. It introduces a color appearance that models distance-dependent color variation, employs a new physics-based density control strategy to enhance clarity for distant objects, and uses a binary motion mask to handle dynamic content. Optimized with a well-designed loss function supporting for scattering media and strengthened by pseudo-depth maps, UW-GS outperforms existing methods with PSNR gains up to 1.26dB. To fully verify the effectiveness of the model, we also developed a new underwater dataset, S-UW, with dynamic object masks.
3D Gaussian splatting(3DGS)提供了实现实时高质量3D场景渲染的能力。然而,3DGS假定场景处于清晰的中性环境中,在海底场景中挣扎,因为光吸收和散射普遍存在,并且涉及移动物体。为了克服这些挑战,我们引入了一种新的基于高斯展平的方法,称为 UW-GS,专门针对水下应用进行设计。它引入了一种距离相关的色彩变化模型,采用了一种基于新物理的密度控制策略来增强远处对象的清晰度,并使用二进制运动掩码处理动态内容。通过一个设计良好的损失函数来支持散射媒体,并利用伪深度图来加强,UW-GS 相对于现有方法在 PSNR 方面的提升达到了 1.26dB。为了充分验证模型的有效性,我们还开发了一个新的水下数据集 S-UW,带有动态物体掩码。
https://arxiv.org/abs/2410.01517
In recent years, much speech separation research has focused primarily on improving model performance. However, for low-latency speech processing systems, high efficiency is equally important. Therefore, we propose a speech separation model with significantly reduced parameters and computational costs: Time-frequency Interleaved Gain Extraction and Reconstruction network (TIGER). TIGER leverages prior knowledge to divide frequency bands and compresses frequency information. We employ a multi-scale selective attention module to extract contextual features, while introducing a full-frequency-frame attention module to capture both temporal and frequency contextual information. Additionally, to more realistically evaluate the performance of speech separation models in complex acoustic environments, we introduce a dataset called EchoSet. This dataset includes noise and more realistic reverberation (e.g., considering object occlusions and material properties), with speech from two speakers overlapping at random proportions. Experimental results showed that models trained on EchoSet had better generalization ability than those trained on other datasets to the data collected in the physical world, which validated the practical value of the EchoSet. On EchoSet and real-world data, TIGER significantly reduces the number of parameters by 94.3% and the MACs by 95.3% while achieving performance surpassing state-of-the-art (SOTA) model TF-GridNet. This is the first speech separation model with fewer than 1 million parameters that achieves performance comparable to the SOTA model.
近年来,大量语音分离研究主要集中在提高模型性能。然而,对于低延迟语音处理系统来说,高效率同样非常重要。因此,我们提出了一个具有显著减少参数和计算成本的语音分离模型:时间频率交织增益提取和重构网络(TIGER)。TIGER利用先验知识将频率带进行分割并压缩频率信息。我们采用多尺度选择性注意模块提取上下文特征,同时引入全频率帧注意模块来捕捉时间和频率上下文信息。此外,为了更真实地评估语音分离模型的性能,我们在EchoSet上引入了一个数据集。这个数据集包括噪声和其他更真实的回声(例如考虑物体遮挡和物质特性),并且两个发言者的语音以随机比例重叠。实验结果表明,在EchoSet和其他数据集上训练的模型具有比训练在物理世界数据上的模型更好的泛化能力,这验证了EchoSet的实际价值。在EchoSet和现实世界数据上,TIGER通过94.3%的参数减少和95.3%的MAC实现性能超越了最先进的TF-GridNet模型。这是第一个具有不到100万参数的语音分离模型,实现了与SOTA模型性能相当的结果。
https://arxiv.org/abs/2410.01469
State-of-the-art computer- and robot-assisted surgery systems heavily depend on intraoperative imaging technologies such as CT and fluoroscopy to generate detailed 3D visualization of the patient's anatomy. While imaging techniques are highly accurate, they are based on ionizing radiation and expose patients and clinicians. This study introduces an alternative, radiation-free approach for reconstructing the 3D spine anatomy using RGB-D data. Drawing inspiration from the 3D "mental map" that surgeons form during surgeries, we introduce SurgPointTransformer, a shape completion approach for surgical applications that can accurately reconstruct the unexposed spine regions from sparse observations of the exposed surface. Our method involves two main steps: segmentation and shape completion. The segmentation step includes spinal column localization and segmentation, followed by vertebra-wise segmentation. The segmented vertebra point clouds are then subjected to SurgPointTransformer, which leverages an attention mechanism to learn patterns between visible surface features and the underlying anatomy. For evaluation, we utilize an ex-vivo dataset of nine specimens. Their CT data is used to establish ground truth data that were used to compare to the outputs of our methods. Our method significantly outperforms the state-of-the-art baselines, achieving an average Chamfer Distance of 5.39, an F-Score of 0.85, an Earth Mover's Distance of 0.011, and a Signal-to-Noise Ratio of 22.90 dB. This study demonstrates the potential of our reconstruction method for 3D vertebral shape completion. It enables 3D reconstruction of the entire lumbar spine and surgical guidance without ionizing radiation or invasive imaging. Our work contributes to computer-aided and robot-assisted surgery, advancing the perception and intelligence of these systems.
最先进的计算机辅助手术(CAE)和机器人辅助手术系统完全依赖于术中成像技术,如CT和荧光断层扫描,来生成患者解剖结构的详细3D渲染。尽管成像技术非常准确,但它们基于离子辐射,可能会对患者和临床医生造成危害。本研究介绍了一种放射线free的方法,使用RGB-D数据重构椎骨解剖结构。我们受到外科医生在手术过程中形成的3D“思维地图”的启发,引入了SurgPointTransformer,一种用于手术应用 shape完成的方法,可以从暴露表面的稀疏观察中准确重构未暴露的椎骨区域。我们的方法包括两个主要步骤:分割和形状完成。分割步骤包括脊椎轴定位和分割,然后进行逐个椎骨分割。分割出的椎骨点云随后受到SurgPointTransformer的影响,该影响利用了注意机制来学习可见表面特征和潜在解剖结构之间的模式。为了评估,我们利用九个尸体的实验数据。他们的CT数据用于确定与我们的方法输出进行比较的地面真实数据。我们的方法在现有技术基线方面显著优于最先进的基线,实现了平均Chamfer距离为5.39,F-分数为0.85,平均椎间距离为0.011和信号-噪声比为22.90 dB。这项研究展示了我们重构方法在3D椎骨形状完成方面的潜力。它使得整个腰椎椎骨的3D重建和手术指导在没有辐射或侵入性成像的情况下实现。我们的工作对计算机辅助和机器人辅助手术做出了贡献,提高了这些系统的感知和智能水平。
https://arxiv.org/abs/2410.01443
In recent years, there has been an increasing demand for underwater cameras that monitor the condition of offshore structures and check the number of individuals in aqua culture environments with long-period observation. One of the significant issues with this observation is that biofouling sticks to the aperture and lens densely and prevents cameras from capturing clear images. This study examines an underwater camera that applies material technologies with high inherent resistance to biofouling and computer vision technologies based on image reconstruction by deep learning to lens-less cameras. For this purpose, our prototype camera uses a coded aperture with 1k rectangular shape pinholes in a thin metal plate, such as copper, which hinder the growth of biofouling and keep the surface clean. Although images taken by lens-less cameras are usually not well formed due to lack of the traditional glass-based lens, a deep learning approach using ViT (Vision Transformer) has recently demonstrated reconstructing original photo images well and our study shows that using gated MLP (Multilayer Perceptron) also yields good results. On the other hand, a certain degree of thickness for bio-repellence materials is required to exhibit their effect the thickness of aperture is necessary to use apertures sufficiently thinner than the size of the pinholes to avoid unintentional reflection and absorption on the sidewalls. Therefore, we prepared a sufficiently thin plate for image reconstruction and now currently we conduct tests of the lens-less camera of the bio-repellence aperture with actual seawater environments to determine whether it can sufficiently demonstrate the biofouling effect compared with usual camera with only waterproof.
近年来,对水下相机的需求不断增加,这些相机可以监测水下结构的状态,并通过长时间的观察来检查水体环境中的个体数量。其中一个重要的问题是,生物附着在孔道和透镜上密集地附着,导致相机无法捕捉到清晰的图像。本研究探讨了一种使用具有高抗生物附着性能的材料技术和基于深度学习的图像重构技术的水下相机。为此,我们的原型相机使用带有1k个矩形形状孔洞的金属板上的编码孔道,例如铜,以阻碍生物附着生长并保持表面清洁。尽管无透镜相机通常由于缺乏传统玻璃透镜而导致图像效果不佳,但最近使用ViT(视觉 transformer)进行深度学习已经证明了对原始照片图像的较好还原,我们的研究也表明,使用开孔MLP(多层感知器)同样具有良好的效果。另一方面,某种程度的生物阻抗材料厚度对于表现出其效果是必要的。孔道的厚度需要足够薄,以避免在侧面墙壁上意外反射和吸收光。因此,我们为图像重建准备了足够薄的金属板,现在我们还在实际海水中对具有生物阻抗孔道的无透镜相机进行测试,以确定它是否能够充分展示与仅防水相机相比的生物附着效果。
https://arxiv.org/abs/2410.01365
High-quality eyelid reconstruction and animation are challenging for the subtle details and complicated deformations. Previous works usually suffer from the trade-off between the capture costs and the quality of details. In this paper, we propose a novel method that can achieve detailed eyelid reconstruction and animation by only using an RGB video captured by a mobile phone. Our method utilizes both static and dynamic information of eyeballs (e.g., positions and rotations) to assist the eyelid reconstruction, cooperating with an automatic eyeball calibration method to get the required eyeball parameters. Furthermore, we develop a neural eyelid control module to achieve the semantic animation control of eyelids. To the best of our knowledge, we present the first method for high-quality eyelid reconstruction and animation from lightweight captures. Extensive experiments on both synthetic and real data show that our method can provide more detailed and realistic results compared with previous methods based on the same-level capture setups. The code is available at this https URL.
高质量的眼睑重建和动画对于细小细节和复杂的变形来说具有挑战性。之前的工作通常在捕捉成本和细节质量之间存在权衡。在本文中,我们提出了一种新方法,通过仅使用一部智能手机捕获的RGB视频,实现对眼睑的详细重建和动画。我们的方法利用眼球的静态和动态信息(例如位置和旋转)协助眼睑重建,与自动眼球校准方法合作以获取所需的眼部参数。此外,我们还开发了一个神经眼睑控制模块,以实现眼睑的语义动画控制。据我们所知,这是第一个基于轻量级捕捉设置的高质量眼睑重建和动画方法。在合成和真实数据上的广泛实验表明,与基于相同级别捕捉设置的前方法相比,我们的方法可以提供更加详细和逼真的结果。代码可在此处访问:https://www.url
https://arxiv.org/abs/2410.01360
Neural radiance fields have recently revolutionized novel-view synthesis and achieved high-fidelity renderings. However, these methods sacrifice the geometry for the rendering quality, limiting their further applications including relighting and deformation. How to synthesize photo-realistic rendering while reconstructing accurate geometry remains an unsolved problem. In this work, we present AniSDF, a novel approach that learns fused-granularity neural surfaces with physics-based encoding for high-fidelity 3D reconstruction. Different from previous neural surfaces, our fused-granularity geometry structure balances the overall structures and fine geometric details, producing accurate geometry reconstruction. To disambiguate geometry from reflective appearance, we introduce blended radiance fields to model diffuse and specularity following the anisotropic spherical Gaussian encoding, a physics-based rendering pipeline. With these designs, AniSDF can reconstruct objects with complex structures and produce high-quality renderings. Furthermore, our method is a unified model that does not require complex hyperparameter tuning for specific objects. Extensive experiments demonstrate that our method boosts the quality of SDF-based methods by a great scale in both geometry reconstruction and novel-view synthesis.
神经辐射场最近通过革命性地发展了 novel-view 合成和实现高保真度渲染,然而,这些方法为了提高渲染质量而牺牲了几何形状,限制了它们在包括重新建模和变形在内的进一步应用。如何在保持高保真度的同时合成 photo-realistic 渲染仍然是一个未解决的问题。在本文中,我们提出了 AniSDF,一种通过基于物理编码的融合 granularity 神经表面来学习高保真度 3D 重构的新颖方法。与之前的神经表面不同,我们的融合 granularity 几何结构平衡了整体结构和细小几何细节,实现了准确的 geometry 重建。为了从反射外观中区分几何形状,我们引入了混合辐射场,用于根据各向同性球形高斯编码建模漫射和镜面反射。基于这些设计,AniSDF 可以在具有复杂结构的物体上重建物体,并产生高保真度的渲染。此外,我们的方法是一个统一的模型,不需要对特定对象进行复杂超参数的调整。大量实验证明,在我们的方法下,基于 SDF 的方法在几何建模和 novel-view 合成方面的质量得到了极大的提高。
https://arxiv.org/abs/2410.01202