Neural Radiance Fields (NeRF) have shown impressive results in 3D reconstruction and generating novel views. A key challenge within NeRF is the editing of reconstructed scenes, such as object removal, which requires maintaining consistency across multiple views and ensuring high-quality synthesised perspectives. Previous studies have incorporated depth priors, typically from LiDAR or sparse depth measurements provided by COLMAP, to improve the performance of object removal in NeRF. However, these methods are either costly or time-consuming. In this paper, we propose a novel approach that integrates monocular depth estimates with NeRF-based object removal models to significantly reduce time consumption and enhance the robustness and quality of scene generation and object removal. We conducted a thorough evaluation of COLMAP's dense depth reconstruction on the KITTI dataset to verify its accuracy in depth map generation. Our findings suggest that COLMAP can serve as an effective alternative to a ground truth depth map where such information is missing or costly to obtain. Additionally, we integrated various monocular depth estimation methods into the removal NeRF model, i.e., SpinNeRF, to assess their capacity to improve object removal performance. Our experimental results highlight the potential of monocular depth estimation to substantially improve NeRF applications.
Neural Radiance Fields (NeRF) 在 3D 重建和生成新视图方面已经取得了令人印象深刻的成果。 NeRF 中的关键挑战之一是编辑重构场景,例如物体移除,这需要在多个视图中保持一致并确保高质合成视角。之前的研究已经利用深度优先项,通常来自 LiDAR 或稀疏深度测量提供的 COLMAP,来提高 NeRF 中物体移除的性能。然而,这些方法要么代价高昂,要么费时。在本文中,我们提出了一种新方法,将单目深度估计与基于 NeRF 的物体移除模型相结合,显著减少了时间消耗,并提高了场景生成和物体移除的稳健性和质量。我们对 COLMAP 在 KITTI 数据集上的密集深度重建进行了详细的评估,以验证其深度图生成的准确性。我们的研究结果表明,COLMAP 可以作为当深度图缺失或昂贵无法获得时的有效地面真值深度图的替代。此外,我们将各种单目深度估计方法(例如 SpinNeRF)集成到移除 NeRF 模型中,以评估它们提高物体移除性能的能力。我们的实验结果突出了单目深度估计在极大地改善 NeRF 应用中的潜力。
https://arxiv.org/abs/2405.00630
Large language models (LLMs) with their strong zero-shot topic extraction capabilities offer an alternative to probabilistic topic modelling and closed-set topic classification approaches. As zero-shot topic extractors, LLMs are expected to understand human instructions to generate relevant and non-hallucinated topics based on the given documents. However, LLM-based topic modelling approaches often face difficulties in generating topics with adherence to granularity as specified in human instructions, often resulting in many near-duplicate topics. Furthermore, methods for addressing hallucinated topics generated by LLMs have not yet been investigated. In this paper, we focus on addressing the issues of topic granularity and hallucinations for better LLM-based topic modelling. To this end, we introduce a novel approach that leverages Direct Preference Optimisation (DPO) to fine-tune open-source LLMs, such as Mistral-7B. Our approach does not rely on traditional human annotation to rank preferred answers but employs a reconstruction pipeline to modify raw topics generated by LLMs, thus enabling a fast and efficient training and inference framework. Comparative experiments show that our fine-tuning approach not only significantly improves the LLM's capability to produce more coherent, relevant, and precise topics, but also reduces the number of hallucinated topics.
大语言模型(LLMs)具有其强大的零击主题提取能力,为概率主题建模和关闭集主题分类方法提供了另一种选择。作为零击主题提取器,LLM预计能够根据给定的文档理解人类指令来生成相关和非虚构的主题。然而,基于LLM的主题建模方法通常在生成符合人类指令的丰富主题方面遇到困难,通常导致许多近似主题。此外,尚未研究解决LLM生成的主题出现魔幻的方法。在本文中,我们关注主题粒度问题和魔幻生成问题,以改善基于LLM的主题建模。为此,我们引入了一种新方法,该方法利用直接偏好优化(DPO)对开源LLM进行微调,例如Mistral-7B。我们的方法不依赖于传统的人类注释对首选答案进行排名,而是采用重构管道来修改由LLM生成的原始主题,从而实现快速而有效的训练和推理框架。比较实验表明,我们的微调方法不仅显著提高了LLM产生更有条理、相关和非精确主题的能力,而且减少了生成的魔幻主题的数量。
https://arxiv.org/abs/2405.00611
In the field of trajectory generation for objects, ensuring continuous collision-free motion remains a huge challenge, especially for non-convex geometries and complex environments. Previous methods either oversimplify object shapes, which results in a sacrifice of feasible space or rely on discrete sampling, which suffers from the "tunnel effect". To address these limitations, we propose a novel hierarchical trajectory generation pipeline, which utilizes the Swept Volume Signed Distance Field (SVSDF) to guide trajectory optimization for Continuous Collision Avoidance (CCA). Our interdisciplinary approach, blending techniques from graphics and robotics, exhibits outstanding effectiveness in solving this problem. We formulate the computation of the SVSDF as a Generalized Semi-Infinite Programming model, and we solve for the numerical solutions at query points implicitly, thereby eliminating the need for explicit reconstruction of the surface. Our algorithm has been validated in a variety of complex scenarios and applies to robots of various dynamics, including both rigid and deformable shapes. It demonstrates exceptional universality and superior CCA performance compared to typical algorithms. The code will be released at this https URL for the benefit of the community.
在物体轨迹生成领域,确保连续避障运动仍然是一个巨大的挑战,特别是对于非凸几何和复杂环境。以前的方法要么简化物体形状,导致可利用空间牺牲,要么依赖于离散采样,且存在“隧道效应”。为了应对这些局限,我们提出了一个新颖的分层轨迹生成管道,利用Swept Volume Signed Distance Field(SVSDF)引导连续避障优化。我们的跨学科方法,结合了图形和机器人领域的技术,在解决这个问题上表现出出色的效果。我们将SVSDF的计算表示为一般半无限规划模型,并在查询点上 implicit地求解数值解,从而消除对表面显式重建的需求。与典型算法相比,我们的算法具有非凡的普遍性和卓越的CCA性能。代码将在这个https:// URL上发布,以造福于社区。
https://arxiv.org/abs/2405.00362
State-of-the-art neural implicit surface representations have achieved impressive results in indoor scene reconstruction by incorporating monocular geometric priors as additional supervision. However, we have observed that multi-view inconsistency between such priors poses a challenge for high-quality reconstructions. In response, we present NC-SDF, a neural signed distance field (SDF) 3D reconstruction framework with view-dependent normal compensation (NC). Specifically, we integrate view-dependent biases in monocular normal priors into the neural implicit representation of the scene. By adaptively learning and correcting the biases, our NC-SDF effectively mitigates the adverse impact of inconsistent supervision, enhancing both the global consistency and local details in the reconstructions. To further refine the details, we introduce an informative pixel sampling strategy to pay more attention to intricate geometry with higher information content. Additionally, we design a hybrid geometry modeling approach to improve the neural implicit representation. Experiments on synthetic and real-world datasets demonstrate that NC-SDF outperforms existing approaches in terms of reconstruction quality.
先进的神经隐式表面表示已经通过将单目几何先验作为附加监督在室内场景重构中取得了显著的成果。然而,我们观察到,这种先验之间的多视角不一致会给高质量重构带来挑战。为了应对这一挑战,我们提出了NC-SDF,一种带有视图相关正则化的神经签名距离场(SDF)3D重构框架。具体来说,我们将单目正则化先验中的视点依赖偏差融入了场景的神经隐式表示中。通过自适应学习和校正偏差,我们的NC-SDF有效地减轻了不一致监督的不良影响,提高了重构的全局一致性和局部细节。为了进一步优化细节,我们引入了一种关注具有更高信息含量的复杂几何的有意义的像素采样策略。此外,我们还设计了一个混合几何建模方法来提高神经隐式表示。在合成和真实世界数据集上的实验证明,NC-SDF在重建质量方面优于现有方法。
https://arxiv.org/abs/2405.00340
As an important and practical way to obtain high dynamic range (HDR) video, HDR video reconstruction from sequences with alternating exposures is still less explored, mainly due to the lack of large-scale real-world datasets. Existing methods are mostly trained on synthetic datasets, which perform poorly in real scenes. In this work, to facilitate the development of real-world HDR video reconstruction, we present Real-HDRV, a large-scale real-world benchmark dataset for HDR video reconstruction, featuring various scenes, diverse motion patterns, and high-quality labels. Specifically, our dataset contains 500 LDRs-HDRs video pairs, comprising about 28,000 LDR frames and 4,000 HDR labels, covering daytime, nighttime, indoor, and outdoor scenes. To our best knowledge, our dataset is the largest real-world HDR video reconstruction dataset. Correspondingly, we propose an end-to-end network for HDR video reconstruction, where a novel two-stage strategy is designed to perform alignment sequentially. Specifically, the first stage performs global alignment with the adaptively estimated global offsets, reducing the difficulty of subsequent alignment. The second stage implicitly performs local alignment in a coarse-to-fine manner at the feature level using the adaptive separable convolution. Extensive experiments demonstrate that: (1) models trained on our dataset can achieve better performance on real scenes than those trained on synthetic datasets; (2) our method outperforms previous state-of-the-art methods. Our dataset is available at this https URL.
作为一种获取高动态范围(HDR)视频的重要且实用的方法,从交替曝光的序列中重构HDR视频仍然没有被深入研究,主要原因是缺乏大规模现实世界数据集。现有的方法大多是在合成数据集上训练的,在现实场景中表现不佳。在本文中,为了促进现实世界HDR视频重构的发展,我们提出了Real-HDRV,一个大规模现实世界HDR视频重建基准数据集,包括各种场景、多样运动模式和高质量标签。具体来说,我们的数据集包含500个LDR-HDR视频对,约28,000个LDR帧和4,000个HDR标签,涵盖白天、黑夜、室内和室外场景。据我们所知,我们的数据集是最大的现实世界HDR视频重建数据集。因此,我们提出了一个端到端的HDR视频重构网络,其中一种新颖的二级策略被设计用于依次执行对齐。具体来说,第一阶段通过自适应估计全局偏移进行全局对齐,减少了后续对齐的难度。第二阶段在特征级别以粗到细的方式隐式执行局部对齐,利用自适应分离卷积。大量实验证明:(1)在我们的数据集上训练的模型在现实场景中的性能优于那些在合成数据集上训练的模型;(2)我们的方法超越了以前的最先进方法。我们的数据集可在此处访问:https://www.xivelyun.com/Real-HDRV。
https://arxiv.org/abs/2405.00244
Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modelling techniques to audio data. However, traditional codecs often operate at high bitrates or within narrow domains such as speech and lack the semantic clues required for efficient language modelling. Addressing these challenges, we introduce SemantiCodec, a novel codec designed to compress audio into fewer than a hundred tokens per second across diverse audio types, including speech, general audio, and music, without compromising quality. SemantiCodec features a dual-encoder architecture: a semantic encoder using a self-supervised AudioMAE, discretized using k-means clustering on extensive audio data, and an acoustic encoder to capture the remaining details. The semantic and acoustic encoder outputs are used to reconstruct audio via a diffusion-model-based decoder. SemantiCodec is presented in three variants with token rates of 25, 50, and 100 per second, supporting a range of ultra-low bit rates between 0.31 kbps and 1.43 kbps. Experimental results demonstrate that SemantiCodec significantly outperforms the state-of-the-art Descript codec on reconstruction quality. Our results also suggest that SemantiCodec contains significantly richer semantic information than all evaluated audio codecs, even at significantly lower bitrates. Our code and demos are available at this https URL.
大语言模型(LLMs)通过将音频转换为离散 tokens 的音频编码器显著提高了音频处理能力,使得将语言建模技术应用于音频数据成为可能。然而,传统的编码器通常操作在高速位率或狭窄的领域内,如语音,缺乏进行高效语言建模所需的语义线索。为解决这些挑战,我们引入了 SemantiCodec,一种专为将音频压缩成不到100个 tokens per second 的音频编码器,包括语音、通用音频和音乐,同时不牺牲质量。SemantiCodec 具有双重编码器架构:一个使用自监督的 AudioMAE 的语义编码器,通过扩展音频数据上的 k-means 聚类进行离散化,以及一个声学编码器来捕捉剩余细节。语义和声学编码器的输出被用于通过扩散模型解码器重构音频。SemantiCodec 推出了三种版本,具有不同的 token rate,支持在 0.31 kbps 和 1.43 kbps 之间运行的极低比特率。实验结果表明,SemantiCodec 在重构质量方面显著优于最先进的 Descript 编码器。我们的结果还表明,即使在显著较低的比特率下,SemantiCodec 也包含比所有评估音频编码器更丰富的语义信息。我们的代码和演示文稿可以从此链接获取。
https://arxiv.org/abs/2405.00233
Artificial Intelligence holds tremendous potential in medicine, but is traditionally limited by the lack of massive datasets to train models on. Foundation models, pre-trained models that can be adapted to downstream tasks with small datasets, could alleviate this problem. Researchers at Moorfields Eye Hospital (MEH) proposed RETFound-MEH, a foundation model for retinal imaging that was trained on 900,000 images, including private hospital data. Recently, data-efficient DERETFound was proposed that provides comparable performance while being trained on only 150,000 images that are all publicly available. However, both these models required very substantial resources to train initially and are resource-intensive in downstream use. We propose a novel Token Reconstruction objective that we use to train RETFound-Green, a retinal foundation model trained using only 75,000 publicly available images and 400 times less compute. We estimate the cost of training RETFound-MEH and DERETFound at $10,000 and $14,000, respectively, while RETFound-Green could be trained for less than $100, with equally reduced environmental impact. RETFound-Green is also far more efficient in downstream use: it can be downloaded 14 times faster, computes vector embeddings 2.7 times faster which then require 2.6 times less storage space. Despite this, RETFound-Green does not perform systematically worse. In fact, it performs best on 14 tasks, compared to six for DERETFound and two for RETFound-MEH. Our results suggest that RETFound-Green is a very efficient, high-performance retinal foundation model. We anticipate that our Token Reconstruction objective could be scaled up for even higher performance and be applied to other domains beyond retinal imaging.
人工智能在医学领域具有巨大的潜力,但传统上受到大规模数据集缺乏的限制。基础模型、可以适应下游任务的小数据集预训练模型,可以缓解这一问题。英国莫费尔德眼科医院(MEH)的研究人员提出了RETFound-MEH,一种基于900,000张图像的视网膜成像基础模型,其中包括私立医院数据。最近,提出了数据高效的DERETFound模型,该模型在仅使用150,000张可公开获取的图像进行训练的同时,提供了与原模相同的表现。然而,这两种模型在训练初期都需要非常大量的资源,并且在下游使用时也是资源密集型。我们提出了一个新颖的标记重构目标,该目标我们用于训练RETFind-Green,一种仅使用75,000张公开可获取图像的视网膜基础模型,以及400倍较少的计算资源。我们估计RETFind-MEH和DERETFound的训练成本分别为10,000美元和14,000美元,而RETFind-Green的训练成本可以低于100美元,同时具有与原模相当程度的减少环境影响的效应。RETFind-Green在下游使用方面也远更高效:它可以实现14次更快地下载,同时计算向量嵌入速度比需要2.6倍更少的存储空间。尽管如此,RETFind-Green在系统内并没有表现出更差的表现。事实上,与DERETFound和RETFind-MEH相比,RETFind-Green在14个任务上表现最好,相对于六项任务,RETFind-Green的性能更好。我们的结果表明,RETFind-Green是一种非常高效、高性能的视网膜基础模型。我们预计,我们的标记重构目标可以进行扩展,以实现更高的性能,并应用于其他领域,而不仅仅是视网膜成像。
https://arxiv.org/abs/2405.00117
Contemporary 3D research, particularly in reconstruction and generation, heavily relies on 2D images for inputs or supervision. However, current designs for these 2D-3D mapping are memory-intensive, posing a significant bottleneck for existing methods and hindering new applications. In response, we propose a pair of highly scalable components for 3D neural fields: Lightplane Render and Splatter, which significantly reduce memory usage in 2D-3D mapping. These innovations enable the processing of vastly more and higher resolution images with small memory and computational costs. We demonstrate their utility in various applications, from benefiting single-scene optimization with image-level losses to realizing a versatile pipeline for dramatically scaling 3D reconstruction and generation. Code: \url{this https URL}.
当代 3D 研究,尤其是在建模和生成方面,严重依赖 2D 图像作为输入或指导。然而,当前为这些 2D-3D 映射设计的现有方法是内存密集型,这为现有方法和阻碍新的应用造成了显著的瓶颈。为了应对这一问题,我们提出了名为 Lightplane Render 和 Splatter 的两个高度可扩展的 3D 神经网络组件,它们大幅减少了 2D-3D 映射中的内存使用。这些创新使得用较小的内存和计算成本处理大量更高质量的图像成为可能。我们在各种应用中证明了它们的价值,从通过图像级别损失实现单场景优化,到实现用于大幅度扩展 3D 建模和生成的大幅度流程。代码:\url{这个链接}。
https://arxiv.org/abs/2404.19760
We propose RTG-SLAM, a real-time 3D reconstruction system with an RGBD camera for large-scale environments using Gaussian splatting. RTG-SLAM features a compact Gaussian representation and a highly efficient on-the-fly Gaussian optimization scheme. We force each Gaussian to be either opaque or nearly transparent, with the opaque ones fitting the surface and dominant colors, and transparent ones fitting residual colors. By rendering depth in a different way from color rendering, we let a single opaque Gaussian well fit a local surface region without the need of multiple overlapping Gaussians, hence largely reducing the memory and computation cost. For on-the-fly Gaussian optimization, we explicitly add Gaussians for three types of pixels per frame: newly observed, with large color errors and with large depth errors. We also categorize all Gaussians into stable and unstable ones, where the stable Gaussians are expected to well fit previously observed RGBD images and otherwise unstable. We only optimize the unstable Gaussians and only render the pixels occupied by unstable Gaussians. In this way, both the number of Gaussians to be optimized and pixels to be rendered are largely reduced, and the optimization can be done in real time. We show real-time reconstructions of a variety of real large scenes. Compared with the state-of-the-art NeRF-based RGBD SLAM, our system achieves comparable high-quality reconstruction but with around twice the speed and half the memory cost, and shows superior performance in the realism of novel view synthesis and camera tracking accuracy.
我们提出了RTG-SLAM,一种基于Gaussian分割的大规模环境下的实时3D重建系统。RTG-SLAM具有紧凑的Gaussian表示和高效的on-the-fly Gaussian优化方案。我们强制每个Gaussian要么是透明的,要么是几乎透明的,其中透明的Gaussian适合于表面和主导颜色,而透明的Gaussian适合于残余颜色。通过以与颜色渲染不同的方式渲染深度,我们使得一个透明的Gaussian可以适应用户本地表面区域,而无需多个重叠的Gaussian,从而大大降低了内存和计算成本。 对于on-the-fly Gaussian优化,我们明确地添加了每帧三种不同类型的像素的Gaussian:新观察到的,具有大的颜色误差和大的深度误差。我们还将所有Gaussian分为稳定和不稳定两类,其中稳定Gaussian预计将很好地适应用户之前观察到的RGBD图像,而其他Gaussian则是不稳定的。我们仅优化不稳定Gaussian,并仅渲染稳定Gaussian占用的像素。 通过这种方式,Gaussians要优化的数量和需要渲染的像素数量都大大减少,优化可以在实时过程中进行。我们展示了各种真实大场景的实时重构。与基于NeRF的RGBD SLAM的状态相比,我们的系统在质量和高速度方面具有相似的表现,同时将速度和内存成本降低约一半,并在新颖视图合成和相机跟踪精度的现实性方面具有卓越的表现。
https://arxiv.org/abs/2404.19706
We propose GS-LRM, a scalable large reconstruction model that can predict high-quality 3D Gaussian primitives from 2-4 posed sparse images in 0.23 seconds on single A100 GPU. Our model features a very simple transformer-based architecture; we patchify input posed images, pass the concatenated multi-view image tokens through a sequence of transformer blocks, and decode final per-pixel Gaussian parameters directly from these tokens for differentiable rendering. In contrast to previous LRMs that can only reconstruct objects, by predicting per-pixel Gaussians, GS-LRM naturally handles scenes with large variations in scale and complexity. We show that our model can work on both object and scene captures by training it on Objaverse and RealEstate10K respectively. In both scenarios, the models outperform state-of-the-art baselines by a wide margin. We also demonstrate applications of our model in downstream 3D generation tasks. Our project webpage is available at: this https URL .
我们提出了GS-LRM,一个可扩展的大型重构模型,可以在0.23秒内预测高质量的3D高斯基本结构,从2-4个 posed稀疏图像中。我们的模型具有非常简单的Transformer架构;我们通过一系列Transformer模块对输入的对称图像进行补丁,并通过这些模块的串联对多视角图像令牌进行传递。为了进行可变渲染,我们直接从这些令牌中解码最终每个像素的高斯参数。与之前的LRM不同,它们只能重构物体,而GS-LRM通过预测每个像素的高斯函数,自然地处理规模和复杂度的大型场景。我们分别将模型在Objaverse和RealEstate10K上训练,证明了我们的模型可以应用于物体和场景捕捉。在两种情景下,模型在基线模型之上取得了很大的优势。我们还展示了我们的模型在下游3D生成任务中的应用。我们的项目网页可用于此链接:https://this URL 。
https://arxiv.org/abs/2404.19702
AI is revolutionizing MRI along the acquisition and processing chain. Advanced AI frameworks have been developed to apply AI in various successive tasks, such as image reconstruction, quantitative parameter map estimation, and image segmentation. Existing frameworks are often designed to perform tasks independently or are focused on specific models or datasets, limiting generalization. We introduce ATOMMIC, an open-source toolbox that streamlines AI applications for accelerated MRI reconstruction and analysis. ATOMMIC implements several tasks using DL networks and enables MultiTask Learning (MTL) to perform related tasks integrated, targeting generalization in the MRI domain. We first review the current state of AI frameworks for MRI through a comprehensive literature search and by parsing 12,479 GitHub repositories. We benchmark 25 DL models on eight publicly available datasets to present distinct applications of ATOMMIC on accelerated MRI reconstruction, image segmentation, quantitative parameter map estimation, and joint accelerated MRI reconstruction and image segmentation utilizing MTL. Our findings demonstrate that ATOMMIC is the only MTL framework with harmonized complex-valued and real-valued data support. Evaluations on single tasks show that physics-based models, which enforce data consistency by leveraging the physical properties of MRI, outperform other models in reconstructing highly accelerated acquisitions. Physics-based models that produce high reconstruction quality can accurately estimate quantitative parameter maps. When high-performing reconstruction models are combined with robust segmentation networks utilizing MTL, performance is improved in both tasks. ATOMMIC facilitates MRI reconstruction and analysis by standardizing workflows, enhancing data interoperability, integrating unique features like MTL, and effectively benchmarking DL models.
人工智能正在改变MRI获取和处理链。已经开发了高级AI框架,用于在各种连续的任务中应用AI,如图像重建、定量参数图估计和图像分割。现有的框架通常被设计为独立执行任务或专注于特定的模型或数据集,从而限制了其泛化能力。我们介绍了一个名为ATOMMIC的开源工具箱,用于加速MRI重建和分析,该工具箱使用DL网络来执行多个任务,并使多任务学习(MTL)能够集成相关任务,以提高在MRI领域的泛化能力。我们首先通过全面的文献搜索回顾了AI框架在MRI领域的现状,并通过解析12,479个GitHub仓库,对25个DL模型在8个公开可用的数据集上的性能进行了基准测试,以展示ATOMMIC在加速MRI重建、图像分割、定量参数图估计和联合加速MRI重建和图像分割方面的应用。我们的研究结果表明,ATOMMIC是唯一一个支持标准化复杂值和实值数据的MTL框架。在单任务评估中,利用MRI的物理特性来确保数据一致性的物理基础模型在其他模型的重建中优于其他模型。具有高重建质量的物理基础模型可以准确估计定量参数图。当高性能重建模型与使用MTL的鲁棒分割网络结合时,性能在两个任务上都得到了提高。ATOMMIC通过标准化工作流程、增强数据互操作性、集成独特的MTL功能和有效基准测试DL模型,促进了MRI重建和分析。
https://arxiv.org/abs/2404.19665
Unsupervised object discovery is becoming an essential line of research for tackling recognition problems that require decomposing an image into entities, such as semantic segmentation and object detection. Recently, object-centric methods that leverage self-supervision have gained popularity, due to their simplicity and adaptability to different settings and conditions. However, those methods do not exploit effective techniques already employed in modern self-supervised approaches. In this work, we consider an object-centric approach in which DINO ViT features are reconstructed via a set of queried representations called slots. Based on that, we propose a masking scheme on input features that selectively disregards the background regions, inducing our model to focus more on salient objects during the reconstruction phase. Moreover, we extend the slot attention to a multi-query approach, allowing the model to learn multiple sets of slots, producing more stable masks. During training, these multiple sets of slots are learned independently while, at test time, these sets are merged through Hungarian matching to obtain the final slots. Our experimental results and ablations on the PASCAL-VOC 2012 dataset show the importance of each component and highlight how their combination consistently improves object localization. Our source code is available at: this https URL
无监督的对象发现已经成为解决需要将图像分解成实体以解决识别问题的关键技术之一。最近,利用自监督的方法在对象中心方法中取得了广泛的应用,因为它们的简单性和对不同设置和条件的适应性。然而,这些方法并没有充分利用现代自监督方法中已经采用的有效技术。在本文中,我们考虑了一种以对象为中心的方法,其中通过查询表示的槽来重构DINO ViT特征。基于这种方法,我们提出了一个在输入特征上应用遮罩方案,该方案选择性地忽略了背景区域,使我们的模型在重构阶段更加关注显著的物体。此外,我们还将槽注意扩展到多查询方法,使模型能够学习多个设置,从而产生更稳定的掩码。在训练过程中,这些多个设置是独立学习的,而在测试时,这些设置通过匈牙利匹配合并为最终设置。我们对PASCAL-VOC 2012数据集的实验结果和分析表明了每个组件的重要性,并强调了它们组合一致地改善了物体定位。我们的源代码可在此处访问:https:// this URL
https://arxiv.org/abs/2404.19654
Optimization-based approaches, such as score distillation sampling (SDS), show promise in zero-shot 3D generation but suffer from low efficiency, primarily due to the high number of function evaluations (NFEs) required for each sample. In this paper, we introduce score-based iterative reconstruction (SIR), an efficient and general algorithm for 3D generation with a multi-view score-based diffusion model. Given the images produced by the diffusion model, SIR reduces NFEs by repeatedly optimizing 3D parameters, unlike the single optimization in SDS, mimicking the 3D reconstruction process. With other improvements including optimization in the pixel space, we present an efficient approach called MicroDreamer that generally applies to various 3D representations and 3D generation tasks. In particular, retaining a comparable performance, MicroDreamer is 5-20 times faster than SDS in generating neural radiance field and takes about 20 seconds to generate meshes from 3D Gaussian splitting on a single A100 GPU, halving the time of the fastest zero-shot baseline, DreamGaussian. Our code is available at this https URL.
基于优化的方法,如分数蒸馏采样(SDS),在零散3D生成方面表现出前景,但主要由于每个样本需要高数量的功能评估(NFEs)而效率低下。在本文中,我们引入了基于分数的迭代重构(SIR),一种使用多视图分数扩散模型的3D生成的高效且通用的算法。通过扩散模型的图像,SIR通过反复优化3D参数减少NFEs,不同于SDS中的单次优化,模仿了3D重构过程。其他改进包括在像素空间中的优化,我们提出了名为MicroDreamer的效率方法,该方法通常适用于各种3D表示和3D生成任务。特别是,与SDS相比,保留相当性能,MicroDreamer在生成神经元辐射场方面快5-20倍,从单个A100 GPU的3D高斯分裂上生成网格的时间不到20秒,将最快零散基准的时间减半,DreamGaussian。我们的代码可在此处访问:https://www.thunar.me/。
https://arxiv.org/abs/2404.19525
The task of anomaly detection is to separate anomalous data from normal data in the dataset. Models such as deep convolutional autoencoder (CAE) network and deep supporting vector data description (SVDD) model have been universally employed and have demonstrated significant success in detecting anomalies. However, the over-reconstruction ability of CAE network for anomalous data can easily lead to high false negative rate in detecting anomalous data. On the other hand, the deep SVDD model has the drawback of feature collapse, which leads to a decrease of detection accuracy for anomalies. To address these problems, we propose the Improved AutoEncoder with LSTM module and Kullback-Leibler divergence (IAE-LSTM-KL) model in this paper. An LSTM network is added after the encoder to memorize feature representations of normal data. In the meanwhile, the phenomenon of feature collapse can also be mitigated by penalizing the featured input to SVDD module via KL divergence. The efficacy of the IAE-LSTM-KL model is validated through experiments on both synthetic and real-world datasets. Experimental results show that IAE-LSTM-KL model yields higher detection accuracy for anomalies. In addition, it is also found that the IAE-LSTM-KL model demonstrates enhanced robustness to contaminated outliers in the dataset.
异常检测的任务是将数据集中的异常数据与正常数据区分开来。像深度卷积自动编码器(CAE)网络和深度支持向量数据描述(SVDD)模型这样的模型已经被普遍采用,并在检测异常数据方面取得了显著的成功。然而,CAE网络对异常数据的过度重构能力可能导致在检测异常数据时的假阴性率过高。另一方面,深度SVDD模型的缺点是特征收缩,这会导致异常检测的准确性降低。为了应对这些问题,本文提出了改进自动编码器(IAE-LSTM-KL)模型。在编码器之后添加一个LSTM网络来记忆正常数据的特征表示。同时,通过KL散度惩罚SVDD模块中的特征输入,也可以减轻特征收缩的现象。通过实验验证,IAE-LSTM-KL模型的有效性得到了 both synthetic and real-world datasets 的验证。实验结果表明,IAE-LSTM-KL模型对于异常数据的检测准确率更高。此外,还发现IAE-LSTM-KL模型在数据集中的污染异常检测方面表现出更高的鲁棒性。
https://arxiv.org/abs/2404.19247
Depth position highly affects lens distortion, especially in close-range photography, which limits the measurement accuracy of existing stereo vision systems. Moreover, traditional depth-dependent distortion models and their calibration methods have remained complicated. In this work, we propose a minimal set of parameters based depth-dependent distortion model (MDM), which considers the radial and decentering distortions of the lens to improve the accuracy of stereo vision systems and simplify their calibration process. In addition, we present an easy and flexible calibration method for the MDM of stereo vision systems with a commonly used planar pattern, which requires cameras to observe the planar pattern in different orientations. The proposed technique is easy to use and flexible compared with classical calibration techniques for depth-dependent distortion models in which the lens must be perpendicular to the planar pattern. The experimental validation of the MDM and its calibration method showed that the MDM improved the calibration accuracy by 56.55% and 74.15% compared with the Li's distortion model and traditional Brown's distortion model. Besides, an iteration-based reconstruction method is proposed to iteratively estimate the depth information in the MDM during three-dimensional reconstruction. The results showed that the accuracy of the iteration-based reconstruction method was improved by 9.08% compared with that of the non-iteration reconstruction method.
深度位置对镜头扭曲的影响非常高,尤其是在近距离摄影中,这限制了现有立体视觉系统的测量精度。此外,传统基于深度的扭曲模型及其校准方法仍然复杂。在这项工作中,我们提出了一个基于深度扭曲模型(MDM)的最小参数集,该模型考虑了镜头的径向和畸变畸变,以提高立体视觉系统的准确性和简化其校准过程。此外,我们还提出了一个用于常见平面图案的MDM的简便且灵活的校准方法,该方法要求相机在不同的方向上观察平面图案。与深度相关扭曲模型的经典校准方法相比,所提出的方法容易使用且灵活。MDM和其校准方法的实验验证表明,与李的畸变模型和传统的布朗畸变模型相比,MDM的校准精度提高了56.55%和74.15%。此外,还提出了一种迭代式重建方法,用于在三维重建过程中迭代估计MDM中的深度信息。结果表明,与非迭代式重建方法相比,迭代式重建方法的准确度提高了9.08%。
https://arxiv.org/abs/2404.19242
The popularity of mobile vision creates a demand for advanced compact computational imaging systems, which call for the development of both a lightweight optical system and an effective image reconstruction model. Recently, joint design pipelines come to the research forefront, where the two significant components are simultaneously optimized via data-driven learning to realize the optimal system design. However, the effectiveness of these designs largely depends on the initial setup of the optical system, complicated by a non-convex solution space that impedes reaching a globally optimal solution. In this work, we present Global Search Optics (GSO) to automatically design compact computational imaging systems through two parts: (i) Fused Optimization Method for Automatic Optical Design (OptiFusion), which searches for diverse initial optical systems under certain design specifications; and (ii) Efficient Physic-aware Joint Optimization (EPJO), which conducts parallel joint optimization of initial optical systems and image reconstruction networks with the consideration of physical constraints, culminating in the selection of the optimal solution. Extensive experimental results on the design of three-piece (3P) sphere computational imaging systems illustrate that the GSO serves as a transformative end-to-end lens design paradigm for superior global optimal structure searching ability, which provides compact computational imaging systems with higher imaging quality compared to traditional methods. The source code will be made publicly available at this https URL.
移动视觉的普及催生了高级紧凑计算成像系统的需求,这需要开发轻量光学系统和有效的图像重建模型。最近,联合设计流程成为研究前沿,其中两个重要组件通过数据驱动学习同时优化,以实现最优系统设计。然而,这些设计的有效性很大程度上取决于光学系统的初始设置,再加上一个非凸解空间,这阻碍了达到全局最优解。在这项工作中,我们提出了全局搜索光学(GSO)来通过两个部分自动设计紧凑计算成像系统: (i)自动光学设计(OptiFusion)融合优化方法,在某些设计规范下寻找多样化的初始光学系统; (ii)高效的物理感知联合优化(EPJO),考虑物理约束,对初始光学系统和图像重建网络进行并行联合优化,最终实现最优解的选择。 通过对三个体(3P)球形计算成像系统设计的广泛实验结果,表明GSO成为一种卓越的端到端镜头设计范例,具有卓越的全球最优结构搜索能力,能为比传统方法提供更高质量紧凑计算成像系统。源代码将在此处公开可用:https://www. this URL。
https://arxiv.org/abs/2404.19201
4D time-space reconstruction of dynamic events or deforming objects using X-ray computed tomography (CT) is an extremely ill-posed inverse problem. Existing approaches assume that the object remains static for the duration of several tens or hundreds of X-ray projection measurement images (reconstruction of consecutive limited-angle CT scans). However, this is an unrealistic assumption for many in-situ experiments that causes spurious artifacts and inaccurate morphological reconstructions of the object. To solve this problem, we propose to perform a 4D time-space reconstruction using a distributed implicit neural representation (DINR) network that is trained using a novel distributed stochastic training algorithm. Our DINR network learns to reconstruct the object at its output by iterative optimization of its network parameters such that the measured projection images best match the output of the CT forward measurement model. We use a continuous time and space forward measurement model that is a function of the DINR outputs at a sparsely sampled set of continuous valued object coordinates. Unlike existing state-of-the-art neural representation architectures that forward and back propagate through dense voxel grids that sample the object's entire time-space coordinates, we only propagate through the DINR at a small subset of object coordinates in each iteration resulting in an order-of-magnitude reduction in memory and compute for training. DINR leverages distributed computation across several compute nodes and GPUs to produce high-fidelity 4D time-space reconstructions even for extremely large CT data sizes. We use both simulated parallel-beam and experimental cone-beam X-ray CT datasets to demonstrate the superior performance of our approach.
使用X射线计算机断层扫描(CT)对动态事件或变形物体进行4D时域重建是一个极其不稳定的反问题。现有的方法假定对象在数十年或数百个X射线投影测量图像(连续有限角CT扫描的重建)期间保持静态。然而,对于许多现场实验,这是不现实的假设,导致伪影和物体形状不准确的重建。为解决这个问题,我们提出了一种使用分布式隐式神经表示(DINR)网络进行4D时域重建的方法,该网络使用一种新颖的分布式随机训练算法进行训练。我们的DINR网络通过迭代优化其网络参数,使其输出能够通过优化与CT前向测量模型的输出相匹配,从而学习重建物体。我们使用一个连续时间和空间的前向测量模型,该模型与DINR输出在稀疏采样集上的连续值物体坐标相关。与现有的神经表示架构不同,它们通过密集体素网格前馈和反向传播来对整个时间空间进行前馈和反向传播,我们只在每次迭代中对DINR进行前馈和反向传播,从而降低内存和计算成本,实现模型的训练。DINR通过跨多个计算节点和GPU的分布式计算来产生高质量的4D时域重构,即使在非常大的CT数据量下也能实现。我们使用模拟并行光束和实验锥束X射线CT数据集来证明我们方法的优势。
https://arxiv.org/abs/2404.19075
Compressing images at extremely low bitrates (below 0.1 bits per pixel (bpp)) is a significant challenge due to substantial information loss. Existing extreme image compression methods generally suffer from heavy compression artifacts or low-fidelity reconstructions. To address this problem, we propose a novel extreme image compression framework that combines compressive VAEs and pre-trained text-to-image diffusion models in an end-to-end manner. Specifically, we introduce a latent feature-guided compression module based on compressive VAEs. This module compresses images and initially decodes the compressed information into content variables. To enhance the alignment between content variables and the diffusion space, we introduce external guidance to modulate intermediate feature maps. Subsequently, we develop a conditional diffusion decoding module that leverages pre-trained diffusion models to further decode these content variables. To preserve the generative capability of pre-trained diffusion models, we keep their parameters fixed and use a control module to inject content information. We also design a space alignment loss to provide sufficient constraints for the latent feature-guided compression module. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in terms of both visual performance and image fidelity at extremely low bitrates.
在低位比特率(低于0.1位每像素,bpp)压缩图像是一个显著的挑战,因为会导致大量信息的丢失。现有的极端图像压缩方法通常存在严重的压缩伪影或低保真的重构。为了解决这个问题,我们提出了一个新颖的极端图像压缩框架,将压缩VAE与预训练文本到图像扩散模型在端到端的方式相结合。具体来说,我们引入了一个基于压缩VAE的潜在特征指导压缩模块。这个模块通过压缩图像并最初将压缩信息解码为内容变量来压缩图像。为了增强内容变量与扩散空间之间的对齐,我们引入了外部的指导来调节中间特征图。随后,我们开发了一个条件扩散解码模块,它利用预训练的扩散模型进一步解码这些内容变量。为了保留预训练扩散模型的生成能力,我们保持其参数不变,并使用一个控制模块来注入内容信息。我们还设计了一个空间对齐损失,为潜在特征指导压缩模块提供足够的约束。大量实验证明,我们的方法在极端低位比特率下既具有卓越的视觉表现又具有卓越的图像保真度。
https://arxiv.org/abs/2404.18820
This paper introduces YOLOv8-TO, a novel approach for reverse engineering of topology-optimized structures into interpretable geometric parameters using the YOLOv8 instance segmentation model. Density-based topology optimization methods require post-processing to convert the optimal density distribution into a parametric representation for design exploration and integration with CAD tools. Traditional methods such as skeletonization struggle with complex geometries and require manual intervention. YOLOv8-TO addresses these challenges by training a custom YOLOv8 model to automatically detect and reconstruct structural components from binary density distributions. The model is trained on a diverse dataset of both optimized and random structures generated using the Moving Morphable Components method. A custom reconstruction loss function based on the dice coefficient of the predicted geometry is used to train the new regression head of the model via self-supervised learning. The method is evaluated on test sets generated from different topology optimization methods, including out-of-distribution samples, and compared against a skeletonization approach. Results show that YOLOv8-TO significantly outperforms skeletonization in reconstructing visually and structurally similar designs. The method showcases an average improvement of 13.84% in the Dice coefficient, with peak enhancements reaching 20.78%. The method demonstrates good generalization to complex geometries and fast inference times, making it suitable for integration into design workflows using regular workstations. Limitations include the sensitivity to non-max suppression thresholds. YOLOv8-TO represents a significant advancement in topology optimization post-processing, enabling efficient and accurate reverse engineering of optimized structures for design exploration and manufacturing.
本文介绍了一种名为YOLOv8-TO的新方法,用于使用YOLOv8实例分割模型将拓扑优化结构反向工程为可解释的几何参数。密度基于拓扑优化方法需要后处理将最优密度分布转换为设计探索和CAD工具集成所需的参数表示。传统方法如骨架化在复杂几何图形上挣扎,并需要手动干预。YOLOv8-TO通过训练自适应检测和重构结构的YOLOv8模型来解决这些挑战。模型在通过自监督学习训练的新回归头的基础上进行训练,同时使用基于 dice 系数的自适应重构损失函数进行训练。该方法在从不同拓扑优化方法产生的测试集中进行评估,包括离散样本,并将其与骨架化方法进行比较。结果表明,YOLOv8-TO在重构视觉和结构相似的设计方面显著优于骨架化方法。该方法在 Dice 系数上展示了13.84%的改进,峰值增强达到20.78%。该方法具有良好的对复杂几何的泛化能力,并且具有快速的推理时间,使其适用于使用常规工作台进行设计工作流程的集成。局限性包括对非最大抑制阈值的敏感性。YOLOv8-TO在拓扑优化后处理方面取得了显著的进展,实现了对优化结构的高效且准确的逆向工程,以进行设计探索和制造。
https://arxiv.org/abs/2404.18763
The incorporation of generative models as regularisers within variational formulations for inverse problems has proven effective across numerous image reconstruction tasks. However, the resulting optimisation problem is often non-convex and challenging to solve. In this work, we show that score-based generative models (SGMs) can be used in a graduated optimisation framework to solve inverse problems. We show that the resulting graduated non-convexity flow converge to stationary points of the original problem and provide a numerical convergence analysis of a 2D toy example. We further provide experiments on computed tomography image reconstruction, where we show that this framework is able to recover high-quality images, independent of the initial value. The experiments highlight the potential of using SGMs in graduated optimisation frameworks.
将生成模型作为正则化在变分形式中解决反问题已经证明在许多图像重建任务中是有效的。然而,通常得到的优化问题往往是非凸的,难以求解。在这项工作中,我们证明了基于得分生成模型(SGMs)可以在渐进优化框架中用于解决反问题。我们证明了,由此产生的非凸梯度流收敛到原始问题的静止点,并提供了2D玩具例子的数值收敛分析。我们还进行了计算机断层扫描图像重建的实验,实验结果表明,此框架能够独立于初始值恢复高质量图像。这些实验突出了在渐进优化框架中使用SGMs的潜力。
https://arxiv.org/abs/2404.18699