In this paper, we present a new method for multi-view geometric reconstruction. In recent years, large vision models have rapidly developed, performing excellently across various tasks and demonstrating remarkable generalization capabilities. Some works use large vision models for monocular depth estimation, which have been applied to facilitate multi-view reconstruction tasks in an indirect manner. Due to the ambiguity of the monocular depth estimation task, the estimated depth values are usually not accurate enough, limiting their utility in aiding multi-view reconstruction. We propose to incorporate SfM information, a strong multi-view prior, into the depth estimation process, thus enhancing the quality of depth prediction and enabling their direct application in multi-view geometric reconstruction. Experimental results on public real-world datasets show that our method significantly improves the quality of depth estimation compared to previous monocular depth estimation works. Additionally, we evaluate the reconstruction quality of our approach in various types of scenes including indoor, streetscape, and aerial views, surpassing state-of-the-art MVS methods. The code and supplementary materials are available at this https URL .
在这篇论文中,我们提出了一种新的多视图几何重建方法。近年来,大型视觉模型迅速发展,在各种任务上表现出色,并展示了出色的泛化能力。一些研究使用大型视觉模型进行单目深度估计,并间接应用于促进多视图重建任务。由于单目深度估计任务的不确定性,估算出的深度值通常不够准确,限制了它们在辅助多视图重建中的应用效果。我们提出将SfM(Structure from Motion)信息这一强大的多视图先验知识融入到深度估计过程中,从而提高深度预测的质量,并使其可以直接应用于多视图几何重建任务中。 实验结果表明,在公共的真实世界数据集上,我们的方法相比以往的单目深度估计工作显著提高了深度估计的质量。此外,我们还在包括室内、街景和空中视角在内的多种场景类型中评估了我们方法的重建质量,超越了当前最先进的MVS(多视图立体匹配)方法。 代码和补充材料可在[此处](https://this https URL)获取。请注意,链接中的URL部分需要替换为实际提供的网址。
https://arxiv.org/abs/2503.14483
The field of Novel View Synthesis has been revolutionized by 3D Gaussian Splatting (3DGS), which enables high-quality scene reconstruction that can be rendered in real-time. 3DGS-based techniques typically suffer from high GPU memory and disk storage requirements which limits their practical application on consumer-grade devices. We propose Opti3DGS, a novel frequency-modulated coarse-to-fine optimization framework that aims to minimize the number of Gaussian primitives used to represent a scene, thus reducing memory and storage demands. Opti3DGS leverages image frequency modulation, initially enforcing a coarse scene representation and progressively refining it by modulating frequency details in the training images. On the baseline 3DGS, we demonstrate an average reduction of 62% in Gaussians, a 40% reduction in the training GPU memory requirements and a 20% reduction in optimization time without sacrificing the visual quality. Furthermore, we show that our method integrates seamlessly with many 3DGS-based techniques, consistently reducing the number of Gaussian primitives while maintaining, and often improving, visual quality. Additionally, Opti3DGS inherently produces a level-of-detail scene representation at no extra cost, a natural byproduct of the optimization pipeline. Results and code will be made publicly available.
新颖视角合成领域通过3D高斯点阵(3D Gaussian Splatting,简称3DGS)技术得到了革命性的提升,该技术能够实现高质量的场景重建,并支持实时渲染。然而,基于3DGS的技术通常会面临较高的GPU内存和磁盘存储需求问题,这限制了它们在消费级设备上的实际应用。 我们提出了一种新颖的方法——Opti3DGS,这是一种频段调制的从粗到细优化框架,旨在最小化用于表示场景的高斯原语的数量,从而降低内存和存储需求。Opti3DGS利用图像频率调节技术,首先强制执行一个粗糙的场景表示,并通过在训练图像中逐步调整细节频率来细化这一表示。 基于基准的3DGS方法,我们展示了平均减少了62%的高斯点数量,在训练过程中将GPU内存的需求降低了40%,同时优化时间也减少了20%,而这些改进均没有牺牲视觉质量。此外,我们的方法能够无缝集成到多种基于3DGS的技术中,并且在保持甚至提升视觉质量的同时,始终减少高斯原语的数量。 值得注意的是,Opti3DGS自然地生成了不同级别的细节场景表示,无需额外成本,这得益于其优化管道的特性。我们将在未来公开发布结果和代码。
https://arxiv.org/abs/2503.14475
The computer vision community has developed numerous techniques for digitally restoring true scene information from single-view degraded photographs, an important yet extremely ill-posed task. In this work, we tackle image restoration from a different perspective by jointly denoising multiple photographs of the same scene. Our core hypothesis is that degraded images capturing a shared scene contain complementary information that, when combined, better constrains the restoration problem. To this end, we implement a powerful multi-view diffusion model that jointly generates uncorrupted views by extracting rich information from multi-view relationships. Our experiments show that our multi-view approach outperforms existing single-view image and even video-based methods on image deblurring and super-resolution tasks. Critically, our model is trained to output 3D consistent images, making it a promising tool for applications requiring robust multi-view integration, such as 3D reconstruction or pose estimation.
计算机视觉领域已经开发了多种技术,用于从单视角的退化照片中数字化地恢复真实场景信息,这是一个重要但极其难以处理的任务。在这项工作中,我们通过同时去噪同一场景的多张照片,以不同的视角解决了图像修复的问题。我们的核心假设是,捕捉相同场景的退化图片包含互补的信息,这些信息结合后可以更好地约束图像复原问题。为此,我们实现了一个强大的多视图扩散模型,该模型能够从多视图关系中提取丰富的信息,并同时生成未受污染的视角。 实验表明,与现有的单视图和甚至基于视频的方法相比,我们的多视图方法在图像去模糊和超分辨率任务上表现出色。尤为重要的是,我们的模型经过训练可以输出三维一致的图像,使其成为需要稳健多视图集成的应用(如3D重建或姿态估计)的理想工具。
https://arxiv.org/abs/2503.14463
We present a latent diffusion model for fast feed-forward 3D scene generation. Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations. To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of up to 300 times.
我们提出了一种用于快速前馈生成三维场景的潜在扩散模型。给定一个或多个图像,我们的模型Bolt3D可以直接在单个GPU上以不到七秒的时间采样出高质量的三维场景表示。通过利用强大的、可扩展的现有二维扩散网络架构,我们的模型能够生成一致且高保真的三维场景表示。为了训练这个模型,我们创建了一个大规模的多视角一致数据集,该数据集包含了3D几何和外观信息,这是通过对现有的多视图图像数据集应用最先进的密集三维重建技术得到的。与之前的需要针对每个场景进行三维重建优化的多视图生成模型相比,Bolt3D将推理成本降低了高达300倍。
https://arxiv.org/abs/2503.14445
User engagement is greatly enhanced by fully immersive multi-modal experiences that combine visual and auditory stimuli. Consequently, the next frontier in VR/AR technologies lies in immersive volumetric videos with complete scene capture, large 6-DoF interaction space, multi-modal feedback, and high resolution & frame-rate contents. To stimulate the reconstruction of immersive volumetric videos, we introduce ImViD, a multi-view, multi-modal dataset featuring complete space-oriented data capture and various indoor/outdoor scenarios. Our capture rig supports multi-view video-audio capture while on the move, a capability absent in existing datasets, significantly enhancing the completeness, flexibility, and efficiency of data capture. The captured multi-view videos (with synchronized audios) are in 5K resolution at 60FPS, lasting from 1-5 minutes, and include rich foreground-background elements, and complex dynamics. We benchmark existing methods using our dataset and establish a base pipeline for constructing immersive volumetric videos from multi-view audiovisual inputs for 6-DoF multi-modal immersive VR experiences. The benchmark and the reconstruction and interaction results demonstrate the effectiveness of our dataset and baseline method, which we believe will stimulate future research on immersive volumetric video production.
用户参与度可以通过结合视觉和听觉刺激的全沉浸式多模态体验得到显著提升。因此,VR/AR技术的下一个前沿领域在于沉浸式的体三维视频,这种视频能够捕捉完整的场景、提供广阔的6自由度互动空间、支持多模态反馈,并具备高分辨率与帧率的内容。 为了激发对沉浸式体三维视频重建的研究,我们推出了ImViD,这是一个集成了全方位数据捕获和多种室内外场景的多视角、多模式的数据集。我们的捕捉设备能够支持在移动过程中进行多视角音视频捕获,这是现有数据集中所不具备的功能,从而大大增强了数据捕获的整体性、灵活性和效率。 所捕获的多视角视频(同步音频)以5K分辨率60帧每秒播放,持续时间为1到5分钟,并且包括丰富的前景背景元素及复杂的动态效果。我们使用该数据集对现有方法进行了基准测试,并为从多视角音视频输入构建沉浸式体三维视频以实现6自由度的多模态沉浸式VR体验建立了一条基础流水线。 通过基准测试和重建与互动结果,证明了我们的数据集和基线方法的有效性。我们相信这将激发未来关于沉浸式体三维视频制作的研究。
https://arxiv.org/abs/2503.14359
Recent advances in Latent Video Diffusion Models (LVDMs) have revolutionized video generation by leveraging Video Variational Autoencoders (Video VAEs) to compress intricate video data into a compact latent this http URL, as LVDM training scales, the computational overhead of Video VAEs becomes a critical bottleneck, particularly for encoding high-resolution videos. To address this, we propose LeanVAE, a novel and ultra-efficient Video VAE framework that introduces two key innovations: (1) a lightweight architecture based on a Neighborhood-Aware Feedforward (NAF) module and non-overlapping patch operations, drastically reducing computational cost, and (2) the integration of wavelet transforms and compressed sensing techniques to enhance reconstruction quality. Extensive experiments validate LeanVAE's superiority in video reconstruction and generation, particularly in enhancing efficiency over existing Video this http URL model offers up to 50x fewer FLOPs and 44x faster inference speed while maintaining competitive reconstruction quality, providing insights for scalable, efficient video this http URL models and code are available at this https URL.
最近在潜在视频扩散模型(LVDM)方面的进展通过利用视频变分自动编码器(Video VAE)将复杂的视频数据压缩为紧凑的潜在表示,从而革新了视频生成领域。然而,随着LVDM训练规模的扩大,Video VAEs的计算开销成为了一个关键瓶颈,尤其是在处理高分辨率视频时更为突出。为此,我们提出了LeanVAE,这是一种新颖且极其高效的视频自动编码器框架,它引入了两个关键创新:(1) 基于邻域感知前馈(NAF)模块和非重叠补丁操作的轻量级架构,大大降低了计算成本;(2) 将小波变换与压缩传感技术相结合以提高重建质量。 大量的实验验证了LeanVAE在视频重建和生成方面的优越性,特别是在效率方面超过了现有的Video VAE模型。该模型提供的FLOPs减少了多达50倍,并且推理速度提高了44倍,同时保持了具有竞争力的重建质量。这为开发可扩展、高效的视频自动编码器提供了见解。 LeanVAE的模型与代码可在以下链接获取:[此处插入实际链接]。
https://arxiv.org/abs/2503.14325
The differing representation spaces required for visual understanding and generation pose a challenge in unifying them within the autoregressive paradigm of large language models. A vision tokenizer trained for reconstruction excels at capturing low-level perceptual details, making it well-suited for visual generation but lacking high-level semantic representations for understanding tasks. Conversely, a vision encoder trained via contrastive learning aligns well with language but struggles to decode back into the pixel space for generation tasks. To bridge this gap, we propose DualToken, a method that unifies representations for both understanding and generation within a single tokenizer. However, directly integrating reconstruction and semantic objectives in a single tokenizer creates conflicts, leading to degraded performance in both reconstruction quality and semantic performance. Instead of forcing a single codebook to handle both semantic and perceptual information, DualToken disentangles them by introducing separate codebooks for high and low-level features, effectively transforming their inherent conflict into a synergistic relationship. As a result, DualToken achieves state-of-the-art performance in both reconstruction and semantic tasks while demonstrating remarkable effectiveness in downstream MLLM understanding and generation tasks. Notably, we also show that DualToken, as a unified tokenizer, surpasses the naive combination of two distinct types vision encoders, providing superior performance within a unified MLLM.
视觉理解和生成所需的不同的表示空间构成了在大规模语言模型的自回归范式中统一它们的挑战。为重建训练的视觉编码器擅长捕获低级感知细节,这使其非常适合于图像生成任务,但缺乏高级语义表征来处理理解任务。相反,通过对比学习训练的视觉编码器与语言对齐良好,但在将信息解码回像素空间以完成生成任务时会遇到困难。为弥合这一差距,我们提出了DualToken方法,在单一编码器内统一了理解和生成的表示。然而,直接在单个编码器中整合重建和语义目标会导致冲突,导致在重建质量和语义性能上均出现退化。 为了克服这个问题,DualToken通过引入针对高级和低级特征的不同代码簿来分离这些信息,从而将固有的矛盾转化为一种协同关系,而不是强迫单一的代码簿同时处理语义和感知信息。结果是,DualToken在重建任务与语义任务中均达到了最先进的性能,并且在下游大规模多模态语言模型(MLLM)理解和生成任务中表现出显著的效果。 值得注意的是,我们还展示了作为统一编码器的DualToken超越了简单组合两种不同类型的视觉编码器的方法,在单一的大规模多模态语言模型内提供了更优的表现。
https://arxiv.org/abs/2503.14324
3D Gaussian Splatting (3DGS) has become one of the most influential works in the past year. Due to its efficient and high-quality novel view synthesis capabilities, it has been widely adopted in many research fields and applications. Nevertheless, 3DGS still faces challenges to properly manage the number of Gaussian primitives that are used during scene reconstruction. Following the adaptive density control (ADC) mechanism of 3D Gaussian Splatting, new Gaussians in under-reconstructed regions are created, while Gaussians that do not contribute to the rendering quality are pruned. We observe that those criteria for densifying and pruning Gaussians can sometimes lead to worse rendering by introducing artifacts. We especially observe under-reconstructed background or overfitted foreground regions. To encounter both problems, we propose three new improvements to the adaptive density control mechanism. Those include a correction for the scene extent calculation that does not only rely on camera positions, an exponentially ascending gradient threshold to improve training convergence, and significance-aware pruning strategy to avoid background artifacts. With these adaptions, we show that the rendering quality improves while using the same number of Gaussians primitives. Furthermore, with our improvements, the training converges considerably faster, allowing for more than twice as fast training times while yielding better quality than 3DGS. Finally, our contributions are easily compatible with most existing derivative works of 3DGS making them relevant for future works.
3D高斯点绘制(3D Gaussian Splatting,简称3DGS)在过去一年中成为最具影响力的工作之一。由于其高效且高质量的新型视图合成能力,它在许多研究领域和应用中得到了广泛应用。然而,3DGS仍然面临如何适当管理场景重建过程中使用的高斯原语数量这一挑战。根据3D高斯点绘制的自适应密度控制(Adaptive Density Control, ADC)机制,在未充分重建的区域会创建新的高斯分布,而在不影响渲染质量的情况下则会修剪不必要的高斯分布。我们观察到,这些用于增密和修剪高斯分布的标准有时会导致更糟糕的渲染效果,如出现背景未充分重建或前景过度拟合的问题。为了解决这些问题,我们提出了三种对自适应密度控制机制的新改进措施。这包括一个不仅依赖于相机位置计算场景范围的校正方法、一种用于提高训练收敛性的呈指数上升的梯度阈值以及一种避免背景伪影的意义感知修剪策略。通过这些调整,在使用相同数量的高斯原语的情况下,渲染质量得到了提升。此外,借助我们的改进措施,训练过程会显著加快,从而使得训练时间缩短超过两倍,并且还能获得比3DGS更好的效果。最后,我们所提出的方法与大多数现有的3DGS衍生工作兼容,使其对未来的相关研究具有重要意义。
https://arxiv.org/abs/2503.14274
Recent advances in Neural Radiance Fields (NeRF) have shown great potential in 3D reconstruction and novel view synthesis, particularly for indoor and small-scale scenes. However, extending NeRF to large-scale outdoor environments presents challenges such as transient objects, sparse cameras and textures, and varying lighting conditions. In this paper, we propose a segmentation-guided enhancement to NeRF for outdoor street scenes, focusing on complex urban environments. Our approach extends ZipNeRF and utilizes Grounded SAM for segmentation mask generation, enabling effective handling of transient objects, modeling of the sky, and regularization of the ground. We also introduce appearance embeddings to adapt to inconsistent lighting across view sequences. Experimental results demonstrate that our method outperforms the baseline ZipNeRF, improving novel view synthesis quality with fewer artifacts and sharper details.
最近在神经辐射场(NeRF)领域的进展展示了其在3D重建和新颖视角合成方面的巨大潜力,特别是在室内场景和小规模环境中。然而,将NeRF应用于大型户外环境时面临着诸如瞬变物体、稀疏相机与纹理以及光照条件变化等挑战。本文提出了一种针对室外街景的基于分割引导的NeRF增强方法,专注于复杂的城市环境。我们的方法扩展了ZipNeRF,并利用Grounded SAM生成分割掩码,这有助于有效处理瞬变对象,建模天空,并对地面进行正则化。此外,我们引入外观嵌入来适应视图序列中的不一致光照条件。实验结果表明,相较于基线模型ZipNeRF,我们的方法在减少伪影和提高细节清晰度方面显著提升了新颖视角合成的质量。
https://arxiv.org/abs/2503.14219
This paper presents RoGSplat, a novel approach for synthesizing high-fidelity novel views of unseen human from sparse multi-view images, while requiring no cumbersome per-subject optimization. Unlike previous methods that typically struggle with sparse views with few overlappings and are less effective in reconstructing complex human geometry, the proposed method enables robust reconstruction in such challenging conditions. Our key idea is to lift SMPL vertices to dense and reliable 3D prior points representing accurate human body geometry, and then regress human Gaussian parameters based on the points. To account for possible misalignment between SMPL model and images, we propose to predict image-aligned 3D prior points by leveraging both pixel-level features and voxel-level features, from which we regress the coarse Gaussians. To enhance the ability to capture high-frequency details, we further render depth maps from the coarse 3D Gaussians to help regress fine-grained pixel-wise Gaussians. Experiments on several benchmark datasets demonstrate that our method outperforms state-of-the-art methods in novel view synthesis and cross-dataset generalization. Our code is available at this https URL.
本文介绍了 RoGSplat,这是一种新颖的方法,用于从稀疏的多视角图像中合成未见过的人体的新视图,并且不需要复杂的个体优化。与以往的方法不同,这些方法通常在处理重叠较少的稀疏视图时会遇到困难,并且在重建复杂人体几何形状方面效果较差,所提出的方法能够在这样的挑战条件下实现稳健重建。我们的核心思想是将 SMPL 顶点提升到表示准确人体几何结构的密集而可靠的三维先验点上,然后基于这些点回归出人类高斯参数。为了应对 SMPL 模型与图像之间可能存在的对齐错误,我们提出通过利用像素级特征和体素级特征来预测图像对齐的 3D 先验点,并从这些点中回归出粗略的高斯分布。为增强捕捉高频细节的能力,我们进一步渲染由粗略 3D 高斯生成的深度图,以帮助回归精细到像素级别的高斯参数。 在多个基准数据集上的实验表明,我们的方法在新视图合成和跨数据集泛化方面优于当前最先进的方法。我们的代码可在[此处](https://this_https_URL/)获取。
https://arxiv.org/abs/2503.14198
We introduce an image upscaling technique tailored for 3D Gaussian Splatting (3DGS) on lightweight GPUs. Compared to 3DGS, it achieves significantly higher rendering speeds and reduces artifacts commonly observed in 3DGS reconstructions. Our technique upscales low-resolution 3DGS renderings with a marginal increase in cost by directly leveraging the analytical image gradients of Gaussians for gradient-based bicubic spline interpolation. The technique is agnostic to the specific 3DGS implementation, achieving novel view synthesis at rates 3x-4x higher than the baseline implementation. Through extensive experiments on multiple datasets, we showcase the performance improvements and high reconstruction fidelity attainable with gradient-aware upscaling of 3DGS images. We further demonstrate the integration of gradient-aware upscaling into the gradient-based optimization of a 3DGS model and analyze its effects on reconstruction quality and performance.
我们介绍了一种针对轻量级GPU上的三维高斯点化(3D Gaussian Splatting,简称3DGS)技术的图像上采样方法。相较于原始的3DGS方法,该技术能够显著提升渲染速度,并减少3DGS重建中常见的伪影现象。我们的方法通过直接利用高斯函数的解析图像梯度来进行基于梯度的双三次样条插值,从而对低分辨率的3DGS渲染进行上采样,成本几乎可以忽略不计。该技术与具体的3DGS实现无关,能够以比基线实现快3到4倍的速度生成新颖视角合成,并在多个数据集上的广泛实验中展示了基于梯度感知的上采样的性能提升和高重建保真度。 此外,我们还演示了如何将基于梯度的上采样技术整合进3DGS模型的基于梯度的优化过程中,并分析了这种集成对重建质量和性能的影响。
https://arxiv.org/abs/2503.14171
Vision-Language Models (VLMs) have recently witnessed significant progress in visual comprehension. As the permitting length of image context grows, VLMs can now comprehend a broader range of views and spaces. Current benchmarks provide insightful analysis of VLMs in tasks involving complex visual instructions following, multi-image understanding and spatial reasoning. However, they usually focus on spatially irrelevant images or discrete images captured from varied viewpoints. The compositional characteristic of images captured from a static viewpoint remains underestimated. We term this characteristic as Continuous Space Perception. When observing a scene from a static viewpoint while shifting orientations, it produces a series of spatially continuous images, enabling the reconstruction of the entire space. In this paper, we present CoSpace, a multi-image visual understanding benchmark designed to assess the Continuous Space perception ability for VLMs. CoSpace contains 2,918 images and 1,626 question-answer pairs, covering seven types of tasks. We conduct evaluation across 19 proprietary and open-source VLMs. Results reveal that there exist pitfalls on the continuous space perception ability for most of the evaluated models, including proprietary ones. Interestingly, we find that the main discrepancy between open-source and proprietary models lies not in accuracy but in the consistency of responses. We believe that enhancing the ability of continuous space perception is essential for VLMs to perform effectively in real-world tasks and encourage further research to advance this capability.
最近,视觉-语言模型(VLM)在视觉理解方面取得了显著进展。随着图像上下文允许长度的增长,这些模型现在能够理解和处理更广泛的视角和空间范围。当前的基准测试提供了关于复杂视觉指令跟随、多图理解及空间推理任务中VLM表现的深入分析。然而,它们通常关注的是与空间不相关的图片或从不同视角捕捉到的离散图像。而来自固定视点拍摄的照片所具有的组合特性被低估了。我们将这种特性称为连续空间感知(Continuous Space Perception)。当从一个静止的观察点变换方向观察场景时,会产生一系列空间上连贯的图像序列,从而可以重建整个空间环境。 在本文中,我们介绍了CoSpace,这是一个多图视觉理解基准测试,旨在评估VLM对连续空间感知的能力。CoSpace包含2,918张图片和1,626个问题-答案配对,并涵盖了七种类型的任务。我们在19种专有和开源的视觉语言模型上进行了评测。结果表明,大多数被评估的模型(包括专有的)在连续空间感知能力方面存在不足之处。有趣的是,我们发现开源模型与专有模型之间的主要差异不在于准确性而是在于回答的一致性。 我们认为增强VLM对连续空间感知的能力对于其在现实世界任务中的有效性能至关重要,并鼓励进一步研究以提升这种能力。
https://arxiv.org/abs/2503.14161
In recent years, multi-view multi-label learning (MVML) has gained popularity due to its close resemblance to real-world scenarios. However, the challenge of selecting informative features to ensure both performance and efficiency remains a significant question in MVML. Existing methods often extract information separately from the consistency part and the complementary part, which may result in noise due to unclear segmentation. In this paper, we propose a unified model constructed from the perspective of global-view reconstruction. Additionally, while feature selection methods can discern the importance of features, they typically overlook the uncertainty of samples, which is prevalent in realistic scenarios. To address this, we incorporate the perception of sample uncertainty during the reconstruction process to enhance trustworthiness. Thus, the global-view is reconstructed through the graph structure between samples, sample confidence, and the view relationship. The accurate mapping is established between the reconstructed view and the label matrix. Experimental results demonstrate the superior performance of our method on multi-view datasets.
近年来,多视角多标签学习(MVML)因其与现实场景的相似性而受到广泛关注。然而,在MVML中选择能够确保性能和效率的信息特征仍然是一个重要的问题。现有的方法通常会从一致性部分和互补部分分别提取信息,这可能会由于分割不明确而导致噪声产生。在本文中,我们提出了一种基于全局视角重构构建的统一模型。此外,虽然特征选择方法可以区分特征的重要性,但它们往往忽视了样本不确定性这一现实场景中的常见问题。为了解决这个问题,在重构过程中我们将样本不确定性的感知纳入考虑,以增强可信度。因此,通过样本之间的图结构、样本置信度和视角关系来重构全局视图,并在重构的视图与标签矩阵之间建立精确映射。实验结果表明,我们的方法在多视图数据集上表现出优越性能。
https://arxiv.org/abs/2503.14024
Differentiable rendering enables efficient optimization by allowing gradients to be computed through the rendering process, facilitating 3D reconstruction, inverse rendering and neural scene representation learning. To ensure differentiability, existing solutions approximate or re-formulate traditional rendering operations using smooth, probabilistic proxies such as volumes or Gaussian primitives. Consequently, they struggle to preserve sharp edges due to the lack of explicit boundary definitions. We present a novel hybrid representation, Bézier Gaussian Triangle (BG-Triangle), that combines Bézier triangle-based vector graphics primitives with Gaussian-based probabilistic models, to maintain accurate shape modeling while conducting resolution-independent differentiable rendering. We present a robust and effective discontinuity-aware rendering technique to reduce uncertainties at object boundaries. We also employ an adaptive densification and pruning scheme for efficient training while reliably handling level-of-detail (LoD) variations. Experiments show that BG-Triangle achieves comparable rendering quality as 3DGS but with superior boundary preservation. More importantly, BG-Triangle uses a much smaller number of primitives than its alternatives, showcasing the benefits of vectorized graphics primitives and the potential to bridge the gap between classic and emerging representations.
可微渲染通过允许计算穿过渲染过程的梯度,使得高效优化成为可能,并促进了3D重建、逆向渲染和神经场景表示学习。为了确保可微性,现有解决方案采用体积或高斯原语等平滑的概率代理来近似或重新定义传统的渲染操作,从而难以保留锐利边缘,因为缺乏明确的边界定义。 我们提出了一种新型混合表示法——贝塞尔高斯三角形(BG-Triangle),它结合了基于贝塞尔三角形的矢量图形原语和基于高斯的概率模型。这种方法在保持准确形状建模的同时,可以进行分辨率独立且可微分渲染。我们还提出了一种稳健有效的边缘感知渲染技术来减少物体边界处的不确定性,并采用自适应细化与剪枝方案以实现高效训练并可靠地处理细节级别(LoD)的变化。 实验表明,BG-Triangle在渲染质量上与3DGS相当,但在边界保留方面更优。更重要的是,BG-Triangle使用比其替代方法少得多的基本图形原语数量,展示了矢量化图形原语的优势,并具有弥合经典表示法和新兴表示法之间差距的潜力。
https://arxiv.org/abs/2503.13961
Accurate estimation of total leaf area (TLA) is crucial for evaluating plant growth, photosynthetic activity, and transpiration. However, it remains challenging for bushy plants like dwarf tomatoes due to their complex canopies. Traditional methods are often labor-intensive, damaging to plants, or limited in capturing canopy complexity. This study evaluated a non-destructive method combining sequential 3D reconstructions from RGB images and machine learning to estimate TLA for three dwarf tomato cultivars: Mohamed, Hahms Gelbe Topftomate, and Red Robin -- grown under controlled greenhouse conditions. Two experiments (spring-summer and autumn-winter) included 73 plants, yielding 418 TLA measurements via an "onion" approach. High-resolution videos were recorded, and 500 frames per plant were used for 3D reconstruction. Point clouds were processed using four algorithms (Alpha Shape, Marching Cubes, Poisson's, Ball Pivoting), and meshes were evaluated with seven regression models: Multivariable Linear Regression, Lasso Regression, Ridge Regression, Elastic Net Regression, Random Forest, Extreme Gradient Boosting, and Multilayer Perceptron. The Alpha Shape reconstruction ($\alpha = 3$) with Extreme Gradient Boosting achieved the best performance ($R^2 = 0.80$, $MAE = 489 cm^2$). Cross-experiment validation showed robust results ($R^2 = 0.56$, $MAE = 579 cm^2$). Feature importance analysis identified height, width, and surface area as key predictors. This scalable, automated TLA estimation method is suited for urban farming and precision agriculture, offering applications in automated pruning, resource efficiency, and sustainable food production. The approach demonstrated robustness across variable environmental conditions and canopy structures.
精确估计总叶面积(TLA)对于评估植物生长、光合作用和蒸腾作用至关重要。然而,由于其复杂的树冠结构,这对于矮生番茄等灌木状植物来说仍具挑战性。传统方法通常耗时且劳动密集型,或者会对植物造成损害,或在捕捉树冠复杂度方面存在限制。本研究评估了一种结合RGB图像的顺序3D重建和机器学习的非破坏性方法,用于估计三种矮生番茄品种(Mohamed、Hahms Gelbe Topftomate 和 Red Robin)在控制温室条件下生长时的总叶面积。该研究进行了两次实验(春季-夏季和秋季-冬季),包括73株植物,并通过“洋葱”方法获得了418次TLA测量值。记录了高分辨率视频,每株植物使用500帧进行3D重建。点云利用四种算法处理(Alpha Shape、Marching Cubes、Poisson's 和 Ball Pivoting),并且网格通过七种回归模型评估:多元线性回归、Lasso 回归、岭回归、弹性网络回归、随机森林、极端梯度提升和多层感知器。使用 Alpha Shape 重建(α = 3)与极值梯度提升实现了最佳性能($R^2 = 0.80$, $MAE = 489 cm^2$)。跨实验验证显示了稳健的结果($R^2 = 0.56$, $MAE = 579 cm^2$)。特征重要性分析确定了高度、宽度和表面积是关键预测因子。这种可扩展且自动化的TLA估算方法适用于城市农业和精准农业,提供了自动化修剪、资源效率以及可持续食品生产的应用机会。该方法在不同环境条件和树冠结构下表现出稳健的性能。
https://arxiv.org/abs/2503.13778
Video Super-Resolution (VSR) reconstructs high-resolution videos from low-resolution inputs to restore fine details and improve visual clarity. While deep learning-based VSR methods achieve impressive results, their centralized nature raises serious privacy concerns, particularly in applications with strict privacy requirements. Federated Learning (FL) offers an alternative approach, but existing FL methods struggle with low-level vision tasks, leading to suboptimal reconstructions. To address this, we propose FedVSR1, a novel, architecture-independent, and stateless FL framework for VSR. Our approach introduces a lightweight loss term that improves local optimization and guides global aggregation with minimal computational overhead. To the best of our knowledge, this is the first attempt at federated VSR. Extensive experiments show that FedVSR outperforms general FL methods by an average of 0.85 dB in PSNR, highlighting its effectiveness. The code is available at: this https URL
视频超分辨率(VSR)技术通过从低分辨率输入中重建高质量的高分辨率视频,以恢复精细细节并提升视觉清晰度。基于深度学习的VSR方法虽然取得了显著成果,但其集中式的特性引发了严重的隐私担忧,尤其是在需要严格保护用户隐私的应用场景下。联邦学习(FL)提供了一种替代方案,然而现有的FL技术在处理低级视觉任务时表现不佳,导致重建效果不尽如人意。 为了解决这个问题,我们提出FedVSR1——一种新颖的、架构无关且无状态的FL框架,专门用于视频超分辨率任务。我们的方法引入了一个轻量级损失项,该项可以优化本地训练并指导全局聚合,在不显著增加计算成本的前提下提升重建质量。据我们所知,这是首次尝试在联邦学习环境中应用VSR技术。 通过广泛的实验测试,FedVSR的表现明显优于通用的FL方法,PSNR平均提升了0.85dB,充分展示了其在提高视频超分辨率方面的有效性。相关代码可在以下链接获取:[this https URL]
https://arxiv.org/abs/2503.13745
In recent years, attention mechanisms have been exploited in single image super-resolution (SISR), achieving impressive reconstruction results. However, these advancements are still limited by the reliance on simple training strategies and network architectures designed for discrete up-sampling scales, which hinder the model's ability to effectively capture information across multiple scales. To address these limitations, we propose a novel framework, \textbf{C2D-ISR}, for optimizing attention-based image super-resolution models from both performance and complexity perspectives. Our approach is based on a two-stage training methodology and a hierarchical encoding mechanism. The new training methodology involves continuous-scale training for discrete scale models, enabling the learning of inter-scale correlations and multi-scale feature representation. In addition, we generalize the hierarchical encoding mechanism with existing attention-based network structures, which can achieve improved spatial feature fusion, cross-scale information aggregation, and more importantly, much faster inference. We have evaluated the C2D-ISR framework based on three efficient attention-based backbones, SwinIR-L, SRFormer-L and MambaIRv2-L, and demonstrated significant improvements over the other existing optimization framework, HiT, in terms of super-resolution performance (up to 0.2dB) and computational complexity reduction (up to 11%). The source code will be made publicly available at this http URL.
近年来,注意力机制在单幅图像超分辨率(SISR)领域得到了广泛应用,并取得了令人瞩目的重建效果。然而,这些进展仍然受到简单训练策略和为离散上采样比例设计的网络架构限制,这阻碍了模型有效捕捉多尺度信息的能力。为了克服这些局限性,我们提出了一种新的框架 \textbf{C2D-ISR},用于从性能和复杂度两个方面优化基于注意力机制的图像超分辨率模型。我们的方法建立在两阶段训练策略及分层编码机制之上。新提出的训练方法包括对离散尺度模型进行连续比例的训练,从而能够学习跨尺度的相关性和多尺度特征表示。此外,我们将现有的基于注意力网络结构的层次化编码机制进行了推广,实现了改进的空间特征融合、跨尺度信息聚合,并且更重要的是,大幅度提升了推理速度。我们在三种高效的基于注意力的骨干网络(SwinIR-L, SRFormer-L 和 MambaIRv2-L)上评估了 C2D-ISR 框架,并展示了相对于现有优化框架 HiT,在超分辨率性能方面(高达 0.2dB 的提升)和计算复杂度降低方面的显著改进(最高减少11%)。源代码将在此网址公开发布。
https://arxiv.org/abs/2503.13740
Most image-based 3D object reconstructors assume that objects are fully visible, ignoring occlusions that commonly occur in real-world scenarios. In this paper, we introduce Amodal3R, a conditional 3D generative model designed to reconstruct 3D objects from partial observations. We start from a "foundation" 3D generative model and extend it to recover plausible 3D geometry and appearance from occluded objects. We introduce a mask-weighted multi-head cross-attention mechanism followed by an occlusion-aware attention layer that explicitly leverages occlusion priors to guide the reconstruction process. We demonstrate that, by training solely on synthetic data, Amodal3R learns to recover full 3D objects even in the presence of occlusions in real scenes. It substantially outperforms existing methods that independently perform 2D amodal completion followed by 3D reconstruction, thereby establishing a new benchmark for occlusion-aware 3D reconstruction.
大多数基于图像的三维物体重建方法都假设对象是完全可见的,而忽略了在现实场景中常见的遮挡现象。在这篇论文中,我们介绍了Amodal3R,这是一种条件性的三维生成模型,专门用于从部分观察数据中重构三维对象。我们的研究从一个“基础”的三维生成模型开始,并将其扩展以从被遮挡的对象中恢复出合理的三维几何形状和外观。 我们引入了一种掩码加权的多头交叉注意力机制,并在此基础上添加了一个感知遮挡的注意层,该层明确地利用了遮挡先验知识来引导重构过程。我们证明,仅通过使用合成数据训练,Amodal3R能够学习在真实场景中的遮挡情况下恢复出完整的三维对象。 相比现有方法独立完成二维非模态填充(即忽略物体被其他物体遮挡的部分)后再进行三维重建的方式,Amodal3R表现出了显著的优越性,并为感知遮挡的三维重构设定了新的基准。
https://arxiv.org/abs/2503.13439
With the rapid development of 3D reconstruction technology, research in 4D reconstruction is also advancing, existing 4D reconstruction methods can generate high-quality 4D scenes. However, due to the challenges in acquiring multi-view video data, the current 4D reconstruction benchmarks mainly display actions performed in place, such as dancing, within limited scenarios. In practical scenarios, many scenes involve wide-range spatial movements, highlighting the limitations of existing 4D reconstruction datasets. Additionally, existing 4D reconstruction methods rely on deformation fields to estimate the dynamics of 3D objects, but deformation fields struggle with wide-range spatial movements, which limits the ability to achieve high-quality 4D scene reconstruction with wide-range spatial movements. In this paper, we focus on 4D scene reconstruction with significant object spatial movements and propose a novel 4D reconstruction benchmark, WideRange4D. This benchmark includes rich 4D scene data with large spatial variations, allowing for a more comprehensive evaluation of the generation capabilities of 4D generation methods. Furthermore, we introduce a new 4D reconstruction method, Progress4D, which generates stable and high-quality 4D results across various complex 4D scene reconstruction tasks. We conduct both quantitative and qualitative comparison experiments on WideRange4D, showing that our Progress4D outperforms existing state-of-the-art 4D reconstruction methods. Project: this https URL
随着3D重建技术的快速发展,4D重建研究也在不断进步。现有的4D重建方法能够生成高质量的4D场景。然而,由于多视角视频数据采集的挑战性,目前的4D重建基准主要展示的是在有限场景下进行的动作,例如跳舞等静止位置上的活动。在实际应用中,许多场景涉及广泛的位移运动,这突显了现有4D重建数据集的局限性。 此外,现有的4D重建方法依赖于变形场来估计3D物体的动力学特征,但变形场难以处理广泛的空间移动,这限制了实现具有广泛空间移动的高质量4D场景重建的能力。在本文中,我们重点关注含有显著对象位移运动的4D场景重建,并提出了一个新的4D重建基准——WideRange4D。该基准包含大量空间变化丰富的4D场景数据,能够更全面地评估4D生成方法的生成能力。 此外,我们还提出了一种新的4D重建方法——Progress4D,它可以在各种复杂的4D场景重建任务中生成稳定且高质量的结果。我们在WideRange4D上进行了定量和定性的比较实验,结果表明我们的Progress4D优于现有的最先进的4D重建方法。项目链接:[此URL](https://this-url.com)
https://arxiv.org/abs/2503.13435
T2 hyperintensities in spinal cord MR images are crucial biomarkers for conditions such as degenerative cervical myelopathy. However, current clinical diagnoses primarily rely on manual evaluation. Deep learning methods have shown promise in lesion detection, but most supervised approaches are heavily dependent on large, annotated datasets. Unsupervised anomaly detection (UAD) offers a compelling alternative by eliminating the need for abnormal data annotations. However, existing UAD methods rely on curated normal datasets and their performance frequently deteriorates when applied to clinical datasets due to domain shifts. We propose an Uncertainty-based Unsupervised Anomaly Detection framework, termed U2AD, to address these limitations. Unlike traditional methods, U2AD is designed to be trained and tested within the same clinical dataset, following a "mask-and-reconstruction" paradigm built on a Vision Transformer-based architecture. We introduce an uncertainty-guided masking strategy to resolve task conflicts between normal reconstruction and anomaly detection to achieve an optimal balance. Specifically, we employ a Monte-Carlo sampling technique to estimate reconstruction uncertainty mappings during training. By iteratively optimizing reconstruction training under the guidance of both epistemic and aleatoric uncertainty, U2AD reduces overall reconstruction variance while emphasizing regions. Experimental results demonstrate that U2AD outperforms existing supervised and unsupervised methods in patient-level identification and segment-level localization tasks. This framework establishes a new benchmark for incorporating uncertainty guidance into UAD, highlighting its clinical utility in addressing domain shifts and task conflicts in medical image anomaly detection. Our code is available: this https URL
脊髓磁共振图像中的T2高信号强度是评估退行性颈椎脊髓病等状况的重要生物标志物。然而,当前的临床诊断主要依赖于手动评估。深度学习方法在病变检测方面展现出巨大潜力,但大多数监督式的方法严重依赖大规模标注数据集的支持。无监督异常检测(UAD)提供了一种有吸引力的替代方案,因为它消除了对异常数据进行标注的需求。不过,现有的UAD方法依赖于精心策划的正常数据集,并且当应用于临床数据集时由于领域变化往往表现出性能下降。 为此,我们提出一种基于不确定性的无监督异常检测框架,称为U2AD,以解决这些局限性。与传统的方法不同,U2AD旨在仅使用同一临床数据集进行训练和测试,采用基于视觉变换器架构的“遮罩与重建”范式。我们引入了一种不确定性引导的掩码策略来解决正常重建与异常检测之间的任务冲突,从而达到最佳平衡。具体而言,在训练过程中,我们利用蒙特卡洛抽样技术估计重建不确定性的映射。通过在认识性和数据内在噪声(即固有性)两种类型的不确定性指导下迭代优化重建训练,U2AD能够减少整体重建方差并突出强调某些区域。 实验结果表明,与现有监督式和无监督方法相比,U2AD在患者层面的识别及病变定位任务中表现更优。该框架为将不确定度指导纳入到UAD领域设立了一个新的基准,并且突出了其在解决医疗影像异常检测中的域变化及任务冲突方面的临床应用价值。 我们的代码可以在此处获取:[此链接](请确保提供正确的URL)。
https://arxiv.org/abs/2503.13400