Generating high-quality and photorealistic 3D assets remains a longstanding challenge in 3D vision and computer graphics. Although state-of-the-art generative models, such as diffusion models, have made significant progress in 3D generation, they often fall short of human-designed content due to limited ability to follow instructions, align with human preferences, or produce realistic textures, geometries, and physical attributes. In this paper, we introduce Nabla-R2D3, a highly effective and sample-efficient reinforcement learning alignment framework for 3D-native diffusion models using 2D rewards. Built upon the recently proposed Nabla-GFlowNet method, which matches the score function to reward gradients in a principled manner for reward finetuning, our Nabla-R2D3 enables effective adaptation of 3D diffusion models using only 2D reward signals. Extensive experiments show that, unlike vanilla finetuning baselines which either struggle to converge or suffer from reward hacking, Nabla-R2D3 consistently achieves higher rewards and reduced prior forgetting within a few finetuning steps.
生成高质量和逼真的3D资产仍然是三维视觉和计算机图形学领域的长期挑战。尽管最先进的生成模型,如扩散模型,在3D生成方面取得了显著进展,但由于在遵循指令、符合人类偏好或产生逼真纹理、几何形状和物理属性方面的局限性,它们往往难以达到人工设计内容的水平。在这篇论文中,我们介绍了Nabla-R2D3,这是一个高效的强化学习对齐框架,用于基于2D奖励信号调整原生3D扩散模型。该框架建立在最近提出的Nabla-GFlowNet方法之上,后者通过将评分函数与奖励梯度匹配的方式,在原则基础上实现了奖励的微调功能。我们的Nabla-R2D3允许仅使用2D奖励信号就有效地调整3D扩散模型。 广泛的实验表明,不同于原始微调基准线(它们要么难以收敛,要么遭受奖励操控问题),Nabla-R2D3在几次微调步骤后始终能实现更高的奖励值并减少先前知识的遗忘。
https://arxiv.org/abs/2506.15684
Diffusion-based image generation models excel at producing high-quality synthetic content, but suffer from slow and computationally expensive inference. Prior work has attempted to mitigate this by caching and reusing features within diffusion transformers across inference steps. These methods, however, often rely on rigid heuristics that result in limited acceleration or poor generalization across architectures. We propose Evolutionary Caching to Accelerate Diffusion models (ECAD), a genetic algorithm that learns efficient, per-model, caching schedules forming a Pareto frontier, using only a small set of calibration prompts. ECAD requires no modifications to network parameters or reference images. It offers significant inference speedups, enables fine-grained control over the quality-latency trade-off, and adapts seamlessly to different diffusion models. Notably, ECAD's learned schedules can generalize effectively to resolutions and model variants not seen during calibration. We evaluate ECAD on PixArt-alpha, PixArt-Sigma, and this http URL using multiple metrics (FID, CLIP, Image Reward) across diverse benchmarks (COCO, MJHQ-30k, PartiPrompts), demonstrating consistent improvements over previous approaches. On PixArt-alpha, ECAD identifies a schedule that outperforms the previous state-of-the-art method by 4.47 COCO FID while increasing inference speedup from 2.35x to 2.58x. Our results establish ECAD as a scalable and generalizable approach for accelerating diffusion inference. Our project website is available at this https URL and our code is available at this https URL.
基于扩散的图像生成模型在生产高质量合成内容方面表现出色,但其推理过程缓慢且计算成本高昂。先前的研究试图通过缓存和重复使用扩散变换器中的特征来缓解这一问题,在不同的推断步骤中实现这一点。然而,这些方法往往依赖于刚性的启发式规则,导致加速效果有限或不同架构之间泛化性能不佳。 我们提出了一种名为“基于进化缓存的扩散模型加速”(ECAD)的遗传算法,该算法能够根据少量校准提示学习每种模型的有效缓存时间表,并形成帕累托前沿。ECAD不需要对网络参数或参考图像进行任何修改,它可以显著提高推理速度,提供对质量-延迟权衡的精细控制,并无缝适应不同的扩散模型。 值得注意的是,通过使用与校准时未见过的不同分辨率和模型变体,ECAD所学习的时间表能够有效地泛化。 我们在PixArt-alpha、PixArt-Sigma及另一组图像生成模型上评估了ECAD,采用多种指标(FID、CLIP、Image Reward)以及多个基准数据集(COCO、MJHQ-30k、PartiPrompts),结果显示在各个方面都优于先前的方法。例如,在PixArt-alpha上的测试中,ECAD找到了一种方案,其表现超越了此前的最先进方法4.47分COCO FID,并将推断速度从2.35倍提升至2.58倍。 我们的实验结果表明,ECAD是一种可扩展且泛化的加速扩散模型推理的方法。该项目网站和代码可以在相应的链接中找到。
https://arxiv.org/abs/2506.15682
Modeling the dynamics of deformable objects is challenging due to their diverse physical properties and the difficulty of estimating states from limited visual information. We address these challenges with a neural dynamics framework that combines object particles and spatial grids in a hybrid representation. Our particle-grid model captures global shape and motion information while predicting dense particle movements, enabling the modeling of objects with varied shapes and materials. Particles represent object shapes, while the spatial grid discretizes the 3D space to ensure spatial continuity and enhance learning efficiency. Coupled with Gaussian Splattings for visual rendering, our framework achieves a fully learning-based digital twin of deformable objects and generates 3D action-conditioned videos. Through experiments, we demonstrate that our model learns the dynamics of diverse objects -- such as ropes, cloths, stuffed animals, and paper bags -- from sparse-view RGB-D recordings of robot-object interactions, while also generalizing at the category level to unseen instances. Our approach outperforms state-of-the-art learning-based and physics-based simulators, particularly in scenarios with limited camera views. Furthermore, we showcase the utility of our learned models in model-based planning, enabling goal-conditioned object manipulation across a range of tasks. The project page is available at this https URL .
模拟可变形物体的动力学是一个挑战,因为它们具有多样的物理属性,并且从有限的视觉信息中估计状态也十分困难。我们通过一个结合了对象粒子和空间网格的混合表示的神经动力框架来应对这些挑战。我们的粒子-网格模型能够捕获全局形状和运动信息,同时预测密集的粒子运动,从而可以对具有不同形状和材料的对象进行建模。在这个模型中,粒子代表物体的形状,而空间网格则将3D空间离散化,以确保空间连续性并提高学习效率。结合高斯Splattings用于视觉渲染,我们的框架能够实现可变形对象的完全基于学习的数字孪生,并生成3D动作条件视频。通过实验,我们展示了模型可以从机器人与物体交互时的稀疏视图RGB-D记录中学习不同种类物体(如绳索、布料、填充动物玩具和纸袋)的动力学特性,同时在类别层面推广到未见过的实例上。我们的方法在视角有限的情况下超过了基于学习的和物理引擎的模拟器的最佳性能。此外,我们还展示了所学模型在基于模型规划中的实用性,使目标导向的对象操作跨各种任务成为可能。该项目页面可以在以下链接访问:[项目网页](https://这个URL/)。
https://arxiv.org/abs/2506.15680
We address the challenge of relighting a single image or video, a task that demands precise scene intrinsic understanding and high-quality light transport synthesis. Existing end-to-end relighting models are often limited by the scarcity of paired multi-illumination data, restricting their ability to generalize across diverse scenes. Conversely, two-stage pipelines that combine inverse and forward rendering can mitigate data requirements but are susceptible to error accumulation and often fail to produce realistic outputs under complex lighting conditions or with sophisticated materials. In this work, we introduce a general-purpose approach that jointly estimates albedo and synthesizes relit outputs in a single pass, harnessing the generative capabilities of video diffusion models. This joint formulation enhances implicit scene comprehension and facilitates the creation of realistic lighting effects and intricate material interactions, such as shadows, reflections, and transparency. Trained on synthetic multi-illumination data and extensive automatically labeled real-world videos, our model demonstrates strong generalization across diverse domains and surpasses previous methods in both visual fidelity and temporal consistency.
我们解决的是对单张图像或视频进行重新光照的问题,这是一个需要精确场景内在理解以及高质量光线传输合成的任务。现有的端到端重新光照模型通常受限于多光照配对数据的稀缺性,这限制了它们在不同场景中的泛化能力。相反,结合逆向和正向渲染的两阶段管道可以缓解数据需求问题,但容易出现错误累积,并且常常无法在复杂的照明条件下或使用复杂材料时产生逼真的输出结果。 在这项工作中,我们提出了一种通用方法,该方法能够在一次通过中同时估计反射率并合成重新光照后的输出,利用视频扩散模型的生成能力。这种联合公式提高了隐式场景理解的能力,并促进了现实光线效果和复杂材质交互(如阴影、反射和透明度)的创建。 我们的模型在合成多光照数据和大量自动标注的真实世界视频上进行训练,在多种领域中显示出强大的泛化能力,并且在视觉保真度和时间一致性方面超过了之前的方法。
https://arxiv.org/abs/2506.15673
Despite significant advances in inference-time search for vision-language models (VLMs), existing approaches remain both computationally expensive and prone to unpenalized, low-confidence generations which often lead to persistent hallucinations. We introduce \textbf{Value-guided Inference with Margin-based Reward (ViMaR)}, a two-stage inference framework that improves both efficiency and output fidelity by combining a temporal-difference value model with a margin-aware reward adjustment. In the first stage, we perform a single pass to identify the highest-value caption among diverse candidates. In the second stage, we selectively refine only those segments that were overlooked or exhibit weak visual grounding, thereby eliminating frequently rewarded evaluations. A calibrated margin-based penalty discourages low-confidence continuations while preserving descriptive richness. Extensive experiments across multiple VLM architectures demonstrate that ViMaR generates captions that are significantly more reliable, factually accurate, detailed, and explanatory, while achieving over 4$\times$ speedup compared to existing value-guided methods. Specifically, we show that ViMaR trained solely on LLaVA Mistral-7B, \textit{generalizes effectively to guide decoding in a stronger unseen model}. To further validate this, we adapt the ViMaR to steer generation in LLaVA-OneVision-Qwen2-7B, leading to consistent improvements in caption quality and demonstrating robust cross-model guidance. This cross-model generalization highlights ViMaR's flexibility and modularity, positioning it as a scalable and transferable inference-time decoding strategy. Furthermore, when ViMaR-generated captions are used for self-training, the underlying models achieve substantial gains across a broad suite of visual comprehension benchmarks, underscoring the potential of fast, accurate, and self-improving VLM pipelines.
尽管在视觉语言模型(VLMs)的推理时间搜索方面取得了显著进展,现有方法仍然既计算成本高昂又容易产生不受惩罚的低置信度生成结果,这往往会导致持续的幻觉现象。我们引入了**基于边际奖励的价值引导推理(ViMaR)**,这是一种两阶段推理框架,通过结合时差价值模型和边际感知的奖励调整来提高效率和输出保真度。 在第一阶段中,我们进行一次遍历以从多样化的候选方案中识别出最具价值的描述。第二阶段则有针对性地优化那些被忽视或视觉基础较弱的部分,从而消除频繁受到奖励评估的影响。一个校准过的边际惩罚机制会抑制低置信度的延续生成,同时保留描述的丰富性。 在多种VLM架构上的广泛实验表明,ViMaR能够产生更加可靠、事实准确、详细且具有解释性的描述,并与现有基于价值引导的方法相比,在速度上实现了超过4倍的加速。特别是,我们展示了仅使用LLaVA Mistral-7B训练的ViMaR可以**有效指导解码在未见过的强大模型中**进行操作。为了进一步验证这一点,我们将ViMaR调整为在LLaVA-OneVision-Qwen2-7B中引导生成,从而提高了描述质量的一致性,并展示了跨模型指导的稳健性能。 这种跨模型泛化突显了ViMaR的灵活性和模块化特性,使其成为一种可扩展且具有迁移性的推理时间解码策略。此外,在使用由ViMaR生成的描述进行自我训练时,底层模型在一系列视觉理解基准测试中实现了显著提升,这强调了快速、准确且自我改进的VLM管道的巨大潜力。 简而言之,ViMaR不仅提高了VLM输出的质量和效率,还展示了其强大的跨模型泛化能力和作为高效自我改善策略的潜在价值。
https://arxiv.org/abs/2506.15649
Large vision-language models have recently demonstrated impressive performance in planning and control tasks, driving interest in their application to real-world robotics. However, deploying these models for reasoning in embodied contexts is limited by their ability to incorporate long-term experience collected across multiple days and represented by vast collections of images. Current VLMs typically struggle to process more than a few hundred images concurrently, highlighting the need for more efficient mechanisms to handle long-term memory in embodied settings. To effectively evaluate these models for long-horizon control, a benchmark must specifically target scenarios where memory is crucial for success. Existing long-video QA benchmarks overlook embodied challenges like object manipulation and navigation, which demand low-level skills and fine-grained reasoning over past interactions. Moreover, effective memory integration in embodied agents involves both recalling relevant historical information and executing actions based on that information, making it essential to study these aspects together rather than in isolation. In this work, we introduce a new benchmark for long-range embodied tasks in the Habitat simulator. This benchmark evaluates memory-based capabilities across 60 tasks requiring sustained engagement and contextual awareness in an environment. The tasks can also be procedurally extended to longer and more challenging versions, enabling scalable evaluation of memory and reasoning. We also present baselines that integrate state-of-the-art VLMs with low level navigation policies, assessing their performance on these memory-intensive tasks and highlight areas for improvement.
近期,大型视觉-语言模型在规划和控制任务中表现出令人印象深刻的性能,这激发了人们将其应用于真实世界机器人技术的兴趣。然而,在具身环境中应用这些模型进行推理时,它们的局限性在于难以整合跨越多天收集的大量图像所代表的长期经验。当前的视觉语言模型(VLMs)通常只能同时处理几百张图片以内的情况,凸显出在具身场景中更有效地管理长期记忆的需求。为了有效评估这些模型在长周期控制任务中的表现,基准测试必须特别针对那些成功依赖于良好记忆能力的情境。现有的长时间视频问答基准忽略了像物体操作和导航这样的具身挑战,这些问题需要低级技能以及对过去互动的细致推理。 此外,在具身代理中有效地整合记忆不仅包括回忆相关的历史信息,还包括根据这些信息执行动作,这意味着在研究这些方面时应将它们作为一个整体而非孤立地看待。在这项工作中,我们引入了一个新的基准测试,用于评估Habitat模拟器中的长距离具身任务的记忆能力。该基准测试涵盖60个需要环境内持续互动和情境意识的任务,并且可以扩展到更长时间和更具挑战性的版本中去,以实现对记忆和推理的可伸缩性评估。我们还提出了基线方法,这些方法将最先进的VLM与低级导航策略相结合,用以评估它们在这些依赖于强大记忆能力任务上的表现,并指出了改进的方向。
https://arxiv.org/abs/2506.15635
We present HOIDiNi, a text-driven diffusion framework for synthesizing realistic and plausible human-object interaction (HOI). HOI generation is extremely challenging since it induces strict contact accuracies alongside a diverse motion manifold. While current literature trades off between realism and physical correctness, HOIDiNi optimizes directly in the noise space of a pretrained diffusion model using Diffusion Noise Optimization (DNO), achieving both. This is made feasible thanks to our observation that the problem can be separated into two phases: an object-centric phase, primarily making discrete choices of hand-object contact locations, and a human-centric phase that refines the full-body motion to realize this blueprint. This structured approach allows for precise hand-object contact without compromising motion naturalness. Quantitative, qualitative, and subjective evaluations on the GRAB dataset alone clearly indicate HOIDiNi outperforms prior works and baselines in contact accuracy, physical validity, and overall quality. Our results demonstrate the ability to generate complex, controllable interactions, including grasping, placing, and full-body coordination, driven solely by textual prompts. this https URL.
我们介绍了H OIDiNi,这是一个用于合成逼真且合理的物体-人交互(HOI)的文本驱动扩散框架。HOI生成极具挑战性,因为它需要高度准确的身体接触以及多种运动模式。虽然现有文献在现实性和物理正确性之间做出权衡,但H OIDiNi通过使用预训练扩散模型中的噪声空间进行直接优化(利用Diffusion Noise Optimization, DNO),能够同时实现这两点。这成为可能的原因在于我们观察到该问题可以分为两个阶段:以物体为中心的阶段主要做离散的手-物接触位置选择;以人为中心的阶段则细化全身运动,从而实现这一蓝图。这种结构化的办法允许在不牺牲动作自然性的情况下进行精确的手部与物体接触控制。 仅在GRAB数据集上的定量、定性和主观评估清晰地表明H OIDiNi在接触准确性、物理有效性以及整体质量方面超越了先前的工作和基准。我们的结果展示了生成复杂且可控的交互(包括抓取、放置及全身协调)的能力,这完全由文本提示驱动。 您可以在此链接中找到更多相关信息:[此链接](https://example.com)(原文中未提供实际链接,请根据实际情况替换)。
https://arxiv.org/abs/2506.15625
Open-vocabulary 3D object detection has gained significant interest due to its critical applications in autonomous driving and embodied AI. Existing detection methods, whether offline or online, typically rely on dense point cloud reconstruction, which imposes substantial computational overhead and memory constraints, hindering real-time deployment in downstream tasks. To address this, we propose a novel reconstruction-free online framework tailored for memory-efficient and real-time 3D detection. Specifically, given streaming posed RGB-D video input, we leverage Cubify Anything as a pre-trained visual foundation model (VFM) for single-view 3D object detection by bounding boxes, coupled with CLIP to capture open-vocabulary semantics of detected objects. To fuse all detected bounding boxes across different views into a unified one, we employ an association module for correspondences of multi-views and an optimization module to fuse the 3D bounding boxes of the same instance predicted in multi-views. The association module utilizes 3D Non-Maximum Suppression (NMS) and a box correspondence matching module, while the optimization module uses an IoU-guided efficient random optimization technique based on particle filtering to enforce multi-view consistency of the 3D bounding boxes while minimizing computational complexity. Extensive experiments on ScanNetV2 and CA-1M datasets demonstrate that our method achieves state-of-the-art performance among online methods. Benefiting from this novel reconstruction-free paradigm for 3D object detection, our method exhibits great generalization abilities in various scenarios, enabling real-time perception even in environments exceeding 1000 square meters.
开放词汇的三维物体检测由于其在自动驾驶和具身人工智能中的关键应用而引起了广泛关注。现有的检测方法,无论是离线还是在线方法,通常依赖于密集点云重建,这会带来巨大的计算开销和内存限制,阻碍了在下游任务中实现实时部署。为了解决这个问题,我们提出了一种新颖的无重构在线框架,该框架针对内存效率高且能实现实时3D检测进行了优化。具体来说,在给定连续输入的位置标注RGB-D视频的情况下,我们将Cubify Anything作为单视图三维物体检测(通过边界框)的预训练视觉基础模型(VFM)使用,并结合CLIP来捕捉检测到的对象的开放词汇语义信息。 为了将不同视角中所有检测出的边界框融合成一个统一的结果,我们采用了一个关联模块来处理多视角之间的对应关系以及一个优化模块用于融合同一实例在多个视图中的三维边界框。该关联模块利用了3D非极大值抑制(NMS)和一个边界框对应匹配模块;而优化模块则使用基于粒子滤波的IoU引导高效随机优化技术,以确保跨多视角的三维边界框的一致性,并尽量减少计算复杂度。 在ScanNetV2和CA-1M数据集上的大量实验表明,我们的方法在在线方法中实现了最先进的性能。得益于这一新颖的无重构3D物体检测范式,我们的方法展示出了在各种场景中的强大泛化能力,甚至能够在超过1000平方米的环境中实现实时感知。
https://arxiv.org/abs/2506.15610
In clinical practice, imaging modalities with functional characteristics, such as positron emission tomography (PET) and fractional anisotropy (FA), are often aligned with a structural reference (e.g., MRI, CT) for accurate interpretation or group analysis, necessitating multi-modal deformable image registration (DIR). However, due to the extreme heterogeneity of these modalities compared to standard structural scans, conventional unsupervised DIR methods struggle to learn reliable spatial mappings and often distort images. We find that the similarity metrics guiding these models fail to capture alignment between highly disparate modalities. To address this, we propose M2M-Reg (Multi-to-Mono Registration), a novel framework that trains multi-modal DIR models using only mono-modal similarity while preserving the established architectural paradigm for seamless integration into existing models. We also introduce GradCyCon, a regularizer that leverages M2M-Reg's cyclic training scheme to promote diffeomorphism. Furthermore, our framework naturally extends to a semi-supervised setting, integrating pre-aligned and unaligned pairs only, without requiring ground-truth transformations or segmentation masks. Experiments on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset demonstrate that M2M-Reg achieves up to 2x higher DSC than prior methods for PET-MRI and FA-MRI registration, highlighting its effectiveness in handling highly heterogeneous multi-modal DIR. Our code is available at this https URL.
在临床实践中,具备功能特性的成像模式(如正电子发射断层扫描(PET)和分数各向异性(FA))通常需要与结构参考图像(例如MRI或CT)对齐以进行准确的解读或组分析,这需要多模态可变形图像配准(DIR)技术。然而,由于这些功能成像模式相对于标准结构扫描具有极大的异质性,传统的无监督DIR方法难以学习可靠的空域映射,并且经常会扭曲图像。我们发现指导这些模型相似度度量的方法无法捕捉到高度不同模态之间的对齐。 为解决这一问题,我们提出了M2M-Reg(Multi-to-Mono Registration),这是一种新型框架,使用单一模式的相似性来训练多模式DIR模型,同时保持已建立的架构范式以无缝集成现有模型中。此外,我们还引入了GradCyCon,一种正则化器,利用M2M-Reg的循环训练方案促进微分同胚(diffeomorphism)。我们的框架自然地扩展到了半监督设置下,仅整合预先对齐和未对齐的配对数据,并且无需真实转换或分割掩模。 在阿尔茨海默病神经成像倡议(ADNI)数据集上的实验表明,M2M-Reg相较于先前的方法,在PET-MRI及FA-MRI配准中实现了高达两倍的DSC(Dice相似系数),突显了其在处理高度异质性多模式DIR方面的有效性。我们的代码可在提供的链接处获取。
https://arxiv.org/abs/2506.15596
Estimating forest above-ground biomass (AGB) is crucial for assessing carbon storage and supporting sustainable forest management. Quantitative Structural Model (QSM) offers a non-destructive approach to AGB estimation through 3D tree structural reconstruction. However, current QSM methods face significant limitations, as they are primarily designed for individual trees,depend on high-quality point cloud data from terrestrial laser scanning (TLS), and also require multiple pre-processing steps that hinder scalability and practical deployment. This study presents a novel unified framework that enables end-to-end processing of large-scale point clouds using an innovative graph-based pipeline. The proposed approach seamlessly integrates tree segmentation,leaf-wood separation and 3D skeletal reconstruction through dedicated graph operations including pathing and abstracting for tree topology reasoning. Comprehensive validation was conducted on datasets with varying leaf conditions (leaf-on and leaf-off), spatial scales (tree- and plot-level), and data sources (TLS and UAV-based laser scanning, ULS). Experimental results demonstrate strong performance under challenging conditions, particularly in leaf-on scenarios (~20% relative error) and low-density ULS datasets with partial coverage (~30% relative error). These findings indicate that the proposed framework provides a robust and scalable solution for large-scale, non-destructive AGB estimation. It significantly reduces dependency on specialized pre-processing tools and establishes ULS as a viable alternative to TLS. To our knowledge, this is the first method capable of enabling seamless, end-to-end 3D tree reconstruction at operational scales. This advancement substantially improves the feasibility of QSM-based AGB estimation, paving the way for broader applications in forest inventory and climate change research.
估算森林地上生物量(AGB)对于评估碳储存和支持可持续森林管理至关重要。定量结构模型(QSM)通过3D树木结构重建提供了一种非破坏性的AGB估算方法。然而,现有的QSM方法面临重大限制,因为它们主要是为单一树木设计的,依赖于地面激光扫描(TLS)提供的高质量点云数据,并且需要多个预处理步骤,这阻碍了其可扩展性和实际部署。本研究提出了一种新颖的一体化框架,该框架能够使用创新的基于图的流水线对大规模点云进行端到端处理。所提出的这种方法通过专门的图操作(包括路径规划和抽象)无缝地集成了树木分割、叶木分离以及3D骨架重建。在具有不同叶片条件(带叶和无叶)、空间尺度(单树级和地块级)及数据来源(TLS和基于无人机的激光扫描,ULS)的数据集上进行了全面验证。 实验结果显示,在挑战性条件下性能强大,特别是在带叶场景中相对误差约为20%,以及在低密度ULS数据集中部分覆盖情况下的相对误差约为30%。这些发现表明,所提出的框架为大规模、非破坏性的AGB估算提供了稳健且可扩展的解决方案,大大减少了对专门预处理工具的依赖,并确立了ULS作为TLS的有效替代方案。据我们所知,这是第一个能够在操作规模上实现无缝端到端3D树木重建的方法。 这一进步极大地提高了基于QSM的AGB估算的实际可行性,为森林调查和气候变化研究中的更广泛应用铺平了道路。
https://arxiv.org/abs/2506.15577
Ultra-high Spatial Resolution Land Cover Classification is essential for fine-grained land cover analysis, yet it remains challenging due to the high cost of pixel-level annotations, significant scale variation, and the limited adaptability of large-scale vision models. Existing methods typically focus on 1-meter spatial resolution imagery and rely heavily on annotated data, whereas practical applications often require processing higher-resolution imagery under weak supervision. To address this, we propose a parameter-efficient semi-supervised segmentation framework for 0.3 m spatial resolution imagery, which leverages the knowledge of SAM2 and introduces a remote sensing-specific FreqWeaver Adapter to enhance fine-grained detail modeling while maintaining a lightweight design at only 5.96% of the total model parameters. By effectively leveraging unlabeled data and maintaining minimal parameter overhead, the proposed method delivers robust segmentation results with superior structural consistency, achieving a 1.78% improvement over existing parameter-efficient tuning strategies and a 3.44% gain compared to state-of-the-art high-resolution remote sensing segmentation approaches.
超高空间分辨率的土地覆盖分类对于精细土地覆盖分析至关重要,但由于像素级标注成本高、尺度变化显著以及大规模视觉模型适应性有限等原因,这一任务仍然具有挑战性。现有方法通常集中在1米分辨率的影像上,并且严重依赖于注释数据,而实际应用往往需要在弱监督条件下处理更高分辨率的影像。为了解决这个问题,我们提出了一种参数高效的半监督分割框架,用于0.3米空间分辨率的影像。该框架利用了SAM2的知识,并引入了一个针对遥感特定需求设计的FreqWeaver适配器,以增强细粒度细节建模的同时保持轻量级设计(仅占总模型参数的5.96%)。通过有效利用未标注数据并维持最小的参数开销,所提出的方法能够提供结构一致性更强且稳健的分割结果,在现有的参数高效调优策略中提升了1.78%,相比最先进的高分辨率遥感分割方法也提高了3.44%。
https://arxiv.org/abs/2506.15565
This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at this https URL.
本文介绍了改进的原生统一多模态模型,即Show-o2,该模型利用了自回归建模和流匹配技术。基于3D因果变分自动编码器空间,通过空间(-时间)融合的双路径构建统一的视觉表示,使图像和视频模式之间的可扩展性增强,并确保有效的跨模态理解和生成。在语言模型的基础上,自回归建模被原生应用于语言头端,而流匹配则应用于流头端,以促进文本标记预测和图像/视频生成。设计了一种两阶段的训练方案,旨在有效地学习并扩展到更大的模型规模。最终得到的Show-o2模型展示了处理多种多模态理解和生成任务(包括文本、图像和视频)的高度灵活性。代码和模型可在[此链接](https://this-url.com)获取。
https://arxiv.org/abs/2506.15564
Layout-to-Image generation aims to create complex scenes with precise control over the placement and arrangement of subjects. Existing works have demonstrated that pre-trained Text-to-Image diffusion models can achieve this goal without training on any specific data; however, they often face challenges with imprecise localization and unrealistic artifacts. Focusing on these drawbacks, we propose a novel training-free method, WinWinLay. At its core, WinWinLay presents two key strategies, Non-local Attention Energy Function and Adaptive Update, that collaboratively enhance control precision and realism. On one hand, we theoretically demonstrate that the commonly used attention energy function introduces inherent spatial distribution biases, hindering objects from being uniformly aligned with layout instructions. To overcome this issue, non-local attention prior is explored to redistribute attention scores, facilitating objects to better conform to the specified spatial conditions. On the other hand, we identify that the vanilla backpropagation update rule can cause deviations from the pre-trained domain, leading to out-of-distribution artifacts. We accordingly introduce a Langevin dynamics-based adaptive update scheme as a remedy that promotes in-domain updating while respecting layout constraints. Extensive experiments demonstrate that WinWinLay excels in controlling element placement and achieving photorealistic visual fidelity, outperforming the current state-of-the-art methods.
布局到图像生成的目标是创建具有对主体位置和排列精确控制的复杂场景。现有研究已经证明,预训练的文本到图像扩散模型能够在不进行特定数据训练的情况下实现这一目标;然而,它们常常面临定位不够准确和产生不合实际细节的问题。针对这些问题,我们提出了一种新的无需训练的方法——WinWinLay。在核心方面,WinWinLay提出了两种关键策略:非局部注意力能量函数(Non-local Attention Energy Function)和自适应更新机制(Adaptive Update),这两种策略共同提高了对图像控制的精度和现实感。 一方面,从理论上讲,我们证明了常用的注意力能量函数会引入固有的空间分布偏差,这阻碍了物体与其布局指令的一致性排列。为了解决这个问题,WinWinLay探索使用非局部注意机制来重新分配注意力得分,使物体能够更好地符合指定的空间条件。 另一方面,我们发现传统的反向传播更新规则可能会导致模型的输出偏离预训练的数据分布,从而产生不符合现实的细节(即“出界”现象)。为此,我们引入了一种基于兰杰文动力学(Langevin dynamics)的自适应更新方案来解决这个问题,该方法在保持布局约束的同时促进模型内部的更新。 广泛的实验表明,WinWinLay在控制元素放置和实现照片级真实感方面表现优异,超过了当前最先进的技术。
https://arxiv.org/abs/2506.15563
Cancer is an abnormal growth with potential to invade locally and metastasize to distant organs. Accurate auto-segmentation of the tumor and surrounding normal tissues is required for radiotherapy treatment plan optimization. Recent AI-based segmentation models are generally trained on large public datasets, which lack the heterogeneity of local patient populations. While these studies advance AI-based medical image segmentation, research on local datasets is necessary to develop and integrate AI tumor segmentation models directly into hospital software for efficient and accurate oncology treatment planning and execution. This study enhances tumor segmentation using computationally efficient hybrid UNet-Transformer models on magnetic resonance imaging (MRI) datasets acquired from a local hospital under strict privacy protection. We developed a robust data pipeline for seamless DICOM extraction and preprocessing, followed by extensive image augmentation to ensure model generalization across diverse clinical settings, resulting in a total dataset of 6080 images for training. Our novel architecture integrates UNet-based convolutional neural networks with a transformer bottleneck and complementary attention modules, including efficient attention, Squeeze-and-Excitation (SE) blocks, Convolutional Block Attention Module (CBAM), and ResNeXt blocks. To accelerate convergence and reduce computational demands, we used a maximum batch size of 8 and initialized the encoder with pretrained ImageNet weights, training the model on dual NVIDIA T4 GPUs via checkpointing to overcome Kaggle's runtime limits. Quantitative evaluation on the local MRI dataset yielded a Dice similarity coefficient of 0.764 and an Intersection over Union (IoU) of 0.736, demonstrating competitive performance despite limited data and underscoring the importance of site-specific model development for clinical deployment.
癌症是一种具有局部侵犯和远处器官转移潜能的异常生长。为了优化放射治疗计划,准确地自动分割肿瘤及其周围正常组织是必需的。最近基于人工智能的分割模型通常是在大型公共数据集上训练的,这些数据集缺乏本地患者群体的多样性。虽然这些研究推动了基于AI的医学图像分割技术的发展,但使用本地数据进行研究对于开发和整合适用于医院软件的人工智能肿瘤分割模型以实现高效且准确的肿瘤治疗计划制定至关重要。 本研究利用来自当地医院并在严格隐私保护下获取的磁共振成像(MRI)数据集,通过计算效率高的混合UNet-Transformer模型来提高肿瘤分割效果。我们构建了一个强大的数据流水线,能够无缝提取和预处理DICOM文件,并进行了广泛的图像增强以确保在不同临床环境中模型的泛化能力,最终形成一个包含6080张训练图像的数据集。 我们的新型架构将基于UNet的卷积神经网络与变压器瓶颈以及互补注意模块(包括高效注意、挤压激励块(SE)、卷积块注意力模块(CBAM)和ResNeXt块)相结合。为了加速收敛并减少计算需求,我们使用了最大批量大小为8,并用预训练的ImageNet权重初始化编码器,在两个NVIDIA T4 GPU上通过检查点功能进行模型训练以克服Kaggle运行时间限制。 在本地MRI数据集上的定量评估显示,Dice相似系数为0.764,交并比(IoU)为0.736。尽管数据有限,这些结果仍然表明了竞争性的性能,并强调了开发特定于位置的模型对于临床部署的重要性。
https://arxiv.org/abs/2506.15562
Dense metric depth estimation using millimeter-wave radar typically requires dense LiDAR supervision, generated via multi-frame projection and interpolation, to guide the learning of accurate depth from sparse radar measurements and RGB images. However, this paradigm is both costly and data-intensive. To address this, we propose RaCalNet, a novel framework that eliminates the need for dense supervision by using sparse LiDAR to supervise the learning of refined radar measurements, resulting in a supervision density of merely around 1% compared to dense-supervised methods. Unlike previous approaches that associate radar points with broad image regions and rely heavily on dense labels, RaCalNet first recalibrates and refines sparse radar points to construct accurate depth priors. These priors then serve as reliable anchors to guide monocular depth prediction, enabling metric-scale estimation without resorting to dense supervision. This design improves structural consistency and preserves fine details. Despite relying solely on sparse supervision, RaCalNet surpasses state-of-the-art dense-supervised methods, producing depth maps with clear object contours and fine-grained textures. Extensive experiments on the ZJU-4DRadarCam dataset and real-world deployment scenarios demonstrate its effectiveness, reducing RMSE by 35.30% and 34.89%, respectively.
使用毫米波雷达进行密集度量深度估计通常需要通过多帧投影和插值生成的密集LiDAR监督来指导从稀疏雷达测量和RGB图像中学习准确的深度信息。然而,这种范式既昂贵又需要大量的数据。为了解决这个问题,我们提出了RaCalNet,这是一种新的框架,它通过使用稀疏LiDAR来监督精炼后的雷达测量的学习过程,从而消除了对密集监督的需求,其监督密度仅为密集监督方法的大约1%。 与之前的方法将雷达点关联到宽泛的图像区域并依赖于密集标签不同,RaCalNet首先重新校准和细化稀疏雷达点以构建准确的深度先验。这些先验作为可靠的锚点来引导单目深度预测,从而实现没有密集监督下的度量级估计。这种设计提高了结构一致性,并保留了细粒度细节。 尽管仅依赖于稀疏监督,RaCalNet在深度图生成方面超过了最先进的密集监督方法,产生了具有清晰物体轮廓和细腻纹理的深度图。在ZJU-4DRadarCam数据集和真实世界部署场景中的广泛实验验证了其有效性,分别将RMSE降低了35.30%和34.89%。
https://arxiv.org/abs/2506.15560
This work examines the findings of the NTIRE 2025 Shadow Removal Challenge. A total of 306 participants have registered, with 17 teams successfully submitting their solutions during the final evaluation phase. Following the last two editions, this challenge had two evaluation tracks: one focusing on reconstruction fidelity and the other on visual perception through a user study. Both tracks were evaluated with images from the WSRD+ dataset, simulating interactions between self- and cast-shadows with a large number of diverse objects, textures, and materials.
这项工作审查了NTIRE 2025阴影移除挑战赛的研究成果。共有306名参与者注册,其中17支队伍成功提交了他们的解决方案以参加最终评估阶段。与前两届一样,此次挑战设有两个评价赛道:一个侧重于重建保真度,另一个通过用户研究来考察视觉感知效果。这两个赛道均使用WSRD+数据集中的图像进行评估,该数据集模拟了自影和投影阴影与大量多样化的物体、纹理及材料之间的相互作用。
https://arxiv.org/abs/2506.15524
Cervical cancer remains a significant health problem, especially in developing countries. Early detection is critical for effective treatment. Convolutional neural networks (CNN) have shown promise in automated cervical cancer screening, but their performance depends on Pap smear image quality. This study investigates the impact of various image preprocessing techniques on CNN performance for cervical cancer classification using the SIPaKMeD dataset. Three preprocessing techniques were evaluated: perona-malik diffusion (PMD) filter for noise reduction, contrast-limited adaptive histogram equalization (CLAHE) for image contrast enhancement, and the proposed hybrid PMD filter-CLAHE approach. The enhanced image datasets were evaluated on pretrained models, such as ResNet-34, ResNet-50, SqueezeNet-1.0, MobileNet-V2, EfficientNet-B0, EfficientNet-B1, DenseNet-121, and DenseNet-201. The results show that hybrid preprocessing PMD filter-CLAHE can improve the Pap smear image quality and CNN architecture performance compared to the original images. The maximum metric improvements are 13.62% for accuracy, 10.04% for precision, 13.08% for recall, and 14.34% for F1-score. The proposed hybrid PMD filter-CLAHE technique offers a new perspective in improving cervical cancer classification performance using CNN architectures.
宫颈癌仍然是一个重要的健康问题,特别是在发展中国家。早期检测对于有效的治疗至关重要。卷积神经网络(CNN)在自动化宫颈癌筛查中显示出前景,但其性能依赖于巴氏涂片图像的质量。本研究调查了各种图像预处理技术对使用SIPaKMeD数据集进行宫颈癌分类的CNN性能的影响。评估了三种预处理技术:Perona-Malik扩散(PMD)滤波器用于噪声减少、对比度受限自适应直方图均衡化(CLAHE)用于增强图像对比度以及提出的混合PMD滤波器-CLAHE方法。使用预训练模型,如ResNet-34、ResNet-50、SqueezeNet-1.0、MobileNet-V2、EfficientNet-B0、EfficientNet-B1、DenseNet-121和DenseNet-201对增强后的图像数据集进行了评估。结果显示,混合预处理PMD滤波器-CLAHE方法可以改善巴氏涂片图像质量和CNN架构性能,相较于原始图像,最大指标改进分别为:准确率提高13.62%,精确度提高10.04%,召回率提高13.08%以及F1分数提高14.34%。提出的混合PMD滤波器-CLAHE技术为利用CNN架构改善宫颈癌分类性能提供了新的视角。
https://arxiv.org/abs/2506.15489
Medical report generation from imaging data remains a challenging task in clinical practice. While large language models (LLMs) show great promise in addressing this challenge, their effective integration with medical imaging data still deserves in-depth exploration. In this paper, we present MRG-LLM, a novel multimodal large language model (MLLM) that combines a frozen LLM with a learnable visual encoder and introduces a dynamic prompt customization mechanism. Our key innovation lies in generating instance-specific prompts tailored to individual medical images through conditional affine transformations derived from visual features. We propose two implementations: prompt-wise and promptbook-wise customization, enabling precise and targeted report generation. Extensive experiments on IU X-ray and MIMIC-CXR datasets demonstrate that MRG-LLM achieves state-of-the-art performance in medical report generation. Our code will be made publicly available.
从影像数据生成医疗报告仍然是临床实践中的一项挑战性任务。尽管大型语言模型(LLMs)在应对这一挑战方面显示出巨大潜力,但它们与医学影像数据的有效整合仍需深入研究。在这篇论文中,我们提出了MRG-LLM,这是一种新颖的多模态大型语言模型(MLLM),它结合了一个冻结的LLM和一个可学习的视觉编码器,并引入了一种动态提示定制机制。我们的关键创新在于通过从视觉特征派生出的条件仿射变换为每个单独的医学影像生成特定实例的提示。我们提出了两种实施方式:基于提示的和基于提示手册的定制,使精准且目标明确的报告生成成为可能。在IU X光数据集和MIMIC-CXR数据集上的广泛实验表明,MRG-LLM在医疗报告生成方面达到了最先进的性能水平。我们的代码将公开发布。
https://arxiv.org/abs/2506.15477
Ensuring reliability is paramount in deep learning, particularly within the domain of medical imaging, where diagnostic decisions often hinge on model outputs. The capacity to separate out-of-distribution (OOD) samples has proven to be a valuable indicator of a model's reliability in research. In medical imaging, this is especially critical, as identifying OOD inputs can help flag potential anomalies that might otherwise go undetected. While many OOD detection methods rely on feature or logit space representations, recent works suggest these approaches may not fully capture OOD diversity. To address this, we propose a novel OOD scoring mechanism, called NERO, that leverages neuron-level relevance at the feature layer. Specifically, we cluster neuron-level relevance for each in-distribution (ID) class to form representative centroids and introduce a relevance distance metric to quantify a new sample's deviation from these centroids, enhancing OOD separability. Additionally, we refine performance by incorporating scaled relevance in the bias term and combining feature norms. Our framework also enables explainable OOD detection. We validate its effectiveness across multiple deep learning architectures on the gastrointestinal imaging benchmarks Kvasir and GastroVision, achieving improvements over state-of-the-art OOD detection methods.
确保可靠性在深度学习中至关重要,尤其是在医学影像领域,诊断决策往往依赖于模型的输出结果。区分出“非分布数据”(OOD)样本的能力已被证明是衡量模型可靠性的宝贵指标。在医学影像学中,这一点尤为重要,因为识别OOD输入可以帮助标记潜在异常情况,这些异常情况可能被忽略。 虽然许多OOD检测方法依赖于特征空间或对数几率空间的表示形式,但最近的研究表明,这些方法可能无法完全捕捉OOD数据的多样性。为了解决这个问题,我们提出了一种新的OOD评分机制,称为NERO(Neuron-level Relevance for Out-of-Distribution),该机制利用了在特征层面上神经元级别的相关性。具体而言,我们将每个“内分布”(ID)类别的神经元级别相关性进行聚类以形成代表性中心,并引入相关距离度量来量化新样本相对于这些中心的偏离程度,从而提高OOD数据的可区分性。 此外,我们通过在偏差项中加入缩放后的相关性和组合特征范数来改进性能。我们的框架还支持解释性的OOD检测。我们在胃肠道影像基准数据集Kvasir和GastroVision上验证了其有效性,并且在最先进的OOD检测方法上取得了改进。
https://arxiv.org/abs/2506.15404
Open-source pre-trained models hold great potential for diverse applications, but their utility declines when their training data is unavailable. Data-Free Image Synthesis (DFIS) aims to generate images that approximate the learned data distribution of a pre-trained model without accessing the original data. However, existing DFIS meth ods produce samples that deviate from the training data distribution due to the lack of prior knowl edge about natural images. To overcome this limitation, we propose DDIS, the first Diffusion-assisted Data-free Image Synthesis method that leverages a text-to-image diffusion model as a powerful image prior, improving synthetic image quality. DDIS extracts knowledge about the learned distribution from the given model and uses it to guide the diffusion model, enabling the generation of images that accurately align with the training data distribution. To achieve this, we introduce Domain Alignment Guidance (DAG) that aligns the synthetic data domain with the training data domain during the diffusion sampling process. Furthermore, we optimize a single Class Alignment Token (CAT) embedding to effectively capture class-specific attributes in the training dataset. Experiments on PACS and Ima geNet demonstrate that DDIS outperforms prior DFIS methods by generating samples that better reflect the training data distribution, achieving SOTA performance in data-free applications.
开源的预训练模型在各种应用中具有巨大潜力,但当这些模型的训练数据不可用时,它们的实用性会下降。无数据图像合成(DFIS)旨在生成与预训练模型所学的数据分布相近的图像,而无需访问原始数据。然而,现有的DFIS方法由于缺乏关于自然图像的先验知识,会产生偏离训练数据分布的样本。为了解决这一局限性,我们提出了DDIS——首个利用文本到图像扩散模型作为强大图像先验的知识辅助无数据图像合成方法,从而提高了合成图像的质量。DDIS从给定的模型中提取有关所学分布的知识,并将其用于指导扩散模型,在生成与训练数据分布准确对齐的图像方面发挥了作用。 为了实现这一目标,我们引入了领域对齐引导(DAG),在扩散采样过程中使合成数据域与训练数据域对齐。此外,我们优化了一个单一的类别对齐标记(CAT)嵌入,以有效捕捉训练数据集中的特定类别属性。在PACS和ImageNet上的实验表明,DDIS优于以前的DFIS方法,生成的样本更准确地反映了训练数据分布,在无数据应用中实现了最先进的性能。
https://arxiv.org/abs/2506.15381