Multi-modal 3D object detection is important for reliable perception in robotics and autonomous driving. However, its effectiveness remains limited under adverse weather conditions due to weather-induced distortions and misalignment between different data modalities. In this work, we propose DiffFusion, a novel framework designed to enhance robustness in challenging weather through diffusion-based restoration and adaptive cross-modal fusion. Our key insight is that diffusion models possess strong capabilities for denoising and generating data that can adapt to various weather conditions. Building on this, DiffFusion introduces Diffusion-IR restoring images degraded by weather effects and Point Cloud Restoration (PCR) compensating for corrupted LiDAR data using image object cues. To tackle misalignments between two modalities, we develop Bidirectional Adaptive Fusion and Alignment Module (BAFAM). It enables dynamic multi-modal fusion and bidirectional bird's-eye view (BEV) alignment to maintain consistent spatial correspondence. Extensive experiments on three public datasets show that DiffFusion achieves state-of-the-art robustness under adverse weather while preserving strong clean-data performance. Zero-shot results on the real-world DENSE dataset further validate its generalization. The implementation of our DiffFusion will be released as open-source.
多模态3D物体检测在机器人技术和自动驾驶中对于可靠的感知至关重要。然而,由于恶劣天气引起的扭曲和不同数据模式之间的不匹配,在恶劣天气条件下其有效性仍然受限。在这项工作中,我们提出了DiffFusion,这是一种新颖的框架,旨在通过基于扩散的方法恢复和自适应跨模式融合来增强恶劣天气条件下的鲁棒性。我们的关键见解是,扩散模型具有强大的去噪能力和可以适应各种天气条件的数据生成能力。在此基础上,DiffFusion引入了Diffusion-IR(用于修复受天气影响而退化的图像)以及点云修复(PCR)(使用图像对象线索来补偿受损的LiDAR数据)。为了解决两种模式之间的不匹配问题,我们开发了一种双向自适应融合和对齐模块(BAFAM)。它能够实现动态多模态融合并进行双向俯视图(BEV)对齐以保持一致的空间对应关系。 在三个公开数据集上的大量实验表明,DiffFusion在恶劣天气条件下实现了最先进的鲁棒性,并且同时保持了强大的干净数据性能。在现实世界的DENSE数据集上进行的零样本结果进一步验证了其泛化能力。我们将开放源代码发布我们的DiffFusion实现。
https://arxiv.org/abs/2512.13107
The spinal angle is an important indicator of body balance. It is important to restore the 3D shape of the human body and estimate the spine center line. Existing mul-ti-image-based body restoration methods require expensive equipment and complex pro-cedures, and single image-based body restoration methods have limitations in that it is difficult to accurately estimate the internal structure such as the spine center line due to occlusion and viewpoint limitation. This study proposes a method to compensate for the shortcomings of the multi-image-based method and to solve the limitations of the sin-gle-image method. We propose a 3D body posture analysis system that integrates depth images from four directions to restore a 3D human model and automatically estimate the spine center line. Through hierarchical matching of global and fine registration, restora-tion to noise and occlusion is performed. Also, the Adaptive Vertex Reduction is applied to maintain the resolution and shape reliability of the mesh, and the accuracy and stabil-ity of spinal angle estimation are simultaneously secured by using the Level of Detail en-semble. The proposed method achieves high-precision 3D spine registration estimation without relying on training data or complex neural network models, and the verification confirms the improvement of matching quality.
脊柱角度是衡量身体平衡的重要指标。恢复人体的三维形状并估计脊椎中心线非常重要。现有的基于多图像的身体重建方法需要昂贵的设备和复杂的程序,而基于单张图像的方法则由于遮挡和视角限制难以准确估算如脊椎中心线等内部结构。本研究提出了一种补偿多图像法缺点及解决单一图像法局限性的方法。我们设计了一个集成了四个方向深度图的三维人体姿态分析系统,用于恢复一个三维的人体模型并自动估计脊柱中心线。通过全局和精细注册的分层匹配,可以应对噪声和遮挡的影响进行重建。此外,应用自适应顶点减少以保持网格的分辨率和形状可靠性,并利用细节层次(LoD)集成同时保证了脊椎角度估算的准确性和稳定性。所提出的方法能够在不依赖训练数据或复杂神经网络模型的情况下实现高精度的三维脊柱注册估计,并且验证确认匹配质量得到了改善。
https://arxiv.org/abs/2512.12718
Modern text-to-video (T2V) diffusion models can synthesize visually compelling clips, yet they remain brittle at fine-scale structure: even state-of-the-art generators often produce distorted faces and hands, warped backgrounds, and temporally inconsistent motion. Such severe structural artifacts also appear in very low-quality real-world videos. Classical video restoration and super-resolution (VR/VSR) methods, in contrast, are tuned for synthetic degradations such as blur and downsampling and tend to stabilize these artifacts rather than repair them, while diffusion-prior restorers are usually trained on photometric noise and offer little control over the trade-off between perceptual quality and fidelity. We introduce CreativeVR, a diffusion-prior-guided video restoration framework for AI-generated (AIGC) and real videos with severe structural and temporal artifacts. Our deep-adapter-based method exposes a single precision knob that controls how strongly the model follows the input, smoothly trading off between precise restoration on standard degradations and stronger structure- and motion-corrective behavior on challenging content. Our key novelty is a temporally coherent degradation module used during training, which applies carefully designed transformations that produce realistic structural failures. To evaluate AIGC-artifact restoration, we propose the AIGC54 benchmark with FIQA, semantic and perceptual metrics, and multi-aspect scoring. CreativeVR achieves state-of-the-art results on videos with severe artifacts and performs competitively on standard video restoration benchmarks, while running at practical throughput (about 13 FPS at 720p on a single 80-GB A100). Project page: this https URL.
现代的文本到视频(T2V)扩散模型能够合成视觉上吸引人的片段,但它们在细粒度结构方面仍然很脆弱:即使是最先进的生成器也常常会产生变形的脸部和手部、扭曲的背景以及时间上不一致的动作。这些严重的结构性缺陷同样出现在质量极低的真实世界视频中。相比之下,传统的视频修复与超分辨率(VR/VSR)方法针对的是模糊和下采样等合成退化现象,并倾向于稳定而不是修正这些缺陷,而基于扩散模型的修复方法通常是在光度噪声上进行训练,对感知质量和保真度之间的权衡控制较少。 我们提出了CreativeVR,这是一种基于扩散先验指导的视频修复框架,适用于存在严重结构性和时间性瑕疵的人工智能生成(AIGC)视频以及真实世界视频。我们的深度适配器方法提供了一个单一的精度旋钮,用来控制模型遵循输入数据的程度,从而在标准退化情况下的精确恢复与更具挑战性内容中的结构及运动修正之间平滑地进行权衡。 我们创新的关键在于训练过程中使用的时间一致性退化模块,该模块应用精心设计的转换来生成现实中的结构性失败。为了评估AIGC瑕疵修复,我们提出了AIGC54基准测试集,并采用了FIQA、语义和感知指标以及多方面评分系统。CreativeVR在具有严重缺陷的视频上实现了最佳结果,在标准视频修复基准上的表现也非常出色,同时运行效率高(在单个80GB A100 GPU上以720p分辨率大约为每秒13帧)。 项目页面:[此链接](this https URL)。
https://arxiv.org/abs/2512.12060
Abutment design is a critical step in dental implant restoration. However, manual design involves tedious measurement and fitting, and research on automating this process with AI is limited, due to the unavailability of large annotated datasets. Although self-supervised learning (SSL) can alleviate data scarcity, its need for pre-training and fine-tuning results in high computational costs and long training times. In this paper, we propose a Self-supervised assisted automatic abutment design framework (SS$A^3$D), which employs a dual-branch architecture with a reconstruction branch and a regression branch. The reconstruction branch learns to restore masked intraoral scan data and transfers the learned structural information to the regression branch. The regression branch then predicts the abutment parameters under supervised learning, which eliminates the separate pre-training and fine-tuning process. We also design a Text-Conditioned Prompt (TCP) module to incorporate clinical information (such as implant location, system, and series) into SS$A^3$D. This guides the network to focus on relevant regions and constrains the parameter predictions. Extensive experiments on a collected dataset show that SS$A^3$D saves half of the training time and achieves higher accuracy than traditional SSL methods. It also achieves state-of-the-art performance compared to other methods, significantly improving the accuracy and efficiency of automated abutment design.
牙冠设计是牙齿种植修复中的一个关键步骤。然而,手动设计涉及繁琐的测量和适配工作,而利用AI自动化这一过程的研究却因缺乏大规模注释数据集而受到限制。尽管自监督学习(SSL)可以缓解数据不足的问题,但它需要预训练和微调,导致计算成本高且训练时间长。在本文中,我们提出了一种基于自监督辅助的自动牙冠设计框架(SS$A^3$D),该框架采用了具有重建分支和回归分支的双分支架构。重建分支学习修复掩码后的口腔扫描数据,并将学到的结构信息传递给回归分支。随后,回归分支在有监督的学习条件下预测牙冠参数,从而消除了独立预训练和微调的过程。我们还设计了一个基于文本条件提示(TCP)模块,以将临床信息(如种植体位置、系统和系列)融入SS$A^3$D中。这引导网络关注相关区域并约束参数预测。在收集的数据集上进行的广泛实验表明,与传统的SSL方法相比,SS$A^3$D节省了一半的训练时间,并且具有更高的准确性。此外,它也比其他方法表现出色,显著提高了自动化牙冠设计的准确性和效率。 这个框架和模块的设计,通过结合自监督学习和临床信息引导的学习方式,成功地克服了数据稀缺问题,并优化了自动牙冠设计的过程。这一创新为未来的牙齿种植修复技术发展提供了新的可能路径,有望进一步推动医疗领域的智能化进程。
https://arxiv.org/abs/2512.11507
Lens flare is a degradation phenomenon caused by strong light sources. Existing researches on flare removal have mainly focused on images, while the spatiotemporal characteristics of video flare remain largely unexplored. Video flare synthesis and removal pose significantly greater challenges than in image, owing to the complex and mutually independent motion of flare, light sources, and scene content. This motion independence further affects restoration performance, often resulting in flicker and artifacts. To address this issue, we propose a physics-informed dynamic flare synthesis pipeline, which simulates light source motion using optical flow and models the temporal behaviors of both scattering and reflective flares. Meanwhile, we design a video flare removal network that employs an attention module to spatially suppress flare regions and incorporates a Mamba-based temporal modeling component to capture long range spatio-temporal dependencies. This motion-independent spatiotemporal representation effectively eliminates the need for multi-frame alignment, alleviating temporal aliasing between flares and scene content and thereby improving video flare removal performance. Building upon this, we construct the first video flare dataset to comprehensively evaluate our method, which includes a large set of synthetic paired videos and additional real-world videos collected from the Internet to assess generalization capability. Extensive experiments demonstrate that our method consistently outperforms existing video-based restoration and image-based flare removal methods on both real and synthetic videos, effectively removing dynamic flares while preserving light source integrity and maintaining spatiotemporal consistency of scene.
镜头光晕是一种由强光源引起的退化现象。现有的关于去除光晕的研究主要集中在图像上,而视频中的光晕的空间和时间特性则鲜有研究。与图像相比,视频光晕的合成和去除面临更大的挑战,原因在于光晕、光源以及场景内容之间的复杂且相互独立的运动模式进一步影响了修复效果,通常会导致闪烁和伪影。为了解决这个问题,我们提出了一种基于物理信息的动态光晕合成管道,该管道利用光学流模拟光源运动,并对散射和反射性光晕的时间行为进行建模。与此同时,我们设计了一个视频光晕去除网络,其中运用了注意力模块来抑制空间上的光晕区域,并整合了一个基于Mamba的时间模型组件以捕捉长程时空依赖关系。这种独立于运动的时空表示有效地消除了多帧对齐的需求,减轻了光晕与场景内容之间的时间混叠问题,从而提升了视频光晕去除的效果。 在此基础上,我们构建了首个用于全面评估我们方法的视频光晕数据集,其中包括大量的合成配对视频以及从互联网收集的真实世界视频以测试泛化能力。广泛的实验表明,我们的方法在真实和合成视频上始终优于现有的基于视频恢复和图像去光晕的方法,能够有效地去除动态光晕的同时保持光源的完整性,并维持场景时空一致性。
https://arxiv.org/abs/2512.11327
Pre-trained image restoration models often fail on real-world, out-of-distribution degradations due to significant domain gaps. Adapting to these unseen domains is challenging, as out-of-distribution data lacks ground truth, and traditional adaptation methods often require complex architectural changes. We propose LEGO (Learning from a Generative Oracle), a practical three-stage framework for post-training domain adaptation without paired data. LEGO converts this unsupervised challenge into a tractable pseudo-supervised one. First, we obtain initial restorations from the pre-trained model. Second, we leverage a frozen, large-scale generative oracle to refine these estimates into high-quality pseudo-ground-truths. Third, we fine-tune the original model using a mixed-supervision strategy combining in-distribution data with these new pseudo-pairs. This approach adapts the model to the new distribution without sacrificing its original robustness or requiring architectural modifications. Experiments demonstrate that LEGO effectively bridges the domain gap, significantly improving performance on diverse real-world benchmarks.
预训练的图像恢复模型在处理现实世界中未见过的退化情况时往往表现不佳,因为它们之间存在显著的领域差距。适应这些未知领域的挑战在于,无法获得这种未见过数据的真实标签,并且传统的适应方法通常需要复杂的架构调整。我们提出了LEGO(从生成式预言家学习),这是一个实用的三阶段框架,用于在没有配对数据的情况下进行后期训练域适应。LEGO将这一无监督问题转化为可操作的伪监督问题。 首先,我们使用预训练模型获得初始恢复图像。 其次,我们利用一个冻结的大规模生成器来细化这些估计值,将其转换为高质量的伪真实标签。 最后,我们在一种混合监督策略下对原始模型进行微调,该策略结合了分布内数据和新的伪配对数据。 这种方法可以在不牺牲原有鲁棒性或需要架构修改的情况下,使模型适应新领域。实验表明,LEGO有效地弥合了领域差距,在各种现实世界基准测试中显著提高了性能。
https://arxiv.org/abs/2512.11121
All-in-One Image Restoration (AiOIR) aims to recover high-quality images from diverse degradations within a unified framework. However, existing methods often fail to explicitly model degradation types and struggle to adapt their restoration behavior to complex or mixed degradations. To address these issues, we propose ClusIR, a Cluster-Guided Image Restoration framework that explicitly models degradation semantics through learnable clustering and propagates cluster-aware cues across spatial and frequency domains for adaptive restoration. Specifically, ClusIR comprises two key components: a Probabilistic Cluster-Guided Routing Mechanism (PCGRM) and a Degradation-Aware Frequency Modulation Module (DAFMM). The proposed PCGRM disentangles degradation recognition from expert activation, enabling discriminative degradation perception and stable expert routing. Meanwhile, DAFMM leverages the cluster-guided priors to perform adaptive frequency decomposition and targeted modulation, collaboratively refining structural and textural representations for higher restoration fidelity. The cluster-guided synergy seamlessly bridges semantic cues with frequency-domain modulation, empowering ClusIR to attain remarkable restoration results across a wide range of degradations. Extensive experiments on diverse benchmarks validate that ClusIR reaches competitive performance under several scenarios.
All-in-One图像恢复(AiOIR)旨在通过统一框架从各种退化类型中恢复高质量的图像。然而,现有的方法通常无法明确地建模退化类型,并且难以适应复杂的或混合的退化情况。为了应对这些问题,我们提出了ClusIR——一个由聚类引导的图像恢复框架,该框架通过可学习的聚类显式地建模退化语义,并在空间和频率域之间传播具有聚类意识的信息,以实现自适应恢复。具体而言,ClusIR包含两个关键组件:概率聚类指导路由机制(PCGRM)和基于退化的频域调制模块(DAFMM)。所提出的PCGRM将退化识别与专家激活分离,从而能够进行区分性退化感知并稳定地引导专家处理。同时,DAFMM利用聚类导向的先验知识执行自适应频率分解和目标调制,协同精炼结构和纹理表示以提高恢复保真度。由聚类指导的协同作用无缝连接了语义线索与频域调节,使ClusIR能够在广泛的退化类型中取得显著的恢复效果。在各种基准上的广泛实验验证了ClusIR在多种情况下的性能达到了竞争水平。
https://arxiv.org/abs/2512.10948
All-in-one image restoration aims to handle diverse degradations (e.g., noise, blur, adverse weather) within a unified framework, yet existing methods increasingly rely on complex architectures (e.g., Mixture-of-Experts, diffusion models) and elaborate degradation prompt strategies. In this work, we reveal a critical insight: well-crafted feature extraction inherently encodes degradation-carrying information, and a symmetric U-Net architecture is sufficient to unleash these cues effectively. By aligning feature scales across encoder-decoder and enabling streamlined cross-scale propagation, our symmetric design preserves intrinsic degradation signals robustly, rendering simple additive fusion in skip connections sufficient for state-of-the-art performance. Our primary baseline, SymUNet, is built on this symmetric U-Net and achieves better results across benchmark datasets than existing approaches while reducing computational cost. We further propose a semantic enhanced variant, SE-SymUNet, which integrates direct semantic injection from frozen CLIP features via simple cross-attention to explicitly amplify degradation priors. Extensive experiments on several benchmarks validate the superiority of our methods. Both baselines SymUNet and SE-SymUNet establish simpler and stronger foundations for future advancements in all-in-one image restoration. The source code is available at this https URL.
全栈式图像恢复旨在通过统一框架处理各种退化情况(如噪声、模糊和恶劣天气),然而现有方法越来越依赖复杂的架构(例如专家混合模型、扩散模型)以及精细的降级提示策略。在这项工作中,我们揭示了一个关键见解:经过精心设计的功能提取固有地包含了携带退化的信息,并且一个对称的U-Net架构足以有效地释放这些线索。通过在编码器-解码器之间对齐特征尺度并简化跨尺度传播过程,我们的对称设计能够稳健地保留内在的降级信号,使得简单的加性融合在跳跃连接中就足够达到最先进的性能水平。我们主要的基础模型SymUNet建立在此对称U-Net之上,并且在基准数据集上比现有方法表现出更好的结果,同时降低了计算成本。此外,我们提出了一种语义增强变体SE-SymUNet,该变体通过简单的跨注意力机制直接从冻结的CLIP特征注入语义信息,以显式放大降级先验知识。我们在多个基准数据集上进行了广泛的实验,验证了我们方法的优势。这两种基础模型SymUNet和SE-SymUNet为全栈式图像恢复未来的发展奠定了更简单且更强的基础。源代码可在提供的URL地址获取。
https://arxiv.org/abs/2512.10581
A pre-trained unconditional diffusion model, combined with posterior sampling or maximum a posteriori (MAP) estimation techniques, can solve arbitrary inverse problems without task-specific training or fine-tuning. However, existing posterior sampling and MAP estimation methods often rely on modeling approximations and can be computationally demanding. In this work, we propose the variational mode-seeking loss (VML), which, when minimized during each reverse diffusion step, guides the generated sample towards the MAP estimate. VML arises from a novel perspective of minimizing the Kullback-Leibler (KL) divergence between the diffusion posterior $p(\mathbf{x}_0|\mathbf{x}_t)$ and the measurement posterior $p(\mathbf{x}_0|\mathbf{y})$, where $\mathbf{y}$ denotes the measurement. Importantly, for linear inverse problems, VML can be analytically derived and need not be approximated. Based on further theoretical insights, we propose VML-MAP, an empirically effective algorithm for solving inverse problems, and validate its efficacy over existing methods in both performance and computational time, through extensive experiments on diverse image-restoration tasks across multiple datasets.
预先训练的无条件扩散模型,结合后验抽样或最大后验概率(MAP)估计技术,可以解决任意逆向问题而无需特定任务的训练或微调。然而,现有的后验抽样和MAP估计方法通常依赖于建模近似,并且在计算上可能具有挑战性。在这项工作中,我们提出了变分模式寻找损失函数(VML),它通过引导生成样本朝着MAP估计的方向,在每次逆向扩散步骤中最小化时实现这一目标。VML来源于一种新颖的视角,即通过最小化扩散后验概率 $p(\mathbf{x}_0|\mathbf{x}_t)$ 和测量后验概率 $p(\mathbf{x}_0|\mathbf{y})$ 之间的Kullback-Leibler (KL)散度来实现。这里的 $\mathbf{y}$ 表示观测值。 重要的是,对于线性逆向问题,VML可以被解析导出且无需近似计算。基于进一步的理论见解,我们提出了VML-MAP,这是一种用于解决逆向问题的有效算法,并通过在多种数据集上的各种图像恢复任务中进行广泛的实验验证了其相较于现有方法,在性能和计算时间方面的优越性。
https://arxiv.org/abs/2512.10524
The integrity and reliability of scientific literature is facing a serious threat by adversarial text generation techniques, specifically from the use of automated paraphrasing tools to mask plagiarism. These tools generate "tortured phrases", statistically improbable synonyms (e.g. "counterfeit consciousness" for "artificial intelligence"), that preserve the local grammar while obscuring the original source. Most existing detection methods depend heavily on static blocklists or general-domain language models, which suffer from high false-negative rates for novel obfuscations and cannot determine the source of the plagiarized content. In this paper, we propose Semantic Reconstruction of Adversarial Plagiarism (SRAP), a framework designed not only to detect these anomalies but to mathematically recover the original terminology. We use a two-stage architecture: (1) statistical anomaly detection with a domain-specific masked language model (SciBERT) using token-level pseudo-perplexity, and (2) source-based semantic reconstruction using dense vector retrieval (FAISS) and sentence-level alignment (SBERT). Experiments on a parallel corpus of adversarial scientific text show that while zero-shot baselines fail completely (0.00 percent restoration accuracy), our retrieval-augmented approach achieves 23.67 percent restoration accuracy, significantly outperforming baseline methods. We also show that static decision boundaries are necessary for robust detection in jargon-heavy scientific text, since dynamic thresholding fails under high variance. SRAP enables forensic analysis by linking obfuscated expressions back to their most probable source documents.
科学文献的完整性和可靠性正面临着敌对文本生成技术的严重威胁,尤其是自动改写工具被用于掩盖抄袭行为。这些工具会产生“扭曲的短语”,即统计上不太可能的同义词(例如,“伪造意识”替代“人工智能”),它们在保留局部语法的同时隐藏了原始来源。现有的大多数检测方法依赖于静态黑名单或通用领域的语言模型,这导致对新颖混淆的假阴性率高,并且无法确定抄袭内容的来源。在这篇论文中,我们提出了语义重构敌对剽窃(SRAP)框架,该框架不仅设计用于检测这些异常情况,还旨在通过数学手段恢复原始术语。 我们的方法采用两阶段架构:(1) 使用领域特定的掩码语言模型(SciBERT)进行统计异常检测,并利用标记级伪困惑度;(2) 使用密集向量检索(FAISS)和句子级别对齐(SBERT)进行基于源语义重构。在对抗性科学文本平行语料库上的实验表明,零样本基准方法完全失败(恢复准确率为0.00%),而我们的增强检索方法实现了23.67%的恢复准确性,显著优于基线方法。 此外,我们还展示了静态决策边界对于识别富含专业术语的科学文献中的剽窃行为是必需的,因为动态阈值在高方差情况下无法正常工作。SRAP通过将混淆表达链接回最可能的源文档来实现法医分析。
https://arxiv.org/abs/2512.10435
Real-World Image Super-Resolution (Real-ISR) aims to recover high-quality images from low-quality inputs degraded by unknown and complex real-world factors. Real-world scenarios involve diverse and coupled degradations, making it necessary to provide diffusion models with richer and more informative guidance. However, existing methods often assume known degradation severity and rely on CLIP text encoders that cannot capture numerical severity, limiting their generalization ability. To address this, we propose \textbf{HD-CLIP} (\textbf{H}ierarchical \textbf{D}egradation CLIP), which decomposes a low-quality image into a semantic embedding and an ordinal degradation embedding that captures ordered relationships and allows interpolation across unseen levels. Furthermore, we integrated it into diffusion models via classifier-free guidance (CFG) and proposed classifier-free projection guidance (CFPG). HD-CLIP leverages semantic cues to guide generative restoration while using degradation cues to suppress undesired hallucinations and artifacts. As a \textbf{plug-and-play module}, HD-CLIP can be seamlessly integrated into various super-resolution frameworks without training, significantly improving detail fidelity and perceptual realism across diverse real-world datasets.
现实世界图像超分辨率(Real-ISR)旨在从由未知且复杂的现实因素降级的低质量输入中恢复高质量的图像。在实际场景中,存在多样化的和耦合的退化情况,这就需要提供给扩散模型更丰富和更有信息量的指导。然而,现有的方法往往假设已知的退化程度,并依赖于CLIP文本编码器(这些编码器不能捕捉数值上的严重程度),这限制了它们的泛化能力。 为了应对这一挑战,我们提出了一种名为**HD-CLIP**(**H**ierarchical **D**egradation CLIP)的方法。该方法将低质量图像分解为一个语义嵌入和一个序数退化嵌入,后者可以捕获有序关系,并允许在未见过的水平之间进行插值。此外,我们将这种方法整合到扩散模型中,通过无分类器指导(CFG)提出了一种新的引导方式——无分类器投影指导(CFPG)。HD-CLIP利用语义线索来引导生成性恢复过程,同时使用退化线索来抑制不需要的幻觉和伪影。 作为**即插即用模块**,HD-CLIP可以无缝地集成到各种超分辨率框架中,而无需进行额外训练。这显著提高了在多种现实世界数据集上的细节保真度和感知逼真性。
https://arxiv.org/abs/2512.10340
Recent studies have witnessed significant advances in image restoration foundation models driven by improvements in the scale and quality of pre-training data. In this work, we find that the data mixture proportions from different restoration tasks are also a critical factor directly determining the overall performance of all-in-one image restoration models. To this end, we propose a high-capacity diffusion-based image restoration foundation model, FoundIR-v2, which adopts a data equilibrium scheduling paradigm to dynamically optimize the proportions of mixed training datasets from different tasks. By leveraging the data mixing law, our method ensures a balanced dataset composition, enabling the model to achieve consistent generalization and comprehensive performance across diverse tasks. Furthermore, we introduce an effective Mixture-of-Experts (MoE)-driven scheduler into generative pre-training to flexibly allocate task-adaptive diffusion priors for each restoration task, accounting for the distinct degradation forms and levels exhibited by different tasks. Extensive experiments demonstrate that our method can address over 50 sub-tasks across a broader scope of real-world scenarios and achieves favorable performance against state-of-the-art approaches.
最近的研究表明,图像恢复基础模型在大规模和高质量预训练数据的推动下取得了显著进展。在这项工作中,我们发现不同恢复任务中的数据混合比例也是一个关键因素,直接决定了全功能图像恢复模型的整体性能。为此,我们提出了一种基于扩散机制的高容量图像恢复基础模型FoundIR-v2,该模型采用了一种数据均衡调度范式,能够动态优化来自不同任务的混合训练数据集的比例。通过利用数据混合法则,我们的方法确保了数据集组成平衡,使模型能够在各种任务中实现一致的泛化能力和全面性能。 此外,我们引入了一种有效的专家混合(MoE)驱动调度器来灵活分配每个恢复任务的任务适应性扩散先验,考虑到不同任务表现出不同的退化形式和水平。广泛的实验表明,我们的方法可以解决超过50个子任务,在更广泛的实际场景中,并且在性能上优于最先进的方法。
https://arxiv.org/abs/2512.09282
Event cameras have the potential to revolutionize vision systems with their high temporal resolution and dynamic range, yet they remain susceptible to lens flare, a fundamental optical artifact that causes severe degradation. In event streams, this optical artifact forms a complex, spatio-temporal distortion that has been largely overlooked. We present E-Deflare, the first systematic framework for removing lens flare from event camera data. We first establish the theoretical foundation by deriving a physics-grounded forward model of the non-linear suppression mechanism. This insight enables the creation of the E-Deflare Benchmark, a comprehensive resource featuring a large-scale simulated training set, E-Flare-2.7K, and the first-ever paired real-world test set, E-Flare-R, captured by our novel optical system. Empowered by this benchmark, we design E-DeflareNet, which achieves state-of-the-art restoration performance. Extensive experiments validate our approach and demonstrate clear benefits for downstream tasks. Code and datasets are publicly available.
事件相机具有通过其高时间分辨率和动态范围革新视觉系统的能力,但它们仍然容易受到镜头眩光的影响,这是一种基本的光学现象,会导致严重的图像退化。在事件流中,这种光学现象形成了一种复杂的时空扭曲,长期以来一直被忽视。我们提出了E-Deflare,这是第一个用于从事件相机数据中移除镜头眩光的系统性框架。 首先,我们通过推导基于物理的基础前向模型来建立理论基础,该模型描述了非线性抑制机制。这一见解使我们能够创建E-Deflare基准测试集,这是一个全面的资源库,其中包括大规模模拟训练集合E-Flare-2.7K以及首个真实世界配对测试集合E-Flare-R,这些数据由我们的新型光学系统捕捉。 借助此基准测试,我们设计了E-DeflareNet,它达到了最先进的图像恢复性能。广泛的实验验证了我们的方法,并展示了下游任务的明显优势。代码和数据集已经公开提供。
https://arxiv.org/abs/2512.09016
Text-Aware Image Restoration (TAIR) aims to recover high- quality images from low-quality inputs containing degraded textual content. While diffusion models provide strong gen- erative priors for general image restoration, they often pro- duce text hallucinations in text-centric tasks due to the ab- sence of explicit linguistic knowledge. To address this, we propose UniT, a unified text restoration framework that in- tegrates a Diffusion Transformer (DiT), a Vision-Language Model (VLM), and a Text Spotting Module (TSM) in an it- erative fashion for high-fidelity text restoration. In UniT, the VLM extracts textual content from degraded images to provide explicit textual guidance. Simultaneously, the TSM, trained on diffusion features, generates intermedi- ate OCR predictions at each denoising step, enabling the VLM to iteratively refine its guidance during the denoising process. Finally, the DiT backbone, leveraging its strong representational power, exploit these cues to recover fine- grained textual content while effectively suppressing text hallucinations. Experiments on the SA-Text and Real-Text benchmarks demonstrate that UniT faithfully reconstructs degraded text, substantially reduces hallucinations, and achieves state-of-the-art end-to-end F1-score performance in TAIR task.
文本感知图像恢复(TAIR)旨在从包含退化文字内容的低质量输入中恢复高质量图像。虽然扩散模型为一般图像恢复提供了强大的生成先验,但在以文本为中心的任务中,由于缺乏显式的语言知识,它们往往会生成虚假的文字信息。为了应对这一挑战,我们提出了UniT,这是一个统一的文本恢复框架,它将一个扩散变换器(DiT)、视觉-语言模型(VLM)和文字检测模块(TSM)以迭代方式集成在一起,用于高保真度的文字恢复。 在UniT中,VLM从退化图像中提取文本内容,提供显式的文本指导。同时,经过训练的TSM基于扩散特征,在每次去噪步骤中生成中间OCR预测,使VLM能够在去噪过程中逐步细化其指导信息。最后,具有强大表示能力的DiT骨干网络利用这些线索来恢复细粒度的文字内容,并有效地抑制文字虚假信息。 在SA-Text和Real-Text基准测试上的实验表明,UniT能够忠实重建退化文本,显著减少幻觉现象,并在TAIR任务中实现了最先进的端到端F1分数性能。
https://arxiv.org/abs/2512.08922
Oil painting, as a high-level medium that blends human abstract thinking with artistic expression, poses substantial challenges for digital generation and editing due to its intricate brushstroke dynamics and stylized characteristics. Existing generation and editing techniques are often constrained by the distribution of training data and primarily focus on modifying real photographs. In this work, we introduce a unified multimodal framework for oil painting generation and editing. The proposed system allows users to incorporate reference images for precise semantic control, hand-drawn sketches for spatial structure alignment, and natural language prompts for high-level semantic guidance, while consistently maintaining a unified painting style across all outputs. Our method achieves interactive oil painting creation through three crucial technical advancements. First, we enhance the training stage with spatial alignment and semantic enhancement conditioning strategy, which map masks and sketches into spatial constraints, and encode contextual embedding from reference images and text into feature constraints, enabling object-level semantic alignment. Second, to overcome data scarcity, we propose a self-supervised style transfer pipeline based on Stroke-Based Rendering (SBR), which simulates the inpainting dynamics of oil painting restoration, converting real images into stylized oil paintings with preserved brushstroke textures to construct a large-scale paired training dataset. Finally, during inference, we integrate features using the AdaIN operator to ensure stylistic consistency. Extensive experiments demonstrate that our interactive system enables fine-grained editing while preserving the artistic qualities of oil paintings, achieving an unprecedented level of imagination realization in stylized oil paintings generation and editing.
油画作为一种将人类抽象思维与艺术表达相结合的高级媒介,由于其复杂的笔触动态和风格化特征,在数字生成和编辑方面面临着重大挑战。现有的生成和编辑技术通常受到训练数据分布的限制,并且主要侧重于修改真实照片。在这项工作中,我们引入了一个统一的多模态框架,用于油画的生成和编辑。该系统允许用户通过参考图像进行精确语义控制、手绘草图进行空间结构对齐以及自然语言提示进行高层次语义指导的同时,在所有输出中保持一致的绘画风格。我们的方法通过三项关键技术进步实现了互动式油画创作。 首先,我们加强了训练阶段的空间对齐和语义增强条件策略,将蒙版和草图映射为空间约束,并从参考图像和文本编码上下文嵌入以形成特征约束,从而实现对象级别的语义对齐。其次,为了克服数据稀缺性的问题,我们提出了一种基于笔触基渲染(SBR)的自我监督风格迁移管道,该方法模拟了油画修复中的填色动态,将真实图像转换为具有保存下来的笔触纹理的风格化油画,以构建大规模成对训练数据集。最后,在推理阶段,我们使用AdaIN操作符整合特征,确保风格一致性。 广泛的实验表明,我们的互动系统能够在保留油画艺术品质的同时进行精细编辑,实现了在风格化油画生成和编辑方面前所未有的想象力实现水平。
https://arxiv.org/abs/2512.08534
Scene recovery serves as a critical task for various computer vision applications. Existing methods typically rely on a single prior, which is inherently insufficient to handle multiple degradations, or employ complex network architectures trained on synthetic data, which suffer from poor generalization for diverse real-world scenarios. In this paper, we propose Spatial and Frequency Priors (SFP) for real-world scene recovery. In the spatial domain, we observe that the inverse of the degraded image exhibits a projection along its spectral direction that resembles the scene transmission. Leveraging this spatial prior, the transmission map is estimated to recover the scene from scattering degradation. In the frequency domain, a mask is constructed for adaptive frequency enhancement, with two parameters estimated using our proposed novel priors. Specifically, one prior assumes that the mean intensity of the degraded image's direct current (DC) components across three channels in the frequency domain closely approximates that of each channel in the clear image. The second prior is based on the observation that, for clear images, the magnitude of low radial frequencies below 0.001 constitutes approximately 1% of the total spectrum. Finally, we design a weighted fusion strategy to integrate spatial-domain restoration, frequency-domain enhancement, and salient features from the input image, yielding the final recovered result. Extensive evaluations demonstrate the effectiveness and superiority of our proposed SFP for scene recovery under various degradation conditions.
场景恢复是计算机视觉应用中的一个关键任务。现有的方法通常依赖单一的先验知识,这在处理多种退化情况时显得不足,或者采用复杂的网络架构并在合成数据上进行训练,导致其在多样化的现实世界场景中泛化能力较差。在这篇论文中,我们提出了用于真实世界场景恢复的空间和频域先验(SFP)方法。 在空间域方面,我们观察到退化图像的逆像在其谱方向上的投影类似于场景传输图。利用这一空间先验知识,可以估算出传输映射以从散射退化中恢复场景信息。 在频率域方面,我们构建了一个自适应频域增强的掩码,并使用我们提出的新型先验估计了两个参数。具体来说,一种假设是退化图像的直流(DC)分量在整个频率域中的平均强度接近于清晰图像每个通道的平均值。另一种基于观察到的现象:对于清晰图像而言,在低径向频率下(低于0.001),频谱幅度大约占整个频谱的1%。 最后,我们设计了一种加权融合策略来整合空间域恢复、频率域增强以及输入图中的显著特征,从而产生最终的恢复结果。广泛的评估展示了我们的SFP方法在各种退化条件下用于场景恢复的有效性和优越性。
https://arxiv.org/abs/2512.08254
Image dehazing is crucial for reliable visual perception, yet it remains highly challenging under real-world non-uniform haze conditions. Although Transformer-based methods excel at capturing global context, their quadratic computational complexity hinders real-time deployment. To address this, we propose Fourier Receptance Weighted Key Value (Fourier-RWKV), a novel dehazing framework based on a Multi-State Perception paradigm. The model achieves comprehensive haze degradation modeling with linear complexity by synergistically integrating three distinct perceptual states: (1) Spatial-form Perception, realized through the Deformable Quad-directional Token Shift (DQ-Shift) operation, which dynamically adjusts receptive fields to accommodate local haze variations; (2) Frequency-domain Perception, implemented within the Fourier Mix block, which extends the core WKV attention mechanism of RWKV from the spatial domain to the Fourier domain, preserving the long-range dependencies essential for global haze estimation while mitigating spatial attenuation; (3) Semantic-relation Perception, facilitated by the Semantic Bridge Module (SBM), which utilizes Dynamic Semantic Kernel Fusion (DSK-Fusion) to precisely align encoder-decoder features and suppress artifacts. Extensive experiments on multiple benchmarks demonstrate that Fourier-RWKV delivers state-of-the-art performance across diverse haze scenarios while significantly reducing computational overhead, establishing a favorable trade-off between restoration quality and practical efficiency. Code is available at: this https URL.
图像去雾对于可靠的视觉感知至关重要,但在现实世界的非均匀雾霾条件下,这一过程仍然极具挑战性。尽管基于Transformer的方法在捕获全局上下文方面表现出色,但其二次计算复杂度阻碍了其实时部署。为了解决这个问题,我们提出了Fourier Receptance Weighted Key Value(Fourier-RWKV),这是一种新颖的去雾框架,基于多态感知范式构建。该模型通过协同整合三种不同的感知状态实现了全面的雾霾退化建模,并且计算复杂度呈线性增长: 1. **空间形式感知**:通过可变形四向令牌偏移(DQ-Shift)操作实现,它动态调整感受野以适应局部雾霾的变化。 2. **频域感知**:在Fourier Mix模块中实施,该模块将RWKV注意力机制的核心WKV从空间领域扩展到傅里叶领域,保留了全局雾霾估计所需的长程依赖性,并减轻了空间衰减。 3. **语义关系感知**:由语义桥模块(SBM)提供支持,该模块利用动态语义核融合(DSK-Fusion)技术精确对齐编码器-解码器特征并抑制伪影。 在多个基准测试中的广泛实验表明,Fourier-RWKV在各种雾霾场景下均达到了最先进的性能,并且显著降低了计算开销,在图像恢复质量和实际效率之间建立了有利的权衡。代码可以在以下URL获取:[此链接](https://this-url.com)(请将"this https URL"替换为实际可用的代码链接)。
https://arxiv.org/abs/2512.08161
Flow-based text-to-image (T2I) models excel at prompt-driven image generation, but falter on Image Restoration (IR), often "drifting away" from being faithful to the measurement. Prior work mitigate this drift with data-specific flows or task-specific adapters that are computationally heavy and not scalable across tasks. This raises the question "Can't we efficiently manipulate the existing generative capabilities of a flow model?" To this end, we introduce FlowSteer (FS), an operator-aware conditioning scheme that injects measurement priors along the sampling path,coupling a frozed flow's implicit guidance with explicit measurement constraints. Across super-resolution, deblurring, denoising, and colorization, FS improves measurement consistency and identity preservation in a strictly zero-shot setting-no retrained models, no adapters. We show how the nature of flow models and their sensitivities to noise inform the design of such a scheduler. FlowSteer, although simple, achieves a higher fidelity of reconstructed images, while leveraging the rich generative priors of flow models.
基于流的文本到图像(T2I)模型在指令驱动的图像生成方面表现出色,但在图像修复(IR)任务上表现不佳,往往偏离了对测量值的忠实度。先前的工作通过使用特定数据的流或特定任务的适配器来减轻这种漂移现象,但这些方法计算成本高且无法跨任务扩展。这引发了这样一个问题:“我们能否有效地操纵现有生成模型的能力?”为此,我们引入了一种名为FlowSteer(FS)的操作感知条件方案,该方案在采样路径上注入测量先验信息,并结合冻结流的隐式指导与明确的测量约束。 在超分辨率、去模糊化、降噪和色彩处理等任务中,FlowSteer在严格零样本设置下提升了图像的一致性和身份保持性——无需重新训练模型或适配器。我们展示了流模型的本质及其对噪声的敏感度是如何影响调度设计的。尽管简单,FlowSteer通过利用流模型丰富的生成先验信息实现了更高的重建图像保真度。 总结来说,这项工作提出了一个简洁而高效的方法,可以增强基于流的模型在图像修复任务中的性能,并且无需额外训练或复杂适配器的支持,在零样本设置下实现显著改进。
https://arxiv.org/abs/2512.08125
City-scale 3D reconstruction from satellite imagery presents the challenge of extreme viewpoint extrapolation, where our goal is to synthesize ground-level novel views from sparse orbital images with minimal parallax. This requires inferring nearly $90^\circ$ viewpoint gaps from image sources with severely foreshortened facades and flawed textures, causing state-of-the-art reconstruction engines such as NeRF and 3DGS to fail. To address this problem, we propose two design choices tailored for city structures and satellite inputs. First, we model city geometry as a 2.5D height map, implemented as a Z-monotonic signed distance field (SDF) that matches urban building layouts from top-down viewpoints. This stabilizes geometry optimization under sparse, off-nadir satellite views and yields a watertight mesh with crisp roofs and clean, vertically extruded facades. Second, we paint the mesh appearance from satellite images via differentiable rendering techniques. While the satellite inputs may contain long-range, blurry captures, we further train a generative texture restoration network to enhance the appearance, recovering high-frequency, plausible texture details from degraded inputs. Our method's scalability and robustness are demonstrated through extensive experiments on large-scale urban reconstruction. For example, in our teaser figure, we reconstruct a $4\,\mathrm{km}^2$ real-world region from only a few satellite images, achieving state-of-the-art performance in synthesizing photorealistic ground views. The resulting models are not only visually compelling but also serve as high-fidelity, application-ready assets for downstream tasks like urban planning and simulation.
从卫星图像进行城市规模的3D重建面临着极端视角外推的挑战,即我们的目标是从稀疏且视差较小的轨道图像中合成地面级别的全新视图。这要求我们能够从严重透视变形和纹理有缺陷的图像源中推断出近90度的视角差距。这种需求使最先进的重建引擎(如NeRF和3DGS)失效。为了解决这个问题,我们提出了两个针对城市结构和卫星输入量身定制的设计选择。 首先,我们将城市的几何形状建模为2.5维高度图,并将其实现为一个Z单调递增的符号距离场(SDF),以匹配从顶部视角的城市建筑布局。这在稀疏且偏离垂直方向的卫星视图下稳定了几何优化过程,并生成了一个具有清晰屋顶和干净竖直延伸立面的无缝网格。 其次,我们通过可微分渲染技术,使用卫星图像为网格绘制外观。尽管这些输入可能包含长距离、模糊的捕捉信息,但我们进一步训练了一种生成纹理恢复网络来提升其外观质量,从退化输入中恢复高频且合理的纹理细节。 我们的方法在大规模城市重建中的扩展性和鲁棒性通过广泛的实验得到了验证。例如,在我们预告片图示中,仅使用几张卫星图像就能成功地从一个4平方公里的现实世界区域进行重建,并且能够合成极其逼真的地面视图,达到最先进的性能水平。生成的结果不仅视觉效果出众,还能作为高保真度、可应用的资产用于城市规划和模拟等下游任务。
https://arxiv.org/abs/2512.07527
Underwater image restoration is essential for marine applications ranging from ecological monitoring to archaeological surveys, but effectively addressing the complex and spatially varying nature of underwater degradations remains a challenge. Existing methods typically apply uniform restoration strategies across the entire image, struggling to handle multiple co-occurring degradations that vary spatially and with water conditions. We introduce TIDE, a $\underline{t}$wo stage $\underline{i}$nverse $\underline{d}$egradation $\underline{e}$stimation framework that explicitly models degradation characteristics and applies targeted restoration through specialized prior decomposition. Our approach disentangles the restoration process into multiple specialized hypotheses that are adaptively fused based on local degradation patterns, followed by a progressive refinement stage that corrects residual artifacts. Specifically, TIDE decomposes underwater degradations into four key factors, namely color distortion, haze, detail loss, and noise, and designs restoration experts specialized for each. By generating specialized restoration hypotheses, TIDE balances competing degradation factors and produces natural results even in highly degraded regions. Extensive experiments across both standard benchmarks and challenging turbid water conditions show that TIDE achieves competitive performance on reference based fidelity metrics while outperforming state of the art methods on non reference perceptual quality metrics, with strong improvements in color correction and contrast enhancement. Our code is available at: this https URL.
水下图像恢复对于从生态监测到考古调查的各种海洋应用至关重要,但有效应对复杂且空间变化的水下退化问题仍然具有挑战性。现有方法通常在整个图像上采用统一的修复策略,难以处理随空间和水质条件变化而多发的混合退化情况。我们介绍了TIDE框架(两阶段逆退化估计),该框架明确地建模退化特征,并通过专门化的先验分解应用针对性的恢复技术。 我们的方法将恢复过程拆解为多个专业的假设并根据局部退化模式自适应融合,随后进行逐步细化以校正残留伪影。具体来说,TIDE将水下退化因素分解成四个关键部分:色彩失真、雾霾、细节损失和噪声,并设计了专门针对每个方面的修复专家。通过生成专门的恢复假设,TIDE平衡了相互竞争的退化因子,在严重退化的区域也能产生自然的结果。 跨标准基准测试及复杂浑浊水体条件进行广泛实验表明,TIDE在基于参考的保真度指标上取得了与现有方法相当的表现,并且在非参考感知质量指标方面超越了最先进的方法,特别是在色彩校正和对比度增强方面有显著改进。我们的代码可在以下网址获取:this https URL。
https://arxiv.org/abs/2512.07171