Large-scale video generative models are trained on vast and diverse visual data, enabling them to internalize rich structural, semantic, and dynamic priors of the visual world. While these models have demonstrated impressive generative capability, their potential as general-purpose visual learners remains largely untapped. In this work, we introduce V-Bridge, a framework that bridges this latent capacity to versatile few-shot image restoration tasks. We reinterpret image restoration not as a static regression problem, but as a progressive generative process, and leverage video models to simulate the gradual refinement from degraded inputs to high-fidelity outputs. Surprisingly, with only 1,000 multi-task training samples (less than 2% of existing restoration methods), pretrained video models can be induced to perform competitive image restoration, achieving multiple tasks with a single model, rivaling specialized architectures designed explicitly for this purpose. Our findings reveal that video generative models implicitly learn powerful and transferable restoration priors that can be activated with only extremely limited data, challenging the traditional boundary between generative modeling and low-level vision, and opening a new design paradigm for foundation models in visual tasks.
https://arxiv.org/abs/2603.13089
In recent years, learning-based underwater image enhancement (UIE) techniques have rapidly evolved. However, distribution shifts between high-quality enhanced outputs and natural images can hinder semantic cue extraction for downstream vision tasks, thereby limiting the adaptability of existing enhancement models. To address this challenge, this work proposes a new learning mechanism that leverages Vision-Language Models (VLMs) to empower UIE models with semantic-sensitive capabilities. To be concrete, our strategy first generates textual descriptions of key objects from a degraded image via VLMs. Subsequently, a text-image alignment model remaps these relevant descriptions back onto the image to produce a spatial semantic guidance map. This map then steers the UIE network through a dual-guidance mechanism, which combines cross-attention and an explicit alignment loss. This forces the network to focus its restorative power on semantic-sensitive regions during image reconstruction, rather than pursuing a globally uniform improvement, thereby ensuring the faithful restoration of key object features. Experiments confirm that when our strategy is applied to different UIE baselines, significantly boosts their performance on perceptual quality metrics as well as enhances their performance on detection and segmentation tasks, validating its effectiveness and adaptability.
https://arxiv.org/abs/2603.12773
This paper presents a new ambient light normalization framework, DINOLight, that integrates the self-supervised model DINOv2's image understanding capability into the restoration process as a visual prior. Ambient light normalization aims to restore images degraded by non-uniform shadows and lighting caused by multiple light sources and complex scene geometries. We observe that DINOv2 can reliably extract both semantic and geometric information from a degraded image. Based on this observation, we develop a novel framework to utilize DINOv2 features for lighting normalization. First, we propose an adaptive feature fusion module that combines features from different DINOv2 layers using a point-wise softmax mask. Next, the fused features are integrated into our proposed restoration network in both spatial and frequency domains through an auxiliary cross-attention mechanism. Experiments show that DINOLight achieves superior performance on the Ambient6K dataset, and that DINOv2 features are effective for enhancing ambient light normalization. We also apply our method to shadow-removal benchmark datasets, achieving competitive results compared to methods that use mask priors. Codes will be released upon acceptance.
https://arxiv.org/abs/2603.12579
Page-level calligraphy synthesis requires balancing glyph precision with layout composition. Existing character models lack spatial context, while page-level methods often compromise brushwork detail. In this paper, we present \textbf{CalliMaster}, a unified framework for controllable generation and editing that resolves this conflict by decoupling spatial planning from content synthesis. Inspired by the human cognitive process of ``planning before writing'', we introduce a coarse-to-fine pipeline \textbf{(Text $\rightarrow$ Layout $\rightarrow$ Image)} to tackle the combinatorial complexity of page-scale synthesis. Operating within a single Multimodal Diffusion Transformer, a spatial planning stage first predicts character bounding boxes to establish the global spatial arrangement. This intermediate layout then serves as a geometric prompt for the content synthesis stage, where the same network utilizes flow-matching to render high-fidelity brushwork. Beyond achieving state-of-the-art generation quality, this disentanglement supports versatile downstream capabilities. By treating the layout as a modifiable constraint, CalliMaster enables controllable semantic re-planning: users can resize or reposition characters while the model automatically harmonizes the surrounding void space and brush momentum. Furthermore, we demonstrate the framework's extensibility to artifact restoration and forensic analysis, providing a comprehensive tool for digital cultural heritage.
https://arxiv.org/abs/2603.12482
Prevalent Computational Aberration Correction (CAC) methods are typically tailored to specific optical systems, leading to poor generalization and labor-intensive re-training for new lenses. Developing CAC paradigms capable of generalizing across diverse photographic lenses offers a promising solution to these challenges. However, efforts to achieve such cross-lens universality within consumer photography are still in their early stages due to the lack of a comprehensive benchmark that encompasses a sufficiently wide range of optical aberrations. Furthermore, it remains unclear which specific factors influence existing CAC methods and how these factors affect their performance. In this paper, we present comprehensive experiments and evaluations involving 24 image restoration and CAC algorithms, utilizing our newly proposed UniCAC, a large-scale benchmark for photographic cameras constructed via automatic optical design. The Optical Degradation Evaluator (ODE) is introduced as a novel framework to objectively assess the difficulty of CAC tasks, offering credible quantification of optical aberrations and enabling reliable evaluation. Drawing on our comparative analysis, we identify three key factors -- prior utilization, network architecture, and training strategy -- that most significantly influence CAC performance, and further investigate their respective effects. We believe that our benchmark, dataset, and observations contribute foundational insights to related areas and lay the groundwork for future investigations. Benchmarks, codes, and Zemax files will be available at this https URL.
目前流行的计算图像校正(CAC)方法通常是为了特定的光学系统量身定制,导致泛化能力差,并且在使用新镜头时需要重新训练以耗费大量人力。开发能够跨多种摄影镜头进行泛化的CAC范式为解决这些问题提供了有前景的方法。然而,在消费级摄影中实现这种跨镜头通用性的努力仍处于初期阶段,原因是缺乏涵盖广泛光学像差的全面基准测试。 此外,尚不清楚影响现有CAC方法的具体因素以及这些因素如何影响其性能。在本文中,我们通过使用新提出的UniCAC大规模基准进行综合实验和评估,涉及24种图像恢复和CAC算法,该基准是通过自动光学设计构建的摄影相机专用的。我们引入了光学退化评估器(ODE),这是一种新的框架,用于客观地评估CAC任务的难度,并提供可信的光学像差量化以及可靠的评估方法。 根据我们的比较分析,我们确定了影响CAC性能的三个关键因素:先验利用、网络架构和训练策略,并进一步研究它们各自的影响。我们认为我们的基准测试、数据集和观察结果为相关领域提供了基础见解,并为未来的调查奠定了基础。基准测试、代码和Zemax文件可在以下网址获取(假设URL已提供,实际使用时请插入正确的URL)。
https://arxiv.org/abs/2603.12083
While deep learning has advanced single-image deraining, existing models suffer from a fundamental limitation: they employ a static inference paradigm that fails to adapt to the complex, coupled degradations (e.g., noise artifacts, blur, and color deviation) of real-world rain. Consequently, restored images often exhibit residual artifacts and inconsistent perceptual quality. In this work, we present Derain-Agent, a plug-and-play refinement framework that transitions deraining from static processing to dynamic, agent-based restoration. Derain-Agent equips a base deraining model with two core capabilities: 1) a Planning Network that intelligently schedules an optimal sequence of restoration tools for each instance, and 2) a Strength Modulation mechanism that applies these tools with spatially adaptive intensity. This design enables precise, region-specific correction of residual errors without the prohibitive cost of iterative search. Our method demonstrates strong generalization, consistently boosting the performance of state-of-the-art deraining models on both synthetic and real-world benchmarks.
虽然深度学习在单张图像去雨方面取得了进展,但现有的模型存在一个根本性的局限:它们使用静态的推理范式,无法适应现实世界中降雨带来的复杂、耦合退化(例如噪声痕迹、模糊和颜色偏差)。因此,恢复后的图像常常表现出残留的艺术品,并且感知质量不一致。在本工作中,我们提出了Derain-Agent,这是一个即插即用的优化框架,将去雨处理从静态处理转变为动态代理修复。Derain-Agent为基本的去雨模型配备了两大核心能力:1)一个规划网络,能够智能地为每个实例安排最优的恢复工具序列;2)一个强度调节机制,以空间自适应的方式应用这些工具。这种设计能够在不增加迭代搜索成本的前提下实现精确、区域特定的错误修正。我们的方法展示了强大的泛化能力,在合成和现实世界基准测试中持续提高了最先进的去雨模型的表现水平。
https://arxiv.org/abs/2603.11866
Hybrid CNN-Transformer architectures achieve strong results in image super-resolution, but scaling attention windows or convolution kernels significantly increases computational cost, limiting deployment on resource-constrained devices. We present UCAN, a lightweight network that unifies convolution and attention to expand the effective receptive field efficiently. UCAN combines window-based spatial attention with a Hedgehog Attention mechanism to model both local texture and long-range dependencies, and introduces a distillation-based large-kernel module to preserve high-frequency structure without heavy computation. In addition, we employ cross-layer parameter sharing to further reduce complexity. On Manga109 ($4\times$), UCAN-L achieves 31.63 dB PSNR with only 48.4G MACs, surpassing recent lightweight models. On BSDS100, UCAN attains 27.79 dB, outperforming methods with significantly larger models. Extensive experiments show that UCAN achieves a superior trade-off between accuracy, efficiency, and scalability, making it well-suited for practical high-resolution image restoration.
混合CNN-Transformer架构在图像超分辨率任务中取得了很好的结果,但扩大注意力窗口或卷积核会显著增加计算成本,从而限制了其在资源受限设备上的部署。我们提出了UCAN(Unified Convolution and Attention Network),这是一种轻量级网络,它将卷积和注意力统一起来以高效地扩展有效感受野。 UCAN结合了基于窗口的空间注意机制与刺猬注意力机制,以便同时建模局部纹理和长程依赖关系,并引入了一种基于知识蒸馏的大核模块来在不进行大量计算的情况下保留高频结构。此外,我们还采用了跨层参数共享策略进一步降低复杂度。 在Manga109(4倍超分辨率)数据集上,UCAN-L实现了31.63 dB的PSNR值,仅使用了48.4G MACs的操作量,超过了最近的一些轻量级模型。而在BSDS100数据集上,UCAN达到了27.79 dB的成绩,优于那些拥有显著更大规模模型的方法。 通过广泛的实验显示,UCAN在精度、效率和可扩展性之间实现了更佳的平衡,使其非常适合用于实际高分辨率图像恢复任务中。
https://arxiv.org/abs/2603.11680
General speech restoration demands techniques that can interpret complex speech structures under various distortions. While State-Space Models like SEMamba have advanced the state-of-the-art in speech denoising, they are not inherently optimized for critical speech characteristics, such as spectral periodicity or multi-resolution frequency analysis. In this work, we introduce an architecture tailored to incorporate speech-specific features as inductive biases. In particular, we propose Frequency GLP, a frequency feature extraction block that effectively and efficiently leverages the properties of frequency bins. Then, we design a multi-resolution parallel time-frequency dual-processing block to capture diverse spectral patterns, and a learnable mapping to further enhance model performance. With all our ideas combined, the proposed SEMamba++ achieves the best performance among multiple baseline models while remaining computationally efficient.
一般性的语音恢复技术需要能够在各种失真条件下解析复杂的语音结构的技术。虽然状态空间模型(如SEMamba)已经在语音降噪方面取得了最先进的成果,但它们并不固有地针对诸如频谱周期性或多分辨率频率分析等关键语音特征进行优化。在这项工作中,我们介绍了一种专门设计来整合特定于语音的特性作为归纳偏差的架构。具体来说,我们提出了频率GLP(Frequency GLP),这是一种频率特征提取模块,能够有效且高效地利用频率带的属性。然后,我们设计了一个多分辨率并行时频双处理块,以捕捉多样化的频谱模式,并引入了一种可学习映射来进一步提升模型性能。通过结合所有这些想法,所提出的SEMamba++在保持计算效率的同时,在多个基线模型中取得了最佳性能。
https://arxiv.org/abs/2603.11669
Self-supervised and multimodal vision encoders learn strong visual representations that are widely adopted in downstream vision tasks and large vision-language models (LVLMs). However, downstream users often rely on third-party pretrained encoders with uncertain provenance, exposing them to backdoor attacks. In this work, we propose BackdoorIDS, a simple yet effective zero-shot, inference-time backdoor samples detection method for pretrained vision encoders. BackdoorIDS is motivated by two observations: Attention Hijacking and Restoration. Under progressive input masking, a backdoored image initially concentrates attention on malicious trigger features. Once the masking ratio exceeds the trigger's robustness threshold, the trigger is deactivated, and attention rapidly shifts to benign content. This transition induces a pronounced change in the image embedding, whereas embeddings of clean images evolve more smoothly across masking progress. BackdoorIDS operationalizes this signal by extracting an embedding sequence along the masking trajectory and applying density-based clustering such as DBSCAN. An input is flagged as backdoored if its embedding sequence forms more than one cluster. Extensive experiments show that BackdoorIDS consistently outperforms existing defenses across diverse attack types, datasets, and model families. Notably, it is a plug-and-play approach that requires no retraining and operates fully zero-shot at inference time, making it compatible with a wide range of encoder architectures, including CNNs, ViTs, CLIP, and LLaVA-1.5.
自监督和多模态视觉编码器学习到的强大视觉表示在下游视觉任务和大型视觉语言模型(LVLM)中被广泛采用。然而,下游用户经常依赖来源不明的第三方预训练编码器,这使他们容易受到后门攻击。在这项工作中,我们提出了BackdoorIDS,这是一种简单而有效的零样本、推理时检测预训练视觉编码器中的后门样本的方法。 BackdoorIDS的设计动机来自两个观察结果:注意力劫持和恢复。在逐步输入屏蔽过程中,含有恶意触发特征的被篡改图像最初会集中注意于这些恶意特征上。一旦屏蔽比例超过了触发特征的鲁棒性阈值,该触发特征就会失效,注意力迅速转移到正常的内容上。这一转变会在图像嵌入中引起显著的变化,而干净图像的嵌入在屏蔽过程中则变化较为平缓。 BackdoorIDS通过沿着屏蔽轨迹提取一个嵌入序列并应用基于密度的聚类(如DBSCAN)来实现这一点。如果某个输入样本的嵌入序列形成了超过一个簇,则将其标记为被篡改。广泛的实验表明,无论攻击类型、数据集还是模型家族如何,BackdoorIDS都能持续超越现有的防御方法。 特别值得注意的是,BackdoorIDS是一个即插即用的方法,不需要重新训练,并且能够在推理时完全零样本操作,使其兼容包括CNNs、ViTs、CLIP和LLaVA-1.5在内的各种编码器架构。
https://arxiv.org/abs/2603.11664
The simplicity and effectiveness of the UNet architecture makes it ubiquitous in image restoration, image segmentation, and diffusion models. They are often assumed to be equivariant to translations, yet they traditionally consist of layers that are known to be prone to aliasing, which hinders their equivariance in practice. To overcome this limitation, we propose a new alias-free UNet designed from a careful selection of state-of-the-art translation-equivariant layers. We evaluate the proposed equivariant architecture against non-equivariant baselines on image restoration tasks and observe competitive performance with a significant increase in measured equivariance. Through extensive ablation studies, we also demonstrate that each change is crucial for its empirical equivariance. Our implementation is available at this https URL
UNet架构的简单性和有效性使其在图像修复、图像分割和扩散模型中广泛使用。人们通常认为它们对平移具有等变性,但传统上却由易受混叠影响的层组成,这实际上阻碍了其等变性能。为克服这一限制,我们提出了一种新的无混叠UNet架构,该架构通过精心选择最先进的平移等变层构建而成。我们在图像修复任务中评估了所提出的等变架构,并观察到与非等变基线相比,其实验结果在等变性方面有显著提高,同时性能也具有竞争力。通过广泛的消融研究,我们也证明每个改变对实验中的等变性至关重要。我们的实现代码可在此网址获得:[此URL](请将“this https URL”替换为实际提供的具体链接)。
https://arxiv.org/abs/2603.11323
Large Multimodal Models (LMMs) struggle to adapt varying computational budgets due to numerous visual tokens. Previous methods attempted to reduce the number of visual tokens before or within LLMs. However, these strategies inevitably result in the loss of visual semantic. To address these issues, we introduce FMVR, a plug-and-play and extremely simple Frequency-Modulated Visual Restoration strategy to boost the reasoning ability of LMMs under visual token reduction. Specifically, FMVR disentangles the visual representation of fewer visual tokens into low- and high-frequency components through AvgPool and MaxPool. The derived frequencies are subsequently modulated using lightweight learnable parameters. The high-frequency from AvgPool acts as a saliency filter to enhance saliency visual semantics, while the low-frequency from MaxPool acts as an anti-saliency filter to strengthen weak visual semantics. It enables the preservation of visual semantics dominated by few visual tokens and the restoration of diluted visual semantics. Additionally, we inject FMVR into Matryoshka Representation Learning to learn coarse-to-fine visual token sets, thus enabling to elastically adjust the number of visual tokens during inference while maintaining comparable performance. Experiments across 10 image-based and 4 video-based bench marks demonstrate that FMVR-LLaVA reduce the FLOPs of LLaVA-1.5-7B by 89%, while maintaining almost 100% of the original accuracy. The code will be open.
大型多模态模型(LMMs)在适应不同的计算预算时面临挑战,这主要是由于存在大量的视觉标记。之前的方法试图通过减少输入到LLMs之前的视觉标记数量来解决这个问题。然而,这些策略不可避免地会导致视觉语义的丢失。为了解决这些问题,我们引入了FMVR(Frequency-Modulated Visual Restoration),这是一种插件式且极其简单的策略,旨在增强在视觉标记减少情况下的LMM推理能力。 具体来说,FMVR通过使用AvgPool和MaxPool将较少视觉标记的视觉表示分解为低频和高频分量。随后,利用轻量级可学习参数对这些频率进行调制。来自AvgPool的高频部分充当了显著性过滤器,用于增强显著性的视觉语义;而来自MaxPool的低频部分则充当非显著性过滤器,用来加强弱化的视觉语义。这使得在使用少量视觉标记的情况下保留主导视觉语义,并恢复稀释的视觉语义成为可能。 此外,我们将FMVR整合到Matryoshka表示学习中,以从粗到细地学习视觉令牌集,从而能够在推理过程中灵活调整视觉令牌的数量,同时保持与原模型相当的表现力。跨10个基于图像和4个基于视频的基准测试的实验表明,在减少LLaVA-1.5-7B的FLOPs(89%)的同时,FMVR-LLaVA能够几乎维持原始准确性的100%不变。代码将会公开。
https://arxiv.org/abs/2603.11220
Ultra-high-definition (UHD) image deblurring poses significant challenges for UHD restoration methods, which must balance fine-grained detail recovery and practical inference efficiency. Although prominent discriminative and generative methods have achieved remarkable results, a trade-off persists between computational cost and the ability to generate fine-grained detail for UHD image deblurring tasks. To further alleviate these issues, we propose a novel autoregressive flow method for UHD image deblurring with an ill-conditioned constraint. Our core idea is to decompose UHD restoration into a progressive, coarse-to-fine process: at each scale, the sharp estimate is formed by upsampling the previous-scale result and adding a current-scale residual, enabling stable, stage-wise refinement from low to high resolution. We further introduce Flow Matching to model residual generation as a conditional vector field and perform few-step ODE sampling with efficient Euler/Heun solvers, enriching details while keeping inference affordable. Since multi-step generation at UHD can be numerically unstable, we propose an ill-conditioning suppression scheme by imposing condition-number regularization on a feature-induced attention matrix, improving convergence and cross-scale consistency. Our method demonstrates promising performance on blurred images at 4K (3840$\times$2160) or higher resolutions.
超高清(UHD)图像去模糊对恢复方法提出了重大挑战,这些方法必须在精细细节恢复和实用推理效率之间取得平衡。尽管一些显著的判别性和生成性方法已经取得了令人瞩目的成果,在计算成本与生成超高清图像去模糊任务中细粒度细节的能力之间仍存在权衡问题。为了解决这些问题,我们提出了一种新的自回归流方法用于带有病态条件约束的UHD图像去模糊。 我们的核心思想是将UHD恢复过程分解成一个渐进式的、由粗到细的过程:在每一尺度下,锐化估计通过上采样前一尺度的结果并添加当前尺度的残差来形成,从而实现从低分辨率到高分辨率的稳定阶段式细化。我们进一步引入流匹配(Flow Matching)以将残差生成建模为条件向量场,并采用高效的欧拉/赫恩求解器进行少量步骤的常微分方程采样,在保持推理成本可控的同时丰富细节。 由于在UHD情况下多步生成可能会出现数值不稳定性,我们提出了一种病态抑制方案,通过对特征诱导注意力矩阵施加条件数正则化来改善收敛性和跨尺度一致性。我们的方法在4K(3840x2160)或更高分辨率的模糊图像上表现出色。
https://arxiv.org/abs/2603.10517
Quantization-Aware Training (QAT), combined with Knowledge Distillation (KD), holds immense promise for compressing models for edge deployment. However, joint optimization for precision-sensitive image restoration (IR) to recover visual quality from degraded images remains largely underexplored. Directly adapting QAT-KD to low-level vision reveals three critical bottlenecks: teacher-student capacity mismatch, spatial error amplification during decoder distillation, and an optimization "tug-of-war" between reconstruction and distillation losses caused by quantization noise. To tackle these, we introduce Quantization-aware Distilled Restoration (QDR), a framework for edge-deployed IR. QDR eliminates capacity mismatch via FP32 self-distillation and prevents error amplification through Decoder-Free Distillation (DFD), which corrects quantization errors strictly at the network bottleneck. To stabilize the optimization tug-of-war, we propose a Learnable Magnitude Reweighting (LMR) that dynamically balances competing gradients. Finally, we design an Edge-Friendly Model (EFM) featuring a lightweight Learnable Degradation Gating (LDG) to dynamically modulate spatial degradation localization. Extensive experiments across four IR tasks demonstrate that our Int8 model recovers 96.5% of FP32 performance, achieves 442 frames per second (FPS) on an NVIDIA Jetson Orin, and boosts downstream object detection by 16.3 mAP
量化感知训练(QAT)与知识蒸馏(KD)的结合为模型压缩以便在边缘设备上部署带来了巨大的潜力。然而,针对对精度敏感的图像恢复(IR),即从退化图像中恢复视觉质量的整体优化问题仍然很大程度上未被探索。直接将QAT-KD应用于低级视觉揭示了三个关键瓶颈:教师-学生容量不匹配、解码器蒸馏过程中的空间误差放大以及由于量化噪声导致重建和蒸馏损失之间的“拉锯战”。为了应对这些问题,我们引入了量化的蒸馏恢复(QDR),这是一种面向边缘部署的IR框架。QDR通过FP32自我蒸馏消除容量差异,并通过无解码器蒸馏(DFD)防止误差放大,严格纠正网络瓶颈处的量化错误。为稳定优化“拉锯战”,我们提出了一种可学习幅度重新加权(LMR),动态地平衡竞争梯度。最后,我们设计了一个友好的边缘模型(EFM),该模型具有轻量级的学习退化门控(LDG)以动态调节空间退化定位。跨四个IR任务的广泛实验表明,我们的Int8模型恢复了96.5%的FP32性能,在NVIDIA Jetson Orin上实现了442帧每秒(FPS),并且提高了下游目标检测的16.3 mAP。 这种QDR框架通过引入一系列创新的技术和策略来克服传统方法中的瓶颈,显著提高了压缩模型在边缘设备上的部署效率与效果。
https://arxiv.org/abs/2603.09624
Translating freehand sketches into photorealistic images remains a fundamental challenge in image synthesis, particularly due to the abstract, sparse, and stylistically diverse nature of sketches. Existing approaches, including GAN-based and diffusion-based models, often struggle to reconstruct fine-grained details, maintain spatial alignment, or adapt across different sketch domains. In this paper, we propose a component-aware, self-refining framework for sketch-to-image generation that addresses these challenges through a novel two-stage architecture. A Self-Attention-based Autoencoder Network (SA2N) first captures localised semantic and structural features from component-wise sketch regions, while a Coordinate-Preserving Gated Fusion (CGF) module integrates these into a coherent spatial layout. Finally, a Spatially Adaptive Refinement Revisor (SARR), built on a modified StyleGAN2 backbone, enhances realism and consistency through iterative refinement guided by spatial context. Extensive experiments across both facial (CelebAMask-HQ, CUFSF) and non-facial (Sketchy, ChairsV2, ShoesV2) datasets demonstrate the robustness and generalizability of our method. The proposed framework consistently outperforms state-of-the-art GAN and diffusion models, achieving significant gains in image fidelity, semantic accuracy, and perceptual quality. On CelebAMask-HQ, our model improves over prior methods by 21% (FID), 58% (IS), 41% (KID), and 20% (SSIM). These results, along with higher efficiency and visual coherence across diverse domains, position our approach as a strong candidate for applications in forensics, digital art restoration, and general sketch-based image synthesis.
将手绘草图转换为逼真的图像仍然是图像合成中的一个基本挑战,尤其是因为草图具有抽象、稀疏和风格多样的特点。现有的方法,包括基于GAN的模型和扩散模型,往往难以重建细粒度细节、保持空间对齐或适应不同的草图领域。在本文中,我们提出了一种组件感知、自我完善的框架,用于解决从草图生成图像时遇到的问题,并通过新颖的两阶段架构来克服这些挑战。首先,基于自注意力机制的自动编码网络(Self-Attention-based Autoencoder Network, SA2N)捕获来自按组件划分的草图区域的局部语义和结构特征;然后,坐标保持门控融合模块(Coordinate-Preserving Gated Fusion, CGF)将这些信息整合成一个连贯的空间布局。最后,基于修改后的StyleGAN2骨干网构建的空间自适应细化修订器(Spatially Adaptive Refinement Revisor, SARR),通过空间上下文引导的迭代改进来增强现实性和一致性。 在面部(CelebAMask-HQ、CUFSF)和非面部(Sketchy、ChairsV2、ShoesV2)数据集上的广泛实验验证了我们方法的鲁棒性与泛化能力。我们的框架持续优于最先进的GAN和扩散模型,在图像保真度、语义准确性及感知质量方面取得了显著提升。在CelebAMask-HQ上,我们的模型相比先前的方法分别提高了21%(FID)、58%(IS)、41%(KID)以及20%(SSIM)。这些成果,加上更高的效率和跨不同领域的视觉连贯性,使我们的方法成为法医学、数字艺术品修复及通用草图生成图像合成应用的有力候选。
https://arxiv.org/abs/2603.09484
Image restoration requires simultaneously preserving fine-grained local structures and maintaining long-range spatial coherence. While convolutional networks struggle with limited receptive fields, and Transformers incur quadratic complexity for global attention, recent State Space Models (SSMs), such as Mamba, provide an appealing linear-time alternative for long-range dependency modelling. However, naively extending Mamba to 2D images exposes two intrinsic shortcomings. First, flattening 2D feature maps into 1D sequences disrupts spatial topology, leading to locality distortion that hampers precise structural recovery. Second, the stability-driven recurrent dynamics of SSMs induce long-range decay, progressively attenuating information across distant spatial positions and weakening global consistency. Together, these effects limit the effectiveness of state-space modelling in high-fidelity restoration. We propose Progressive Split-Mamba (PS-Mamba), a topology-aware hierarchical state-space framework designed to reconcile locality preservation with efficient global propagation. Instead of sequentially flattening entire feature maps, PS-Mamba performs geometry-consistent partitioning, maintaining neighbourhood integrity prior to state-space processing. A progressive split hierarchy (halves, quadrants, octants) enables structured multi-scale modelling while retaining linear complexity. To counteract long-range decay, we introduce symmetric cross-scale shortcut pathways that directly transmit low-frequency global context across hierarchical levels, stabilising information flow over large spatial extents. Extensive experiments on super-resolution, denoising, and JPEG artifact reduction show consistent improvements over recent Mamba-based and attention-based models with a clear margin.
图像恢复需要同时保留精细的局部结构和保持长距离的空间一致性。虽然卷积网络在受限的感受野方面存在困难,而Transformer在全局注意力机制下会因二次复杂度问题导致效率低下,最近的状态空间模型(如Mamba)提供了一种有吸引力的线性时间替代方案来处理长期依赖关系建模。然而,简单地将Mamba扩展到2D图像揭示了两个内在不足之处。首先,将2D特征图扁平化为1D序列会破坏空间拓扑结构,导致局部失真,阻碍精确结构恢复。其次,状态空间模型中的稳定性驱动递归动力学会导致长距离衰减,在远离的空间位置上逐渐减弱信息传递并削弱全局一致性。这些影响共同限制了高保真度图像恢复中状态空间建模的有效性。 为了解决上述问题,我们提出了渐进式分割Mamba(PS-Mamba),这是一种拓扑感知的分层状态空间框架,旨在平衡局部结构保留与高效全局传播之间的关系。不同于将整个特征图依次扁平化处理,PS-Mamba采用几何一致性的分区方法,在进入状态空间处理前保持邻域完整性。渐进式分割层次结构(半、四分之一和八分之一)能够在维护线性复杂度的同时支持结构化的多尺度建模。为了对抗长距离衰减问题,我们引入了对称跨尺度快捷通道,直接在不同层级之间传递低频全局上下文信息,从而稳定大范围空间的信息流动。 广泛的实验结果显示,在超分辨率、去噪和JPEG伪像消除方面,PS-Mamba相比于最近基于Mamba的模型以及注意力机制模型均表现出一致性的改进效果,并且优势明显。
https://arxiv.org/abs/2603.09171
Diffusion-based image super-resolution (ISR) has shown strong potential, but it still struggles in real-world scenarios where degradations are unknown and spatially non-uniform, often resulting in lost details or visual artifacts. To address this challenge, we propose a novel super-resolution diffusion model, QUSR, which integrates a Quality-Aware Prior (QAP) with an Uncertainty-Guided Noise Generation (UNG) module. The UNG module adaptively adjusts the noise injection intensity, applying stronger perturbations to high-uncertainty regions (e.g., edges and textures) to reconstruct complex details, while minimizing noise in low-uncertainty regions (e.g., flat areas) to preserve original information. Concurrently, the QAP leverages an advanced Multimodal Large Language Model (MLLM) to generate reliable quality descriptions, providing an effective and interpretable quality prior for the restoration process. Experimental results confirm that QUSR can produce high-fidelity and high-realism images in real-world scenarios. The source code is available at this https URL.
基于扩散的图像超分辨率(ISR)展现了强大的潜力,但在现实场景中却面临挑战,尤其是在退化情况未知且空间非均匀的情况下,往往会导致细节丢失或视觉伪影。为了解决这一问题,我们提出了一种新的超分辨率扩散模型QUSR,它集成了质量感知先验(QAP)和不确定性引导噪声生成(UNG)模块。UNG模块能够自适应地调整噪声注入强度,在高不确定区域(如边缘和纹理)应用更强的扰动以重建复杂细节,而在低不确定区域(如平坦区域)则尽量减少噪声以保留原有信息。同时,QAP利用先进的多模态大型语言模型(MLLM)生成可靠的质量描述,为恢复过程提供有效且可解释的质量先验。实验结果证实了QUSR能够在现实场景中产生高保真度和高真实感的图像。源代码可在该网址获得(原链接未提供具体URL,请访问相关研究论文或官方发布渠道获取)。
https://arxiv.org/abs/2603.09125
Rain streaks manifest as directional and frequency-concentrated structures that overlap across multiple scales, making single-image rain removal particularly challenging. While diffusion-based restoration models provide a powerful framework for progressive denoising, standard spatial-domain diffusion does not explicitly account for such structured spectral characteristics. We introduce SpectralDiff, a spectral-structured diffusion-based framework tailored for single-image rain removal. Rather than redefining the diffusion formulation, our method incorporates structured spectral perturbations to guide the progressive suppression of multi-directional rain components. To support this design, we further propose a full-product U-Net architecture that leverages the convolution theorem to replace convolution operations with element-wise product layers, improving computational efficiency while preserving modeling capacity. Extensive experiments on synthetic and real-world benchmarks demonstrate that SpectralDiff achieves competitive rain removal performance with improved model compactness and favorable inference efficiency compared to existing diffusion-based approaches.
雨迹表现为多尺度重叠的方向性和频谱集中的结构,这使得单幅图像去雨特别具有挑战性。虽然基于扩散的恢复模型为逐步降噪提供了一个强大的框架,但标准的空间域扩散并未明确考虑这种有组织的光谱特性。我们引入了SpectralDiff,这是一种专门用于单幅图像去雨的频谱结构化扩散框架。该方法没有重新定义扩散公式,而是通过加入有组织的频谱扰动来逐步抑制多方向性的雨成分。 为了支持这一设计,我们进一步提出了一个全乘积U-Net架构,它利用卷积定理将卷积操作替换为逐元素乘法层,在提高计算效率的同时保持了模型能力。在合成和真实世界的基准测试中进行的大量实验表明,SpectralDiff与现有的基于扩散的方法相比,在去雨性能方面具有竞争力,并且提高了模型紧凑性以及有利于推理效率。
https://arxiv.org/abs/2603.09054
Novel view synthesis from low-light, noisy, and motion-blurred imagery remains a valuable and challenging task. Current volumetric rendering methods struggle with compound degradation, and sequential 2D preprocessing introduces artifacts due to interdependencies. In this work, we introduce FLED-GS, a fast low-light enhancement and deblurring framework that reformulates 3D scene restoration as an alternating cycle of enhancement and reconstruction. Specifically, FLED-GS inserts several intermediate brightness anchors to enable progressive recovery, preventing noise blow-up from harming deblurring or geometry. Each iteration sharpens inputs with an off-the-shelf 2D deblurrer and then performs noise-aware 3DGS reconstruction that estimates and suppresses noise while producing clean priors for the next level. Experiments show FLED-GS outperforms state-of-the-art LuSh-NeRF, achieving 21$\times$ faster training and 11$\times$ faster rendering.
从光线不足、噪点较多和运动模糊的图像中合成新的视图仍然是一项有价值且具有挑战性的任务。当前的体积渲染方法在处理复合退化方面存在困难,而顺序二维预处理则由于相互依赖性而引入了伪影。在这项工作中,我们介绍了FLED-GS(快速低光增强与去模糊框架),它将3D场景恢复重新表述为增强和重建交替循环的过程。具体来说,FLED-GS插入几个中间亮度锚点以实现渐进式恢复,从而防止噪点放大影响去模糊或几何结构。 在每次迭代中,FLED-GS使用现成的2D去模糊器来锐化输入,并随后执行基于噪声感知的3DGS重建。这种重建过程不仅能估算并抑制噪声,还能生成干净的先验条件以供下一次迭代使用。实验表明,FLED-GS的表现优于当前最先进的LuSh-NeRF方法,在训练速度上快21倍,在渲染速度上快11倍。
https://arxiv.org/abs/2603.08133
Real-world image restoration (RWIR) is a highly challenging task due to the absence of clean ground-truth images. Many recent methods resort to pseudo-label (PL) supervision, often within a Mean-Teacher (MT) framework. However, these methods face a critical paradox: unconditionally trusting the often imperfect, low-quality PLs forces the student model to learn undesirable artifacts, while discarding them severely limits data diversity and impairs model generalization. In this paper, we propose QualiTeacher, a novel framework that transforms pseudo-label quality from a noisy liability into a conditional supervisory signal. Instead of filtering, QualiTeacher explicitly conditions the student model on the quality of the PLs, estimated by an ensemble of complementary non-reference image quality assessment (NR-IQA) models spanning low-level distortion and semantic-level assessment. This strategy teaches the student network to learn a quality-graded restoration manifold, enabling it to understand what constitutes different quality levels. Consequently, it can not only avoid mimicking artifacts from low-quality labels but also extrapolate to generate results of higher quality than the teacher itself. To ensure the robustness and accuracy of this quality-driven learning, we further enhance the process with a multi-augmentation scheme to diversify the PL quality spectrum, a score-based preference optimization strategy inspired by Direct Preference Optimization (DPO) to enforce a monotonically ordered quality separation, and a cropped consistency loss to prevent adversarial over-optimization (reward hacking) of the IQA models. Experiments on standard RWIR benchmarks demonstrate that QualiTeacher can serve as a plug-and-play strategy to improve the quality of the existing pseudo-labeling framework, establishing a new paradigm for learning from imperfect supervision. Code will be released.
现实世界图像恢复(RWIR)是一项极具挑战性的任务,由于缺乏干净的原始真实图像作为参考,许多最近的方法依赖于伪标签(PL)监督,通常是在均值教师(MT)框架内。然而,这些方法面临一个关键矛盾:无条件地信任往往不完美的低质量伪标签会使学生模型学习到不必要的瑕疵,而完全抛弃它们又会极大地限制数据多样性并损害模型泛化能力。 在本文中,我们提出了QualiTeacher,这是一种新颖的框架,它将伪标签的质量从一种嘈杂的责任转变为一种有条件的监督信号。不同于简单地过滤掉低质量的伪标签,QualiTeacher明确地根据一组互补的非参考图像质量评估(NR-IQA)模型估计出的PL质量来调整学生模型的学习过程,这些模型涵盖了从低级失真到语义水平的质量评估。这种策略教导学生网络学习一种按质量分级的恢复流形,使其能够理解不同质量级别的特征。因此,它不仅可以避免模仿来自低质量标签的艺术瑕疵,还可以推广生成比教师本身更高的图像质量结果。 为了确保这一基于质量驱动的学习过程的鲁棒性和准确性,我们进一步增强其过程,包括多数据增强方案以多样化PL的质量光谱、一种基于直接偏好优化(DPO)启发的方法来执行单调有序的质量分离,并且引入裁剪一致性损失防止IQA模型的对抗过度优化(奖励作弊)。在标准RWIR基准上的实验表明,QualiTeacher可以作为现有伪标签框架的一种即插即用策略以提高图像质量,为从不完美的监督中学习树立了新的范例。代码将公开发布。
https://arxiv.org/abs/2603.08030
Robotic systems operating in real-world environments inevitably encounter unobserved dynamics shifts during continuous execution, including changes in actuation, mass distribution, or contact conditions. When such shifts occur mid-episode, even locally stabilizing learned policies can experience substantial transient performance degradation. While input-to-state stability guarantees bounded state deviation, it does not ensure rapid restoration of task-level performance. We address inference-time recovery under frozen policy parameters by casting adaptation as constrained disturbance shaping around a nominal stabilizing controller. We propose a stability-aligned residual control architecture in which a reinforcement learning policy trained under nominal dynamics remains fixed at deployment, and adaptation occurs exclusively through a bounded additive residual channel. A Stability Alignment Gate (SAG) regulates corrective authority through magnitude constraints, directional coherence with the nominal action, performance-conditioned activation, and adaptive gain modulation. These mechanisms preserve the nominal closed-loop structure while enabling rapid compensation for unobserved dynamics shifts without retraining or privileged disturbance information. Across mid-episode perturbations including actuator degradation, mass variation, and contact changes, the proposed method consistently reduces recovery time relative to frozen and online-adaptation baselines while maintaining near-nominal steady-state performance. Recovery time is reduced by \textbf{87\%} on the Go1 quadruped, \textbf{48\%} on the Cassie biped, \textbf{30\%} on the H1 humanoid, and \textbf{20\%} on the Scout wheeled platform on average across evaluated conditions relative to a frozen SAC policy.
在现实环境中运行的机器人系统不可避免地会在连续执行过程中遇到未观察到的动力学变化,包括驱动器的变化、质量分布的不同或接触条件的变化。当这些变化发生在任务进行中时,即使是一些局部稳定的已学习策略也会经历显著的瞬态性能下降。虽然输入-状态稳定性可以保证有限的状态偏差,但它不能确保任务级性能快速恢复。 为了在冻结政策参数的情况下解决推理时间内的恢复问题,我们将适应性视为围绕名义稳定控制器的约束干扰塑形。我们提出了一个与稳定性对齐的残差控制架构,在这种架构中,经过名义动力学训练的强化学习策略保持不变,并且仅通过有界加法残差通道来进行调整。稳定性对齐门(SAG)通过幅度限制、与名义动作的方向一致、性能条件激活以及自适应增益调制来调节纠正权限。 这些机制在不重新训练或特权干扰信息的情况下,保留了名义闭合回路结构,并能够快速补偿未观察到的动力学变化。在整个任务中间的扰动下(包括驱动器退化、质量变化和接触变化),相对于冻结策略和在线调整基准,所提出的方法平均减少恢复时间:在Go1四足机器人上减少了**87%**;在Cassie双足机器人上减少了**48%**;在H1人形机器人上减少了**30%**;在Scout轮式平台上减少了**20%**。同时,这种方法还能保持接近名义值的稳态性能。
https://arxiv.org/abs/2603.07775