Super-resolution (SR) applied to real-world low-resolution (LR) images often results in complex, irregular degradations that stem from the inherent complexity of natural scene acquisition. In contrast to SR artifacts arising from synthetic LR images created under well-defined scenarios, those distortions are highly unpredictable and vary significantly across different real-life contexts. Consequently, assessing the quality of SR images (SR-IQA) obtained from realistic LR, remains a challenging and underexplored problem. In this work, we introduce a no-reference SR-IQA approach tailored for such highly ill-posed realistic settings. The proposed method enables domain-adaptive IQA for real-world SR applications, particularly in data-scarce domains. We hypothesize that degradations in super-resolved images are strongly dependent on the underlying SR algorithms, rather than being solely determined by image content. To this end, we introduce a self-supervised learning (SSL) strategy that first pretrains multiple SR model oriented representations in a pretext stage. Our contrastive learning framework forms positive pairs from images produced by the same SR model and negative pairs from those generated by different methods, independent of image content. The proposed approach S3 RIQA, further incorporates targeted preprocessing to extract complementary quality information and an auxiliary task to better handle the various degradation profiles associated with different SR scaling factors. To this end, we constructed a new dataset, SRMORSS, to support unsupervised pretext training; it includes a wide range of SR algorithms applied to numerous real LR images, which addresses a gap in existing datasets. Experiments on real SR-IQA benchmarks demonstrate that S3 RIQA consistently outperforms most state-of-the-art relevant metrics.
超分辨率(SR)技术在处理现实世界的低分辨率(LR)图像时,往往会遇到复杂的、不规则的退化现象,这些退化是由自然场景获取过程中的固有复杂性所引起的。与根据定义明确的情境生成的合成LR图像所产生的SR伪影相比,在真实生活情境中产生的这些扭曲是高度不可预测且变化多端的。因此,评估从现实世界低分辨率图像获得的超分辨图像(SR-IQA)的质量仍然是一项具有挑战性和未充分探索的问题。 在这项工作中,我们提出了一种专门针对这种高难度、非结构化现实场景的无参考SR-IQA方法。该方法能够为真实世界的SR应用提供领域自适应型IQA,在数据稀缺的情况下尤其有效。我们认为,超分辨率图像中的退化很大程度上取决于所使用的SR算法,而不仅仅是由图像内容决定的。 为此,我们引入了一种自我监督学习(SSL)策略:在预处理阶段,先对多个针对SR模型的不同表示进行预训练。我们的对比学习框架通过将同一种SR方法生成的图片配成正样本对,并将不同方法产生的图片配成负样本对来构建特征集,完全忽略图像内容的影响。 提出的S3 RIQA方法进一步包含了针对性的预处理步骤以提取互补的质量信息,并附加了一个辅助任务以便更好地应对与不同SR缩放因子相关的各种退化模式。为此,我们创建了新的数据集SRMORSS,用于支持无监督预训练;该数据集中包含广泛应用于大量真实LR图像的不同SR算法的应用实例,从而填补现有数据集的空白。 在实际SR-IQA基准测试中的实验表明,S3 RIQA方法在大多数当前最先进的相关度量标准中表现一致优异。
https://arxiv.org/abs/2602.10744
Diffusion models have attained remarkable breakthroughs in the real-world super-resolution (SR) task, albeit at slow inference and high demand on devices. To accelerate inference, recent works like GenDR adopt step distillation to minimize the step number to one. However, the memory boundary still restricts the maximum processing size, necessitating tile-by-tile restoration of high-resolution images. Through profiling the pipeline, we pinpoint that the variational auto-encoder (VAE) is the bottleneck of latency and memory. To completely solve the problem, we leverage pixel-(un)shuffle operations to eliminate the VAE, reversing the latent-based GenDR to pixel-space GenDR-Pix. However, upscale with x8 pixelshuffle may induce artifacts of repeated patterns. To alleviate the distortion, we propose a multi-stage adversarial distillation to progressively remove the encoder and decoder. Specifically, we utilize generative features from the previous stage models to guide adversarial discrimination. Moreover, we propose random padding to augment generative features and avoid discriminator collapse. We also introduce a masked Fourier space loss to penalize the outliers of amplitude. To improve inference performance, we empirically integrate a padding-based self-ensemble with classifier-free guidance to improve inference scaling. Experimental results show that GenDR-Pix performs 2.8x acceleration and 60% memory-saving compared to GenDR with negligible visual degradation, surpassing other one-step diffusion SR. Against all odds, GenDR-Pix can restore 4K image in only 1 second and 6GB.
扩散模型在实际超分辨率(SR)任务中取得了显著突破,尽管推理速度较慢且对设备要求较高。为了加速推理过程,近期的研究如GenDR采用步长蒸馏技术将步骤数最小化至一步。然而,内存限制仍然制约着最大处理尺寸,需要分块恢复高分辨率图像。通过分析整个流程,我们发现变分自动编码器(VAE)是延迟和内存的瓶颈所在。为彻底解决问题,我们利用像素重排操作来消除VAE的影响,并将基于潜在空间的GenDR逆转到像素空间的GenDR-Pix中。然而,使用8倍像素重排进行上采样可能会导致重复图案的伪影出现。 为了减轻这种失真现象,我们提出了一种多阶段对抗蒸馏方法,逐步移除编码器和解码器,并利用前一阶段模型生成的特征来指导对抗判别过程。此外,我们还提出了随机填充策略以增强生成特性并避免鉴别器崩溃问题。同时,引入了掩蔽傅立叶空间损失函数,用于惩罚幅度异常值。 为了提高推理性能,我们基于填充自集成方法结合无分类引导技术进行了经验整合,从而改进了推理扩展性。实验结果表明,与GenDR相比,GenDR-Pix在视觉质量几乎不变的情况下实现了2.8倍的加速和60%的内存节省,并且优于其他一步扩散SR模型。令人惊讶的是,即使面对逆境,GenDR-Pix也能够在1秒内仅用6GB内存恢复4K图像。
https://arxiv.org/abs/2602.10630
We propose AdaDS, a generalizable framework for depth super-resolution that robustly recovers high-resolution depth maps from arbitrarily degraded low-resolution inputs. Unlike conventional approaches that directly regress depth values and often exhibit artifacts under severe or unknown degradation, AdaDS capitalizes on the contraction property of Gaussian smoothing: as noise accumulates in the forward process, distributional discrepancies between degraded inputs and their pristine high-quality counterparts diminish, ultimately converging to isotropic Gaussian prior. Leveraging this, AdaDS adaptively selects a starting timestep in the reverse diffusion trajectory based on estimated refinement uncertainty, and subsequently injects tailored noise to position the intermediate sample within the high-probability region of the target posterior distribution. This strategy ensures inherent robustness, enabling generative prior of a pre-trained diffusion model to dominate recovery even when upstream estimations are imperfect. Extensive experiments on real-world and synthetic benchmarks demonstrate AdaDS's superior zero-shot generalization and resilience to diverse degradation patterns compared to state-of-the-art methods.
我们提出了AdaDS,这是一种通用的深度超分辨率框架,能够从任意退化的低分辨率输入中稳健地恢复出高分辨率的深度图。与传统的直接回归深度值的方法不同,在严重或未知退化情况下经常出现伪影,AdaDS 利用了高斯平滑的收缩特性:随着噪声在前向过程中积累,退化输入与其原始高质量版本之间的分布差异逐渐减小,并最终收敛于各向同性的高斯先验。 借助这一点,AdaDS 根据估计的细化不确定性自适应地选择反向扩散轨迹中的起始时间步长,并随后注入定制化的噪声以将中间样本置于目标后验分布的高概率区域内。这种策略确保了固有的鲁棒性,使得预训练扩散模型的生成先验能够主导恢复过程,即使上游估计不完美也是如此。 在现实世界和合成基准测试上的广泛实验表明,与最先进的方法相比,AdaDS 在零样本泛化和对各种退化模式的韧性方面表现更佳。
https://arxiv.org/abs/2602.09510
Data-driven super-resolution (SR) methods are often integrated into imaging pipelines as preprocessing steps to improve downstream tasks such as classification and detection. However, these SR models introduce a previously unexplored attack surface into imaging pipelines. In this paper, we present AdvSR, a framework demonstrating that adversarial behavior can be embedded directly into SR model weights during training, requiring no access to inputs at inference time. Unlike prior attacks that perturb inputs or rely on backdoor triggers, AdvSR operates entirely at the model level. By jointly optimizing for reconstruction quality and targeted adversarial outcomes, AdvSR produces models that appear benign under standard image quality metrics while inducing downstream misclassification. We evaluate AdvSR on three SR architectures (SRCNN, EDSR, SwinIR) paired with a YOLOv11 classifier and demonstrate that AdvSR models can achieve high attack success rates with minimal quality degradation. These findings highlight a new model-level threat for imaging pipelines, with implications for how practitioners source and validate models in safety-critical applications.
https://arxiv.org/abs/2602.07251
Diffusion-based super-resolution can synthesize rich details, but models trained on synthetic paired data often fail on real-world LR images due to distribution shifts. We propose Bird-SR, a bidirectional reward-guided diffusion framework that formulates super-resolution as trajectory-level preference optimization via reward feedback learning (ReFL), jointly leveraging synthetic LR-HR pairs and real-world LR images. For structural fidelity easily affected in ReFL, the model is directly optimized on synthetic pairs at early diffusion steps, which also facilitates structure preservation for real-world inputs under smaller distribution gap in structure levels. For perceptual enhancement, quality-guided rewards are applied at later sampling steps to both synthetic and real LR images. To mitigate reward hacking, the rewards for synthetic results are formulated in a relative advantage space bounded by their clean counterparts, while real-world optimization is regularized via a semantic alignment constraint. Furthermore, to balance structural and perceptual learning, we adopt a dynamic fidelity-perception weighting strategy that emphasizes structure preservation at early stages and progressively shifts focus toward perceptual optimization at later diffusion steps. Extensive experiments on real-world SR benchmarks demonstrate that Bird-SR consistently outperforms state-of-the-art methods in perceptual quality while preserving structural consistency, validating its effectiveness for real-world super-resolution.
基于扩散的超分辨率可以合成丰富的细节,但利用合成配对数据训练的模型在处理真实世界的低分辨率(LR)图像时往往因分布偏移而失效。我们提出了Bird-SR,这是一种双向奖励引导的扩散框架,它通过奖赏反馈学习(ReFL),将超分辨率问题定义为基于轨迹水平偏好优化的问题,并同时利用合成的低分辨-高分辨配对数据和真实世界的低分辨图像。为了在ReFL中容易受到影响的结构保真度,在早期的扩散步骤中直接针对合成的数据进行模型优化,这也有助于在结构层次上的分布差距较小时保持真实输入的结构一致性。对于感知增强,在后期采样步骤中对合成和真实低分辨率图像都应用了质量导向的奖赏。 为了缓解奖励操控问题,我们通过以其清晰对应物为界来限定合成结果的相对优势空间来构建其奖项制度,同时在真实世界的优化过程中加入语义对齐约束以进行正则化。此外,为了平衡结构和感知学习,我们采用了一种动态保真度-感知权衡策略,在早期阶段强调结构保持,并逐步将注意力转移到后期扩散步骤中的感知优化。 通过真实的超分辨率基准实验表明,Bird-SR在感知质量上持续优于最先进的方法,同时保持了结构一致性,这验证了它对现实世界超分辨率的有效性。
https://arxiv.org/abs/2602.07069
Creating high-fidelity, animatable 3D talking heads is crucial for immersive applications, yet often hindered by the prevalence of low-quality image or video sources, which yield poor 3D reconstructions. In this paper, we introduce SuperHead, a novel framework for enhancing low-resolution, animatable 3D head avatars. The core challenge lies in synthesizing high-quality geometry and textures, while ensuring both 3D and temporal consistency during animation and preserving subject identity. Despite recent progress in image, video and 3D-based super-resolution (SR), existing SR techniques are ill-equipped to handle dynamic 3D inputs. To address this, SuperHead leverages the rich priors from pre-trained 3D generative models via a novel dynamics-aware 3D inversion scheme. This process optimizes the latent representation of the generative model to produce a super-resolved 3D Gaussian Splatting (3DGS) head model, which is subsequently rigged to an underlying parametric head model (e.g., FLAME) for animation. The inversion is jointly supervised using a sparse collection of upscaled 2D face renderings and corresponding depth maps, captured from diverse facial expressions and camera viewpoints, to ensure realism under dynamic facial motions. Experiments demonstrate that SuperHead generates avatars with fine-grained facial details under dynamic motions, significantly outperforming baseline methods in visual quality.
创建高保真、可动画化的3D说话头像对于沉浸式应用至关重要,然而这一过程常常受到低质量图像或视频源的限制,这些来源会导致不佳的3D重建效果。在本文中,我们介绍了SuperHead,这是一种用于提升低分辨率、可动画化3D头部角色的新框架。核心挑战在于合成高质量的几何结构和纹理,并且确保在动态过程中保持三维一致性和时间一致性的同时保留主体身份。尽管图像、视频和基于3D的超分辨率(SR)技术有所进步,但现有的SR技术难以处理动态3D输入。为了解决这一问题,SuperHead通过一种新的感知动力学的3D反演方案利用了预训练3D生成模型中的丰富先验信息。该过程优化了生成模型的潜在表示以产生超分辨率的3DGaussian Splatting(3DGS)头部模型,并随后将其与底层参数化头部模型(例如FLAME)绑定,以便进行动画处理。反演过程中采用稀疏的、放大后的2D面部渲染图像及其对应深度图集合联合监督,这些数据涵盖了各种面部表情和相机视角,以确保在动态面部动作下的真实感。实验结果表明,SuperHead生成了具有细致面部细节的头像,在动态运动下显著优于基线方法,并且在视觉质量上表现出色。
https://arxiv.org/abs/2602.06122
While deep learning-based super-resolution (SR) methods have shown impressive outcomes with synthetic degradation scenarios such as bicubic downsampling, they frequently struggle to perform well on real-world images that feature complex, nonlinear degradations like noise, blur, and compression artifacts. Recent efforts to address this issue have involved the painstaking compilation of real low-resolution (LR) and high-resolution (HR) image pairs, usually limited to several specific downscaling factors. To address these challenges, our work introduces a novel framework capable of synthesizing authentic LR images from a single HR image by leveraging the latent degradation space with flow matching. Our approach generates LR images with realistic artifacts at unseen degradation levels, which facilitates the creation of large-scale, real-world SR training datasets. Comprehensive quantitative and qualitative assessments verify that our synthetic LR images accurately replicate real-world degradations. Furthermore, both traditional and arbitrary-scale SR models trained using our datasets consistently yield much better HR outcomes.
虽然基于深度学习的超分辨率(SR)方法在使用双三次降采样等合成退化场景下展示了令人印象深刻的结果,但在处理具有复杂、非线性退化因素如噪声、模糊和压缩伪影的真实世界图像时却常常表现不佳。为解决这一问题,最近的努力集中在精心编纂真实低分辨率(LR)和高分辨率(HR)图像对上,这些努力通常仅限于几个特定的降尺度因子。 为了应对上述挑战,我们的工作引入了一个新颖的框架,该框架可以从单个HR图像中合成具有实际退化效果的LR图像,通过利用潜在退化空间与流匹配技术。我们的方法可以生成在未知退化水平下具有现实瑕疵的低分辨率图像,从而有助于大规模真实世界SR训练数据集的创建。 全面的定量和定性评估验证了我们合成的LR图像能够准确再现现实生活中的退化情况。此外,使用我们的数据集进行训练的传统及任意尺度的超分辨模型在生成高分辨率结果方面均表现出显著改善。
https://arxiv.org/abs/2602.04193
We present PFluxTTS, a hybrid text-to-speech system addressing three gaps in flow-matching TTS: the stability-naturalness trade-off, weak cross-lingual voice cloning, and limited audio quality from low-rate mel features. Our contributions are: (1) a dual-decoder design combining duration-guided and alignment-free models through inference-time vector-field fusion; (2) robust cloning using a sequence of speech-prompt embeddings in a FLUX-based decoder, preserving speaker traits across languages without prompt transcripts; and (3) a modified PeriodWave vocoder with super-resolution to 48 kHz. On cross-lingual in-the-wild data, PFluxTTS clearly outperforms F5-TTS, FishSpeech, and SparkTTS, matches ChatterBox in naturalness (MOS 4.11) while achieving 23% lower WER (6.9% vs. 9.0%), and surpasses ElevenLabs in speaker similarity (+0.32 SMOS). The system remains robust in challenging scenarios where most open-source models fail, while requiring only short reference audio and no extra training. Audio demos are available at this https URL
我们介绍了一种名为PFluxTTS的混合文本到语音系统,该系统解决了流匹配型文本转语音技术中的三大不足:稳定性与自然度之间的权衡、跨语言声音克隆能力弱以及低采样率梅尔特征导致的音频质量受限。我们的贡献包括: 1. 设计了一种双解码器架构,通过推理时向量场融合将持续时间指导型和对齐自由型模型结合在一起。 2. 使用FLUX基础解码器中的语音提示嵌入序列进行鲁棒性克隆,在没有提示转录的情况下也能跨语言保留说话人的特性。 3. 对PeriodWave编码器进行了修改,使其能够实现超分辨率至48kHz。 在多语种真实场景数据集上,PFluxTTS明显优于F5-TTS、FishSpeech和SparkTTS,并且与ChatterBox一样自然(MOS为4.11),但错误率降低了23%(WER从9.0%降至6.9%)。此外,在说话人相似度方面,PFluxTTS超过了ElevenLabs(SMOS增加+0.32)。 该系统在大多数开源模型无法胜任的困难场景中仍然保持稳健表现,并且只需短参考音频和无需额外训练。音频演示可在以下网址获取:[此URL]
https://arxiv.org/abs/2602.04160
Text-conditioned diffusion models have advanced image and video super-resolution by using prompts as semantic priors, but modern super-resolution pipelines typically rely on latent tiling to scale to high resolutions, where a single global caption causes prompt underspecification. A coarse global prompt often misses localized details (prompt sparsity) and provides locally irrelevant guidance (prompt misguidance) that can be amplified by classifier-free guidance. We propose Tiled Prompts, a unified framework for image and video super-resolution that generates a tile-specific prompt for each latent tile and performs super-resolution under locally text-conditioned posteriors, providing high-information guidance that resolves prompt underspecification with minimal overhead. Experiments on high resolution real-world images and videos show consistent gains in perceptual quality and text alignment, while reducing hallucinations and tile-level artifacts relative to global-prompt baselines.
文本条件扩散模型通过将提示用作语义先验,推动了图像和视频超分辨率的发展。然而,现代的超分辨率管道通常依赖于潜在图块化来扩展到高分辨率场景,在这种情况下,单一的整体描述会导致提示信息不足。一个粗略的整体提示往往忽略了局部细节(提示稀疏性),并且提供不相关的本地指导(提示误导),这可能会被无分类器引导进一步放大。 我们提出了Tiled Prompts,这是一种统一的框架,用于图像和视频超分辨率,它为每个潜在图块生成特定于该图块的提示,并在局部文本条件后验下执行超分辨率处理。这种方法提供了高信息量的指导,以最小的开销解决了提示信息不足的问题。 实验结果表明,在高分辨率的真实世界图像和视频上,Tiled Prompts相比整体提示基线方法,在感知质量、文本对齐方面取得了一致性的提升,并且减少了幻觉现象和图块级别的伪影。
https://arxiv.org/abs/2602.03342
One-Step Diffusion Models have demonstrated promising capability and fast inference in video super-resolution (VSR) for real-world. Nevertheless, the substantial model size and high computational cost of Diffusion Transformers (DiTs) limit downstream applications. While low-bit quantization is a common approach for model compression, the effectiveness of quantized models is challenged by the high dynamic range of input latent and diverse layer behaviors. To deal with these challenges, we introduce LSGQuant, a layer-sensitivity guided quantizing approach for one-step diffusion-based real-world VSR. Our method incorporates a Dynamic Range Adaptive Quantizer (DRAQ) to fit video token activations. Furthermore, we estimate layer sensitivity and implement a Variance-Oriented Layer Training Strategy (VOLTS) by analyzing layer-wise statistics in calibration. We also introduce Quantization-Aware Optimization (QAO) to jointly refine the quantized branch and a retained high-precision branch. Extensive experiments demonstrate that our method has nearly performance to origin model with full-precision and significantly exceeds existing quantization techniques. Code is available at: this https URL.
一阶段扩散模型(One-Step Diffusion Models)在现实世界中的视频超分辨率(VSR)任务中展示了令人鼓舞的能力和快速的推理性能。然而,扩散变换器(DiTs)巨大的模型尺寸以及高昂的计算成本限制了其下游应用的发展。虽然低比特量化是常用的模型压缩方法,但输入潜变量的高动态范围及不同层的行为多样性对量化的有效性构成了挑战。 为应对这些挑战,我们提出了LSGQuant,这是一种基于层敏感度引导的一阶段扩散视频超分辨率(VSR)量化方法。我们的方法采用了动态范围自适应量化器(DRAQ),以适配视频令牌激活,并通过校准中的逐层统计分析来估计层的敏感性并实施基于方差导向的层训练策略(VOLTS)。此外,我们还引入了量化的感知优化(QAO)技术,用于同时精炼量化的分支和保留的高精度分支。 广泛的实验表明,我们的方法几乎能够达到原始全精度模型的表现水平,并且在现有量化技术的基础上有了显著提升。代码可在以下网址获得:this https URL.
https://arxiv.org/abs/2602.03182
Recent electroencephalography (EEG) spatial super-resolution (SR) methods, while showing improved quality by either directly predicting missing signals from visible channels or adapting latent diffusion-based generative modeling to temporal data, often lack awareness of physiological spatial structure, thereby constraining spatial generation performance. To address this issue, we introduce TopoDiff, a geometry- and relation-aware diffusion model for EEG spatial super-resolution. Inspired by how human experts interpret spatial EEG patterns, TopoDiff incorporates topology-aware image embeddings derived from EEG topographic representations to provide global geometric context for spatial generation, together with a dynamic channel-relation graph that encodes inter-electrode relationships and evolves with temporal dynamics. This design yields a spatially grounded EEG spatial super-resolution framework with consistent performance improvements. Across multiple EEG datasets spanning diverse applications, including SEED/SEED-IV for emotion recognition, PhysioNet motor imagery (MI/MM), and TUSZ for seizure detection, our method achieves substantial gains in generation fidelity and leads to notable improvements in downstream EEG task performance.
https://arxiv.org/abs/2602.02238
Recent works have explored reference-based super-resolution (RefSR) to mitigate hallucinations in diffusion-based image restoration. A key challenge is that real-world degradations make correspondences between low-quality (LQ) inputs and reference (Ref) images unreliable, requiring adaptive control of reference usage. Existing methods either ignore LQ-Ref correlations or rely on brittle explicit matching, leading to over-reliance on misleading references or under-utilization of valuable cues. To address this, we propose Ada-RefSR, a single-step diffusion framework guided by a "Trust but Verify" principle: reference information is leveraged when reliable and suppressed otherwise. Its core component, Adaptive Implicit Correlation Gating (AICG), employs learnable summary tokens to distill dominant reference patterns and capture implicit correlations with LQ features. Integrated into the attention backbone, AICG provides lightweight, adaptive regulation of reference guidance, serving as a built-in safeguard against erroneous fusion. Experiments on multiple datasets demonstrate that Ada-RefSR achieves a strong balance of fidelity, naturalness, and efficiency, while remaining robust under varying reference alignment.
https://arxiv.org/abs/2602.01864
Recently, Diffusion Transformers (DiTs) have emerged in Real-World Image Super-Resolution (Real-ISR) to generate high-quality textures, yet their heavy inference burden hinders real-world deployment. While Post-Training Quantization (PTQ) is a promising solution for acceleration, existing methods in super-resolution mostly focus on U-Net architectures, whereas generic DiT quantization is typically designed for text-to-image tasks. Directly applying these methods to DiT-based super-resolution models leads to severe degradation of local textures. Therefore, we propose Q-DiT4SR, the first PTQ framework specifically tailored for DiT-based Real-ISR. We propose H-SVD, a hierarchical SVD that integrates a global low-rank branch with a local block-wise rank-1 branch under a matched parameter budget. We further propose Variance-aware Spatio-Temporal Mixed Precision: VaSMP allocates cross-layer weight bit-widths in a data-free manner based on rate-distortion theory, while VaTMP schedules intra-layer activation precision across diffusion timesteps via dynamic programming (DP) with minimal calibration. Experiments on multiple real-world datasets demonstrate that our Q-DiT4SR achieves SOTA performance under both W4A6 and W4A4 settings. Notably, the W4A4 quantization configuration reduces model size by 5.8$\times$ and computational operations by over 60$\times$. Our code and models will be available at this https URL.
https://arxiv.org/abs/2602.01273
Neuropathological analyses benefit from spatially precise volumetric reconstructions that enhance anatomical delineation and improve morphometric accuracy. Our prior work has shown the feasibility of reconstructing 3D brain volumes from 2D dissection photographs. However these outputs sometimes exhibit coarse, overly smooth reconstructions of structures, especially under high anisotropy (i.e., reconstructions from thick slabs). Here, we introduce a computationally efficient super-resolution step that imputes slices to generate anatomically consistent isotropic volumes from anisotropic 3D reconstructions of dissection photographs. By training on domain-randomized synthetic data, we ensure that our method generalizes across dissection protocols and remains robust to large slab thicknesses. The imputed volumes yield improved automated segmentations, achieving higher Dice scores, particularly in cortical and white matter regions. Validation on surface reconstruction and atlas registration tasks demonstrates more accurate cortical surfaces and MRI registration. By enhancing the resolution and anatomical fidelity of photograph-based reconstructions, our approach strengthens the bridge between neuropathology and neuroimaging. Our method is publicly available at this https URL
https://arxiv.org/abs/2602.00669
Universal image restoration is a critical task in low-level vision, requiring the model to remove various degradations from low-quality images to produce clean images with rich detail. The challenges lie in sampling the distribution of high-quality images and adjusting the outputs on the basis of the degradation. This paper presents a novel approach, Bridging Degradation discrimination and Generation (BDG), which aims to address these challenges concurrently. First, we propose the Multi-Angle and multi-Scale Gray Level Co-occurrence Matrix (MAS-GLCM) and demonstrate its effectiveness in performing fine-grained discrimination of degradation types and levels. Subsequently, we divide the diffusion training process into three distinct stages: generation, bridging, and restoration. The objective is to preserve the diffusion model's capability of restoring rich textures while simultaneously integrating the discriminative information from the MAS-GLCM into the restoration process. This enhances its proficiency in addressing multi-task and multi-degraded scenarios. Without changing the architecture, BDG achieves significant performance gains in all-in-one restoration and real-world super-resolution tasks, primarily evidenced by substantial improvements in fidelity without compromising perceptual quality. The code and pretrained models are provided in this https URL.
https://arxiv.org/abs/2602.00579
Diffusion models have been increasingly used as strong generative priors for solving inverse problems such as super-resolution in medical imaging. However, these approaches typically utilize a diffusion prior trained at a single scale, ignoring the hierarchical scale structure of image data. In this work, we propose to decompose images into Laplacian pyramid scales and train separate diffusion priors for each frequency band. We then develop an algorithm to perform super-resolution that utilizes these priors to progressively refine reconstructions across different scales. Evaluated on brain, knee, and prostate MRI data, our approach both improves perceptual quality over baselines and reduces inference time through smaller coarse-scale networks. Our framework unifies multiscale reconstruction and diffusion priors for medical image super-resolution.
https://arxiv.org/abs/2601.23201
Hyperspectral single image super-resolution (SISR) aims to enhance spatial resolution while preserving the rich spectral information of hyperspectral images. Most existing methods rely on supervised learning with high-resolution ground truth data, which is often unavailable in practice. To overcome this limitation, we propose an unsupervised learning approach based on synthetic abundance data. The hyperspectral image is first decomposed into endmembers and abundance maps through hyperspectral unmixing. A neural network is then trained to super-resolve these maps using data generated with the dead leaves model, which replicates the statistical properties of real abundances. The final super-resolution hyperspectral image is reconstructed by recombining the super-resolved abundance maps with the endmembers. Experimental results demonstrate the effectiveness of our method and the relevance of synthetic data for training.
高光谱单幅图像超分辨率(SISR)旨在增强空间分辨率,同时保持高光谱图像的丰富光谱信息。大多数现有的方法依赖于具有高质量地面真实数据的监督学习,而这些数据在实际应用中往往不可用。为了解决这一限制,我们提出了一种基于合成丰度数据的无监督学习方法。首先通过高光谱解混将高光谱图像分解成端元和丰度图。然后训练神经网络使用死叶片模型生成的数据来对这些地图进行超分辨率处理,该模型复制了真实丰度的统计特性。最后,通过重新组合超分辨率丰度图与端元,重建出最终的超分辨高光谱图像。实验结果证明了我们方法的有效性以及合成数据在训练中的相关性和实用性。
https://arxiv.org/abs/2602.02552
Diffusion posterior sampling solves inverse problems by combining a pretrained diffusion prior with measurement-consistency guidance, but it often fails to recover fine details because measurement terms are applied in a manner that is weakly coupled to the diffusion noise level. At high noise, data-consistency gradients computed from inaccurate estimates can be geometrically incongruent with the posterior geometry, inducing early-step drift, spurious high-frequency artifacts, plus sensitivity to schedules and ill-conditioned operators. To address these concerns, we propose a noise--frequency Continuation framework that constructs a continuous family of intermediate posteriors whose likelihood enforces measurement consistency only within a noise-dependent frequency band. This principle is instantiated with a stabilized posterior sampler that combines a diffusion predictor, band-limited likelihood guidance, and a multi-resolution consistency strategy that aggressively commits reliable coarse corrections while conservatively adopting high-frequency details only when they become identifiable. Across super-resolution, inpainting, and deblurring, our method achieves state-of-the-art performance and improves motion deblurring PSNR by up to 5 dB over strong baselines.
https://arxiv.org/abs/2602.00176
Scaling has powered recent advances in vision foundation models, yet extending this paradigm to metric depth estimation remains challenging due to heterogeneous sensor noise, camera-dependent biases, and metric ambiguity in noisy cross-source 3D data. We introduce Metric Anything, a simple and scalable pretraining framework that learns metric depth from noisy, diverse 3D sources without manually engineered prompts, camera-specific modeling, or task-specific architectures. Central to our approach is the Sparse Metric Prompt, created by randomly masking depth maps, which serves as a universal interface that decouples spatial reasoning from sensor and camera biases. Using about 20M image-depth pairs spanning reconstructed, captured, and rendered 3D data across 10000 camera models, we demonstrate-for the first time-a clear scaling trend in the metric depth track. The pretrained model excels at prompt-driven tasks such as depth completion, super-resolution and Radar-camera fusion, while its distilled prompt-free student achieves state-of-the-art results on monocular depth estimation, camera intrinsics recovery, single/multi-view metric 3D reconstruction, and VLA planning. We also show that using pretrained ViT of Metric Anything as a visual encoder significantly boosts Multimodal Large Language Model capabilities in spatial intelligence. These results show that metric depth estimation can benefit from the same scaling laws that drive modern foundation models, establishing a new path toward scalable and efficient real-world metric perception. We open-source MetricAnything at this http URL to support community research.
https://arxiv.org/abs/2601.22054
HSI-SR aims to enhance spatial resolution while preserving spectrally faithful and physically plausible characteristics. Recent methods have achieved great progress by leveraging spatial correlations to enhance spatial resolution. However, these methods often neglect spectral consistency across bands, leading to spurious oscillations and physically implausible artifacts. While spectral consistency can be addressed by designing the network architecture, it results in a loss of generality and flexibility. To address this issue, we propose a lightweight plug-and-play rectifier, physically priors Spectral Rectification Super-Resolution Network (SR$^{2}$-Net), which can be attached to a wide range of HSI-SR models without modifying their architectures. SR$^{2}$-Net follows an enhance-then-rectify pipeline consisting of (i) Hierarchical Spectral-Spatial Synergy Attention (H-S$^{3}$A) to reinforce cross-band interactions and (ii) Manifold Consistency Rectification (MCR) to constrain the reconstructed spectra to a compact, physically plausible spectral manifold. In addition, we introduce a degradation-consistency loss to enforce data fidelity by encouraging the degraded SR output to match the observed low resolution input. Extensive experiments on multiple benchmarks and diverse backbones demonstrate consistent improvements in spectral fidelity and overall reconstruction quality with negligible computational overhead. Our code will be released upon publication.
https://arxiv.org/abs/2601.21338