Speech tokenizers serve as the cornerstone of discrete Speech Large Language Models (Speech LLMs). Existing tokenizers either prioritize semantic encoding, fuse semantic content with acoustic style inseparably, or achieve incomplete semantic-acoustic disentanglement. To achieve better disentanglement, we propose DSA-Tokenizer, which explicitly disentangles speech into discrete semantic and acoustic tokens via distinct optimization constraints. Specifically, semantic tokens are supervised by ASR to capture linguistic content, while acoustic tokens focus on mel-spectrograms restoration to encode style. To eliminate rigid length constraints between the two sequences, we introduce a hierarchical Flow-Matching decoder that further improve speech generation this http URL, We employ a joint reconstruction-recombination training strategy to enforce this separation. DSA-Tokenizer enables high fidelity reconstruction and flexible recombination through robust disentanglement, facilitating controllable generation in speech LLMs. Our analysis highlights disentangled tokenization as a pivotal paradigm for future speech modeling. Audio samples are avaialble at this https URL. The code and model will be made publicly available after the paper has been accepted.
语音标记化器是离散的大型语言模型(Speech LLMs)的基础。现有的标记化方法要么侧重于语义编码,要么将语义内容与声学风格不可分割地融合在一起,或者实现不完全的语义-声学解耦。为了更好地进行解耦,我们提出了DSA-Tokenizer,它通过不同的优化约束条件明确地将语音分解成离散的语义和声学标记。具体来说,语义标记由ASR监督以捕捉语言内容,而声学标记则专注于梅尔频谱图恢复以编码风格。为了消除两个序列之间固有的长度限制,我们引入了分层Flow-Matching解码器,进一步改善了语音生成过程中的音质和灵活性。通过联合重构-重组训练策略强制执行这种分离。DSA-Tokenizer通过强大的解耦实现了高保真度的重建和灵活的重组,从而促进了Speech LLMs中可控生成的应用。我们的分析强调了解耦标记化作为未来语音建模的关键范式的重要性。音频样本可在提供的链接处获取。论文接受后,代码和模型将公开发布。 请注意,原文中的“this http URL”可能是指向实际资源或数据的具体网址,在这里我们用描述性语言代替了具体的URL以便更清楚地说明内容。如果您需要访问具体资源,请参考原始文档或联系作者以获得正确的链接地址。
https://arxiv.org/abs/2601.09239
Real-world License Plate Recognition (LPR) faces significant challenges from severe degradations such as motion blur, low resolution, and complex illumination. The prevailing "restoration-then-recognition" two-stage paradigm suffers from a fundamental flaw: the pixel-level optimization objectives of image restoration models are misaligned with the semantic goals of character recognition, leading to artifact interference and error accumulation. While Vision-Language Models (VLMs) have demonstrated powerful general capabilities, they lack explicit structural modeling for license plate character sequences (e.g., fixed length, specific order). To address this, we propose an end-to-end structure-aware multimodal reasoning framework based on Qwen3-VL. The core innovation lies in the Character-Aware Multimodal Reasoning Module (CMRM), which introduces a set of learnable Character Slot Queries. Through a cross-attention mechanism, these queries actively retrieve fine-grained evidence corresponding to character positions from visual features. Subsequently, we inject these character-aware representations back into the visual tokens via residual modulation, enabling the language model to perform autoregressive generation based on explicit structural priors. Furthermore, combined with the LoRA parameter-efficient fine-tuning strategy, the model achieves domain adaptation while retaining the generalization capabilities of the large model. Extensive experiments on both synthetic and real-world severely degraded datasets demonstrate that our method significantly outperforms existing restoration-recognition combinations and general VLMs, validating the superiority of incorporating structured reasoning into large models for low-quality text recognition tasks.
现实中的车牌识别(LPR)面临着诸如运动模糊、低分辨率和复杂光照等严重退化的挑战。现有的“恢复-识别”两阶段方法存在根本性的缺陷:图像恢复模型的像素级优化目标与其后的字符识别任务的语义目标不一致,导致了伪迹干扰和错误累积的问题。虽然视觉语言模型(VLMs)展示了强大的通用能力,但它们缺乏对车牌字符序列显式的结构建模(例如固定的长度、特定的顺序)。为了解决这个问题,我们提出了一种基于Qwen3-VL的端到端结构感知多模态推理框架。该方法的核心创新在于字符感知多模态推理模块(CMRM),其引入了一系列可学习的字符槽查询。通过交叉注意力机制,这些查询可以主动从视觉特征中检索出与字符位置相对应的细粒度证据。随后,我们将这些带有字符信息的表示注入到视觉标记中进行残差调制,从而使语言模型能够基于明确的结构先验执行自回归生成。此外,结合低秩适应(LoRA)参数高效的微调策略,该模型可以在保持大型模型泛化能力的同时实现领域适应。 在合成数据集和严重退化的现实世界数据集上的广泛实验表明,我们的方法显著优于现有的恢复-识别组合以及通用的VLMs,验证了将结构推理集成到大规模模型中以进行低质量文本识别任务的有效性。
https://arxiv.org/abs/2601.09116
Large language models (LLMs) excel at semantic understanding, yet their ability to reconstruct internal structure from scrambled inputs remains underexplored. Sentence-level restoration is ill-posed for automated evaluation because multiple valid word orders often exist. We introduce OrderProbe, a deterministic benchmark for structural reconstruction using fixed four-character expressions in Chinese, Japanese, and Korean, which have a unique canonical order and thus support exact-match scoring. We further propose a diagnostic framework that evaluates models beyond recovery accuracy, including semantic fidelity, logical validity, consistency, robustness sensitivity, and information density. Experiments on twelve widely used LLMs show that structural reconstruction remains difficult even for frontier systems: zero-shot recovery frequently falls below 35%. We also observe a consistent dissociation between semantic recall and structural planning, suggesting that structural robustness is not an automatic byproduct of semantic competence.
大型语言模型(LLMs)在语义理解方面表现出色,但它们从混乱输入中重构内部结构的能力尚未得到充分探索。句子级别的恢复对于自动评估而言是不适定问题,因为通常存在多种有效的词序排列。我们引入了OrderProbe,这是一个使用固定四字符表达式来测试中文、日文和韩文中结构重组能力的确定性基准。这些语言中的短语具有独特的规范顺序,因此支持精确匹配评分。此外,我们还提出了一种诊断框架,该框架评估模型在恢复准确性之外的能力,包括语义保真度、逻辑有效性、一致性、鲁棒性和信息密度。 实验结果表明,在十二个广泛使用的大型语言模型上,结构重组对于前沿系统来说仍然具有挑战性:零样本恢复的准确率经常低于35%。我们还观察到,语义召回与结构规划之间存在一致的分离现象,这表明结构性稳健性不是语义能力的自动产物。
https://arxiv.org/abs/2601.08626
We introduce T3 (Testing Trustworthy Thinking), a diagnostic benchmark designed to rigorously evaluate LLM causal judgment across Pearl's Ladder of Causality. Comprising 454 expert-curated vignettes, T3 prioritizes high-resolution failure analysis, decomposing performance into Utility (sensitivity), Safety (specificity), and Wise Refusal on underdetermined cases. By applying T3 to frontier models, we diagnose two distinct pathologies: a "Skepticism Trap" at L1 (where safety-tuned models like Claude Haiku reject 60% of valid links) and a non-monotonic Scaling Paradox at L3. In the latter, the larger GPT-5.2 underperforms GPT-4-Turbo by 55 points on ambiguous counterfactuals, driven by a collapse into paralysis (excessive hedging) rather than hallucination. Finally, we use the benchmark to validate a process-verified protocol (RCA), showing that T3 successfully captures the restoration of decisive causal judgment under structured verification.
我们介绍了T3(测试可信思维),这是一个诊断基准,旨在严格评估大型语言模型在因果判断方面的表现,特别是在珠叶因果层次结构中的表现。T3由454个专家策划的情景组成,优先考虑高分辨率的失败分析,并将性能分解为效用(敏感性)、安全性(特异性)和对不确定情况下的明智拒绝。 通过使用T3来评估前沿模型,我们诊断出两种不同的病理状态:一种是“怀疑陷阱”,出现在L1层级中,即安全调优的模型如Claude Haiku在面对有效的因果关系时会拒绝60%的情况;另一种是非单调性的规模悖论,在L3层级出现。在这种情况下,更大的GPT-5.2在模糊反事实问题上的表现比GPT-4-Turbo低55分,原因是它陷入瘫痪(过度规避)而不是幻觉。 最后,我们使用该基准来验证一个过程认证的协议(RCA),表明T3成功地捕捉了在结构化验证下决定性因果判断恢复的过程。
https://arxiv.org/abs/2601.08258
Chinese paleography, the study of ancient Chinese writing, is undergoing a computational turn powered by artificial intelligence. This position paper charts the trajectory of this emerging field, arguing that it is evolving from automating isolated visual tasks to creating integrated digital ecosystems for scholarly research. We first map the landscape of digital resources, analyzing critical datasets for oracle bone, bronze, and bamboo slip scripts. The core of our analysis follows the field's methodological pipeline: from foundational visual processing (image restoration, character recognition), through contextual analysis (artifact rejoining, dating), to the advanced reasoning required for automated decipherment and human-AI collaboration. We examine the technological shift from classical computer vision to modern deep learning paradigms, including transformers and large multimodal models. Finally, we synthesize the field's core challenges -- notably data scarcity and a disconnect between current AI capabilities and the holistic nature of humanistic inquiry -- and advocate for a future research agenda focused on creating multimodal, few-shot, and human-centric systems to augment scholarly expertise.
中国古文字学,即研究古代中文书写体系的学问,在人工智能技术的支持下正经历着一场计算革命。本文探讨了这一新兴领域的演变路径,指出其正在从自动化孤立视觉任务转变为创建整合数字生态系统以促进学术研究的方向发展。 首先,我们勾勒出数字化资源的地貌图,并分析用于甲骨文、青铜器和竹简书写体系的关键数据集。接下来,我们的核心分析围绕该领域的方法论流程展开:从基础的视觉处理(图像恢复、字符识别)到上下文分析(文物复原、年代测定),再到实现自动化解读及人机协作所需的高级推理能力。 本文还审视了技术上的转变,从传统的计算机视觉转向现代深度学习范式,包括转换器和大规模多模态模型。最后,我们总结了该领域面临的挑战——特别是数据稀缺问题以及当前人工智能的能力与人文研究整体性质之间的脱节,并倡导未来的研究议程应专注于创建多模态、少量样本训练及以人类为中心的系统,以此来增强学术专家的知识。 总的来说,中国古文字学正在利用新兴技术实现前所未有的突破,有望在数字化时代成为连接过去和未来的桥梁。
https://arxiv.org/abs/2601.06753
We introduce OceanSplat, a novel 3D Gaussian Splatting-based approach for accurately representing 3D geometry in underwater scenes. To overcome multi-view inconsistencies caused by underwater optical degradation, our method enforces trinocular view consistency by rendering horizontally and vertically translated camera views relative to each input view and aligning them via inverse warping. Furthermore, these translated camera views are used to derive a synthetic epipolar depth prior through triangulation, which serves as a self-supervised depth regularizer. These geometric constraints facilitate the spatial optimization of 3D Gaussians and preserve scene structure in underwater environments. We also propose a depth-aware alpha adjustment that modulates the opacity of 3D Gaussians during early training based on their $z$-component and viewing direction, deterring the formation of medium-induced primitives. With our contributions, 3D Gaussians are disentangled from the scattering medium, enabling robust representation of object geometry and significantly reducing floating artifacts in reconstructed underwater scenes. Experiments on real-world underwater and simulated scenes demonstrate that OceanSplat substantially outperforms existing methods for both scene reconstruction and restoration in scattering media.
我们介绍了OceanSplat,这是一种新颖的基于3D高斯点阵(Gaussian Splatting)的方法,用于准确表示水下场景中的三维几何结构。为了克服由水下光学退化引起的多视角不一致性问题,我们的方法通过相对于每个输入视图渲染水平和垂直平移的相机视图,并利用逆向重映射技术进行对齐来强制执行三目视图的一致性。此外,这些平移后的相机视图被用来通过三角测量推导出一种合成的极线深度先验,这充当了一种自我监督的深度正则化器。这些几何约束有助于三维高斯分布的空间优化,并在水下环境中保留场景结构。 我们还提出了一种基于深度感知的alpha调整方法,在早期训练阶段根据3D高斯分布的z分量和观察方向来调节其不透明度,从而阻止由介质引起的原始形态的形成。通过我们的贡献,三维高斯分布与散射介质分离,使物体几何结构能够得到稳健表示,并显著减少重建水下场景中的漂浮伪影。 在真实世界和模拟水下场景上的实验表明,OceanSplat在散射介质中进行场景重构和恢复方面明显优于现有方法。
https://arxiv.org/abs/2601.04984
With recent advances in generative models, diffusion models have emerged as powerful priors for solving inverse problems in each domain. Since Latent Diffusion Models (LDMs) provide generic priors, several studies have explored their potential as domain-agnostic zero-shot inverse solvers. Despite these efforts, existing latent diffusion inverse solvers suffer from their instability, exhibiting undesirable artifacts and degraded quality. In this work, we first identify the instability as a discrepancy between the solver's and true reverse diffusion dynamics, and show that reducing this gap stabilizes the solver. Building on this, we introduce Measurement-Consistent Langevin Corrector (MCLC), a theoretically grounded plug-and-play correction module that remedies the LDM-based inverse solvers through measurement-consistent Langevin updates. Compared to prior approaches that rely on linear manifold assumptions, which often do not hold in latent space, MCLC operates without this assumption, leading to more stable and reliable behavior. We experimentally demonstrate the effectiveness of MCLC and its compatibility with existing solvers across diverse image restoration tasks. Additionally, we analyze blob artifacts and offer insights into their underlying causes. We highlight that MCLC is a key step toward more robust zero-shot inverse problem solvers.
随着生成模型的近期进展,扩散模型已成为解决各个领域逆问题的强大先验方法。由于潜在扩散模型(LDMs)提供了通用的先验条件,因此有几项研究探讨了它们作为跨域零样本逆解算器的潜力。尽管做出了这些努力,现有的基于潜变量的扩散逆解算器仍存在不稳定性的问题,表现出不良的伪影和质量下降。 在本工作中,我们首先将这种不稳定现象归因于解算器与真正反向扩散动力学之间的差异,并展示减少这一差距可以稳定解算器。在此基础上,我们引入了测量一致性兰奇维校正器(Measurement-Consistent Langevin Corrector, MCLC),这是一个理论上严谨的即插即用修正模块,通过测量一致性的兰奇维更新来改进基于LDM的逆向解算器。与依赖于线性流形假设的先前方法不同,该假设在潜在空间中往往不成立,MCLC无需此假设即可运行,从而导致更稳定和可靠的行为。 我们通过实验展示了MCLC的有效性和其与现有解决方法在各种图像恢复任务中的兼容性。此外,我们还分析了团状伪影,并对其根本原因提供了见解。我们强调,MCLC是迈向更为稳健的零样本逆向问题解算器的关键步骤。
https://arxiv.org/abs/2601.04791
Music Source Restoration (MSR) aims to recover original, unprocessed instrument stems from professionally mixed and degraded audio, requiring the reversal of both production effects and real-world degradations. We present the inaugural MSR Challenge, which features objective evaluation on studio-produced mixtures using Multi-Mel-SNR, Zimtohrli, and FAD-CLAP, alongside subjective evaluation on real-world degraded recordings. Five teams participated in the challenge. The winning system achieved 4.46 dB Multi-Mel-SNR and 3.47 MOS-Overall, corresponding to relative improvements of 91% and 18% over the second-place system, respectively. Per-stem analysis reveals substantial variation in restoration difficulty across instruments, with bass averaging 4.59 dB across all teams, while percussion averages only 0.29 dB. The dataset, evaluation protocols, and baselines are available at this https URL.
音乐源恢复(MSR)旨在从专业混音和退化的音频中恢复原始未处理的乐器轨道,这需要逆转生产效果和现实世界中的退化。我们介绍了首届MSR挑战赛,在这次比赛中使用了Multi-Mel-SNR、Zimtohrli 和 FAD-CLAP 对工作室制作的混合物进行客观评估,并对真实世界的退化录音进行了主观评估。共有五支队伍参加了此次挑战赛。获胜系统在 Multi-Mel-SNR 上获得了 4.46 dB 的成绩,在 MOS-Overall 上获得了 3.47 分,分别比第二名系统提高了91%和18%的成绩。每种乐器的单独分析显示了恢复难度的巨大差异,低音平均值为所有团队4.59 dB,而打击乐平均仅为0.29 dB。数据集、评估协议及基准线可在以下链接中获取:[https URL](请将“URL”替换为实际网址)。
https://arxiv.org/abs/2601.04343
Image restoration has traditionally required training specialized models on thousands of paired examples per degradation type. We challenge this paradigm by demonstrating that powerful pre-trained text-conditioned image editing models can be efficiently adapted for multiple restoration tasks through parameter-efficient fine-tuning with remarkably few examples. Our approach fine-tunes LoRA adapters on FLUX.1 Kontext, a state-of-the-art 12B parameter flow matching model for image-to-image translation, using only 16-128 paired images per task, guided by simple text prompts that specify the restoration operation. Unlike existing methods that train specialized restoration networks from scratch with thousands of samples, we leverage the rich visual priors already encoded in large-scale pre-trained editing models, dramatically reducing data requirements while maintaining high perceptual quality. A single unified LoRA adapter, conditioned on task-specific text prompts, effectively handles multiple degradations including denoising, deraining, and dehazing. Through comprehensive ablation studies, we analyze: (i) the impact of training set size on restoration quality, (ii) trade-offs between task-specific versus unified multi-task adapters, (iii) the role of text encoder fine-tuning, and (iv) zero-shot baseline performance. While our method prioritizes perceptual quality over pixel-perfect reconstruction metrics like PSNR/SSIM, our results demonstrate that pre-trained image editing models, when properly adapted, offer a compelling and data-efficient alternative to traditional image restoration approaches, opening new avenues for few-shot, prompt-guided image enhancement. The code to reproduce our results are available at: this https URL
图像恢复传统上需要针对每种退化类型在数千对示例数据上训练专门的模型。我们挑战这一范式,通过展示强大的预训练文本条件图像编辑模型可以仅使用少量配对样本进行高效的参数高效微调(Parameter-Efficient Fine-Tuning),从而适应多种恢复任务。我们的方法是在FLUX.1 Kontext上微调LoRA适配器,FLUX.1 Kontext是用于图到图转换的最先进的120亿参数流匹配模型,每项任务仅使用16至128对图像,并通过简单的文本提示来指导特定的恢复操作。与现有方法不同的是,这些方法从零开始训练专门的恢复网络,需要数千个样本,我们利用大规模预训练编辑模型中已编码的丰富视觉先验知识,在减少数据需求的同时保持高感知质量。 一个单一统一的LoRA适配器,基于任务特定的文本提示,可以有效地处理包括去噪、去雨和去雾在内的多种退化。通过全面的消融研究,我们分析了:(i)训练集大小对恢复质量的影响;(ii)针对特定任务与统一多任务适配器之间的权衡;(iii)文本编码器微调的作用;以及(iv)零样本基线性能。 虽然我们的方法优先考虑感知质量而不是像素完美的重建指标,如PSNR/SSIM,但结果表明,当这些预训练的图像编辑模型得到适当调整时,它们提供了一种有吸引力且数据高效的替代传统图像恢复方法的方法,并为少量样本、提示引导的图像增强开辟了新的途径。 用于重现我们结果的代码可在以下网址获取:[请在此处插入正确的链接]
https://arxiv.org/abs/2601.03391
While Unified Multimodal Models (UMMs) have achieved remarkable success in cross-modal comprehension, a significant gap persists in their ability to leverage such internal knowledge for high-quality generation. We formalize this discrepancy as Conduction Aphasia, a phenomenon where models accurately interpret multimodal inputs but struggle to translate that understanding into faithful and controllable synthesis. To address this, we propose UniCorn, a simple yet elegant self-improvement framework that eliminates the need for external data or teacher supervision. By partitioning a single UMM into three collaborative roles: Proposer, Solver, and Judge, UniCorn generates high-quality interactions via self-play and employs cognitive pattern reconstruction to distill latent understanding into explicit generative signals. To validate the restoration of multimodal coherence, we introduce UniCycle, a cycle-consistency benchmark based on a Text to Image to Text reconstruction loop. Extensive experiments demonstrate that UniCorn achieves comprehensive and substantial improvements over the base model across six general image generation benchmarks. Notably, it achieves SOTA performance on TIIF(73.8), DPG(86.8), CompBench(88.5), and UniCycle while further delivering substantial gains of +5.0 on WISE and +6.5 on OneIG. These results highlight that our method significantly enhances T2I generation while maintaining robust comprehension, demonstrating the scalability of fully self-supervised refinement for unified multimodal intelligence.
虽然统一多模态模型(UMMs)在跨模态理解方面取得了显著成功,但在利用内部知识进行高质量生成方面仍存在明显差距。我们将这种不一致形式化为传导性失语症(Conduction Aphasia),即模型能够准确地解析多模态输入,但难以将这种理解转化为忠实且可控的合成。 为了应对这一挑战,我们提出了UniCorn,一个简单而优雅的自我改进框架,无需外部数据或教师监督。通过将单一UMM划分为三个协作角色:提议者(Proposer)、解算器(Solver)和裁判(Judge),UniCorn利用自我游戏生成高质量交互,并采用认知模式重建技术,将潜在理解转化为显式的生成信号。 为了验证多模态一致性的恢复,我们引入了UniCycle,这是一个基于文本到图像再到文本重构循环的循环一致性基准测试。广泛的实验表明,在六个通用图像生成基准上,UniCorn相对于基础模型实现了全面且实质性的改进。特别是在TIIF(73.8)、DPG(86.8)、CompBench(88.5)和UniCycle方面取得了最先进的性能,并在WISE和OneIG上分别获得了+5.0和+6.5的显著提升。 这些结果表明,我们的方法显著增强了文本到图像生成能力的同时保持了强大的理解力,展示了完全自我监督改进对于统一多模态智能的可扩展性。
https://arxiv.org/abs/2601.03193
Motion blur caused by camera shake produces ghosting artifacts that substantially degrade edge side object detection. Existing approaches either suppress blur as noise and lose discriminative structure, or apply full image restoration that increases latency and limits deployment on resource constrained devices. We propose DFRCP, a Dynamic Fuzzy Robust Convolutional Pyramid, as a plug in upgrade to YOLOv11 for blur robust detection. DFRCP enhances the YOLOv11 feature pyramid by combining large scale and medium scale features while preserving native representations, and by introducing Dynamic Robust Switch units that adaptively inject fuzzy features to strengthen global perception under jitter. Fuzzy features are synthesized by rotating and nonlinearly interpolating multiscale features, then merged through a transparency convolution that learns a content adaptive trade off between original and fuzzy cues. We further develop a CUDA parallel rotation and interpolation kernel that avoids boundary overflow and delivers more than 400 times speedup, making the design practical for edge deployment. We train with paired supervision on a private wheat pest damage dataset of about 3,500 images, augmented threefold using two blur regimes, uniform image wide motion blur and bounding box confined rotational blur. On blurred test sets, YOLOv11 with DFRCP achieves about 10.4 percent higher accuracy than the YOLOv11 baseline with only a modest training time overhead, reducing the need for manual filtering after data collection.
由相机抖动引起的运动模糊会产生鬼影伪影,这会显著降低边缘侧物体检测的性能。现有的方法要么将模糊视为噪声并削弱了判别结构特征,要么采用全图修复技术,而这会增加延迟并且限制其在资源受限设备上的部署。我们提出了一种动态模糊鲁棒卷积金字塔(Dynamic Fuzzy Robust Convolutional Pyramid, DFRCP),作为对YOLOv11的插件式升级方案,以提高其抗模糊检测能力。 DFRCP通过结合大尺度和中等规模特征来增强YOLOv11的特征金字塔,并保留原始表示。同时引入动态鲁棒开关单元(Dynamic Robust Switch units),这些单元能够自适应地注入模糊特性,从而在抖动条件下加强全局感知。模糊特征是通过对多尺度特征进行旋转并非线性插值合成得到的,然后通过透明卷积(transparency convolution)融合原始和模糊线索,并学习内容适配权重。 我们进一步开发了一种CUDA并行旋转和插值内核,避免了边界溢出问题,并实现了超过400倍的速度提升,从而使该设计在边缘部署中具有实用性。我们在一个私有小麦害虫损伤数据集上进行训练(大约包含3,500张图像),该数据集经过三倍的数据增强处理,使用两种模糊模式:全图范围内的均匀运动模糊和边界框限制下的旋转模糊。 在模糊测试集中,装备了DFRCP的YOLOv11相比原始基线模型YOLOv11,在不显著增加训练时间的情况下,准确性提高了约10.4%。这减少了数据收集后的人工过滤需求。
https://arxiv.org/abs/2601.03046
Image Quality Assessment (IQA) is a long-standing problem in computer vision. Previous methods typically focus on predicting numerical scores without explanation or provide low-level descriptions lacking precise scores. Recent reasoning-based vision language models (VLMs) have shown strong potential for IQA, enabling joint generation of quality descriptions and scores. However, we notice that existing VLM-based IQA methods tend to exhibit unreliable reasoning due to their limited capability of integrating visual and textual cues. In this work, we introduce Zoom-IQA, a VLM-based IQA model to explicitly emulate key cognitive behaviors: uncertainty awareness, region reasoning, and iterative refinement. Specifically, we present a two-stage training pipeline: 1) supervised fine-tuning (SFT) on our Grounded-Rationale-IQA (GR-IQA) dataset to teach the model to ground its assessments in key regions; and 2) reinforcement learning (RL) for dynamic policy exploration, primarily stabilized by our KL-Coverage regularizer to prevent reasoning and scoring diversity collapse, and supported by a Progressive Re-sampling Strategy to mitigate annotation bias. Extensive experiments show that Zoom-IQA achieves improved robustness, explainability, and generalization. The application to downstream tasks, such as image restoration, further demonstrates the effectiveness of Zoom-IQA.
图像质量评估(IQA)是计算机视觉领域长期存在的一个问题。以往的方法通常集中于预测数值评分而不提供解释,或者仅给出缺乏精确评分的低级描述。最近基于推理的视觉语言模型(VLMs)在IQA中展现出强大的潜力,能够同时生成质量和得分描述。然而,我们注意到现有的基于VLM的IQA方法由于其整合视觉和文本线索的能力有限,往往表现出不稳定的推理能力。 为此,我们在本文中引入了Zoom-IQA,这是一个基于VLM的图像质量评估模型,旨在明确模拟关键的认知行为:不确定性感知、区域推理以及迭代细化。具体而言,我们提出了一种两阶段的训练流程: 1. **监督微调(SFT)**:在我们构建的Grounded-Rationale-IQA (GR-IQA)数据集上进行监督微调,以教导模型将其评估基于关键区域; 2. **强化学习(RL)**:通过动态策略探索进行强化学习,并主要依靠我们的KL-Coverage正则化器来防止推理和评分多样性的崩溃,同时借助渐进式重采样策略缓解注释偏差。 经过广泛的实验验证,Zoom-IQA在鲁棒性、可解释性和泛化能力方面都有所提高。此外,在下游任务(如图像恢复)中的应用进一步证明了Zoom-IQA的有效性。
https://arxiv.org/abs/2601.02918
All-in-One Image Restoration (AiOIR) has advanced significantly, offering promising solutions for complex real-world degradations. However, most existing approaches rely heavily on degradation-specific representations, often resulting in oversmoothing and artifacts. To address this, we propose ClearAIR, a novel AiOIR framework inspired by Human Visual Perception (HVP) and designed with a hierarchical, coarse-to-fine restoration strategy. First, leveraging the global priority of early HVP, we employ a Multimodal Large Language Model (MLLM)-based Image Quality Assessment (IQA) model for overall evaluation. Unlike conventional IQA, our method integrates cross-modal understanding to more accurately characterize complex, composite degradations. Building upon this overall assessment, we then introduce a region awareness and task recognition pipeline. A semantic cross-attention, leveraging semantic guidance unit, first produces coarse semantic prompts. Guided by this regional context, a degradation-aware module implicitly captures region-specific degradation characteristics, enabling more precise local restoration. Finally, to recover fine details, we propose an internal clue reuse mechanism. It operates in a self-supervised manner to mine and leverage the intrinsic information of the image itself, substantially enhancing detail restoration. Experimental results show that ClearAIR achieves superior performance across diverse synthetic and real-world datasets.
全视角图像恢复(All-in-One Image Restoration,AiOIR)技术已取得了显著进步,为解决复杂现实世界中的退化问题提供了有希望的解决方案。然而,大多数现有的方法严重依赖于特定退化的表示方式,这往往会导致过度平滑和产生伪影。为了应对这一挑战,我们提出了一种名为ClearAIR的新框架,该框架受人类视觉感知(HVP)启发,并采用了分层、从粗到细的恢复策略。 首先,利用早期人类视觉感知的全局优先级,我们采用基于多模态大型语言模型(Multimodal Large Language Model, MLLM)的图像质量评估(IQA)模型进行总体评价。与传统的IQA方法不同,我们的方法结合了跨模态理解来更准确地描述复杂的复合退化情况。在此整体评估的基础上,我们进一步引入了一个区域感知和任务识别流程。 通过使用语义交叉注意力机制并借助语义指导单元,该流程首先生成粗略的语义提示。随后,在这种区域性上下文指导下,一个具备降噪意识的模块隐式地捕捉了特定区域内的退化特征,从而能够实现更加精确的局部恢复。 最后,为了恢复细部细节,我们提出了一种内部线索重用机制。该机制以自监督的方式工作,挖掘并利用图像本身的内在信息,极大地增强了细节恢复的效果。 实验结果显示,ClearAIR在多种合成和现实世界数据集上均表现出优越的性能。
https://arxiv.org/abs/2601.02763
Solving inverse problems in imaging requires models that support efficient inference, uncertainty quantification, and principled probabilistic reasoning. Energy-Based Models (EBMs), with their interpretable energy landscapes and compositional structure, are well-suited for this task but have historically suffered from high computational costs and training instability. To overcome the historical shortcomings of EBMs, we introduce a fast distillation strategy to transfer the strengths of pre-trained diffusion models into multi-scale EBMs. These distilled EBMs enable efficient sampling and preserve the interpretability and compositionality inherent to potential-based frameworks. Leveraging EBM compositionality, we propose Annealed Langevin Posterior Sampling (ALPS) algorithm for Maximum-A-Posteriori (MAP), Minimum Mean Square Error (MMSE), and uncertainty estimates for inverse problems in imaging. Unlike diffusion models that use complex guidance strategies for latent variables, we perform annealing on static posterior distributions that are well-defined and composable. Experiments on image inpainting and MRI reconstruction demonstrate that our method matches or surpasses diffusion-based baselines in both accuracy and efficiency, while also supporting MAP recovery. Overall, our framework offers a scalable and principled solution for inverse problems in imaging, with potential for practical deployment in scientific and clinical settings. ALPS code is available at the GitHub repository \href{this https URL}{ALPS}.
解决成像中的逆问题需要支持高效推理、不确定性量化和原理性概率推理的模型。基于能量的模型(EBM)由于其可解释的能量景观和组合结构,非常适合这一任务,但历史上一直面临计算成本高和训练不稳定的缺点。为克服EBM的历史不足,我们引入了一种快速蒸馏策略,将预先训练好的扩散模型的优势转移到多尺度EBM中。这些经过蒸馏的EBM能够高效采样,并保持基于势能框架固有的可解释性和组合性。 利用EBM的组合特性,我们提出了一种用于最大后验概率(MAP)、最小均方误差(MMSE)和逆问题不确定性估计的退火Langevin后验抽样(ALPS)算法。与扩散模型使用的复杂潜在变量指导策略不同,我们在静态且可组合的良好定义后验分布上进行退火处理。 在图像修复和MRI重建实验中显示,我们的方法在精度和效率方面均可以匹敌或超越基于扩散的基线方法,并同时支持MAP恢复。总的来说,我们提出的框架为成像中的逆问题提供了一种可扩展且原理性的解决方案,在科学和临床环境中具有实际应用潜力。 ALPS代码可在GitHub仓库\href{this https URL}{ALPS}中获取。
https://arxiv.org/abs/2601.02594
Enhancing the visibility of nighttime hazy images is challenging due to the complex degradation distributions. Existing methods mainly address a single type of degradation (e.g., haze or low-light) at a time, ignoring the interplay of different degradation types and resulting in limited visibility improvement. We observe that the domain knowledge shared between low-light and haze priors can be reinforced mutually for better visibility. Based on this key insight, in this paper, we propose a novel framework that enhances visibility in nighttime hazy images by reinforcing the intrinsic consistency between haze and low-light priors mutually and progressively. In particular, our model utilizes image-, patch-, and pixel-level experts that operate across visual and frequency domains to recover global scene structure, regional patterns, and fine-grained details progressively. A frequency-aware router is further introduced to adaptively guide the contribution of each expert, ensuring robust image restoration. Extensive experiments demonstrate the superior performance of our model on nighttime dehazing benchmarks both quantitatively and qualitatively. Moreover, we showcase the generalizability of our model in daytime dehazing and low-light enhancement tasks.
夜间模糊图像的清晰度增强由于复杂的退化分布而具有挑战性。现有的方法主要一次解决一种类型的退化(例如,雾或低光),忽略了不同退化类型之间的相互作用,导致可见性的提升有限。我们观察到,低光和雾先验之间共享的领域知识可以通过相互强化来提高更好的可视性。基于这一关键见解,在本文中,我们提出了一种新的框架,通过在夜间模糊图像上相互且逐步地增强雾和低光先验之间的内在一致性来提升可见度。 具体来说,我们的模型利用了跨视觉和频率域工作的图像级、补丁级和像素级专家,以逐步恢复全局场景结构、区域模式和细粒度细节。此外,我们引入了一个频率感知路由器,以自适应地指导每个专家的贡献,确保稳健的图像修复。广泛的实验表明,在夜间去雾基准测试中,我们的模型在定量和定性上都表现出优越的性能。此外,我们展示了我们的模型在白天去雾和低光增强任务中的泛化能力。
https://arxiv.org/abs/2601.01998
Synthetic aperture radar (SAR) provides valuable information about the Earth's surface under all weather and illumination conditions. However, the inherent phenomenon of speckle and the presence of sidelobes around bright targets pose challenges for accurate interpretation of SAR imagery. Most existing SAR image restoration methods address despeckling and sidelobes reduction as separate tasks. In this paper, we propose a unified framework that jointly performs both tasks using neural networks (NNs) trained on a realistic SAR simulated dataset generated with MOCEM. Inference can then be performed on real SAR images, demonstrating effective simulation to real (Sim2Real) transferability. Additionally, we incorporate acquisition metadata as auxiliary input to the NNs, demonstrating improved restoration performance.
合成孔径雷达(SAR)能够在各种天气和光照条件下提供有关地球表面的宝贵信息。然而,固有的斑点现象以及明亮目标周围的旁瓣的存在给SAR图像的准确解释带来了挑战。大多数现有的SAR图像恢复方法将去斑处理和减少旁瓣视为两个独立的任务来解决。在本文中,我们提出了一种统一框架,该框架使用基于MOCEM生成的真实SAR模拟数据集训练的神经网络(NN),同时执行这两种任务。然后可以在实际的SAR图像上进行推理,展示了有效的从仿真到现实(Sim2Real)迁移能力。此外,我们将采集元数据作为辅助输入纳入到神经网络中,证明了这一方法可以提升恢复性能。
https://arxiv.org/abs/2601.01541
This research aims to develop a novel deep learning network, GBU-Net, utilizing a group-batch-normalized U-Net framework, specifically designed for the precise semantic segmentation of the left ventricle in short-axis cine MRI scans. The methodology includes a down-sampling pathway for feature extraction and an up-sampling pathway for detail restoration, enhanced for medical imaging. Key modifications include techniques for better contextual understanding crucial in cardiac MRI segmentation. The dataset consists of 805 left ventricular MRI scans from 45 patients, with comparative analysis using established metrics such as the dice coefficient and mean perpendicular distance. GBU-Net significantly improves the accuracy of left ventricle segmentation in cine MRI scans. Its innovative design outperforms existing methods in tests, surpassing standard metrics like the dice coefficient and mean perpendicular distance. The approach is unique in its ability to capture contextual information, often missed in traditional CNN-based segmentation. An ensemble of the GBU-Net attains a 97% dice score on the SunnyBrook testing dataset. GBU-Net offers enhanced precision and contextual understanding in left ventricle segmentation for surgical robotics and medical analysis.
这项研究旨在开发一种新型深度学习网络GBU-Net,利用组批归一化(group-batch-normalized)的U型网框架,专门用于短轴心脏电影磁共振成像(cine MRI)扫描中左心室的精确语义分割。该方法包括一个下采样路径进行特征提取和一个上采样路径恢复细节,并针对医学影像进行了增强。关键改进在于采用了有助于改善心血管MRI分割上下文理解的技术。 数据集由来自45名患者的805个左心室MRI扫描图像组成,通过已建立的指标如骰子系数(dice coefficient)和平均垂直距离(mean perpendicular distance)进行比较分析。GBU-Net显著提高了心脏电影MRI中左心室分割的准确性。其创新设计在测试中优于现有方法,在骰子系数和平均垂直距离等标准度量上表现更佳。 该方法的独特之处在于它能够捕捉到上下文信息,这是传统基于CNN(卷积神经网络)的方法常常忽略的重要细节。GBU-Net的集成模型在SunnyBrook测试数据集上达到了97%的骰子分数。GBU-Net为心脏手术机器人和医学分析提供了更精确、更具上下文理解能力的左心室分割方法。
https://arxiv.org/abs/2601.01512
Face super-resolution aims to recover high-quality facial images from severely degraded low-resolution inputs, but remains challenging due to the loss of fine structural details and identity-specific features. This work introduces SwinIFS, a landmark-guided super-resolution framework that integrates structural priors with hierarchical attention mechanisms to achieve identity-preserving reconstruction at both moderate and extreme upscaling factors. The method incorporates dense Gaussian heatmaps of key facial landmarks into the input representation, enabling the network to focus on semantically important facial regions from the earliest stages of processing. A compact Swin Transformer backbone is employed to capture long-range contextual information while preserving local geometry, allowing the model to restore subtle facial textures and maintain global structural consistency. Extensive experiments on the CelebA benchmark demonstrate that SwinIFS achieves superior perceptual quality, sharper reconstructions, and improved identity retention; it consistently produces more photorealistic results and exhibits strong performance even under 8x magnification, where most methods fail to recover meaningful structure. SwinIFS also provides an advantageous balance between reconstruction accuracy and computational efficiency, making it suitable for real-world applications in facial enhancement, surveillance, and digital restoration. Our code, model weights, and results are available at this https URL.
面部超分辨率技术旨在从严重降级的低分辨率输入中恢复高质量的人脸图像,但由于精细结构细节和特定身份特征的丢失,这一任务仍然具有挑战性。这项工作引入了SwinIFS,这是一种基于标志点引导的超级分辨率框架,它结合了结构先验知识与分层注意力机制,在适度和极端放大的情况下都能实现保真度重建。该方法将关键面部标志点的密集高斯热图融入输入表示中,使得网络能够从处理的早期阶段就开始关注语义重要的面部区域。采用了一个紧凑型Swin Transformer骨干网来捕捉长距离上下文信息同时保持局部几何形状,使模型能够恢复细微的人脸纹理并维持全局结构的一致性。 在CelebA基准测试上的广泛实验表明,SwinIFS实现了卓越的感知质量、更清晰的重建效果以及改进的身份保留能力。该方法持续生成更具真实感的结果,并且即使在8倍放大时也能展现出强大的性能,而大多数其他方法在这种极端情况下无法恢复有意义的结构。 此外,SwinIFS在重建精度和计算效率之间提供了一个有利的平衡点,使其适合用于面部美化、监控以及数字修复等现实世界的应用场景中。我们的代码、模型权重及结果可在以下链接获取:[https URL](请将[https URL]替换为实际提供的URL)。
https://arxiv.org/abs/2601.01406
All-in-one image restoration aims to recover clean images from diverse unknown degradations using a single model. But extending this task to videos faces unique challenges. Existing approaches primarily focus on frame-wise degradation variation, overlooking the temporal continuity that naturally exists in real-world degradation processes. In practice, degradation types and intensities evolve smoothly over time, and multiple degradations may coexist or transition gradually. In this paper, we introduce the Smoothly Evolving Unknown Degradations (SEUD) scenario, where both the active degradation set and degradation intensity change continuously over time. To support this scenario, we design a flexible synthesis pipeline that generates temporally coherent videos with single, compound, and evolving degradations. To address the challenges in the SEUD scenario, we propose an all-in-One Recurrent Conditional and Adaptive prompting Network (ORCANet). First, a Coarse Intensity Estimation Dehazing (CIED) module estimates haze intensity using physical priors and provides coarse dehazed features as initialization. Second, a Flow Prompt Generation (FPG) module extracts degradation features. FPG generates both static prompts that capture segment-level degradation types and dynamic prompts that adapt to frame-level intensity variations. Furthermore, a label-aware supervision mechanism improves the discriminability of static prompt representations under different degradations. Extensive experiments show that ORCANet achieves superior restoration quality, temporal consistency, and robustness over image and video-based baselines. Code is available at this https URL.
一站式图像修复旨在使用单一模型从各种未知退化中恢复清晰图像。然而,将此任务扩展到视频面临独特的挑战。现有方法主要关注逐帧降质变化,而忽略了真实世界降质过程中自然存在的时序连续性。实际上,降质量类型和强度会随时间平滑演变,并且可能同时存在多种降质或逐渐过渡。本文引入了“平滑演化未知退化”(SEUD)场景,在该场景中,活跃的降级集合及降级强度随着时间不断变化。为了支持这一场景,我们设计了一个灵活的合成流水线,可以生成具有单个、复合和演变降级的时间连续视频。 为了解决 SEUD 场景中的挑战,我们提出了一种一站式循环条件与自适应提示网络(ORCANet)。首先,粗略强度估计去雾模块使用物理先验估算雾强度,并提供初步的去雾特征。其次,流动提示生成(FPG)模块提取降质特征。FPG 生成静态提示以捕获分段级别的降级类型以及动态提示以适应帧级别强度变化。此外,一个标签感知监督机制提高了不同降级下静态提示表示的判别性。 广泛的实验表明,ORCANet 在图像和视频基线上的修复质量、时间连续性和鲁棒性方面表现出优越性能。代码可在提供的链接获取。
https://arxiv.org/abs/2601.00533
We present a lightweight two-stage framework for joint geometry and color inpainting of damaged 3D objects, motivated by the digital restoration of cultural heritage artifacts. The pipeline separates damage localization from reconstruction. In the first stage, a 2D convolutional network predicts damage masks on RGB slices extracted from a voxelized object, and these predictions are aggregated into a volumetric mask. In the second stage, a diffusion-based 3D U-Net performs mask-conditioned inpainting directly on voxel grids, reconstructing geometry and color while preserving observed regions. The model jointly predicts occupancy and color using a composite objective that combines occupancy reconstruction with masked color reconstruction and perceptual regularization. We evaluate the approach on a curated set of textured artifacts with synthetically generated damage using standard geometric and color metrics. Compared to symmetry-based baselines, our method produces more complete geometry and more coherent color reconstructions at a fixed 32^3 resolution. Overall, the results indicate that explicit mask conditioning is a practical way to guide volumetric diffusion models for joint 3D geometry and color inpainting.
我们提出了一种轻量级的两阶段框架,用于受损三维物体的几何和颜色修补,该框架受到了文化遗迹数字修复工作的启发。此流程将损伤定位与重建过程分开。在第一阶段中,一个2D卷积网络会预测从体素化对象提取的颜色图层中的损坏掩码,并将这些预测汇总成体积掩码。在第二阶段中,基于扩散的3D U-Net直接在体素网格上进行条件修复(即利用掩码信息),同时重建几何形状和颜色并保持可见区域不变。模型通过结合占用度重构、屏蔽颜色重构以及感知正则化的目标函数来联合预测占据情况和颜色。 我们在一组精心挑选的具有合成损坏的文化艺术品上使用标准几何和颜色指标评估了该方法的效果。与基于对称性的基线方法相比,我们的方法在固定的32^3分辨率下生成了更完整的几何重建及更加一致的颜色恢复结果。总体而言,这些结果显示明确地进行掩码条件设置是一种实用的方法,能够引导体积扩散模型实现联合的三维几何和颜色修补任务。
https://arxiv.org/abs/2601.00368