Knowledge graphs (KGs) play a critical role in enhancing large language models (LLMs) by introducing structured and grounded knowledge into the learning process. However, most existing KG-enhanced approaches rely on parameter-intensive fine-tuning, which risks catastrophic forgetting and degrades the pretrained model's generalization. Moreover, they exhibit limited adaptability to real-time knowledge updates due to their static integration frameworks. To address these issues, we introduce the first test-time KG-augmented framework for LLMs, built around a dedicated knowledge graph-guided attention (KGA) module that enables dynamic knowledge fusion without any parameter updates. The proposed KGA module augments the standard self-attention mechanism with two synergistic pathways: outward and inward aggregation. Specifically, the outward pathway dynamically integrates external knowledge into input representations via input-driven KG fusion. This inward aggregation complements the outward pathway by refining input representations through KG-guided filtering, suppressing task-irrelevant signals and amplifying knowledge-relevant patterns. Importantly, while the outward pathway handles knowledge fusion, the inward path selects the most relevant triples and feeds them back into the fusion process, forming a closed-loop enhancement mechanism. By synergistically combining these two pathways, the proposed method supports real-time knowledge fusion exclusively at test-time, without any parameter modification. Extensive experiments on five benchmarks verify the comparable knowledge fusion performance of KGA.
知识图谱(KGs)在通过引入结构化和有根据的知识来增强大型语言模型(LLMs)的学习过程中扮演着关键角色。然而,大多数现有的基于KG的改进方法依赖于耗参数量大的微调过程,这会带来灾难性遗忘的风险,并且损害预训练模型的泛化能力。此外,由于它们采用静态集成框架,这些方法对实时知识更新表现出有限的适应性。 为了解决这些问题,我们引入了首个测试时基于KG增强的LLMs框架,该框架围绕一个专有的、以知识图为指导的注意力(KGA)模块构建而成,可以实现在无需参数更新的情况下动态融合知识。所提出的KGA模块通过两条协同作用的路径扩展标准自注意力机制:向外聚合和向内聚合。 具体而言,向外路径通过输入驱动的知识图谱融合,将外部知识动态地集成到输入表示中。向内聚合则通过KG指导下的过滤来完善输入表示,抑制任务无关信号并放大与知识相关的模式,从而补充了向外路径的功能。尤为重要的是,在处理知识融合时,向外路径选择最相关的三元组,并将其反馈给融合过程,形成了一个闭环增强机制。 通过协同结合这两条路径,所提出的方法支持仅在测试时间进行实时知识融合,而无需对任何参数进行修改。在五个基准上的广泛实验验证了KGA的知识融合性能与现有方法相当。
https://arxiv.org/abs/2507.08704
Low-level enhancement and high-level visual understanding in low-light vision have traditionally been treated separately. Low-light enhancement improves image quality for downstream tasks, but existing methods rely on physical or geometric priors, limiting generalization. Evaluation mainly focuses on visual quality rather than downstream performance. Low-light visual understanding, constrained by scarce labeled data, primarily uses task-specific domain adaptation, which lacks scalability. To address these challenges, we build a generalized bridge between low-light enhancement and low-light understanding, which we term Generalized Enhancement For Understanding (GEFU). This paradigm improves both generalization and scalability. To address the diverse causes of low-light degradation, we leverage pretrained generative diffusion models to optimize images, achieving zero-shot generalization performance. Building on this, we propose Semantically Consistent Unsupervised Fine-tuning (SCUF). Specifically, to overcome text prompt limitations, we introduce an illumination-aware image prompt to explicitly guide image generation and propose a cycle-attention adapter to maximize its semantic potential. To mitigate semantic degradation in unsupervised training, we propose caption and reflectance consistency to learn high-level semantics and image-level spatial semantics. Extensive experiments demonstrate that our proposed method outperforms current state-of-the-art methods in traditional image quality and GEFU tasks including classification, detection, and semantic segmentation.
在低光视觉中,底层增强和高层视觉理解传统上被分别对待。低光增强可以提升图像质量以支持下游任务的性能,但现有方法依赖于物理或几何先验知识,这限制了它们的泛化能力。评价主要集中在视觉质量而非下游任务的表现上。低光照下的视觉理解由于标注数据稀缺,通常采用特定任务领域的适应方法,这种方法缺乏可扩展性。 为解决这些挑战,我们建立了一个将低光增强和低光理解连接起来的一般桥梁,并将其命名为“用于理解的广义增强”(Generalized Enhancement For Understanding, GEFU)。这种范式能够同时提升泛化能力和可扩展性。为了应对各种低光照退化的成因,我们利用预训练的生成扩散模型来优化图像,实现零样本学习下的性能。 在此基础上,我们提出了语义一致性的无监督微调(Semantically Consistent Unsupervised Fine-tuning, SCUF)。具体来说,为克服文本提示的局限性,我们引入了光照感知型图像提示,以明确指导图像生成,并提出了一种循环注意力适配器来最大化其语义潜力。为了减轻无监督训练中的语义退化问题,我们提出了标题和反射一致性以学习高级语义及图片级空间语义。 广泛实验表明,所提出的这种方法在传统图像质量和包括分类、检测以及语义分割在内的GEFU任务上均超越了现有的最先进的方法。
https://arxiv.org/abs/2507.08380
Video restoration and enhancement are critical not only for improving visual quality, but also as essential pre-processing steps to boost the performance of a wide range of downstream computer vision tasks. This survey presents a comprehensive review of video restoration and enhancement techniques with a particular focus on unsupervised approaches. We begin by outlining the most common video degradations and their underlying causes, followed by a review of early conventional and deep learning methods-based, highlighting their strengths and limitations. We then present an in-depth overview of unsupervised methods, categorise by their fundamental approaches, including domain translation, self-supervision signal design and blind spot or noise-based methods. We also provide a categorization of loss functions employed in unsupervised video restoration and enhancement, and discuss the role of paired synthetic datasets in enabling objective evaluation. Finally, we identify key challenges and outline promising directions for future research in this field.
视频恢复和增强不仅对于提升视觉质量至关重要,还是广泛下游计算机视觉任务性能提升的重要预处理步骤。本综述全面回顾了视频恢复与增强技术,并特别关注无监督方法。我们首先概述最常见的视频退化类型及其根本原因,然后回顾早期的传统及基于深度学习的方法,分析它们的优点和局限性。接着,我们深入介绍了无监督方法,按照其基本原理进行分类,包括领域翻译、自我监督信号设计以及盲点或噪声基方法。此外,本文还提供了在无监督视频恢复与增强中使用的损失函数的分类,并讨论了配对合成数据集在客观评估中的作用。最后,我们指出了这一领域的关键挑战,并概述了未来研究有前景的方向。
https://arxiv.org/abs/2507.08375
Room Impulse Responses (RIRs) accurately characterize acoustic properties of indoor environments and play a crucial role in applications such as speech enhancement, speech recognition, and audio rendering in augmented reality (AR) and virtual reality (VR). Existing blind estimation methods struggle to achieve practical accuracy. To overcome this challenge, we propose the dynamic audio-room acoustic synthesis (DARAS) model, a novel deep learning framework that is explicitly designed for blind RIR estimation from monaural reverberant speech signals. First, a dedicated deep audio encoder effectively extracts relevant nonlinear latent space features. Second, the Mamba-based self-supervised blind room parameter estimation (MASS-BRPE) module, utilizing the efficient Mamba state space model (SSM), accurately estimates key room acoustic parameters and features. Third, the system incorporates a hybrid-path cross-attention feature fusion module, enhancing deep integration between audio and room acoustic features. Finally, our proposed dynamic acoustic tuning (DAT) decoder adaptively segments early reflections and late reverberation to improve the realism of synthesized RIRs. Experimental results, including a MUSHRA-based subjective listening study, demonstrate that DARAS substantially outperforms existing baseline models, providing a robust and effective solution for practical blind RIR estimation in real-world acoustic environments.
房间脉冲响应(RIR)能够精确地表征室内环境的声学特性,并在诸如语音增强、语音识别及增强现实(AR)和虚拟现实(VR)中的音频渲染等应用中发挥关键作用。现有的盲估计方法难以达到实用精度。为了解决这一挑战,我们提出了一种动态音频-房间声学合成(DARAS)模型,这是一种专为从单声道混响语音信号进行盲RIR估算而设计的新型深度学习框架。首先,一个专门的深度音频编码器有效提取相关的非线性潜在空间特征;其次,采用高效的Mamba状态空间模型(SSM)的基于自监督的盲房间参数估计(MASS-BRPE)模块能够准确地估计关键的房间声学参数和特性;第三,系统整合了混合路径交叉注意力特征融合模块,增强了音频与房间声学特征之间的深度集成。最后,我们提出的动态声学调谐(DAT)解码器自适应地分割早期反射和晚期混响,以提高合成RIR的真实感。 实验结果,包括基于MUSHRA的主观听觉研究,表明DARAS在性能上显著优于现有的基准模型,并为实际环境中的盲RIR估计提供了一个稳健且有效的解决方案。
https://arxiv.org/abs/2507.08135
While Vision-Language Models (VLMs) have shown promising progress in general multimodal tasks, they often struggle in industrial anomaly detection and reasoning, particularly in delivering interpretable explanations and generalizing to unseen categories. This limitation stems from the inherently domain-specific nature of anomaly detection, which hinders the applicability of existing VLMs in industrial scenarios that require precise, structured, and context-aware analysis. To address these challenges, we propose SAGE, a VLM-based framework that enhances anomaly reasoning through Self-Guided Fact Enhancement (SFE) and Entropy-aware Direct Preference Optimization (E-DPO). SFE integrates domain-specific knowledge into visual reasoning via fact extraction and fusion, while E-DPO aligns model outputs with expert preferences using entropy-aware optimization. Additionally, we introduce AD-PL, a preference-optimized dataset tailored for industrial anomaly reasoning, consisting of 28,415 question-answering instances with expert-ranked responses. To evaluate anomaly reasoning models, we develop Multiscale Logical Evaluation (MLE), a quantitative framework analyzing model logic and consistency. SAGE demonstrates superior performance on industrial anomaly datasets under zero-shot and one-shot settings. The code, model and dataset are available at this https URL.
尽管视觉-语言模型(VLMs)在多模态任务中表现出令人鼓舞的进步,但在工业异常检测和推理方面却常常遇到挑战,尤其是在提供可解释性说明以及泛化到未见过的类别上。这一限制源于异常检测本质上具有特定领域的特性,这阻碍了现有VLMs在需要精确、结构化和上下文感知分析的工业场景中的应用。 为了应对这些挑战,我们提出了SAGE框架,这是一个基于VLM的方法,通过自我引导事实增强(Self-Guided Fact Enhancement, SFE)和熵感知直接偏好优化(Entropy-aware Direct Preference Optimization, E-DPO)来提高异常推理能力。SFE通过事实提取与融合将领域特定知识融入视觉推理中;E-DPO则使用熵感知的优化方法使模型输出与专家偏好的一致。 此外,我们还引入了AD-PL,这是一个针对工业异常推理而设计的偏好优化数据集,包含28,415个问答实例和根据专家排名排序的答案。为了评估异常推理模型,我们开发了一种名为多尺度逻辑评价(Multiscale Logical Evaluation, MLE)的量化框架,用于分析模型的逻辑一致性和准确性。 SAGE在零样本设置和单样本设置下对工业异常数据集均表现出优越性能。代码、模型及数据集可在[此处](https://this-url.com)获取。
https://arxiv.org/abs/2507.07939
Robust Visual SLAM (vSLAM) is essential for autonomous systems operating in real-world environments, where challenges such as dynamic objects, low texture, and critically, varying illumination conditions often degrade performance. Existing feature-based SLAM systems rely on fixed front-end parameters, making them vulnerable to sudden lighting changes and unstable feature tracking. To address these challenges, we propose ``IRAF-SLAM'', an Illumination-Robust and Adaptive Feature-Culling front-end designed to enhance vSLAM resilience in complex and challenging environments. Our approach introduces: (1) an image enhancement scheme to preprocess and adjust image quality under varying lighting conditions; (2) an adaptive feature extraction mechanism that dynamically adjusts detection sensitivity based on image entropy, pixel intensity, and gradient analysis; and (3) a feature culling strategy that filters out unreliable feature points using density distribution analysis and a lighting impact factor. Comprehensive evaluations on the TUM-VI and European Robotics Challenge (EuRoC) datasets demonstrate that IRAF-SLAM significantly reduces tracking failures and achieves superior trajectory accuracy compared to state-of-the-art vSLAM methods under adverse illumination conditions. These results highlight the effectiveness of adaptive front-end strategies in improving vSLAM robustness without incurring significant computational overhead. The implementation of IRAF-SLAM is publicly available at https://thanhnguyencanh. this http URL.
鲁棒的视觉同步定位与地图构建(vSLAM)对于在真实世界环境中运行的自主系统至关重要。动态物体、低纹理以及最关键的是光照条件的变化,常常会降低其性能。现有的基于特征的SLAM系统依赖于固定的前端参数设置,这使得它们对突然的照明变化和不稳定的特征跟踪变得脆弱。 为了解决这些问题,我们提出了“IRAF-SLAM”,这是一种设计用于提升vSLAM在复杂且挑战性环境中稳健性的光照鲁棒性和自适应特征剔除前端。我们的方法包括: 1. 一种图像增强方案,用于预处理并根据不同的光照条件调整图像质量; 2. 自适应的特征提取机制,该机制可以根据图像熵、像素强度和梯度分析动态地调整检测敏感性; 3. 基于密度分布分析和照明影响因子过滤不可靠特征点的特征剔除策略。 我们在TUM-VI和欧洲机器人挑战赛(EuRoC)数据集上进行了全面评估,结果表明IRAF-SLAM在不良光照条件下显著减少了跟踪失败,并且其轨迹准确性优于最先进的vSLAM方法。这些结果突显了自适应前端策略在不增加显著计算开销的情况下提升vSLAM鲁棒性方面的有效性。 IRAF-SLAM的实现代码公开可用,网址为https://thanhnguyencanh.github.io/iraf-slam/。
https://arxiv.org/abs/2507.07752
Matching theoretical predictions to experimental data remains a central challenge in hadron spectroscopy. In particular, the identification of new hadronic states is difficult, as exotic signals near threshold can arise from a variety of physical mechanisms. A key diagnostic in this context is the pole structure of the scattering amplitude, but different configurations can produce similar signatures. The mapping between pole configurations and line shapes is especially ambiguous near the mass threshold, where analytic control is limited. In this work, we introduce an uncertainty-aware machine learning approach for classifying pole structures in $S$-matrix elements. Our method is based on an ensemble of classifier chains that provide both epistemic and aleatoric uncertainty estimates. We apply a rejection criterion based on predictive uncertainty, achieving a validation accuracy of nearly $95\%$ while discarding only a small fraction of high-uncertainty predictions. Trained on synthetic data with known pole structures, the model generalizes to previously unseen experimental data, including enhancements associated with the $P_{c\bar{c}}(4312)^+$ state observed by LHCb. In this, we infer a four-pole structure, representing the presence of a genuine compact pentaquark in the presence of a higher channel virtual state pole with non-vanishing width. While evaluated on this particular state, our framework is broadly applicable to other candidate hadronic states and offers a scalable tool for pole structure inference in scattering amplitudes.
将理论预测与实验数据相匹配仍然是介子谱学中的一个核心挑战。特别是,新介子态的识别非常困难,因为阈值附近的奇特信号可以由多种物理机制产生。在这种背景下,散射振幅的极点结构是关键的诊断工具,但不同的配置可以产生类似的特征。在质量阈值附近,由于分析控制受限,极点配置和线形之间的映射尤其模糊。 在这项工作中,我们引入了一种不确定性感知机器学习方法来分类S矩阵元素中的极点结构。该方法基于一组分类链,提供了知识不确定性和随机不确定性的估计。我们应用了一个基于预测不确定性的拒绝标准,在仅丢弃一小部分高不确定性预测的同时实现了接近95%的验证准确率。 通过在已知极点结构的人工数据上进行训练,模型可以推广到之前未见过的实验数据中,包括与LHCb观测到的$P_{c\bar{c}}(4312)^+$状态相关联的增强。在这里,我们推断出一个四极点结构,这代表了一个真正的紧凑五夸克态在较高通道虚设极点存在非零宽度的情况。 尽管是在特定状态下进行了评估,但我们的框架广泛适用于其他候选介子态,并为散射振幅中的极点结构推理提供了可扩展的工具。
https://arxiv.org/abs/2507.07668
Single-channel speech enhancement is utilized in various tasks to mitigate the effect of interfering signals. Conventionally, to ensure the speech enhancement performs optimally, the speech enhancement has needed to be tuned for each task. Thus, generalizing speech enhancement models to unknown downstream tasks has been challenging. This study aims to construct a generic speech enhancement front-end that can improve the performance of back-ends to solve multiple downstream tasks. To this end, we propose a novel training criterion that minimizes the distance between the enhanced and the ground truth clean signal in the feature representation domain of self-supervised learning models. Since self-supervised learning feature representations effectively express high-level speech information useful for solving various downstream tasks, the proposal is expected to make speech enhancement models preserve such information. Experimental validation demonstrates that the proposal improves the performance of multiple speech tasks while maintaining the perceptual quality of the enhanced signal.
单通道语音增强技术被应用于各种任务中,以减轻干扰信号的影响。传统上,为了确保语音增强在每个任务中的性能最优,需要针对每项任务对语音增强模型进行调优。因此,将语音增强模型推广到未知下游任务一直是一个挑战。本研究旨在构建一个通用的语音增强前端,该前端能够改善后端性能以解决多个下游任务。为此,我们提出了一种新的训练标准,即在自监督学习模型的特征表示域内最小化增强信号与干净信号的真实值之间的距离。由于自监督学习的特征表示可以有效表达对各种下游任务有用的高级语音信息,因此该提议有望使语音增强模型保留此类信息。实验验证表明,该提案能够提升多个语音任务的表现同时保持增强信号的感知质量。
https://arxiv.org/abs/2507.07631
This paper presents enhancements to the SAM2 framework for video object tracking task, addressing challenges such as occlusions, background clutter, and target reappearance. We introduce a hierarchical motion estimation strategy, combining lightweight linear prediction with selective non-linear refinement to improve tracking accuracy without requiring additional training. In addition, we optimize the memory bank by distinguishing long-term and short-term memory frames, enabling more reliable tracking under long-term occlusions and appearance changes. Experimental results show consistent improvements across different model scales. Our method achieves state-of-the-art performance on LaSOT and LaSOText with the large model, achieving 9.6% and 7.2% relative improvements in AUC over the original SAM2, and demonstrates even larger relative gains on smaller models, highlighting the effectiveness of our trainless, low-overhead improvements for boosting long-term tracking performance. The code is available at this https URL.
本文介绍了对SAM2框架在视频对象跟踪任务中的改进,解决了诸如遮挡、背景杂乱和目标重新出现等挑战。我们引入了一种分层的运动估计策略,结合轻量级线性预测与选择性的非线性精炼来提高追踪精度,并且无需额外训练。此外,通过区分长时记忆帧和短时记忆帧,优化了记忆库,使长时间遮挡和外观变化下的跟踪更加可靠。实验结果显示,在不同模型规模下均有持续改进的表现。我们的方法在使用大型模型时于LaSOT和LaSOText数据集上达到了最先进的性能,与原始的SAM2相比,在AUC指标上分别提高了9.6%和7.2%,并在较小的模型中显示了更大的相对提升,突显了我们无训练、低开销改进方法在提高长期跟踪性能方面的有效性。代码可在该网址获取(假设原文中的“this https URL”为代码托管链接)。
https://arxiv.org/abs/2507.07603
Large language models (LLMs) incorporated with Retrieval-Augmented Generation (RAG) have demonstrated powerful capabilities in generating counterspeech against misinformation. However, current studies rely on limited evidence and offer less control over final outputs. To address these challenges, we propose a Multi-agent Retrieval-Augmented Framework to generate counterspeech against health misinformation, incorporating multiple LLMs to optimize knowledge retrieval, evidence enhancement, and response refinement. Our approach integrates both static and dynamic evidence, ensuring that the generated counterspeech is relevant, well-grounded, and up-to-date. Our method outperforms baseline approaches in politeness, relevance, informativeness, and factual accuracy, demonstrating its effectiveness in generating high-quality counterspeech. To further validate our approach, we conduct ablation studies to verify the necessity of each component in our framework. Furthermore, human evaluations reveal that refinement significantly enhances counterspeech quality and obtains human preference.
结合检索增强生成(RAG)技术的大规模语言模型(LLM)在针对错误信息生成反驳言论方面展现出了强大的能力。然而,现有的研究依赖于有限的证据,并且对最终输出的控制较少。为了解决这些挑战,我们提出了一种多代理检索增强框架,用于根据健康领域的错误信息生成反驳言论。该框架集成了多个LLM来优化知识检索、证据强化和响应改进。我们的方法结合了静态和动态证据,确保生成的反驳言论是相关的、具有坚实的依据且最新的。 与基线方法相比,我们提出的方法在礼貌性、相关性、信息性和事实准确性方面表现更佳,证明了其在生成高质量反驳言论方面的有效性。为了进一步验证我们的方法,我们进行了消融研究以确认框架中每个组件的必要性。此外,人类评估显示,在提升反驳言论质量及获得人类偏好方面,改进措施发挥了显著作用。
https://arxiv.org/abs/2507.07307
Despite significant advancements in adapting Large Language Models (LLMs) for radiology report generation (RRG), clinical adoption remains challenging due to difficulties in accurately mapping pathological and anatomical features to their corresponding text descriptions. Additionally, semantic agnostic feature extraction further hampers the generation of accurate diagnostic reports. To address these challenges, we introduce Medical Concept Aligned Radiology Report Generation (MCA-RG), a knowledge-driven framework that explicitly aligns visual features with distinct medical concepts to enhance the report generation process. MCA-RG utilizes two curated concept banks: a pathology bank containing lesion-related knowledge, and an anatomy bank with anatomical descriptions. The visual features are aligned with these medical concepts and undergo tailored enhancement. We further propose an anatomy-based contrastive learning procedure to improve the generalization of anatomical features, coupled with a matching loss for pathological features to prioritize clinically relevant regions. Additionally, a feature gating mechanism is employed to filter out low-quality concept features. Finally, the visual features are corresponding to individual medical concepts, and are leveraged to guide the report generation process. Experiments on two public benchmarks (MIMIC-CXR and CheXpert Plus) demonstrate that MCA-RG achieves superior performance, highlighting its effectiveness in radiology report generation.
尽管在将大型语言模型(LLMs)应用于放射学报告生成(RRG)方面取得了显著进展,但由于难以准确地将病理和解剖特征映射到相应的文本描述中,临床应用仍然面临挑战。此外,语义无关的特征提取进一步阻碍了精确诊断报告的生成。为了解决这些问题,我们提出了医学概念对齐的放射学报告生成(MCA-RG),这是一种知识驱动的框架,该框架明确地将视觉特征与特定的医学概念相匹配以提升报告生成过程的质量。 MCA-RG利用两个经过精心策划的概念库:一个包含病变相关知识的病理库和一个包含解剖描述的解剖学库。通过这种方式,视觉特征可以被对齐并与这些医学概念进行定制化的增强处理。我们还提出了一种基于解剖结构对比学习的过程来改进解剖特征的泛化能力,并结合一种匹配损失函数以优先考虑临床相关的区域。此外,采用了特征门控机制来过滤掉低质量的概念特征。 最后,视觉特征与个体医学概念相对应,并用于指导报告生成过程。在两个公共基准(MIMIC-CXR和CheXpert Plus)上的实验表明,MCA-RG实现了卓越的性能,突显了其在放射学报告生成中的有效性。
https://arxiv.org/abs/2507.06992
Generalized Category Discovery (GCD) aims to recognize unlabeled images from known and novel classes by distinguishing novel classes from known ones, while also transferring knowledge from another set of labeled images with known classes. Existing GCD methods rely on self-supervised vision transformers such as DINO for representation learning. However, focusing solely on the global representation of the DINO CLS token introduces an inherent trade-off between discriminability and generalization. In this paper, we introduce an adaptive part discovery and learning method, called APL, which generates consistent object parts and their correspondences across different similar images using a set of shared learnable part queries and DINO part priors, without requiring any additional annotations. More importantly, we propose a novel all-min contrastive loss to learn discriminative yet generalizable part representation, which adaptively highlights discriminative object parts to distinguish similar categories for enhanced discriminability while simultaneously sharing other parts to facilitate knowledge transfer for improved generalization. Our APL can easily be incorporated into different GCD frameworks by replacing their CLS token feature with our part representations, showing significant enhancements on fine-grained datasets.
广义类别发现(GCD)旨在通过区分新旧类别来识别未标记的图像,同时还将一个已知类别的标注图像集的知识迁移到这些图像中。现有的GCD方法依赖于自监督视觉变换器,如DINO进行表示学习。然而,仅仅关注DINO CLS令牌的全局表示引入了判别力和泛化能力之间的内在权衡。 在本文中,我们介绍了一种适应性部分发现与学习的方法(称为APL),该方法通过使用一组可学习的部分查询和DINO部分先验生成不同相似图像之间的一致物体部分及其对应关系,并且无需任何额外注释。更重要的是,我们提出了一种新颖的全最小对比损失来学习判别性和泛化性的部分表示,该损失能够自适应地突出判别性对象部分以增强相似类别之间的区分能力,同时共享其他部分以促进知识迁移从而提高泛化性能。 我们的APL方法可以轻松集成到不同的GCD框架中,通过用我们提供的部分表示替换它们的CLS令牌特征,在细粒度数据集上表现出显著改进。
https://arxiv.org/abs/2507.06928
Low-Light Image Enhancement (LLIE) aims to restore vivid content and details from corrupted low-light images. However, existing standard RGB (sRGB) color space-based LLIE methods often produce color bias and brightness artifacts due to the inherent high color sensitivity. While Hue, Saturation, and Value (HSV) color space can decouple brightness and color, it introduces significant red and black noise artifacts. To address this problem, we propose a new color space for LLIE, namely Horizontal/Vertical-Intensity (HVI), defined by the HV color map and learnable intensity. The HV color map enforces small distances for the red coordinates to remove red noise artifacts, while the learnable intensity compresses the low-light regions to remove black noise artifacts. Additionally, we introduce the Color and Intensity Decoupling Network+ (HVI-CIDNet+), built upon the HVI color space, to restore damaged content and mitigate color distortion in extremely dark regions. Specifically, HVI-CIDNet+ leverages abundant contextual and degraded knowledge extracted from low-light images using pre-trained vision-language models, integrated via a novel Prior-guided Attention Block (PAB). Within the PAB, latent semantic priors can promote content restoration, while degraded representations guide precise color correction, both particularly in extremely dark regions through the meticulously designed cross-attention fusion mechanism. Furthermore, we construct a Region Refinement Block that employs convolution for information-rich regions and self-attention for information-scarce regions, ensuring accurate brightness adjustments. Comprehensive results from benchmark experiments demonstrate that the proposed HVI-CIDNet+ outperforms the state-of-the-art methods on 10 datasets.
低光图像增强(LLIE)的目标是从受损的低光图像中恢复生动的内容和细节。然而,现有的基于标准RGB(sRGB)色彩空间的LLIE方法常常会产生色偏和亮度伪影,因为这种颜色空间对颜色变化非常敏感。尽管HSV色彩空间可以分离亮度和颜色,但它会引入显著的红色和黑色噪点伪影。为了解决这个问题,我们提出了一种新的用于LLIE的颜色空间,即水平/垂直强度(HVI),由HV色图和可学习的强度定义。HV色图强制执行红色坐标的短距离以移除红色噪点伪影,而可学习的强度则压缩低光区域以去除黑色噪点伪影。 此外,我们引入了一种基于HVI颜色空间构建的颜色和强度解耦网络+(HVI-CIDNet+),用于恢复受损内容并减轻极暗区域中的色彩失真。具体来说,HVI-CIDNet+利用预训练的视觉-语言模型从低光图像中提取丰富的上下文和降级知识,并通过一种新颖的先验引导注意模块(PAB)集成这些知识。在PAB内部,潜在语义先验可以促进内容恢复,而退化的表示则指导精确的颜色校正,在极暗区域尤其如此,这是通过精心设计的交叉注意力融合机制实现的。 此外,我们构建了一个区域精炼块,该块采用卷积处理信息丰富的区域,并使用自注意处理信息匮乏的区域,确保准确的亮度调整。基准实验的全面结果表明,所提出的HVI-CIDNet+在10个数据集上优于当前最先进的方法。
https://arxiv.org/abs/2507.06814
Thermal imaging from unmanned aerial vehicles (UAVs) holds significant potential for applications in search and rescue, wildlife monitoring, and emergency response, especially under low-light or obscured conditions. However, the scarcity of large-scale, diverse thermal aerial datasets limits the advancement of deep learning models in this domain, primarily due to the high cost and logistical challenges of collecting thermal data. In this work, we introduce a novel procedural pipeline for generating synthetic thermal images from an aerial perspective. Our method integrates arbitrary object classes into existing thermal backgrounds by providing control over the position, scale, and orientation of the new objects, while aligning them with the viewpoints of the background. We enhance existing thermal datasets by introducing new object categories, specifically adding a drone class in urban environments to the HIT-UAV dataset and an animal category to the MONET dataset. In evaluating these datasets for object detection task, we showcase strong performance across both new and existing classes, validating the successful expansion into new applications. Through comparative analysis, we show that thermal detectors outperform their visible-light-trained counterparts and highlight the importance of replicating aerial viewing angles. Project page: this https URL.
无人飞行器(UAV)上的热成像技术在搜索和救援、野生动物监测以及应急响应等领域具有巨大的应用潜力,尤其是在低光或视线受阻的情况下。然而,由于收集大型多样化热数据集的成本高且物流挑战大,此类深度学习模型的发展受到了限制。在此项工作中,我们介绍了一种用于从空中视角生成合成热图像的新型程序化流程。 我们的方法通过提供对新物体位置、尺寸和方向的控制,在现有的热背景中融合任意对象类别,使这些新加入的对象与背景视点保持一致。我们通过在HIT-UAV数据集中添加无人机类别(特别是在城市环境中)以及向MONET数据集引入动物类别来增强现有热数据集。 评估这些数据集中的目标检测任务时,我们在新旧类别中都展示了强大的性能表现,证明了成功地扩展到新的应用领域。通过对比分析,我们表明热成像探测器优于其可见光训练的对应物,并强调复制空中视角的重要性。 项目页面:[此链接](https://this-url.com/)(原文中的链接请替换为具体地址)。
https://arxiv.org/abs/2507.06797
Spatio-temporal video prediction plays a pivotal role in critical domains, ranging from weather forecasting to industrial automation. However, in high-precision industrial scenarios such as semiconductor manufacturing, the absence of specialized benchmark datasets severely hampers research on modeling and predicting complex processes. To address this challenge, we make a twofold this http URL, we construct and release the Chip Dicing Lane Dataset (CHDL), the first public temporal image dataset dedicated to the semiconductor wafer dicing process. Captured via an industrial-grade vision system, CHDL provides a much-needed and challenging benchmark for high-fidelity process modeling, defect detection, and digital twin this http URL, we propose DIFFUMA, an innovative dual-path prediction architecture specifically designed for such fine-grained dynamics. The model captures global long-range temporal context through a parallel Mamba module, while simultaneously leveraging a diffusion module, guided by temporal features, to restore and enhance fine-grained spatial details, effectively combating feature degradation. Experiments demonstrate that on our CHDL benchmark, DIFFUMA significantly outperforms existing methods, reducing the Mean Squared Error (MSE) by 39% and improving the Structural Similarity (SSIM) from 0.926 to a near-perfect 0.988. This superior performance also generalizes to natural phenomena datasets. Our work not only delivers a new state-of-the-art (SOTA) model but, more importantly, provides the community with an invaluable data resource to drive future research in industrial AI.
时空视频预测在诸如天气预报和工业自动化等关键领域中扮演着核心角色。然而,在半导体制造等高精度的工业场景中,缺乏专门的基准数据集严重阻碍了对复杂过程建模与预测的研究进展。为解决这一挑战,我们进行了两个方面的努力:首先,构建并发布了用于半导体晶圆切割工艺的第一个公开时间图像数据集——Chip Dicing Lane 数据集(CHDL)。该数据集通过工业级视觉系统获取,提供了高保真度流程建模、缺陷检测和数字孪生所急需的且具有挑战性的基准。其次,我们提出了 DIFFUMA 架构,一种专门设计用于此类精细动态预测的创新双路径架构。DIFFUMA 通过并行的 Mamba 模块捕获全局长时序上下文信息,并同时利用由时间特征引导的扩散模块恢复和增强细粒度空间细节,从而有效对抗特征退化问题。 实验表明,在我们的 CHDL 基准测试中,DIFFUMA 显著优于现有方法,将均方误差(MSE)降低了 39%,并将结构相似性指数(SSIM)从 0.926 提升到了接近完美的 0.988。这一卓越性能也推广到自然现象数据集上。我们的工作不仅提供了新的最先进的(SOTA)模型,而且更重要的是为社区提供了一个宝贵的数据资源,以推动工业 AI 的未来研究进展。
https://arxiv.org/abs/2507.06738
Recent advances in large vision-language models (VLMs) have shown remarkable progress in solving the text-promptable object counting problem. Representative methods typically specify text prompts with object category information in images. This however is insufficient for training the model to accurately distinguish the number of objects in the counting task. To this end, we propose QUANet, which introduces novel quantity-oriented text prompts with a vision-text quantity alignment loss to enhance the model's quantity awareness. Moreover, we propose a dual-stream adaptive counting decoder consisting of a Transformer stream, a CNN stream, and a number of Transformer-to-CNN enhancement adapters (T2C-adapters) for density map prediction. The T2C-adapters facilitate the effective knowledge communication and aggregation between the Transformer and CNN streams. A cross-stream quantity ranking loss is proposed in the end to optimize the ranking orders of predictions from the two streams. Extensive experiments on standard benchmarks such as FSC-147, CARPK, PUCPR+, and ShanghaiTech demonstrate our model's strong generalizability for zero-shot class-agnostic counting. Code is available at this https URL
最近在大型视觉语言模型(VLM)领域的进展显示,在解决基于文本提示的对象计数问题上取得了显著的成果。代表性方法通常会在图像中使用包含对象类别信息的文本提示来训练模型。然而,这种方法对于训练模型准确地区分计数任务中的对象数量是不够的。为此,我们提出了QUANet模型,该模型引入了新的以数量为导向的文本提示,并且通过视觉-文本数量对齐损失增强模型的数量感知能力。此外,我们还提出了一种双流自适应计数解码器,包括一个Transformer流、一个CNN流以及多个从Transformer到CNN的增强适配器(T2C适配器),用于密度图预测。这些T2C适配器有助于在Transformer和CNN流之间有效地进行知识交流与聚合。最后,我们提出了一种跨流数量排序损失来优化两个流中预测结果的排序顺序。 在标准基准测试如FSC-147、CARPK、PUCPR+及上海科技大学数据集上的大量实验表明,我们的模型对于零样本类别不可知计数具有强大的泛化能力。代码可在提供的链接获取。
https://arxiv.org/abs/2507.06679
Offline multi-task reinforcement learning aims to learn a unified policy capable of solving multiple tasks using only pre-collected task-mixed datasets, without requiring any online interaction with the environment. However, it faces significant challenges in effectively sharing knowledge across tasks. Inspired by the efficient knowledge abstraction observed in human learning, we propose Goal-Oriented Skill Abstraction (GO-Skill), a novel approach designed to extract and utilize reusable skills to enhance knowledge transfer and task performance. Our approach uncovers reusable skills through a goal-oriented skill extraction process and leverages vector quantization to construct a discrete skill library. To mitigate class imbalances between broadly applicable and task-specific skills, we introduce a skill enhancement phase to refine the extracted skills. Furthermore, we integrate these skills using hierarchical policy learning, enabling the construction of a high-level policy that dynamically orchestrates discrete skills to accomplish specific tasks. Extensive experiments on diverse robotic manipulation tasks within the MetaWorld benchmark demonstrate the effectiveness and versatility of GO-Skill.
离线多任务强化学习旨在仅使用预先收集的混合任务数据集,学习一种能够解决多个任务的统一策略,而不需与环境进行在线互动。然而,在不同任务之间有效共享知识方面面临重大挑战。受人类学习过程中高效知识抽象现象的启发,我们提出了目标导向技能抽象(GO-Skill)这一创新方法,旨在提取和利用可重用技能以增强知识迁移和任务性能。 我们的方法通过一个目标导向的技能提取过程发现可复用的技能,并利用向量量化技术构建了一个离散化的技能库。为了缓解通用性和特定任务技能之间不平衡的问题,我们引入了技能改进阶段来优化提取出的技能。此外,我们将这些技能整合到分层策略学习中,从而能够建立一个高层级策略动态调度离散技能以完成特定任务。 在MetaWorld基准测试中的各种机器人操作任务上进行的广泛实验表明,GO-Skill的有效性和灵活性。
https://arxiv.org/abs/2507.06628
Medical Hyperspectral Imaging (MHSI) has emerged as a promising tool for enhanced disease diagnosis, particularly in computational pathology, offering rich spectral information that aids in identifying subtle biochemical properties of tissues. Despite these advantages, effectively fusing both spatial-dimensional and spectral-dimensional information from MHSIs remains challenging due to its high dimensionality and spectral redundancy inherent characteristics. To solve the above challenges, we propose a novel spatial-spectral omni-fusion network for hyperspectral image segmentation, named as Omni-Fuse. Here, we introduce abundant cross-dimensional feature fusion operations, including a cross-dimensional enhancement module that refines both spatial and spectral features through bidirectional attention mechanisms, a spectral-guided spatial query selection to select the most spectral-related spatial feature as the query, and a two-stage cross-dimensional decoder which dynamically guide the model to focus on the selected spatial query. Despite of numerous attention blocks, Omni-Fuse remains efficient in execution. Experiments on two microscopic hyperspectral image datasets show that our approach can significantly improve the segmentation performance compared with the state-of-the-art methods, with over 5.73 percent improvement in DSC. Code available at: this https URL.
医学超光谱成像(MHSI)作为一种增强疾病诊断的有前途的工具,特别是在计算病理学领域,提供了丰富的光谱信息,有助于识别组织中细微的生化特性。尽管存在这些优势,但由于其高维度和固有的光谱冗余性,有效地融合来自MHSIs的空间维和光谱维信息仍然是一个挑战。为了解决上述难题,我们提出了一种名为Omni-Fuse的新颖空间-光谱全方位融合网络,用于超光谱图像分割。在此,我们引入了丰富的跨维度特征融合操作,包括一个通过双向注意机制精炼空间和光谱特性的跨维度增强模块、一种选择最相关于光谱的空间特征作为查询的以光谱为导向的空间查询选择方法以及一个动态指导模型聚焦选定空间查询的两阶段跨维度解码器。尽管存在许多注意力块,Omni-Fuse在执行效率上仍然很高。 在两个显微超光谱图像数据集上的实验表明,与当前最先进的方法相比,我们的方法可以显著提高分割性能,在DSC(Dice相似系数)方面提高了超过5.73个百分点。 代码可在以下链接获取:[请在此处插入实际的URL]。
https://arxiv.org/abs/2507.06606
Acoustic signals from industrial machines offer valuable insights for anomaly detection, predictive maintenance, and operational efficiency enhancement. However, existing task-specific, supervised learning methods often scale poorly and fail to generalize across diverse industrial scenarios, whose acoustic characteristics are distinct from general audio. Furthermore, the scarcity of accessible, large-scale datasets and pretrained models tailored for industrial audio impedes community-driven research and benchmarking. To address these challenges, we introduce DINOS (Diverse INdustrial Operation Sounds), a large-scale open-access dataset. DINOS comprises over 74,149 audio samples (exceeding 1,093 hours) collected from various industrial acoustic scenarios. We also present IMPACT (Industrial Machine Perception via Acoustic Cognitive Transformer), a novel foundation model for industrial machine sound analysis. IMPACT is pretrained on DINOS in a self-supervised manner. By jointly optimizing utterance and frame-level losses, it captures both global semantics and fine-grained temporal structures. This makes its representations suitable for efficient fine-tuning on various industrial downstream tasks with minimal labeled data. Comprehensive benchmarking across 30 distinct downstream tasks (spanning four machine types) demonstrates that IMPACT outperforms existing models on 24 tasks, establishing its superior effectiveness and robustness, while providing a new performance benchmark for future research.
工业机器发出的声学信号为异常检测、预测性维护以及运营效率提升提供了有价值的见解。然而,现有的特定任务专用监督学习方法在扩展性和跨不同工业场景泛化方面表现不佳,这些场景中的声学特征与通用音频大不相同。此外,缺乏大量可获取的、针对工业音频定制的数据集和预训练模型阻碍了社区驱动的研究及基准测试。 为了解决这些问题,我们引入了DINOS(多样化工业操作声音),这是一个大规模开放访问数据集。DINOS包含来自各种工业声学场景收集到的超过74,149个音频样本(超过1093小时)。同时,我们也推出了IMPACT(通过声学认知变压器进行工业机器感知),这是一种新型基础模型用于分析工业机器声音。IMPACT在DINOS数据集上以自监督方式进行了预训练。通过联合优化语音级和帧级损失函数,它捕捉到了全局语义及细粒度的时间结构特性,使其表示形式适合在最少标注数据的条件下高效地微调各种工业下游任务。跨30个不同的下游任务(涵盖四种机器类型)进行的全面基准测试显示,IMPACT在24项任务上超越了现有模型,证明其具有优越的效果和鲁棒性,并为未来研究提供了新的性能基准。
https://arxiv.org/abs/2507.06481
Automatic speech quality assessment plays a crucial role in the development of speech synthesis systems, but existing models exhibit significant performance variations across different granularity levels of prediction tasks. This paper proposes an enhanced MOS prediction system based on self-supervised learning speech models, incorporating a Mixture of Experts (MoE) classification head and utilizing synthetic data from multiple commercial generation models for data augmentation. Our method builds upon existing self-supervised models such as wav2vec2, designing a specialized MoE architecture to address different types of speech quality assessment tasks. We also collected a large-scale synthetic speech dataset encompassing the latest text-to-speech, speech conversion, and speech enhancement systems. However, despite the adoption of the MoE architecture and expanded dataset, the model's performance improvements in sentence-level prediction tasks remain limited. Our work reveals the limitations of current methods in handling sentence-level quality assessment, provides new technical pathways for the field of automatic speech quality assessment, and also delves into the fundamental causes of performance differences across different assessment granularities.
自动语音质量评估在语音合成系统的开发中扮演着关键角色,但现有的模型在不同粒度级别的预测任务上表现出显著的性能差异。本文提出了一种基于自监督学习语音模型增强的MOS(Mean Opinion Score)预测系统,该系统结合了专家混合(MoE)分类头,并利用来自多个商业生成模型的合成数据进行数据扩充。我们的方法建立在现有的自监督模型(如wav2vec2)之上,设计了一种专门针对不同类型的语音质量评估任务的MoE架构。此外,我们还收集了一个大规模的合成语音数据集,涵盖了最新的文本到语音、语音转换和语音增强系统。然而,尽管采用了MoE架构和扩展的数据集,在句子级别的预测任务中模型性能的改进仍然有限。我们的工作揭示了当前方法在处理句子级别质量评估方面的局限性,并为自动语音质量评估领域提供了新的技术途径,同时还深入探讨了不同评估粒度间性能差异的根本原因。
https://arxiv.org/abs/2507.06116