The success of CLIP has driven substantial progress in text-video retrieval. However, current methods often suffer from "blind" feature interaction, where the model struggles to discern key visual information from background noise due to the sparsity of textual queries. To bridge this gap, we draw inspiration from human cognitive behavior and propose the Human Vision-Driven (HVD) model. Our framework establishes a coarse-to-fine alignment mechanism comprising two key components: the Frame Features Selection Module (FFSM) and the Patch Features Compression Module (PFCM). FFSM mimics the human macro-perception ability by selecting key frames to eliminate temporal redundancy. Subsequently, PFCM simulates micro-perception by aggregating patch features into salient visual entities through an advanced attention mechanism, enabling precise entity-level matching. Extensive experiments on five benchmarks demonstrate that HVD not only captures human-like visual focus but also achieves state-of-the-art performance.
CLIP模型的成功推动了文本视频检索领域的显著进步。然而,当前的方法往往在“盲”特征交互方面存在问题,即由于文本查询的稀疏性,模型难以从背景噪声中区分出关键视觉信息。为了弥补这一差距,我们借鉴了人类的认知行为,并提出了人眼视图驱动(HVD)模型。我们的框架建立了一个由粗到细的对齐机制,包含两个关键组件:帧特征选择模块(FFSM)和补丁特征压缩模块(PFCM)。FFSM通过选择关键帧来消除时间冗余,模拟了人类宏观感知能力。随后,PFCM通过先进的注意力机制聚合补丁特征以形成显著视觉实体,从而模仿微观感知并实现精确的实体级别匹配。 在五个基准测试中的大量实验表明,HVD不仅能够捕捉到类似人的视觉关注点,还实现了最先进的性能。
https://arxiv.org/abs/2601.16155
Text-Based Person Search (TBPS) holds unique value in real-world surveillance bridging visual perception and language understanding, yet current paradigms utilizing pre-training models often fail to transfer effectively to complex open-world scenarios. The reliance on "Passive Observation" leads to multifaceted spurious correlations and spatial semantic misalignment, causing a lack of robustness against distribution shifts. To fundamentally resolve these defects, this paper proposes ICON (Invariant Counterfactual Optimization with Neuro-symbolic priors), a framework integrating causal and topological priors. First, we introduce Rule-Guided Spatial Intervention to strictly penalize sensitivity to bounding box noise, forcibly severing location shortcuts to achieve geometric invariance. Second, Counterfactual Context Disentanglement is implemented via semantic-driven background transplantation, compelling the model to ignore background interference for environmental independence. Then, we employ Saliency-Driven Semantic Regularization with adaptive masking to resolve local saliency bias and guarantee holistic completeness. Finally, Neuro-Symbolic Topological Alignment utilizes neuro-symbolic priors to constrain feature matching, ensuring activated regions are topologically consistent with human structural logic. Experimental results demonstrate that ICON not only maintains leading performance on standard benchmarks but also exhibits exceptional robustness against occlusion, background interference, and localization noise. This approach effectively advances the field by shifting from fitting statistical co-occurrences to learning causal invariance.
基于文本的人体搜索(TBPS)在现实世界的监控中具有独特的价值,它连接了视觉感知和语言理解。然而,当前利用预训练模型的范式通常无法有效地转移到复杂的开放世界场景中。对“被动观察”的依赖导致多方面的虚假关联及空间语义错位,这使得系统缺乏应对分布变化的鲁棒性。为了从根本上解决这些问题,本文提出了ICON(具有神经符号先验的不变反事实优化),这是一种结合因果和拓扑先验知识的框架。 首先,我们引入了规则引导的空间干预机制,严格惩罚对边界框噪声的敏感度,并强制切断位置捷径以实现几何上的不变性。其次,通过语义驱动的背景移植来实施反事实上下文解耦,迫使模型忽略背景干扰从而达到环境独立性。然后,利用自适应掩码进行显着性驱动的语义正则化,解决局部显着偏差,并保证整体完整性和一致性。最后,神经符号拓扑对齐利用了神经符号先验来约束特征匹配,确保激活区域与人体结构逻辑在拓扑上保持一致。 实验结果表明,ICON不仅在标准基准测试中维持领先的性能,而且表现出对抗遮挡、背景干扰和定位噪声的卓越鲁棒性。这种方法通过从拟合统计共现转变为学习因果不变性,有效地推动了该领域的发展。
https://arxiv.org/abs/2601.15931
Cervical spine fractures are critical medical conditions requiring precise and efficient detection for effective clinical management. This study explores the viability of 2D projection-based vertebra segmentation for vertebra-level fracture detection in 3D CT volumes, presenting an end-to-end pipeline for automated analysis of cervical vertebrae (C1-C7). By approximating a 3D volume through optimized 2D axial, sagittal, and coronal projections, regions of interest are identified using the YOLOv8 model from all views and combined to approximate the 3D cervical spine area, achieving a 3D mIoU of 94.45 percent. This projection-based localization strategy reduces computational complexity compared to traditional 3D segmentation methods while maintaining high performance. It is followed by a DenseNet121-Unet-based multi-label segmentation leveraging variance- and energy-based projections, achieving a Dice score of 87.86 percent. Strategic approximation of 3D vertebral masks from these 2D segmentation masks enables the extraction of individual vertebra volumes. The volumes are analyzed for fractures using an ensemble of 2.5D Spatio-Sequential models incorporating both raw slices and projections per vertebra for complementary evaluation. This ensemble achieves vertebra-level and patient-level F1 scores of 68.15 and 82.26, and ROC-AUC scores of 91.62 and 83.04, respectively. We further validate our approach through an explainability study that provides saliency map visualizations highlighting anatomical regions relevant for diagnosis, and an interobserver variability analysis comparing our model's performance with expert radiologists, demonstrating competitive results.
颈椎骨折是一种需要精确和高效检测以进行有效临床管理的严重医疗状况。本研究探讨了基于2D投影的椎骨分割在3D CT体积中进行椎体水平骨折检测的可行性,并提出了一种针对颈椎(C1-C7)自动化分析的端到端管道。通过优化的2D轴向、矢状面和冠状面投影来近似一个三维体积,使用YOLOv8模型从所有视角识别感兴趣区域并结合以近似3D颈椎区域,达到了94.45%的3D mIoU(平均交并比)。这种基于投影的位置策略与传统的3D分割方法相比,在降低计算复杂性的同时保持了高性能。随后,使用基于Variance和Energy的投影进行多标签分割,采用DenseNet121-Unet架构实现了87.86%的Dice分数。通过战略性地从这些2D分割掩码中近似出3D椎骨掩模,可以提取各个椎体体积,并对其进行骨折分析。 该过程使用一个集成了每个椎体原始切片和投影的2.5D空间序列模型集合进行互补评估,在椎体级别和患者级别的F1得分分别为68.15和82.26,在ROC-AUC分数中分别达到91.62和83.04。我们进一步通过可解释性研究验证了我们的方法,该研究表明了用于诊断的相关解剖区域的热图,并进行了观察者间变异分析,将模型性能与专家放射科医师进行比较,证明了其具有竞争力的结果。 总结:这项研究开发了一种新颖的方法来检测颈椎骨折,结合了2D和3D图像处理技术的优势。这种方法不仅提高了骨折检测的速度和准确性,而且还为医学影像分析提供了一个有前景的新方向。
https://arxiv.org/abs/2601.15235
Open-vocabulary keyword spotting (KWS) with text-based enrollment has emerged as a flexible alternative to fixed-phrase triggers. Prior utterance-level matching methods, from an embedding-learning standpoint, learn embeddings at a single fixed dimensionality. We depart from this design and propose Matryoshka Audio-Text Embeddings (MATE), a dual-encoder framework that encodes multiple embedding granularities within a single vector via nested sub-embeddings ("prefixes"). Specifically, we introduce a PCA-guided prefix alignment: PCA-compressed versions of the full text embedding for each prefix size serve as teacher targets to align both audio and text prefixes. This alignment concentrates salient keyword cues in lower-dimensional prefixes, while higher dimensions add detail. MATE is trained with standard deep metric learning objectives for audio-text KWS, and is loss-agnostic. To our knowledge, this is the first application of matryoshka-style embeddings to KWS, achieving state-of-the-art results on WSJ and LibriPhrase without any inference overhead.
开放词汇关键词检测(KWS)与基于文本注册的方法已经作为固定短语触发器的一种灵活替代方案出现。此前的句子级匹配方法从嵌入学习的角度来看,是在单一固定的维度上进行嵌入学习。我们摒弃了这种设计,并提出了一个名为Matryoshka音频-文本嵌入(MATE)的双编码框架,在单个向量内通过嵌套子嵌入("前缀")来编码多个嵌入粒度。 具体来说,我们引入了一种PCA指导下的前缀对齐方法:每个前缀大小对应的完整文本嵌入的PCA压缩版本作为教师目标,用于对齐音频和文本前缀。这种对齐将关键词的关键线索集中在较低维度的前缀中,而较高维度则添加更多细节。 MATE通过标准深度度量学习的目标进行训练,专为音频-文本KWS设计,并且不受损失函数类型的限制。据我们所知,这是首次将Matryoshka风格的嵌入应用于KWS,在没有额外推理开销的情况下于WSJ和LibriPhrase数据集上达到了最先进的结果。
https://arxiv.org/abs/2601.14012
While Chain-of-Thought (CoT) reasoning significantly enhances the performance of Multimodal Large Language Models (MLLMs), its autoregressive nature incurs prohibitive latency constraints. Current efforts to mitigate this via token compression often fail by blindly applying text-centric metrics to multimodal contexts. We identify a critical failure mode termed Visual Amnesia, where linguistically redundant tokens are erroneously pruned, leading to hallucinations. To address this, we introduce V-Skip that reformulates token pruning as a Visual-Anchored Information Bottleneck (VA-IB) optimization problem. V-Skip employs a dual-path gating mechanism that weighs token importance through both linguistic surprisal and cross-modal attention flow, effectively rescuing visually salient anchors. Extensive experiments on Qwen2-VL and Llama-3.2 families demonstrate that V-Skip achieves a $2.9\times$ speedup with negligible accuracy loss. Specifically, it preserves fine-grained visual details, outperforming other baselines over 30\% on the DocVQA.
尽管Chain-of-Thought(CoT)推理显著增强了多模态大型语言模型(MLLMs)的性能,但其自回归特性导致了严重的延迟问题。目前通过令牌压缩来缓解这一问题的努力往往失败,因为它们盲目地将以文本为中心的指标应用于多模态场景中。我们发现了一种称为视觉遗忘症的关键失效模式,在这种情况下,语言上冗余的令牌被错误地裁剪掉,从而导致幻觉现象的发生。为了解决这个问题,我们引入了V-Skip,它重新构建了令牌修剪作为视觉锚定信息瓶颈(VA-IB)优化问题。 V-Skip采用了一种双通道门控机制,通过语言上的意外性和跨模态注意力流来权衡令牌的重要性,从而有效地挽救了视觉上显著的锚点。在Qwen2-VL和Llama-3.2系列模型上进行的广泛实验表明,V-Skip实现了约2.9倍的速度提升,并且几乎没有准确性损失。特别地,在DocVQA数据集上,它保留了细粒度的视觉细节,性能超过其他基线方法达30%以上。
https://arxiv.org/abs/2601.13879
Explainable Artificial Intelligence (XAI) techniques, such as Gradient-weighted Class Activation Mapping (Grad-CAM), have become indispensable for visualizing the reasoning process of deep neural networks in medical image analysis. Despite their popularity, the faithfulness and reliability of these heatmap-based explanations remain under scrutiny. This study critically investigates whether Grad-CAM truly represents the internal decision-making of deep models trained for lung cancer image classification. Using the publicly available IQ-OTH/NCCD dataset, we evaluate five representative architectures: ResNet-50, ResNet-101, DenseNet-161, EfficientNet-B0, and ViT-Base-Patch16-224, to explore model-dependent variations in Grad-CAM interpretability. We introduce a quantitative evaluation framework that combines localization accuracy, perturbation-based faithfulness, and explanation consistency to assess Grad-CAM reliability across architectures. Experimental findings reveal that while Grad-CAM effectively highlights salient tumor regions in most convolutional networks, its interpretive fidelity significantly degrades for Vision Transformer models due to non-local attention behavior. Furthermore, cross-model comparisons indicate substantial variability in saliency localization, implying that Grad-CAM explanations may not always correspond to the true diagnostic evidence used by the networks. This work exposes critical limitations of current saliency-based XAI approaches in medical imaging and emphasizes the need for model-aware interpretability methods that are both computationally sound and clinically meaningful. Our findings aim to inspire a more cautious and rigorous adoption of visual explanation tools in medical AI, urging the community to rethink what it truly means to "trust" a model's explanation.
可解释的人工智能(XAI)技术,如梯度加权类激活映射(Grad-CAM),已成为医学图像分析中可视化深度神经网络推理过程不可或缺的工具。尽管这些方法很受欢迎,但基于热图的解释的真实性和可靠性仍然受到质疑。本研究批判性地调查了Grad-CAM是否真正代表用于肺部癌症图像分类的深度模型内部决策过程。我们使用公开可用的IQ-OTH/NCCD数据集,评估五种代表性架构:ResNet-50、ResNet-101、DenseNet-161、EfficientNet-B0和ViT-Base-Patch16-224,以探索Grad-CAM解释能力在不同模型之间的差异。我们引入了一个定量评价框架,结合定位准确性、基于扰动的真实性和解释一致性来评估不同架构下的Grad-CAM可靠性。 实验结果表明,尽管大多数卷积网络中的Grad-CAM能够有效突出关键肿瘤区域,但对于视觉Transformer模型来说,由于其非局部注意力机制,Grad-CAM的解释保真度显著下降。此外,跨模型比较显示了显着的变化性在定位显著性方面,这暗示Grad-CAM解释可能并不总是与网络真正使用的诊断证据相对应。 这项工作揭示了当前基于显著性的XAI方法在医学成像中的关键限制,并强调需要开发既能计算上可靠又具有临床意义的、针对模型特征的可解释性方法。我们的研究旨在启发医疗人工智能领域更加谨慎和严格地采用视觉解释工具,促使社区重新思考“信任”模型解释真正意味着什么。
https://arxiv.org/abs/2601.12826
Total-body PET/CT enables system-wide molecular imaging, but heterogeneous anatomical and metabolic signals, approximately 2 m axial coverage, and structured radiology semantics challenge existing medical AI models that assume single-modality inputs, localized fields of view, and coarse image-text alignment. We introduce SDF-HOLO (Systemic Dual-stream Fusion Holo Model), a multimodal foundation model for holistic total-body PET/CT, pre-trained on more than 10,000 patients. SDF-HOLO decouples CT and PET representation learning with dual-stream encoders and couples them through a cross-modal interaction module, allowing anatomical context to refine PET aggregation while metabolic saliency guides subtle morphological reasoning. To model long-range dependencies across the body, hierarchical context modeling combines efficient local windows with global attention. To bridge voxels and clinical language, we use anatomical segmentation masks as explicit semantic anchors and perform voxel-mask-text alignment during pre-training. Across tumor segmentation, low-dose lesion detection, and multilingual diagnostic report generation, SDF-HOLO outperforms strong task-specific and clinical-reference baselines while reducing localization errors and hallucinated findings. Beyond focal interpretation, the model enables system-wide metabolic profiling and reveals tumor-associated fingerprints of inter-organ metabolic network interactions, providing a scalable computational foundation for total-body PET/CT diagnostics and system-level precision oncology.
全身PET/CT成像使系统级分子成像成为可能,但解剖和代谢信号的异质性、约2米轴向覆盖范围以及结构化的放射学语义对现有的单模态输入假设的医学AI模型构成了挑战。为此,我们引入了SDF-HOLO(全身PET/CT的系统双流融合全息模型),这是一个多模态基础模型,在超过10,000名患者的数据上进行了预训练。 SDF-HOLO使用双流编码器分离CT和PET表示学习,并通过跨模式交互模块将它们结合在一起,使解剖学上下文能够优化PET数据的聚合过程,同时代谢显著性指导细微形态推理。为了建模身体各部位之间的长距离依赖关系,分层上下文模型结合了高效的局部窗口与全局注意力机制。 为了弥合体素和临床语言之间的差距,在预训练过程中我们使用解剖分割掩码作为显式的语义锚点,并执行体素-掩码-文本对齐。在肿瘤分割、低剂量病灶检测以及多语言诊断报告生成等任务中,SDF-HOLO超越了强特定任务基准和临床参照模型,减少了定位错误并降低了假阳性发现。 除了关注局部解释之外,该模型还能够进行系统级代谢谱分析,并揭示与器官间代谢网络相互作用相关的肿瘤特征,为全身PET/CT诊断以及系统层面的精准肿瘤学提供了可扩展的计算基础。
https://arxiv.org/abs/2601.12820
Explainable AI (XAI) is crucial for building transparent and trustworthy machine learning systems, especially in high-stakes domains. Concept Bottleneck Models (CBMs) have emerged as a promising ante-hoc approach that provides interpretable, concept-level explanations by explicitly modeling human-understandable concepts. However, existing CBMs often suffer from poor locality faithfulness, failing to spatially align concepts with meaningful image regions, which limits their interpretability and reliability. In this work, we propose SL-CBM (CBM with Semantic Locality), a novel extension that enforces locality faithfulness by generating spatially coherent saliency maps at both concept and class levels. SL-CBM integrates a 1x1 convolutional layer with a cross-attention mechanism to enhance alignment between concepts, image regions, and final predictions. Unlike prior methods, SL-CBM produces faithful saliency maps inherently tied to the model's internal reasoning, facilitating more effective debugging and intervention. Extensive experiments on image datasets demonstrate that SL-CBM substantially improves locality faithfulness, explanation quality, and intervention efficacy while maintaining competitive classification accuracy. Our ablation studies highlight the importance of contrastive and entropy-based regularization for balancing accuracy, sparsity, and faithfulness. Overall, SL-CBM bridges the gap between concept-based reasoning and spatial explainability, setting a new standard for interpretable and trustworthy concept-based models.
解释性人工智能(XAI)对于构建透明和可信的机器学习系统至关重要,特别是在高风险领域。概念瓶颈模型(CBM)作为一种有前景的事先方法脱颖而出,它通过显式地建模人类易于理解的概念来提供可解释、基于概念级别的说明。然而,现有的CBMs通常在局部忠实度上表现不佳,无法将概念与有意义的图像区域进行空间对齐,这限制了它们的可解释性和可靠性。在这项工作中,我们提出了SL-CBM(具有语义局部性的CBM),这是一种新的扩展方法,通过在概念和类别级别生成一致的空间敏感性图来强制执行局部忠实度。SL-CBM整合了一个1x1卷积层与交叉注意力机制以增强概念、图像区域及最终预测之间的对齐。不同于以往的方法,SL-CBM产生的忠实敏感性映射与模型内部推理固有地关联起来,从而促进了更有效的调试和干预操作。在图像数据集上的广泛实验表明,SL-CBM在保持竞争分类准确性的前提下大幅提高了局部忠实度、解释质量及介入效能。我们的消融研究突显了对比和基于熵的正则化对于平衡准确性、稀疏性和忠实性的重要性。总的来说,SL-CBM弥合了基于概念推理与空间可解释性之间的差距,并为可解释且可信的概念模型设定了新标准。
https://arxiv.org/abs/2601.12804
Video-text retrieval (VTR) aims to locate relevant videos using natural language queries. Current methods, often based on pre-trained models like CLIP, are hindered by video's inherent redundancy and their reliance on coarse, final-layer features, limiting matching accuracy. To address this, we introduce the HVP-Net (Hierarchical Visual Perception Network), a framework that mines richer video semantics by extracting and refining features from multiple intermediate layers of a vision encoder. Our approach progressively distills salient visual concepts from raw patch-tokens at different semantic levels, mitigating redundancy while preserving crucial details for alignment. This results in a more robust video representation, leading to new state-of-the-art performance on challenging benchmarks including MSRVTT, DiDeMo, and ActivityNet. Our work validates the effectiveness of exploiting hierarchical features for advancing video-text retrieval. Our codes are available at this https URL.
视频-文本检索(VTR)的目标是使用自然语言查询来定位相关的视频。目前的方法通常基于像CLIP这样的预训练模型,但这些方法受到视频内在冗余的限制以及它们对粗略、顶层特征依赖的影响,这会限制匹配精度。为了解决这一问题,我们提出了HVP-Net(分层视觉感知网络),这是一种通过从视觉编码器的多个中间层提取和精炼特征来挖掘更丰富视频语义的框架。我们的方法逐步提炼出不同语义层次上的原始补丁标记中的显著视觉概念,减少冗余的同时保留关键细节以实现对齐。这将产生一种更为稳健的视频表示形式,并在包括MSRVTT、DiDeMo和ActivityNet在内的具有挑战性的基准测试中取得了新的最佳性能。我们的工作验证了利用分层特征对于推进视频-文本检索的有效性。代码可在该链接提供:[此URL](请将“this https URL”替换为实际的网址)。
https://arxiv.org/abs/2601.12768
Interpretability is essential for user trust in real-world anomaly detection applications. However, deep learning models, despite their strong performance, often lack transparency. In this work, we study the interpretability of autoencoder-based models for audio anomaly detection, by comparing a standard autoencoder (AE) with a mask autoencoder (MAE) in terms of detection performance and interpretability. We applied several attribution methods, including error maps, saliency maps, SmoothGrad, Integrated Gradients, GradSHAP, and Grad-CAM. Although MAE shows a slightly lower detection, it consistently provides more faithful and temporally precise explanations, suggesting a better alignment with true anomalies. To assess the relevance of the regions highlighted by the explanation method, we propose a perturbation-based faithfulness metric that replaces them with their reconstructions to simulate normal input. Our findings, based on experiments in a real industrial scenario, highlight the importance of incorporating interpretability into anomaly detection pipelines and show that masked training improves explanation quality without compromising performance.
可解释性对于用户在实际异常检测应用中的信任至关重要。然而,尽管深度学习模型表现出色,但它们通常缺乏透明度。在这项工作中,我们研究了基于自编码器的模型(用于音频异常检测)的可解释性,通过比较标准自编码器(AE)和掩码自编码器(MAE),从检测性能和可解释性的角度进行对比分析。我们应用了几种归因方法,包括误差图、注意力热图、SmoothGrad、集成梯度、GradSHAP以及Grad-CAM。尽管MAE的检测准确率略低一些,但它始终能够提供更忠实且时间上更精确的解释,这表明它更能与真正的异常情况对齐。 为了评估解释方法突出显示区域的相关性,我们提出了一种基于扰动的忠实度指标,该指标通过用其重构替换这些区域来模拟正常输入。我们的研究结果,根据实际工业场景中的实验得出,突显了在异常检测管道中融入可解释性的必要性,并表明掩码训练可以提高解释质量而不影响性能。
https://arxiv.org/abs/2601.12660
Deploying Sentinel-2 satellite derived bathymetry (SDB) robustly across sites remains challenging. We analyze a Swin-Transformer based U-Net model (Swin-BathyUNet) to understand how it infers depth and when its predictions are trustworthy. A leave-one-band out study ranks spectral importance to the different bands consistent with shallow water optics. We adapt ablation-based CAM to regression (A-CAM-R) and validate the reliability via a performance retention test: keeping only the top-p% salient pixels while neutralizing the rest causes large, monotonic RMSE increase, indicating explanations localize on evidence the model relies on. Attention ablations show decoder conditioned cross attention on skips is an effective upgrade, improving robustness to glint/foam. Cross-region inference (train on one site, test on another) reveals depth-dependent degradation: MAE rises nearly linearly with depth, and bimodal depth distributions exacerbate mid/deep errors. Practical guidance follows: maintain wide receptive fields, preserve radiometric fidelity in green/blue channels, pre-filter bright high variance near shore, and pair light target site fine tuning with depth aware calibration to transfer across regions.
将Sentinel-2卫星导出的水深数据(SDB)在不同站点上稳健地部署仍然面临挑战。我们分析了一个基于Swin-Transformer的U-Net模型(Swin-BathyUNet),以理解它如何推断深度以及其预测何时可靠。通过“留一光谱带”研究,我们对不同光谱带的重要性进行了排名,这与浅水光学特性相一致。我们将基于消融的类激活映射(A-CAM)改编为回归应用(A-CAM-R),并通过性能保持测试验证了其可靠性:在保留仅前p%显著像素的同时使其余部分失活会导致RMSE单调增加,表明解释集中在模型依赖的证据上。注意力消融显示,解码器对跳过连接条件下的交叉注意是有效的升级措施,提高了对镜面反射/泡沫干扰的鲁棒性。跨区域推断(在一个站点训练,在另一个站点测试)揭示了深度相关的性能下降:平均绝对误差随深度几乎线性增加,并且双峰深度分布会加剧中等和深水区的错误。以下为实际指导建议:保持宽广的感受野,保持绿色/蓝色通道中的辐射度保真度,预过滤近岸高亮度、高变化区域,并结合目标站点精细调整与深度感知校准以实现跨地区转移。
https://arxiv.org/abs/2601.12636
In remote sensing images, complex backgrounds, weak object signals, and small object scales make accurate detection particularly challenging, especially under low-quality imaging conditions. A common strategy is to integrate single-image super-resolution (SR) before detection; however, such serial pipelines often suffer from misaligned optimization objectives, feature redundancy, and a lack of effective interaction between SR and detection. To address these issues, we propose a Saliency-Driven multi-task Collaborative Network (SDCoNet) that couples SR and detection through implicit feature sharing while preserving task specificity. SDCoNet employs the swin transformer-based shared encoder, where hierarchical window-shifted self-attention supports cross-task feature collaboration and adaptively balances the trade-off between texture refinement and semantic representation. In addition, a multi-scale saliency prediction module produces importance scores to select key tokens, enabling focused attention on weak object regions, suppression of background clutter, and suppression of adverse features introduced by multi-task coupling. Furthermore, a gradient routing strategy is introduced to mitigate optimization conflicts. It first stabilizes detection semantics and subsequently routes SR gradients along a detection-oriented direction, enabling the framework to guide the SR branch to generate high-frequency details that are explicitly beneficial for detection. Experiments on public datasets, including NWPU VHR-10-Split, DOTAv1.5-Split, and HRSSD-Split, demonstrate that the proposed method, while maintaining competitive computational efficiency, significantly outperforms existing mainstream algorithms in small object detection on low-quality remote sensing images. Our code is available at this https URL.
在遥感图像中,复杂的背景、弱信号目标和小规模物体使得准确检测变得特别具有挑战性,尤其是在成像质量较低的情况下。一种常见的策略是在检测之前整合单幅图像的超分辨率(SR)处理;然而,这种串联管道通常会面临优化目标不一致、特征冗余以及超分辨率与检测之间缺乏有效交互的问题。为了解决这些问题,我们提出了一种以显著性为导向的多任务协作网络 (SDCoNet),该网络通过隐式的特征共享将超分辨率和检测结合起来,同时保持各自任务的独特性。SDCoNet 使用基于 Swin 变换器的共享编码器,在此编码器中层次化的窗口偏移自注意力支持跨任务特性合作,并适应性地平衡纹理细化与语义表示之间的权衡。 此外,一个多尺度显著性预测模块生成重要度评分以选择关键令牌,从而能够对弱目标区域集中注意,抑制背景杂乱并压制由多任务结合引入的不利特征。而且,我们还提出了一种梯度路由策略来缓解优化冲突。该策略首先稳定检测语义,并随后将超分辨率梯度沿有利于检测的方向进行路由,使框架引导超分辨率分支生成对检测显式有益的高频细节。 在包括 NWPU VHR-10-Split、DOTAv1.5-Split 和 HRSSD-Split 在内的公共数据集上的实验表明,所提出的方法在低质量遥感图像的小目标检测方面显著优于现有的主流算法,同时保持了具有竞争力的计算效率。我们的代码可在 [这里](https://this https URL) 获取。
https://arxiv.org/abs/2601.12507
Customer reviews contain detailed, domain specific signals about service failures and user expectations, but converting this unstructured feedback into actionable business decisions remains difficult. We study review-to-action generation: producing concrete, implementable recommendations grounded in review text. We propose a modular two-LLM framework in which an Issue model extracts salient issues and assigns coarse themes, and an Advice model generates targeted operational fixes conditioned on the extracted issue representation. To enable specialization without expensive full fine-tuning, we adapt the Advice model using a mixture of LoRA experts strategy: multiple low-rank adapters are trained and a lightweight gating mechanism performs token-level expert mixing at inference, combining complementary expertise across issue types. We construct synthetic review-issue-advice triples from Yelp reviews (airlines and restaurants) to supervise training, and evaluate recommendations using an eight dimension operational rubric spanning actionability, specificity, feasibility, expected impact, novelty, non-redundancy, bias, and clarity. Across both domains, our approach consistently outperforms prompting-only and single-adapter baselines, yielding higher actionability and specificity while retaining favorable efficiency-quality trade-offs.
客户评论包含有关服务故障和用户期望的详细且特定领域的信号,但将这种非结构化反馈转化为可操作的商业决策仍然具有挑战性。我们研究了从评论生成行动(review-to-action):基于评论文本产生具体、可行的建议。 为此,我们提出了一种模块化的双LLM框架,在该框架中,“问题模型”提取突出的问题并分配粗略的主题,“建议模型”则在提取出的问题表示的基础上生成有针对性的操作性解决方案。为了使模型专业化而不进行昂贵的整体微调,我们将“建议模型”通过混合LoRA专家策略来适应:训练多个低秩适配器,并且一个轻量级的门控机制执行推理时的逐令牌专家混合,结合不同问题类型的专业知识。 我们从Yelp评论(包括航空和餐厅)中构建了合成的评论-问题-建议三元组以监督训练,并使用包含八个维度的操作准则来评估建议:操作性、具体性、可行性、预期影响、新颖性、非冗余性、偏见和清晰度。在两个领域内,我们的方法始终优于仅凭提示和单个适配器基线,在提高行动性和具体性的同时保持了效率与质量之间的有利权衡。
https://arxiv.org/abs/2601.12338
High-precision scene parsing tasks, including image matting and dichotomous segmentation, aim to accurately predict masks with extremely fine details (such as hair). Most existing methods focus on salient, single foreground objects. While interactive methods allow for target adjustment, their class-agnostic design restricts generalization across different categories. Furthermore, the scarcity of high-quality annotation has led to a reliance on inharmonious synthetic data, resulting in poor generalization to real-world scenarios. To this end, we propose a Foreground Consistent Learning model, dubbed as FCLM, to address the aforementioned issues. Specifically, we first introduce a Depth-Aware Distillation strategy where we transfer the depth-related knowledge for better foreground representation. Considering the data dilemma, we term the processing of synthetic data as domain adaptation problem where we propose a domain-invariant learning strategy to focus on foreground learning. To support interactive prediction, we contribute an Object-Oriented Decoder that can receive both visual and language prompts to predict the referring target. Experimental results show that our method quantitatively and qualitatively outperforms SOTA methods.
高精度场景解析任务,包括图像抠图和二元分割,旨在准确预测包含极细微节(如头发)的掩码。现有大多数方法主要关注显著、单一前景对象。尽管交互式方法允许目标调整,但它们无类别感知的设计限制了跨不同类别的泛化能力。此外,高质量标注数据的稀缺导致对不协调合成数据的高度依赖,从而在真实场景中的表现不佳。 为了解决上述问题,我们提出了一种名为FCLM(Foreground Consistent Learning model)的方法来解决这些问题。具体来说,我们首先引入了深度感知蒸馏策略,在该策略中我们转移与深度相关的知识以更好地表示前景对象。考虑到数据困境,我们将合成数据的处理视为领域适应问题,并提出了域不变学习策略专注于前景学习。为了支持交互式预测,我们贡献了一个面向对象解码器,它可以接收视觉和语言提示来预测引用目标。 实验结果表明,我们的方法在定量和定性上均优于现有最先进的方法(SOTA)。
https://arxiv.org/abs/2601.12080
Remote sensing video referring object segmentation (RS-RVOS) is challenged by weak target saliency and severe visual information truncation in dynamic scenes, making it extremely difficult to maintain discriminative target representations during segmentation. Moreover, progress in this field is hindered by the absence of large-scale dedicated benchmarks, while existing models are often affected by biased initial memory construction that impairs accurate instance localization in complex scenarios, as well as indiscriminate memory accumulation that encodes noise from occlusions or misclassifications, leading to persistent error propagation. This paper advances RS-RVOS research through dual contributions in data and methodology. First, we construct RS-RVOS Bench, the first large-scale benchmark comprising 111 video sequences, about 25,000 frames, and 213,000 temporal referring annotations. Unlike common RVOS benchmarks where many expressions are written with access to the full video context, our dataset adopts a strict causality-aware annotation strategy in which linguistic references are generated solely from the target state in the initial frame. Second, we propose a memory-quality-aware online referring segmentation framework, termed Memory Quality Control with Segment Anything Model (MQC-SAM). MQC-SAM introduces a temporal motion consistency module for initial memory calibration, leveraging short-term motion trajectory priors to correct structural deviations and establish accurate memory anchoring. Furthermore, it incorporates a decoupled attention-based memory integration mechanism with dynamic quality assessment, selectively updating high-confidence semantic features while filtering unreliable information, thereby effectively preventing error accumulation and propagation. Extensive experiments on RS-RVOS Bench demonstrate that MQC-SAM achieves state-of-the-art performance.
远程感应视频指代对象分割(RS-RVOS)技术在动态场景中面临着目标显著性弱和严重视觉信息截断的挑战,这使得在分割过程中保持判别性的目标表示变得极其困难。此外,由于缺乏大规模专用基准,该领域的发展受到了阻碍,而现有模型往往受到初始记忆构建偏差的影响,在复杂场景中准确实例定位受到影响,并且无差别地积累内存导致遮挡或误分类噪声编码,从而引发持续的错误传播。本文通过数据和方法两个方面的贡献推进了RS-RVOS的研究。首先,我们构建了RS-RVOS Bench,这是首个大规模基准测试集,包含111个视频序列、约25,000帧以及213,000条时间指代标注。与通常的RVOS基准测试不同,我们的数据集采用了严格的因果性感知注释策略,在初始帧目标状态下生成语言参考而非在整个视频上下文中编写表达式。 其次,我们提出了一个基于记忆质量的认知在线指代分割框架,即带有段落任何模型(MQC-SAM)的记忆质量控制。MQC-SAM引入了一个时间运动一致性模块以校准初始内存,利用短期运动轨迹先验来修正结构偏差并建立准确的内存锚定。此外,它结合了一种基于解耦注意力机制的动态质量评估记忆集成机制,选择性地更新高置信度语义特征同时过滤不可靠信息,从而有效防止错误积累和传播。 在RS-RVOS Bench上的广泛实验表明,MQC-SAM实现了最先进的性能。
https://arxiv.org/abs/2601.12076
Image retrieval is a critical step for alleviating the quadratic complexity of image matching in unconstrained Structure-from-Motion (SfM). However, in this context, image retrieval typically focuses more on the image pairs of geometric matchability than on those of semantic similarity, a nuance that most existing deep learning-based methods guided by batched binaries (overlapping vs. non-overlapping pairs) fail to capture. In this paper, we introduce SupScene, a novel solution that learns global descriptors tailored for finding overlapping image pairs of similar geometric nature for SfM. First, to better underline co-visible regions, we employ a subgraph-based training strategy that moves beyond equally important isolated pairs, leveraging ground-truth geometric overlapping relationships with various weights to provide fine-grained supervision via a soft supervised contrastive loss. Second, we introduce DiVLAD, a DINO-inspired VLAD aggregator that leverages the inherent multi-head attention maps from the last block of ViT. And then, a learnable gating mechanism is designed to adaptively utilize these semantically salient cues with visual features, enabling a more discriminative global descriptor. Extensive experiments on the GL3D dataset demonstrate that our method achieves state-of-the-art performance, significantly outperforming NetVLAD while introducing a negligible number of additional trainable parameters. Furthermore, we show that the proposed training strategy brings consistent gains across different aggregation techniques. Code and models are available at this https URL.
图像检索是减轻无约束结构从运动(SfM)中二次匹配复杂性的关键步骤。然而,在这种情况下,图像检索通常更关注几何可匹配性较高的图像对,而不是语义相似的图像对,而大多数现有的基于深度学习的方法,通过批量二进制(重叠与非重叠对)引导的方式未能捕捉到这一点。在本文中,我们介绍了SupScene,这是一种新颖的解决方案,旨在为SfM找到几何性质相似的重叠图像对学习全局描述符。 首先,为了更好地强调共视区域,我们采用了一种基于子图的训练策略,超越了等重要性孤立配对的方法,利用各种权重的真实几何重叠关系提供细粒度监督,并通过软监督对比损失实现。其次,我们引入了DiVLAD,这是一种受DINO启发的VLAD聚合器,它利用ViT最后块中的内在多头注意力图。然后设计了一种可学习的门控机制,以自适应地使用这些语义显著线索与视觉特征相结合,从而生成更具判别力的全局描述符。 在GL3D数据集上的广泛实验表明,我们的方法达到了最先进的性能,远远超过了NetVLAD,并且仅引入了少量额外的训练参数。此外,我们展示了提出的训练策略为不同的聚合技术带来了持续的好处。代码和模型可在提供的URL中获得。
https://arxiv.org/abs/2601.11930
Convolutional neural networks (CNNs) have achieved state-of-the-art performance in image recognition tasks but often involve complex architectures that may overfit on small datasets. In this study, we evaluate a compact CNN across five publicly available, real-world image datasets from Bangladesh, including urban encroachment, vehicle detection, road damage, and agricultural crops. The network demonstrates high classification accuracy, efficient convergence, and low computational overhead. Quantitative metrics and saliency analyses indicate that the model effectively captures discriminative features and generalizes robustly across diverse scenarios, highlighting the suitability of streamlined CNN architectures for small-class image classification tasks.
卷积神经网络(CNN)在图像识别任务中取得了最先进的性能,但常常涉及复杂的架构,在小规模数据集上可能过度拟合。在这项研究中,我们评估了一个精简的CNN在五个来自孟加拉国的公开可用、真实世界图像数据集上的表现,包括城市侵占、车辆检测、道路损坏和农业作物等任务。该网络展示了高分类准确率、高效的收敛性和低计算开销。定量指标和显著性分析表明,模型能够有效捕捉区分特征,并在各种场景中稳健地推广,突显了简化CNN架构适用于小类图像分类任务的适用性。
https://arxiv.org/abs/2601.11911
We present XChoice, an explainable framework for evaluating AI-human alignment in constrained decision making. Moving beyond outcome agreement such as accuracy and F1 score, XChoice fits a mechanism-based decision model to human data and LLM-generated decisions, recovering interpretable parameters that capture the relative importance of decision factors, constraint sensitivity, and implied trade-offs. Alignment is assessed by comparing these parameter vectors across models, options, and subgroups. We demonstrate XChoice on Americans' daily time allocation using the American Time Use Survey (ATUS) as human ground truth, revealing heterogeneous alignment across models and activities and salient misalignment concentrated in Black and married groups. We further validate robustness of XChoice via an invariance analysis and evaluate targeted mitigation with a retrieval augmented generation (RAG) intervention. Overall, XChoice provides mechanism-based metrics that diagnose misalignment and support informed improvements beyond surface outcome matching.
我们介绍了XChoice,这是一个用于评估约束决策中AI与人类一致性(alignments)的可解释框架。除了准确性、F1分数等结果一致性之外,XChoice将基于机制的决策模型拟合到人类数据和大语言模型生成的决策上,恢复出能够捕捉决策因素相对重要性、约束敏感性和隐含权衡的可解释参数。通过比较这些参数向量来评估不同模型、选项和子群体之间的一致性。我们使用美国时间使用调查(ATUS)作为人类真实情况的数据,在美国人日常时间分配的情况下展示了XChoice的应用,揭示了模型与活动之间的一致性存在异质性,并且在黑人和已婚人群中发现了明显的不一致。此外,我们通过不变性分析验证了XChoice的鲁棒性,并评估了使用检索增强生成(RAG)干预进行有针对性的缓解的效果。总的来说,XChoice提供了基于机制的度量标准,用于诊断一致性问题并支持超越表面结果匹配的知情改进。
https://arxiv.org/abs/2601.11286
No-reference video quality assessment (NR-VQA) estimates perceptual quality without a reference video, which is often challenging. While recent techniques leverage saliency or transformer attention, they merely address global context of the video signal by using static maps as auxiliary inputs rather than embedding context fundamentally within feature extraction of the video sequence. We present Dynamic Attention with Global Registers for Video Quality Assessment (DAGR-VQA), the first framework integrating register-token directly into a convolutional backbone for spatio-temporal, dynamic saliency prediction. By embedding learnable register tokens as global context carriers, our model enables dynamic, HVS-inspired attention, producing temporally adaptive saliency maps that track salient regions over time without explicit motion estimation. Our model integrates dynamic saliency maps with RGB inputs, capturing spatial data and analyzing it through a temporal transformer to deliver a perceptually consistent video quality assessment. Comprehensive tests conducted on the LSVQ, KonVid-1k, LIVE-VQC, and YouTube-UGC datasets show that the performance is highly competitive, surpassing the majority of top baselines. Research on ablation studies demonstrates that the integration of register tokens promotes the development of stable and temporally consistent attention mechanisms. Achieving an efficiency of 387.7 FPS at 1080p, DAGR-VQA demonstrates computational performance suitable for real-time applications like multimedia streaming systems.
无参考视频质量评估(NR-VQA)旨在无需参考视频的情况下估计感知质量,这通常颇具挑战性。虽然近期的技术利用了视觉注意或转换器注意力机制,但它们主要通过使用静态图作为辅助输入来解决视频信号的全局上下文问题,而不是从根本上在特征提取过程中嵌入上下文信息。我们提出了动态关注与全局寄存器用于视频质量评估(DAGR-VQA),这是第一个将寄存器令牌直接集成到卷积骨干网中的框架,以实现空间-时间上的动态注意预测。 通过嵌入可学习的注册令牌作为全球上下文载体,我们的模型能够产生基于人眼视觉系统(HVS)启发的关注机制,并生成随时间变化的自适应关注图,无需显式的运动估计即可跟踪显著区域。该模型将动态关注图与RGB输入相结合,捕捉空间数据并通过时序变换器进行分析以提供感知一致性的视频质量评估。 在LSVQ、KonVid-1k、LIVE-VQC和YouTube-UGC等数据集上进行的全面测试表明,DAGR-VQA的性能极为出色,并且超过了大多数顶级基准模型。通过消融研究的研究表明,寄存器令牌的集成促进了稳定且时序一致的关注机制的发展。 在1080p分辨率下达到每秒387.7帧的速度,DAGR-VQA展示了适用于多媒体流媒体系统等实时应用的计算性能。
https://arxiv.org/abs/2601.11045
Post-training quantization (PTQ) methods for large language models rely on heuristics that implicitly estimate which weight channels most strongly influence model behavior. Two dominant paradigms have emerged: activation-aware methods such as AWQ prioritize channels with large activation magnitudes, while second-order methods such as GPTQ allocate quantization error according to input covariance structure. Despite strong empirical performance, these approaches remain conceptually fragmented, and it is unclear what underlying quantity they are approximating. In this work, we present a unified theoretical framework for PTQ by formalizing activation sensitivity, defined as the expected impact of channel-wise perturbations on the loss. Using a first-order Taylor expansion, we show that sensitivity naturally arises as the squared norm of gradient-weighted activations, yielding a principled measure of channel importance that captures both activation magnitude and downstream error propagation. Within this framework, AWQ and GPTQ can be interpreted as complementary approximations that recover sensitivity under distinct simplifying assumptions. We analyze the design space of sensitivity metrics, connect gradient-based saliency, Fisher information, and Hessian-based criteria, and clarify their relationships to classical pruning methods such as Optimal Brain Damage and Optimal Brain Surgeon. Rather than proposing a new quantization algorithm, this work provides a conceptual foundation for understanding and comparing post-training quantization methods through the lens of sensitivity.
训练后的量化(PTQ)方法针对大型语言模型依赖于启发式算法,这些算法隐式地估计哪些权重通道最强烈地影响了模型的行为。两种主要的方法已经出现:激活感知方法如AWQ优先考虑具有大激活幅度的通道,而第二类方法如GPTQ则根据输入协方差结构分配量化误差。尽管这些方法在实证性能上表现出色,但它们的概念基础仍然是支离破碎的,并且不清楚它们实际逼近的是什么量。 在这项工作中,我们提出了一种统一的理论框架来研究PTQ,通过形式化激活敏感性定义了这一点,即通道级扰动对损失的影响期望值。使用一阶泰勒展开,我们展示了敏感性自然地以梯度加权激活的平方范数的形式出现,这提供了一个原则性的度量标准,捕捉到了激活幅度和后续错误传播。在这一框架下,AWQ和GPTQ可以被解释为在不同的简化假设下的互补近似方法,它们分别恢复了敏感性。我们分析了敏感性度量的设计空间,并将基于梯度的显著性、Fisher信息以及Hessian准则联系起来,同时阐明了它们与经典修剪方法(如最优脑损伤和最优脑手术)的关系。 这项工作的重点不是提出一个新的量化算法,而是通过灵敏度的概念基础来理解和比较训练后的量化方法。
https://arxiv.org/abs/2601.11663