Alzheimer's Disease (AD) is a progressive neurodegenerative condition that adversely affects cognitive abilities. Language-related changes can be automatically identified through the analysis of outputs from linguistic assessment tasks, such as picture description. Language models show promise as a basis for screening tools for AD, but their limited interpretability poses a challenge in distinguishing true linguistic markers of cognitive decline from surface-level textual patterns. To address this issue, we examine how surface form variation affects classification performance, with the goal of assessing the ability of language models to represent underlying semantic indicators. We introduce a novel approach where texts surface forms are transformed by altering syntax and vocabulary while preserving semantic content. The transformations significantly modify the structure and lexical content, as indicated by low BLEU and chrF scores, yet retain the underlying semantics, as reflected in high semantic similarity scores, isolating the effect of semantic information, and finding models perform similarly to if they were using the original text, with only small deviations in macro-F1. We also investigate whether language from picture descriptions retains enough detail to reconstruct the original image using generative models. We found that image-based transformations add substantial noise reducing classification accuracy. Our methodology provides a novel way of looking at what features influence model predictions, and allows the removal of possible spurious correlations. We find that just using semantic information, language model based classifiers can still detect AD. This work shows that difficult to detect semantic impairment can be identified, addressing an overlooked feature of linguistic deterioration, and opening new pathways for early detection systems.
阿尔茨海默病(AD)是一种进行性的神经退行性疾病,会对认知能力产生不利影响。语言相关的改变可以通过分析诸如图片描述等语言评估任务的输出来自动识别。语言模型作为AD筛查工具的基础显示出潜力,但它们有限的可解释性使得区分真正的语言标志和表面文本模式变得困难。为了解决这个问题,我们研究了表层形式的变化如何影响分类性能,并旨在评估语言模型表示潜在语义指标的能力。 我们引入了一种新颖的方法,在此方法中通过改变句法和词汇来变换文本的形式,同时保留其语义内容。这些转换显著改变了结构和词汇内容,如低BLEU和chrF得分所示,但保留了底层的语义,这反映在高的语义相似性评分上,从而隔离了语义信息的影响,并发现模型的表现与使用原始文本时几乎相同,仅有微小的宏观F1值偏差。 我们还探讨了图片描述中的语言是否包含足够的细节以利用生成式模型重建原图。我们发现基于图像的变化会引入大量噪音,降低分类准确性。 我们的方法为研究哪些特征影响模型预测提供了一种新的视角,并允许消除可能存在的虚假相关性。结果表明,仅使用语义信息时,基于语言模型的分类器仍能检测到AD的存在。 这项工作展示了难以察觉的语义损伤可以被识别出来,弥补了对语言退化的一个未被重视的特点的关注,并为早期诊断系统开辟了新的途径。
https://arxiv.org/abs/2512.13685
Industrial anomaly detection (IAD) is difficult due to the scarcity of normal reference samples and the subtle, localized nature of many defects. Single-pass vision-language models (VLMs) often overlook small abnormalities and lack explicit mechanisms to compare against canonical normal patterns. We propose AgentIAD, a tool-driven agentic framework that enables multi-stage visual inspection. The agent is equipped with a Perceptive Zoomer (PZ) for localized fine-grained analysis and a Comparative Retriever (CR) for querying normal exemplars when evidence is ambiguous. To teach these inspection behaviors, we construct structured perceptive and comparative trajectories from the MMAD dataset and train the model in two stages: supervised fine-tuning followed by reinforcement learning. A two-part reward design drives this process: a perception reward that supervises classification accuracy, spatial alignment, and type correctness, and a behavior reward that encourages efficient tool use. Together, these components enable the model to refine its judgment through step-wise observation, zooming, and verification. AgentIAD achieves a new state-of-the-art 97.62% classification accuracy on MMAD, surpassing prior MLLM-based approaches while producing transparent and interpretable inspection traces.
工业异常检测(IAD)由于正常参考样本的稀缺性和许多缺陷的细微、局部性质而变得困难。单一通过的视觉-语言模型(VLMs)通常会忽略小的异常情况,并且缺乏与标准正常模式进行比较的明确机制。我们提出了AgentIAD,这是一个工具驱动的代理框架,能够执行多阶段视觉检查。该代理配备了感知缩放器(PZ),用于局部细粒度分析和对比检索器(CR),在证据模棱两可时查询正常的样本。为了教授这些检测行为,我们从MMAD数据集中构建了结构化的感知和比较轨迹,并分两个阶段训练模型:监督微调后跟强化学习。 这一过程的设计包括两部分奖励机制:一种是感知奖励,用于监督分类准确性、空间对齐和类型正确性;另一种是行为奖励,鼓励高效地使用工具。这些组件共同作用使模型能够通过逐步观察、缩放和验证来细化其判断。 AgentIAD在MMAD数据集上实现了新的最佳性能,分类准确率达到97.62%,超过了先前基于多模态大语言模型的方法,并且生成的检测痕迹透明可解释。
https://arxiv.org/abs/2512.13671
Large-language models (LLMs) have been shown to respond in a variety of ways for classification tasks outside of question-answering. LLM responses are sometimes called "hallucinations" since the output is not what is ex pected. Memorization strategies in LLMs are being studied in detail, with the goal of understanding how LLMs respond. We perform a deep dive into a classification task based on United States Supreme Court (SCOTUS) decisions. The SCOTUS corpus is an ideal classification task to study for LLM memory accuracy because it presents significant challenges due to extensive sentence length, complex legal terminology, non-standard structure, and domain-specific vocabulary. Experimentation is performed with the latest LLM fine tuning and retrieval-based approaches, such as parameter-efficient fine-tuning, auto-modeling, and others, on two traditional category-based SCOTUS classification tasks: one with 15 labeled topics and another with 279. We show that prompt-based models with memories, such as DeepSeek, can be more robust than previous BERT-based models on both tasks scoring about 2 points better than previous models not based on prompting.
大型语言模型(LLMs)已被证明在问答之外的分类任务中会以多种方式响应,有时这些响应被称为“幻觉”,因为输出结果不符合预期。对于LLM中的记忆策略研究正在深入进行,目的是理解这些模型如何做出反应。我们针对美国最高法院(SCOTUS)裁决开展了一项基于分类任务的深度分析。由于句子长度长、法律术语复杂、结构非标准化以及专业词汇的存在,SCOTUS语料库成为了研究LLM记忆准确性的一个理想分类任务。 实验采用了最新的参数高效微调和检索式方法,例如参数高效的微调、自动建模等技术,在两个基于传统类别划分的SCOTUS分类任务上进行测试:一个是包含15个主题标签的任务,另一个则包含了279个。我们展示了带有记忆功能的提示驱动模型(如DeepSeek)在两项任务中的表现均优于先前的BERT基线模型,得分大约高出2分。 通过这种方法的研究表明,具有记忆机制和基于提示调整能力的大型语言模型能够更好地完成复杂法律文本分类任务,并且相较于传统的仅依赖微调的方法,这些新方法可以有效提升模型的表现。
https://arxiv.org/abs/2512.13654
Foundation models have shown promise in medical imaging but remain underexplored for three-dimensional imaging modalities. No foundation model currently exists for Digital Breast Tomosynthesis (DBT), despite its use for breast cancer screening. To develop and evaluate a foundation model for DBT (DBT-DINO) across multiple clinical tasks and assess the impact of domain-specific pre-training. Self-supervised pre-training was performed using the DINOv2 methodology on over 25 million 2D slices from 487,975 DBT volumes from 27,990 patients. Three downstream tasks were evaluated: (1) breast density classification using 5,000 screening exams; (2) 5-year risk of developing breast cancer using 106,417 screening exams; and (3) lesion detection using 393 annotated volumes. For breast density classification, DBT-DINO achieved an accuracy of 0.79 (95\% CI: 0.76--0.81), outperforming both the MetaAI DINOv2 baseline (0.73, 95\% CI: 0.70--0.76, p<.001) and DenseNet-121 (0.74, 95\% CI: 0.71--0.76, p<.001). For 5-year breast cancer risk prediction, DBT-DINO achieved an AUROC of 0.78 (95\% CI: 0.76--0.80) compared to DINOv2's 0.76 (95\% CI: 0.74--0.78, p=.57). For lesion detection, DINOv2 achieved a higher average sensitivity of 0.67 (95\% CI: 0.60--0.74) compared to DBT-DINO with 0.62 (95\% CI: 0.53--0.71, p=.60). DBT-DINO demonstrated better performance on cancerous lesions specifically with a detection rate of 78.8\% compared to Dinov2's 77.3\%. Using a dataset of unprecedented size, we developed DBT-DINO, the first foundation model for DBT. DBT-DINO demonstrated strong performance on breast density classification and cancer risk prediction. However, domain-specific pre-training showed variable benefits on the detection task, with ImageNet baseline outperforming DBT-DINO on general lesion detection, indicating that localized detection tasks require further methodological development.
基于大模型的医学影像应用展现出巨大潜力,但在三维成像模式中的研究相对较少。数字乳腺断层合成(DBT)作为一种用于乳腺癌筛查的技术,目前尚无专门的大模型支持。为了开发和评估一种适用于多种临床任务的DBT基础模型(DBT-DINO),并考察领域特定预训练的影响,本研究使用DINOv2方法在超过2500万张来自487,975个DBT体素数据集中的27,990名患者的二维切片上进行了自监督预训练。以下是三个下游任务的评估结果: 1. **乳腺密度分类**:使用了5,000份筛查检查数据,DBT-DINO在乳腺密度分类中达到了0.79的准确率(95% CI: 0.76--0.81),优于MetaAI DINOv2基线模型(0.73, 95% CI: 0.70--0.76, p<.001)和DenseNet-121 (0.74, 95% CI: 0.71--0.76, p<.001)。 2. **五年乳腺癌风险预测**:使用了106,417份筛查检查数据,DBT-DINO在五年内发生乳腺癌的风险预测中达到了AUROC为0.78(95% CI: 0.76--0.80),相比之下DINOv2的AUROC为0.76 (95% CI: 0.74--0.78, p=.57)。 3. **病灶检测**:使用了393份标注体素数据,DINOv2在病变检测中平均敏感度更高,达到0.67(95% CI: 0.60--0.74),而DBT-DINO的敏感度为0.62 (95% CI: 0.53--0.71, p=.60)。特别地,在检测恶性病变时,DBT-DINO的表现优于Dinov2,检出率为78.8%,相比之下Dinov2为77.3%。 通过使用前所未有的大规模数据集,本研究开发了第一个专门用于数字乳腺断层合成的DBT-DINO基础模型。在乳腺密度分类和癌症风险预测方面,DBT-DINO展示了出色的表现。然而,在病灶检测任务中,领域特定预训练表现出的效果不一,ImageNet基线模型在一般性病灶检测上优于DBT-DINO,表明局部化检测任务需要进一步的方法学研究与发展。
https://arxiv.org/abs/2512.13608
Electroencephalographic (EEG) signals have long been applied in the field of affective brain-computer interfaces (aBCIs). Cross-subject EEG-based emotion recognition has demonstrated significant potential in practical applications due to its suitability across diverse people. However, most studies on cross-subject EEG-based emotion recognition neglect the presence of inter-individual variability and negative transfer phenomena during model training. To address this issue, a cross-subject EEG-based emotion recognition through source selection with adversarial strategy is introduced in this paper. The proposed method comprises two modules: the source selection network (SS) and the adversarial strategies network (AS). The SS uses domain labels to reverse-engineer the training process of domain adaptation. Its key idea is to disrupt class separability and magnify inter-domain differences, thereby raising the classification difficulty and forcing the model to learn domain-invariant yet emotion-relevant representations. The AS gets the source domain selection results and the pretrained domain discriminators from SS. The pretrained domain discriminators compute a novel loss aimed at enhancing the performance of domain classification during adversarial training, ensuring the balance of adversarial strategies. This paper provides theoretical insights into the proposed method and achieves outstanding performance on two EEG-based emotion datasets, SEED and SEED-IV. The code can be found at this https URL.
脑电图(EEG)信号长期以来被应用于情感型脑机接口(aBCIs)领域。跨受试者基于EEG的情感识别由于其在不同人群中的适用性,已经在实际应用中展示了巨大的潜力。然而,大多数关于跨受试者基于EEG的情感识别的研究忽略了个体间差异和模型训练过程中负面迁移现象的存在。为了解决这个问题,本文提出了一种通过对抗策略进行源选择的跨受试者EEG情感识别方法。 该提议的方法包括两个模块:源选择网络(SS)与对抗策略网络(AS)。SS使用领域标签来逆向工程域适应的训练过程。其核心思想是破坏类别的可分离性,放大不同领域的差异,从而提高分类难度,并迫使模型学习到对领域不变但情感相关的表示形式。而AS从SS获取源域选择结果和预训练的领域判别器。这些预训练的领域判别器计算一种新的损失函数,旨在通过对抗式训练来增强领域分类性能,确保对抗策略的平衡性。 本文提供了对该方法的理论见解,并在两个基于EEG的情感数据集SEED和SEED-IV上实现了卓越的表现。代码可在此网址访问:[提供一个URL以供参考]。 请注意,上述文本最后提供的URL是一个占位符,请根据实际情况替换为实际可用的链接地址。
https://arxiv.org/abs/2512.13458
Generative foundation models contain broad visual knowledge and can produce diverse image variations, making them particularly promising for advancing domain generalization tasks. While they can be used for training data augmentation, synthesizing comprehensive target-domain variations remains slow, expensive, and incomplete. We propose an alternative: using diffusion models at test time to map target images back to the source distribution where the downstream model was trained. This approach requires only a source domain description, preserves the task model, and eliminates large-scale synthetic data generation. We demonstrate consistent improvements across segmentation, detection, and classification tasks under challenging environmental shifts in real-to-real domain generalization scenarios with unknown target distributions. Our analysis spans multiple generative and downstream models, including an ensemble variant for enhanced robustness. The method achieves substantial relative gains: 137% on BDD100K-Night, 68% on ImageNet-R, and 62% on DarkZurich.
生成式基础模型包含广泛的视觉知识,并能够产生多样化的图像变体,这使其在推进领域泛化任务方面尤为有前景。尽管它们可以用于训练数据增强,但在合成涵盖整个目标领域的多样化样本时仍然缓慢、昂贵且不完整。我们提出了一种替代方案:利用扩散模型在测试阶段将目标图像映射回源分布,在该分布下下游模型已进行过训练。这种策略只需要一个源领域描述,保留任务模型,并消除了大规模合成数据生成的需求。 我们在具有未知目标分布的真实到真实域泛化场景的分割、检测和分类任务中展示了持续改进的效果。我们的分析涵盖了多种生成式和下游模型,包括一种增强鲁棒性的集成变体。该方法实现了显著的相对收益:在BDD100K-Night上提高了137%,在ImageNet-R上提高了68%,在DarkZurich上提高了62%。
https://arxiv.org/abs/2512.13454
As the therapeutic target for Inflammatory Bowel Disease (IBD) shifts toward histologic remission, the accurate assessment of microscopic inflammation has become increasingly central for evaluating disease activity and response to treatment. In this work, we introduce IMILIA (Interpretable Multiple Instance Learning for Inflammation Analysis), an end-to-end framework designed for the prediction of inflammation presence in IBD digitized slides stained with hematoxylin and eosin (H&E), followed by the automated computation of markers characterizing tissue regions driving the predictions. IMILIA is composed of an inflammation prediction module, consisting of a Multiple Instance Learning (MIL) model, and an interpretability module, divided in two blocks: HistoPLUS, for cell instance detection, segmentation and classification; and EpiSeg, for epithelium segmentation. IMILIA achieves a cross-validation ROC-AUC of 0.83 on the discovery cohort, and a ROC-AUC of 0.99 and 0.84 on two external validation cohorts. The interpretability module yields biologically consistent insights: tiles with higher predicted scores show increased densities of immune cells (lymphocytes, plasmocytes, neutrophils and eosinophils), whereas lower-scored tiles predominantly contain normal epithelial cells. Notably, these patterns were consistent across all datasets. Code and models to partially replicate the results on the public IBDColEpi dataset can be found at this https URL.
随着炎症性肠病(IBD)的治疗目标转向组织学缓解,对微观炎症的准确评估已成为评价疾病活动性和治疗反应的关键。本文介绍了IMILIA(可解释的多重实例学习炎症分析),这是一种端到端框架,用于预测经苏木精和伊红(H&E)染色后数字化的IBD切片中的炎症存在,并随后自动计算表征驱动预测的组织区域标记物。 IMILIA由一个炎症预测模块组成,该模块包含一个多实例学习(MIL)模型,以及解释性模块,分为两个部分:HistoPLUS用于细胞实例检测、分割和分类;EpiSeg用于上皮层分割。在发现队列中,IMILIA实现了交叉验证ROC-AUC为0.83,在两个外部验证队列中的ROC-AUC分别为0.99和0.84。 解释性模块提供了生物学一致的见解:预测得分较高的切片显示出免疫细胞(淋巴细胞、浆细胞、中性粒细胞和嗜酸性粒细胞)密度增加,而得分较低的切片主要包含正常上皮细胞。值得注意的是,在所有数据集中这些模式是一致的。可以在以下网址找到用于部分复制在公开IBDColEpi数据集上的结果代码和模型:[此链接](https://this-URL.com)(请将"this https URL"替换为实际的代码和模型存放地址)。
https://arxiv.org/abs/2512.13440
Prenatal ultrasound is the cornerstone for detecting congenital anomalies of the kidneys and urinary tract, but diagnosis is limited by operator dependence and suboptimal imaging conditions. We sought to assess the performance of a self-supervised ultrasound foundation model for automated fetal renal anomaly classification using a curated dataset of 969 two-dimensional ultrasound images. A pretrained Ultrasound Self-Supervised Foundation Model with Masked Autoencoding (USF-MAE) was fine-tuned for binary and multi-class classification of normal kidneys, urinary tract dilation, and multicystic dysplastic kidney. Models were compared with a DenseNet-169 convolutional baseline using cross-validation and an independent test set. USF-MAE consistently improved upon the baseline across all evaluation metrics in both binary and multi-class settings. USF-MAE achieved an improvement of about 1.87% (AUC) and 7.8% (F1-score) on the validation set, 2.32% (AUC) and 4.33% (F1-score) on the independent holdout test set. The largest gains were observed in the multi-class setting, where the improvement in AUC was 16.28% and 46.15% in F1-score. To facilitate model interpretability, Score-CAM visualizations were adapted for a transformer architecture and show that model predictions were informed by known, clinically relevant renal structures, including the renal pelvis in urinary tract dilation and cystic regions in multicystic dysplastic kidney. These results show that ultrasound-specific self-supervised learning can generate a useful representation as a foundation for downstream diagnostic tasks. The proposed framework offers a robust, interpretable approach to support the prenatal detection of renal anomalies and demonstrates the promise of foundation models in obstetric imaging.
产前超声是检测肾脏和泌尿系统先天性异常的关键手段,但其诊断效果受到操作者依赖性和成像条件不佳的限制。我们利用一组经过精心整理的969张二维超声图像数据集,评估了一种基于自监督学习的超声基础模型在自动胎儿肾异常分类中的表现。我们将预训练的Masked Autoencoding (MAE) 超声自监督基础模型(USF-MAE)进行了微调,用于二元和多元分类任务,包括正常肾脏、泌尿系统扩张以及多囊性发育不全肾的识别。我们通过交叉验证和独立测试集将该模型与基于DenseNet-169卷积网络的基本模型进行了比较。 在所有评估指标上,USF-MAE相对于基线模型均表现出一致性的改进,在二元分类和多元分类设置中都取得了更高的性能。具体而言,USF-MAE在验证集上的AUC提高了约1.87%,F1分数提高了7.8%;独立测试集上的AUC提升了2.32%,F1分数提升4.33%。特别是在多类别设置下,USF-MAE的性能改进更为显著:AUC提升了16.28%,F1分数则达到了惊人的46.15%。 为了便于模型解释,我们对基于Transformer架构的Score-CAM可视化方法进行了改编,并发现该模型的预测主要依据于临床相关的肾脏结构,包括尿路扩张中的肾盂以及多囊性发育不全肾中的囊性区域。这一研究表明,特定于超声波的自监督学习能够生成用于下游诊断任务的基础有用表示。 提出的框架提供了一种稳健且可解释的方法,有助于产前检测肾异常,并展示了基础模型在妇产科成像领域的潜力。
https://arxiv.org/abs/2512.13434
Accurate and timely identification of plant leaf diseases is essential for resilient and sustainable agriculture, yet most deep learning approaches rely on large annotated datasets and computationally intensive models that are unsuitable for data-scarce and resource-constrained environments. To address these challenges we present a few-shot learning approach within a lightweight yet efficient framework that combines domain-adapted MobileNetV2 and MobileNetV3 models as feature extractors, along with a feature fusion technique to generate robust feature representation. For the classification task, the fused features are passed through a Bi-LSTM classifier enhanced with attention mechanisms to capture sequential dependencies and focus on the most relevant features, thereby achieving optimal classification performance even in complex, real-world environments with noisy or cluttered backgrounds. The proposed framework was evaluated across multiple experimental setups, including both laboratory-controlled and field-captured datasets. On tomato leaf diseases from the PlantVillage dataset, it consistently improved performance across 1 to 15 shot scenarios, reaching 98.23+-0.33% at 15 shot, closely approaching the 99.98% SOTA benchmark achieved by a Transductive LSTM with attention, while remaining lightweight and mobile-friendly. Under real-world conditions using field images from the Dhan Shomadhan dataset, it maintained robust performance, reaching 69.28+-1.49% at 15-shot and demonstrating strong resilience to complex backgrounds. Notably, it also outperformed the previous SOTA accuracy of 96.0% on six diseases from PlantVillage, achieving 99.72% with only 15-shot learning. With a compact model size of approximately 40 MB and inference complexity of approximately 1.12 GFLOPs, this work establishes a scalable, mobile-ready foundation for precise plant disease diagnostics in data-scarce regions.
准确且及时地识别植物叶片疾病对于建立有弹性和可持续的农业至关重要,然而大多数深度学习方法依赖于大规模标注数据集和计算资源密集型模型,在数据匮乏和资源受限环境中并不适用。为了解决这些问题,我们提出了一种轻量级但高效的框架内的少量样本学习(few-shot learning)方法,该框架结合了领域适应的MobileNetV2和MobileNetV3模型作为特征提取器,并采用特征融合技术生成稳健的特征表示。对于分类任务,融合后的特征通过增强注意力机制的Bi-LSTM分类器传递,以捕获序列依赖性并聚焦于最相关的特征,从而在复杂且背景嘈杂的真实世界环境中实现最优分类性能。 该框架在多个实验设置中进行了评估,包括实验室控制和野外捕捉的数据集。在PlantVillage数据集中番茄叶片疾病的测试中,在1到15个样本的场景下其表现均得到提升,并在15个样本时达到98.23±0.33%的准确率,接近使用带有注意力机制的归纳LSTM实现的99.98%的最佳性能(SOTA),同时保持轻量级和移动友好性。在真实世界条件下使用Dhan Shomadhan数据集中的田野图像时,在15个样本的情况下,其表现仍然稳健,达到了69.28±1.49%,并在复杂背景中表现出强大的适应能力。值得注意的是,它还超越了PlantVillage数据集中六种疾病的SOTA准确率(96.0%),在仅使用15个样本学习时就实现了99.72%的准确度。 此研究工作以约40MB的小型模型和大约1.12GFLOPs的推理复杂性建立了一个可扩展且适用于移动设备的基础框架,为数据匮乏地区的精准植物病害诊断奠定了基础。
https://arxiv.org/abs/2512.13428
Large language models are powerful but often limited by high computational cost, privacy concerns, and English-centric training. Recent progress demonstrates that small, efficient models with around one billion parameters can deliver strong results and enable on-device use. This paper introduces MiniLingua, a multilingual open-source LLM of one billion parameters trained from scratch for 13 European languages, designed to balance coverage and instruction-following capabilities. Based on evaluation results, the instruction-tuned version of MiniLingua outperforms EuroLLM, a model with a similar training approach but a larger training budget, on summarization, classification and both open- and closed-book question answering. Moreover, it remains competitive with more advanced state-of-the-art models on open-ended generation tasks. We release model weights, tokenizer and source code used for data processing and model training.
大型语言模型虽然功能强大,但常常受限于高昂的计算成本、隐私问题以及以英语为中心的训练方式。近期的研究表明,具有约10亿参数的小型高效模型能够提供强大的性能,并且支持设备端使用。本文介绍了MiniLingua,这是一种开源的多语言大语言模型,拥有大约10亿个参数,针对欧洲地区的13种语言从头开始进行训练,旨在平衡覆盖范围和指令跟随能力。基于评估结果,经过指令微调后的MiniLingua版本在摘要生成、分类以及开放书和闭卷问答任务上均优于EuroLLM——这是一种使用类似训练方法但具有更大训练预算的模型。此外,在开放式生成任务中,它仍然能够与更先进的最先进模型保持竞争力。 我们发布了用于数据处理和模型训练的模型权重、分词器及源代码。
https://arxiv.org/abs/2512.13298
This paper describes an automatic bird call recording system called SAMAY, which is developed to study bird species by creating a database of large amounts of bird acoustic data. By analysing the recorded bird call data, the system can also be used for automatic classification of bird species, monitoring bird populations and analysing the impact of environmental changes. The system is driven through a powerful STM32F407 series microcontroller, supports 4 microphones, is equipped with 128 GB of storage capacity, and is powered by a 10400 mAh battery pack interfaced with a solar charger. In addition, the device is user-configurable over USB and Wi-Fi during runtime, ensuring user-friendly operation during field deployment.
本文介绍了一种自动鸟类叫声录音系统,名为SAMAY。该系统旨在通过创建大量鸟类声学数据的数据库来研究鸟类种类。通过对记录下来的鸟类叫声数据进行分析,该系统还可以用于自动分类鸟类物种、监测鸟类数量以及分析环境变化的影响。SAMAY系统由强大的STM32F407系列微控制器驱动,支持四个麦克风,并配备了128GB的存储容量及一个通过太阳能充电器接口连接的10400mAh电池组。此外,在现场部署期间,该设备可以通过USB和Wi-Fi进行用户配置,确保了友好的操作体验。
https://arxiv.org/abs/2512.13284
Recent advances in video generation have produced vivid content that are often indistinguishable from real videos, making AI-generated video detection an emerging societal challenge. Prior AIGC detection benchmarks mostly evaluate video without audio, target broad narrative domains, and focus on classification solely. Yet it remains unclear whether state-of-the-art video generation models can produce immersive, audio-paired videos that reliably deceive humans and VLMs. To this end, we introduce Video Reality Test, an ASMR-sourced video benchmark suite for testing perceptual realism under tight audio-visual coupling, featuring the following dimensions: \textbf{(i) Immersive ASMR video-audio sources.} Built on carefully curated real ASMR videos, the benchmark targets fine-grained action-object interactions with diversity across objects, actions, and backgrounds. \textbf{(ii) Peer-Review evaluation.} An adversarial creator-reviewer protocol where video generation models act as creators aiming to fool reviewers, while VLMs serve as reviewers seeking to identify fakeness. Our experimental findings show: The best creator Veo3.1-Fast even fools most VLMs: the strongest reviewer (Gemini 2.5-Pro) achieves only 56\% accuracy (random 50\%), far below that of human experts (81.25\%). Adding audio improves real-fake discrimination, yet superficial cues such as watermarks can still significantly mislead models. These findings delineate the current boundary of video generation realism and expose limitations of VLMs in perceptual fidelity and audio-visual consistency. Our code is available at this https URL.
近期在视频生成领域的进展已经产生了栩栩如生的内容,这些内容常常与真实的视频难以区分,这使得AI生成的视频检测成为了新的社会挑战。之前的AIGC(人工智能生成内容)检测基准测试主要针对没有音频的视频进行评估,并且关注的是广泛的叙事领域以及单纯的分类问题。然而,目前还不清楚最先进的视频生成模型是否能够制作出具有沉浸感、与音频配对并且足以欺骗人类和视觉语言模型(VLMs)的视频。 为此,我们引入了“Video Reality Test”,这是一个基于ASMR源的视频基准测试套件,旨在测试在紧密的视听耦合下感知真实性的能力。这一基准测试涵盖了以下几个维度: \textbf{(i) 沉浸式ASMR视频-音频来源:}该基准测试建立在精心挑选的真实ASMR视频基础上,专注于细粒度的动作-对象互动,并且这些视频中的物体、动作和背景具有多样性。 \textbf{(ii) 同行评审评估:}一种对抗性的创作者-评论者协议,在这个协议中,视频生成模型充当创造者试图欺骗评论者,而VLMs则作为评论者尝试识别虚假内容。我们的实验结果表明: 最好的创建者(Veo3.1-Fast)甚至能够愚弄大多数的视觉语言模型:最强力的评审员(Gemini 2.5-Pro)仅能实现56%的准确率(随机猜测为50%),远低于人类专家的表现(81.25%)。加入音频虽然可以改善真实与伪造视频之间的区分度,但是诸如水印这样的表面线索仍然可能显著误导模型。这些发现描绘了当前视频生成现实性的边界,并暴露了VLMs在感知保真度和视听一致性方面的局限性。 我们的代码可在提供的链接中获取(原句中的this https URL应为实际的GitHub或其它存储库地址)。
https://arxiv.org/abs/2512.13281
In the rapidly evolving field of self-supervised learning on graphs, generative and contrastive methodologies have emerged as two dominant approaches. Our study focuses on masked feature reconstruction (MFR), a generative technique where a model learns to restore the raw features of masked nodes in a self-supervised manner. We observe that both MFR and graph contrastive learning (GCL) aim to maximize agreement between similar elements. Building on this observation, we reveal a novel theoretical insight: under specific conditions, the objectives of MFR and node-level GCL converge, despite their distinct operational mechanisms. This theoretical connection suggests these approaches are complementary rather than fundamentally different, prompting us to explore their integration to enhance self-supervised learning on graphs. Our research presents Contrastive Masked Feature Reconstruction (CORE), a novel graph self-supervised learning framework that integrates contrastive learning into MFR. Specifically, we form positive pairs exclusively between the original and reconstructed features of masked nodes, encouraging the encoder to prioritize contextual information over the node's own features. Additionally, we leverage the masked nodes themselves as negative samples, combining MFR's reconstructive power with GCL's discriminative ability to better capture intrinsic graph structures. Empirically, our proposed framework CORE significantly outperforms MFR across node and graph classification tasks, demonstrating state-of-the-art results. In particular, CORE surpasses GraphMAE and GraphMAE2 by up to 2.80% and 3.72% on node classification tasks, and by up to 3.82% and 3.76% on graph classification tasks.
在快速发展的图自监督学习领域,生成式和对比式方法已成为两种主导性策略。我们的研究专注于掩码特征重建(MFR),这是一种通过自我监督方式让模型学会恢复被屏蔽节点原始特性的生成技术。我们注意到,无论是MFR还是图对比学习(GCL)都在力求最大限度地增加相似元素之间的共识度。基于这一观察结果,我们揭示了一个新的理论见解:在特定条件下,尽管两者操作机制不同,但MFR和节点级别的GCL的目标会趋同。这种理论联系表明这些方法是互补的而非根本不同的,这促使我们探索它们融合的可能性以增强图自监督学习的效果。 本研究提出了对比掩码特征重建(CORE),这是一种新的图自监督学习框架,它将对比学习整合进MFR中。具体而言,我们只在原始和被重构的掩码节点特征之间形成正样本对,这鼓励编码器优先利用上下文信息而非节点本身的特性。此外,我们还使用掩码节点自身作为负样本,结合了MFR的重建能力和GCL的鉴别能力来更好地捕捉内在图结构。 从实验结果来看,我们的框架CORE在节点和图分类任务上显著优于MFR,展示了最先进的效果。特别是在节点分类任务中,相比GraphMAE和GraphMAE2,CORE分别提高了高达2.80%和3.72%;而在图分类任务中,则分别提高了高达3.82%和3.76%。
https://arxiv.org/abs/2512.13235
Semantic segmentation requires a holistic understanding of the physical world, as it assigns semantic labels to spatially continuous and structurally coherent objects rather than to isolated pixels. However, existing data-free knowledge distillation (DFKD) methods-primarily designed for classification-often disregard this continuity, resulting in significant performance degradation when applied directly to segmentation tasks. In this paper, we introduce DFSS, a novel data-free distillation framework tailored for semantic segmentation. Unlike prior approaches that treat pixels independently, DFSS respects the structural and contextual continuity of real-world scenes. Our key insight is to leverage Batch Normalization (BN) statistics from a teacher model to guide Approximate Distribution Sampling (ADS), enabling the selection of data that better reflects the original training distribution-without relying on potentially misleading teacher predictions. Additionally, we propose Weighted Distribution Progressive Distillation (WDPD), which dynamically prioritizes reliable samples that are more closely aligned with the original data distribution early in training and gradually incorporates more challenging cases, mirroring the natural progression of learning in human perception. Extensive experiments on standard benchmarks demonstrate that DFSS consistently outperforms existing data-free distillation methods for semantic segmentation, achieving state-of-the-art results with significantly reduced reliance on auxiliary data.
语义分割需要对物理世界有一个整体的理解,因为它为连续且结构连贯的对象分配语义标签,而不仅仅是孤立的像素。然而,现有的无数据知识蒸馏(DFKD)方法主要针对分类设计,常常忽视这种连续性,在直接应用于分割任务时会导致性能显著下降。在本文中,我们介绍了 DFSS,一种专为语义分割定制的新颖无数据蒸馏框架。与之前将像素独立处理的方法不同,DFSS 尊重现实场景中的结构和上下文连贯性。我们的核心见解是利用教师模型的批量归一化(BN)统计信息来指导近似分布采样(ADS),从而选择更好地反映原始训练分布的数据——而不依赖可能具有误导性的教师预测。 此外,我们提出了加权分布渐进式蒸馏(WDPD),该方法在训练初期优先考虑与原始数据分布更加一致的可靠样本,并逐渐纳入更具挑战性的情况,这反映了人类感知学习过程中的自然进展。在标准基准测试上的广泛实验表明,DFSS 一贯优于现有的无数据知识蒸馏方法,在语义分割中实现了最先进的结果,并且显著减少了对辅助数据的依赖。
https://arxiv.org/abs/2512.13175
The development of clinical-grade artificial intelligence in pathology is limited by the scarcity of diverse, high-quality annotated datasets. Generative models offer a potential solution but suffer from semantic instability and morphological hallucinations that compromise diagnostic reliability. To address this challenge, we introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS), the first generative foundation model for pathology-specific text-to-image synthesis. By leveraging a dual-stage training strategy on approximately 2.8 million image-caption pairs, CRAFTS incorporates a novel alignment mechanism that suppresses semantic drift to ensure biological accuracy. This model generates diverse pathological images spanning 30 cancer types, with quality rigorously validated by objective metrics and pathologist evaluations. Furthermore, CRAFTS-augmented datasets enhance the performance across various clinical tasks, including classification, cross-modal retrieval, self-supervised learning, and visual question answering. In addition, coupling CRAFTS with ControlNet enables precise control over tissue architecture from inputs such as nuclear segmentation masks and fluorescence images. By overcoming the critical barriers of data scarcity and privacy concerns, CRAFTS provides a limitless source of diverse, annotated histology data, effectively unlocking the creation of robust diagnostic tools for rare and complex cancer phenotypes.
在病理学中,临床级人工智能的发展受到多样性和高质量标注数据稀缺的限制。生成模型提供了一种潜在解决方案,但它们存在的语义不稳定和形态幻觉问题影响了诊断可靠性。为了解决这一挑战,我们引入了一个名为组织合成相关性调节对齐框架(CRAFTS)的新模型,这是第一个专门针对病理学特定文本到图像合成任务的生成基础模型。 通过在大约280万个图像-描述配对数据上采用双阶段训练策略,CRAFTS结合了一种新的对齐机制,该机制能够抑制语义漂移以确保生物准确性。此模型可以产生涵盖30种癌症类型的多样病理图像,并且这些图像的质量已经过客观指标和病理学家评估的严格验证。 此外,使用增强后的数据集(通过CRAFTS),各种临床任务的表现得到了提高,包括分类、跨模态检索、自监督学习以及视觉问答。另外,将CRAFTS与ControlNet结合可以实现对组织结构的精确控制,例如可以通过细胞核分割掩码和荧光图像等输入进行调整。 通过克服数据稀缺性和隐私问题这两个关键障碍,CRAFTS为生成多样且标注丰富的组织学数据提供了一个无限来源,从而有效解锁了针对罕见和复杂癌症表型的稳健诊断工具的开发。
https://arxiv.org/abs/2512.13164
Deep learning models in medical imaging are susceptible to shortcut learning, relying on confounding metadata (e.g., scanner model) that is often encoded in image embeddings. The crucial question is whether the model actively utilizes this encoded information for its final prediction. We introduce Weight Space Correlation Analysis, an interpretable methodology that quantifies feature utilization by measuring the alignment between the classification heads of a primary clinical task and auxiliary metadata tasks. We first validate our method by successfully detecting artificially induced shortcut learning. We then apply it to probe the feature utilization of an SA-SonoNet model trained for Spontaneous Preterm Birth (sPTB) prediction. Our analysis confirmed that while the embeddings contain substantial metadata, the sPTB classifier's weight vectors were highly correlated with clinically relevant factors (e.g., birth weight) but decoupled from clinically irrelevant acquisition factors (e.g. scanner). Our methodology provides a tool to verify model trustworthiness, demonstrating that, in the absence of induced bias, the clinical model selectively utilizes features related to the genuine clinical signal.
在医学成像中的深度学习模型容易受到捷径学习(shortcut learning)的影响,即依赖于图像嵌入中编码的混淆元数据(例如扫描仪型号)。关键问题在于模型是否积极利用了这种嵌入信息进行最终预测。我们引入了一种名为权重空间相关性分析的方法论,这是一种可解释的技术,通过测量主临床任务分类头和辅助元数据任务之间的对齐程度来量化特征利用率。 首先,我们验证了该方法的有效性,成功检测到了人工诱导的捷径学习现象。然后,我们将此方法应用于探究用于早产预测(sPTB)的SA-SonoNet模型的特征利用情况。我们的分析确认尽管嵌入包含大量元数据,但sPTB分类器的权重向量与临床相关的因素(例如出生体重)高度相关,而与临床无关的数据采集因素(如扫描仪型号)则没有关联。 该方法为验证模型的信任度提供了工具,证明在不存在诱导偏差的情况下,临床模型会选择性地利用与真正临床信号相关的特征。这一发现有助于提升医学成像深度学习模型的可靠性和透明度。
https://arxiv.org/abs/2512.13144
Accurate medical image analysis can greatly assist clinical diagnosis, but its effectiveness relies on high-quality expert annotations Obtaining pixel-level labels for medical images, particularly fundus images, remains costly and time-consuming. Meanwhile, despite the success of deep learning in medical imaging, the lack of interpretability limits its clinical adoption. To address these challenges, we propose TWLR, a two-stage framework for interpretable diabetic retinopathy (DR) assessment. In the first stage, a vision-language model integrates domain-specific ophthalmological knowledge into text embeddings to jointly perform DR grading and lesion classification, effectively linking semantic medical concepts with visual features. The second stage introduces an iterative severity regression framework based on weakly-supervised semantic segmentation. Lesion saliency maps generated through iterative refinement direct a progressive inpainting mechanism that systematically eliminates pathological features, effectively downgrading disease severity toward healthier fundus appearances. Critically, this severity regression approach achieves dual benefits: accurate lesion localization without pixel-level supervision and providing an interpretable visualization of disease-to-healthy transformations. Experimental results on the FGADR, DDR, and a private dataset demonstrate that TWLR achieves competitive performance in both DR classification and lesion segmentation, offering a more explainable and annotation-efficient solution for automated retinal image analysis.
准确的医学影像分析能够极大地辅助临床诊断,但其有效性依赖于高质量的专业标注。获取医学图像(尤其是眼底图像)的像素级标签既耗时又昂贵。尽管深度学习在医学成像领域的成功已经取得突破,但缺乏可解释性限制了其临床应用。为了解决这些问题,我们提出了TWLR,这是一种用于可解释的糖尿病视网膜病变(DR)评估的两阶段框架。 第一阶段中,视觉-语言模型整合特定领域的专业眼科知识到文本嵌入中,从而同时执行DR分级和病变分类。这有效地将语义医学概念与视觉特征联系起来。 第二阶段引入了一个基于弱监督语义分割的迭代严重程度回归框架。通过迭代细化生成的病变显著图引导渐进式填充机制,系统地消除病理特征,使疾病严重程度逐步向更健康的眼底外观退化。关键在于这一严重度回归方法实现了双重效果:在没有像素级监督的情况下实现准确的病变定位,并提供可解释的疾病到健康转变可视化。 实验结果表明,在FGADR、DDR以及一个私有数据集上,TWLR在DR分类和病变分割方面均达到了竞争性的性能,为自动视网膜图像分析提供了更加可解释且标注效率更高的解决方案。
https://arxiv.org/abs/2512.13008
CLIP delivers strong zero-shot classification but remains highly vulnerable to adversarial attacks. Previous work of adversarial fine-tuning largely focuses on matching the predicted logits between clean and adversarial examples, which overlooks uncertainty calibration and may degrade the zero-shot generalization. A common expectation in reliable uncertainty estimation is that predictive uncertainty should increase as inputs become more difficult or shift away from the training distribution. However, we frequently observe the opposite in the adversarial setting: perturbations not only degrade accuracy but also suppress uncertainty, leading to severe miscalibration and unreliable over-confidence. This overlooked phenomenon highlights a critical reliability gap beyond robustness. To bridge this gap, we propose a novel adversarial fine-tuning objective for CLIP considering both prediction accuracy and uncertainty alignments. By reparameterizing the output of CLIP as the concentration parameter of a Dirichlet distribution, we propose a unified representation that captures relative semantic structure and the magnitude of predictive confidence. Our objective aligns these distributions holistically under perturbations, moving beyond single-logit anchoring and restoring calibrated uncertainty. Experiments on multiple zero-shot classification benchmarks demonstrate that our approach effectively restores calibrated uncertainty and achieves competitive adversarial robustness while maintaining clean accuracy.
CLIP模型在零样本分类任务中表现出色,但对对抗性攻击的抵御能力较弱。以往的工作集中在通过微调使得干净样本和受到扰动后的样本之间的预测得分(logits)相匹配上,这种方法忽略了不确定性校准的问题,并且可能会降低零样本泛化性能。可靠的不确定性估计的一个常见期望是:随着输入变得更难或远离训练分布时,预测的不确定性应该增加。然而,在对抗性设置中我们经常观察到相反的现象:扰动不仅会降低准确性,还会抑制不确定性,导致严重的误校准和不可靠的过度自信。这种被忽视的现象揭示了可靠性与鲁棒性的差距。 为了解决这个问题,我们提出了一个新的针对CLIP模型的对抗微调目标,在保持预测准确性和不确定性对齐的同时进行考虑。通过将CLIP输出重新参数化为Dirichlet分布的集中度参数,我们提出了一种统一表示方法,捕捉相对语义结构和预测置信度大小。我们的目标是在扰动下整体对齐这些分布,并超越单个logits锚定来恢复校准的不确定性。 在多个零样本分类基准上的实验表明,该方法能够有效地恢复校准后的不确定性,并且在保持干净准确性的前提下实现了具有竞争力的对抗鲁棒性。
https://arxiv.org/abs/2512.12997
We introduce the Continuous Edit Distance (CED), a geodesic and elastic distance for time-varying persistence diagrams (TVPDs). The CED extends edit-distance ideas to TVPDs by combining local substitution costs with penalized deletions/insertions, controlled by two parameters: \(\alpha\) (trade-off between temporal misalignment and diagram discrepancy) and \(\beta\) (gap penalty). We also provide an explicit construction of CED-geodesics. Building on these ingredients, we present two practical barycenter solvers, one stochastic and one greedy, that monotonically decrease the CED Frechet energy. Empirically, the CED is robust to additive perturbations (both temporal and spatial), recovers temporal shifts, and supports temporal pattern search. On real-life datasets, the CED achieves clustering performance comparable or better than standard elastic dissimilarities, while our clustering based on CED-barycenters yields superior classification results. Overall, the CED equips TVPD analysis with a principled distance, interpretable geodesics, and practical barycenters, enabling alignment, comparison, averaging, and clustering directly in the space of TVPDs. A C++ implementation is provided for reproducibility at the following address this https URL.
我们引入了连续编辑距离(CED),这是一种针对时变持续性图(TVPD)的测地线和弹性距离。通过结合局部替换成本与受两个参数控制的删除/插入惩罚,即 \(\alpha\) (时间错位与图表差异之间的权衡)和 \(\beta\) (间隙处罚),CED 将编辑距离的思想扩展到了 TVPDs 中。 我们还提供了 CED-测地线的具体构造方法。基于这些构建模块,我们提出了两种实用的重心求解器,一种是随机的,另一种是贪婪式的,这两种求解器都能够单调减少 CED Frechet 能量。实验结果表明,CED 对加性扰动(无论是时间上的还是空间上的)具有鲁棒性,并且可以恢复时间偏移并支持时序模式搜索。在现实数据集上,CED 达到了与标准弹性相似度相媲美甚至更好的聚类性能;而基于 CED-重心的聚类则提供了更优的分类结果。 总体而言,CED 为 TVPD 分析提供了一个具有原理性的距离测度、可解释的测地线和实用的重心计算方法,这使得在 TPD 空间内直接进行对齐、比较、平均化以及聚类成为可能。为了便于重复实验验证,我们提供了一种 C++ 实现方式,具体地址为 [此处插入链接]。
https://arxiv.org/abs/2512.12939
Generalized category discovery (GCD) is an important and challenging task in open-world learning. Specifically, given some labeled data of known classes, GCD aims to cluster unlabeled data that contain both known and unknown classes. Current GCD methods based on parametric classification adopt the DINO-like pseudo-labeling strategy, where the sharpened probability output of one view is used as supervision information for the other view. However, large pre-trained models have a preference for some specific visual patterns, resulting in encoding spurious correlation for unlabeled data and generating noisy pseudo-labels. To address this issue, we propose a novel method, which contains two modules: Loss Sharpness Penalty (LSP) and Dynamic Anchor Selection (DAS). LSP enhances the robustness of model parameters to small perturbations by minimizing the worst-case loss sharpness of the model, which suppressing the encoding of trivial features, thereby reducing overfitting of noise samples and improving the quality of pseudo-labels. Meanwhile, DAS selects representative samples for the unknown classes based on KNN density and class probability during the model training and assigns hard pseudo-labels to them, which not only alleviates the confidence difference between known and unknown classes but also enables the model to quickly learn more accurate feature distribution for the unknown classes, thus further improving the clustering accuracy. Extensive experiments demonstrate that the proposed method can effectively mitigate the noise of pseudo-labels, and achieve state-of-the-art results on multiple GCD benchmarks.
广义类别发现(GCD)是开放世界学习中的一个重要且具有挑战性的任务。具体来说,给定一些已知类别的标注数据后,GCD旨在对包含已知和未知类别的未标记数据进行聚类。当前基于参数分类的GCD方法采用类似于DINO的伪标签策略,即使用一个视图的概率输出(经过锐化处理)作为另一个视图的监督信息。然而,大规模预训练模型倾向于偏好某些特定的视觉模式,导致对未标记数据编码虚假关联,并生成噪声伪标签。为了解决这个问题,我们提出了一种新方法,该方法包含两个模块:损失尖锐度惩罚(LSP)和动态锚点选择(DAS)。通过最小化模型在最坏情况下的损失锐利度,LSP增强了模型参数对小扰动的鲁棒性,从而抑制了琐碎特征的编码,减少了噪声样本过拟合,并提高了伪标签的质量。同时,DAS基于KNN密度和类概率,在模型训练过程中选择未知类别的代表性样本,并为其分配困难伪标签,这不仅缓解了已知类别与未知类别之间的置信度差异,还使模型能够快速学习到未知类别的更准确特征分布,从而进一步提高聚类精度。广泛的实验表明,所提出的方法可以有效降低噪声伪标签的影响,在多个GCD基准测试中取得了最先进的结果。
https://arxiv.org/abs/2512.12925