Laparoscopic surgery is a complex surgical technique that requires extensive training. Recent advances in deep learning have shown promise in supporting this training by enabling automatic video-based assessment of surgical skills. However, the development and evaluation of deep learning models is currently hindered by the limited size of available annotated datasets. To address this gap, we introduce the Laparoscopic Skill Analysis and Assessment (LASANA) dataset, comprising 1270 stereo video recordings of four basic laparoscopic training tasks. Each recording is annotated with a structured skill rating, aggregated from three independent raters, as well as binary labels indicating the presence or absence of task-specific errors. The majority of recordings originate from a laparoscopic training course, thereby reflecting a natural variation in the skill of participants. To facilitate benchmarking of both existing and novel approaches for video-based skill assessment and error recognition, we provide predefined data splits for each task. Furthermore, we present baseline results from a deep learning model as a reference point for future comparisons.
腹腔镜手术是一种复杂的外科技术,需要大量的培训。近期深度学习领域的进展显示出在通过基于视频的评估来支持这种训练方面具有潜力,即通过自动评估外科医生的技术水平。然而,由于现有标注数据集规模有限,开发和评估深度学习模型的工作受到了阻碍。为了解决这一缺口,我们引入了腹腔镜技能分析与评估(LASANA)数据集,该数据集包含1270个立体视频记录的四类基本腹腔镜训练任务。每个记录都根据三名独立评分员的综合评价进行了结构化技能等级标注,并且还包含了指示特定任务错误存在与否的二元标签。大多数录像来自于一个腹腔镜培训课程,因此反映了参与者技术水平的自然变化情况。 为了便于对现有的和新颖的方法进行视频基础技能评估和错误识别基准测试,我们为每个任务提供了预定义的数据分组。此外,我们提供了一个深度学习模型的基本结果作为未来比较的一个参考点。
https://arxiv.org/abs/2602.09927
Recent advances in imaging technologies, deep learning and numerical performance have enabled non-invasive detailed analysis of artworks, supporting their documentation and conservation. In particular, automated detection of craquelure in digitized paintings is crucial for assessing degradation and guiding restoration, yet remains challenging due to the possibly complex scenery and the visual similarity between cracks and crack-like artistic features such as brush strokes or hair. We propose a hybrid approach that models crack detection as an inverse problem, decomposing an observed image into a crack-free painting and a crack component. A deep generative model is employed as powerful prior for the underlying artwork, while crack structures are captured using a Mumford--Shah-type variational functional together with a crack prior. Joint optimization yields a pixel-level map of crack localizations in the painting.
最近在成像技术、深度学习和数值性能方面的进展,使得对艺术品进行非侵入性详细分析成为可能,从而支持它们的记录和保护。特别是,在数字化绘画中自动检测裂纹对于评估退化情况以及指导修复工作至关重要,但由于可能存在复杂的背景环境,并且裂纹与诸如笔触或纹理等艺术特征在视觉上相似,因此这一任务仍然具有挑战性。 我们提出了一种混合方法,将裂纹检测建模为一个逆向问题,即将观察到的图像分解成无裂纹的绘画部分和裂纹成分。这种方法利用深度生成模型作为潜在艺术品的强大先验条件,同时使用一种类似Mumford--Shah的变分函数以及裂纹先验来捕捉裂纹结构。通过联合优化,可以获得绘画中裂纹位置的像素级地图。
https://arxiv.org/abs/2602.09730
Liver fibrosis poses a substantial challenge in clinical practice, emphasizing the necessity for precise liver segmentation and accurate disease staging. Based on the CARE Liver 2025 Track 4 Challenge, this study introduces a multi-task deep learning framework developed for liver segmentation (LiSeg) and liver fibrosis staging (LiFS) using multiparametric MRI. The LiSeg phase addresses the challenge of limited annotated images and the complexities of multi-parametric MRI data by employing a semi-supervised learning model that integrates image segmentation and registration. By leveraging both labeled and unlabeled data, the model overcomes the difficulties introduced by domain shifts and variations across modalities. In the LiFS phase, we employed a patchbased method which allows the visualization of liver fibrosis stages based on the classification outputs. Our approach effectively handles multimodality imaging data, limited labels, and domain shifts. The proposed method has been tested by the challenge organizer on an independent test set that includes in-distribution (ID) and out-of-distribution (OOD) cases using three-channel MRIs (T1, T2, DWI) and seven-channel MRIs (T1, T2, DWI, GED1-GED4). The code is freely available. Github link: this https URL
肝纤维化在临床实践中构成了重大挑战,强调了精确肝脏分割和准确疾病分期的必要性。基于CARE Liver 2025 Track 4挑战赛,本研究介绍了一种多任务深度学习框架,该框架用于利用多参数磁共振成像(MRI)进行肝脏分割(LiSeg)和肝纤维化分期(LiFS)。在LiSeg阶段,该框架通过采用半监督学习模型解决了标注图像数量有限以及多参数MRI数据复杂性的问题。这种模型集成了图像分割与配准功能,并且利用标记数据和未标记数据克服了由领域偏移及模式间差异带来的困难。 在LiFS阶段,我们采用了基于补丁的方法,该方法能够根据分类输出可视化肝纤维化分期情况。本研究提出的方法有效地处理了多模态成像数据、标签不足以及领域偏移的问题。所提出的模型已经在挑战赛主办方提供的独立测试集上进行了评估,其中包括分布内(ID)和分布外(OOD)案例,并且使用了三种通道MRI(T1, T2, DWI)及七种通道MRI(T1, T2, DWI, GED1-GED4)。该研究的代码已开放获取。GitHub链接:[此处插入具体链接]
https://arxiv.org/abs/2602.09686
Deep learning has achieved expert-level performance in automated electrocardiogram (ECG) diagnosis, yet the "black-box" nature of these models hinders their clinical deployment. Trust in medical AI requires not just high accuracy but also transparency regarding the specific physiological features driving predictions. Existing explainability methods for ECGs typically rely on post-hoc approximations (e.g., Grad-CAM and SHAP), which can be unstable, computationally expensive, and unfaithful to the model's actual decision-making process. In this work, we propose the ECG-IMN, an Interpretable Mesomorphic Neural Network tailored for high-resolution 12-lead ECG classification. Unlike standard classifiers, the ECG-IMN functions as a hypernetwork: a deep convolutional backbone generates the parameters of a strictly linear model specific to each input sample. This architecture enforces intrinsic interpretability, as the decision logic is mathematically transparent and the generated weights (W) serve as exact, high-resolution feature attribution maps. We introduce a transition decoder that effectively maps latent features to sample-wise weights, enabling precise localization of pathological evidence (e.g., ST-elevation, T-wave inversion) in both time and lead dimensions. We evaluate our approach on the PTB-XL dataset for classification tasks, demonstrating that the ECG-IMN achieves competitive predictive performance (AUROC comparable to black-box baselines) while providing faithful, instance-specific explanations. By explicitly decoupling parameter generation from prediction execution, our framework bridges the gap between deep learning capability and clinical trustworthiness, offering a principled path toward "white-box" cardiac diagnostics.
深度学习在自动心电图(ECG)诊断中已经达到了专家级别的性能,然而这些模型的“黑盒”特性阻碍了它们在临床环境中的部署。医疗人工智能的信任不仅需要高准确性,还需要对推动预测的具体生理特征有透明度。现有的ECG可解释性方法通常依赖于事后近似技术(例如Grad-CAM和SHAP),这些方法可能不稳定、计算成本高昂,并且无法忠实反映模型的实际决策过程。 在这项工作中,我们提出了ECG-IMN(可解释的中间形态神经网络),专门针对高分辨率12导联心电图分类。与标准分类器不同的是,ECG-IMN的功能像一个超网络:深层卷积主干生成特定于每个输入样本的线性模型参数。这种架构强制执行内在透明度,因为决策逻辑是数学上清晰可见的,并且所生成的权重(W)作为精确、高分辨率特征归因图来解释。 我们引入了一个过渡解码器,该解码器能够有效地将潜在特征映射到每个样本的权重中,从而能够在时间和导联两个维度精确定位病理证据(例如ST段抬高、T波倒置)。我们在PTB-XL数据集上评估了我们的方法在分类任务中的性能,并证明ECG-IMN可以达到与黑盒基线相当的竞争预测性能的同时还提供了忠实的实例特定解释。 通过明确分离参数生成和预测执行,我们提供的框架填补了深度学习能力与临床可信度之间的差距,为“白盒”心脏诊断提供了一条原则性路径。
https://arxiv.org/abs/2602.09566
Automated change detection in remote sensing imagery is critical for urban management, environmental monitoring, and disaster assessment. While deep learning models have advanced this field, they often struggle with challenges like low sensitivity to small objects and high computational costs. This paper presents SCA-Net, an enhanced architecture built upon the Change-Agent framework for precise building and road change detection in bi-temporal images. Our model incorporates several key innovations: a novel Difference Pyramid Block for multi-scale change analysis, an Adaptive Multi-scale Processing module combining shape-aware and high-resolution enhancement blocks, and multi-level attention mechanisms (PPM and CSAGate) for joint contextual and detail processing. Furthermore, a dynamic composite loss function and a four-phase training strategy are introduced to stabilize training and accelerate convergence. Comprehensive evaluations on the LEVIR-CD and LEVIR-MCI datasets demonstrate SCA-Net's superior performance over Change-Agent and other state-of-the-art methods. Our approach achieves a significant 2.64% improvement in mean Intersection over Union (mIoU) on LEVIR-MCI and a remarkable 57.9% increase in IoU for small buildings, while reducing the training time by 61%. This work provides an efficient, accurate, and robust solution for practical change detection applications.
在遥感图像中的自动变化检测对于城市管理、环境监测和灾害评估至关重要。虽然深度学习模型已经在这个领域取得了进展,但它们仍然面临着诸如对小目标敏感度低以及计算成本高的挑战。本文提出了基于Change-Agent框架的增强架构SCA-Net,用于双时相影像中建筑物和道路的变化精确定位。我们的模型引入了几个关键创新:一种新颖的差异金字塔模块,用于多尺度变化分析;一个自适应多尺度处理模块,结合形状感知和高分辨率增强块;以及多层次注意力机制(PPM 和 CSAGate),用于联合上下文和细节处理。此外,还提出了一种动态复合损失函数和四阶段训练策略,以稳定训练并加速收敛。 在LEVIR-CD和LEVIR-MCI数据集上的全面评估表明,SCA-Net在性能上超过了Change-Agent和其他最先进的方法。我们的方法在LEVIR-MCI上的平均交并比(mIoU)提高了2.64%,对于小建筑的交并比(IoU)提高达57.9%;同时将训练时间减少了61%。 这项工作提供了一个高效、准确且鲁棒的解决方案,适用于实际变化检测应用。
https://arxiv.org/abs/2602.09529
Extracting drug use information from unstructured Electronic Health Records remains a major challenge in clinical Natural Language Processing. While Large Language Models demonstrate advancements, their use in clinical NLP is limited by concerns over trust, control, and efficiency. To address this, we present NOWJ submission to the ToxHabits Shared Task at BioCreative IX. This task targets the detection of toxic substance use and contextual attributes in Spanish clinical texts, a domain-specific, low-resource setting. We propose a multi-output ensemble system tackling both Subtask 1 - ToxNER and Subtask 2 - ToxUse. Our system integrates BETO with a CRF layer for sequence labeling, employs diverse training strategies, and uses sentence filtering to boost precision. Our top run achieved 0.94 F1 and 0.97 precision for Trigger Detection, and 0.91 F1 for Argument Detection.
从非结构化的电子健康记录中提取药物使用信息仍然是临床自然语言处理(NLP)领域的一大挑战。尽管大型语言模型展示了显著的进步,但由于信任、控制和效率方面的顾虑,它们在临床NLP中的应用受到限制。为了解决这个问题,我们提出了NOWJ对BioCreative IX ToxHabits共享任务的提交方案。该任务旨在检测西班牙语临床文本中使用有毒物质及其上下文属性,这是一项特定领域且资源有限的任务。我们提出了一种多输出集成系统,用于解决子任务1(ToxNER)和子任务2(ToxUse)。我们的系统结合了BETO与CRF层进行序列标注,采用了多样化的训练策略,并使用句子过滤来提高精度。我们的最佳运行达到了触发检测的0.94 F1值和0.97精度,以及论元检测的0.91 F1值。
https://arxiv.org/abs/2602.09469
Urban Visual Pollution (UVP) has emerged as a critical concern, yet research on automatic detection and application remains fragmented. This scoping review maps the existing deep learning-based approaches for detecting, classifying, and designing a comprehensive application framework for visual pollution management. Following the PRISMA-ScR guidelines, seven academic databases (Scopus, Web of Science, IEEE Xplore, ACM DL, ScienceDirect, SpringerNatureLink, and Wiley) were systematically searched and reviewed, and 26 articles were found. Most research focuses on specific pollutant categories and employs variations of YOLO, Faster R-CNN, and EfficientDet architectures. Although several datasets exist, they are limited to specific areas and lack standardized taxonomies. Few studies integrate detection into real-time application systems, yet they tend to be geographically skewed. We proposed a framework for monitoring visual pollution that integrates a visual pollution index to assess the severity of visual pollution for a certain area. This review highlights the need for a unified UVP management system that incorporates pollutant taxonomy, a cross-city benchmark dataset, a generalized deep learning model, and an assessment index that supports sustainable urban aesthetics and enhances the well-being of urban dwellers.
城市视觉污染(UVP)已成为一个关键问题,然而自动检测和应用方面的研究仍然分散。这项综述性回顾总结了现有的基于深度学习的方法,这些方法用于检测、分类以及设计全面的应用框架以管理视觉污染。遵循PRISMA-ScR指南,系统地搜索并审查了七大学术数据库(Scopus、Web of Science、IEEE Xplore、ACM DL、ScienceDirect、SpringerNatureLink 和 Wiley),最终找到了26篇相关文章。大多数研究集中在特定的污染物类别上,并采用YOLO、Faster R-CNN和EfficientDet等架构的变化形式。尽管存在几个数据集,但它们仅限于特定区域并且缺乏标准化分类法。很少有研究将检测整合到实时应用系统中,然而这些研究往往具有地域偏见。 我们提出了一个监测视觉污染的框架,该框架结合了视觉污染指数来评估某一地区视觉污染的严重程度。这项回顾强调需要建立一个统一的城市视觉污染管理系统,这个系统应包括污染物分类、跨城市基准数据集、通用深度学习模型以及支持可持续城市美学和提升城市居民福祉的评估指标。
https://arxiv.org/abs/2602.09446
Fine-grained truck classification is critical for intelligent transportation systems (ITS), yet current LiDAR-based methods face scalability challenges due to their reliance on supervised deep learning and labor-intensive manual annotation. Vision-Language Models (VLMs) offer promising few-shot generalization, but their application to roadside LiDAR is limited by a modality gap between sparse 3D point clouds and dense 2D imagery. We propose a framework that bridges this gap by adapting off-the-shelf VLMs for fine-grained truck classification without parameter fine-tuning. Our new depth-aware image generation pipeline applies noise removal, spatial and temporal registration, orientation rectification, morphological operations, and anisotropic smoothing to transform sparse, occluded LiDAR scans into depth-encoded 2D visual proxies. Validated on a real-world dataset of 20 vehicle classes, our approach achieves competitive classification accuracy with as few as 16-30 examples per class, offering a scalable alternative to data-intensive supervised baselines. We further observe a "Semantic Anchor" effect: text-based guidance regularizes performance in ultra-low-shot regimes $k < 4$, but degrades accuracy in more-shot settings due to semantic mismatch. Furthermore, we demonstrate the efficacy of this framework as a Cold Start strategy, using VLM-generated labels to bootstrap lightweight supervised models. Notably, the few-shot VLM-based model achieves over correct classification rate of 75 percent for specific drayage categories (20ft, 40ft, and 53ft containers) entirely without the costly training or fine-tuning, significantly reducing the intensive demands of initial manual labeling, thus achieving a method of practical use in ITS applications.
细粒度卡车分类对于智能交通系统(ITS)至关重要,但目前基于激光雷达的方法由于依赖监督深度学习和劳动密集型的手动标注而面临可扩展性挑战。视觉-语言模型(VLMs)在少量样本情况下表现出色的泛化能力很有前景,但由于稀疏3D点云与稠密2D图像之间的模态差异,其应用于路边激光雷达的应用受到限制。我们提出了一种框架来弥合这一差距,通过适应现成的VLM以无需参数微调的方式进行细粒度卡车分类。 我们的新深度感知图像生成管道包括噪声去除、空间和时间配准、方向校正、形态操作及各向异性平滑等步骤,将稀疏且被遮挡的激光雷达扫描转换为深度编码的2D视觉代理。在包含20个车辆类别的真实世界数据集上验证后,我们的方法仅使用每个类别16-30个样本即达到了竞争性的分类准确率,提供了一种替代耗数据密集型监督基线的方法。 此外,我们观察到了“语义锚定”效应:基于文本的指导在超低样本($k < 4$)情况下稳定了性能表现,但在更多样本设置中由于语义不匹配而降低了准确性。我们还展示了此框架作为冷启动策略的有效性,即利用VLM生成的标签来引导轻量级监督模型。 值得注意的是,基于少量样本的VLM模型在特定拖车类别(20英尺、40英尺和53英尺集装箱)中达到了75%以上的正确分类率,完全无需昂贵的训练或微调过程。这种方法显著降低了初始手动标注的密集需求,并实现了适用于ITS应用的方法。
https://arxiv.org/abs/2602.09425
Domain adaptation (DA) is a quickly expanding area in machine learning that involves adjusting a model trained in one domain to perform well in another domain. While there have been notable progressions, the fundamental concept of numerous DA methodologies has persisted: aligning the data from various domains into a shared feature space. In this space, knowledge acquired from labeled source data can improve the model training on target data that lacks sufficient labels. In this study, we demonstrate the use of 10 deep learning models to simulate common DA techniques and explore their application in four medical image datasets. We have considered various situations such as multi-modality, noisy data, federated learning (FL), interpretability analysis, and classifier calibration. The experimental results indicate that using DA with ResNet34 in a brain tumor (BT) data set results in an enhancement of 4.7\% in model performance. Similarly, the use of DA can reduce the impact of Gaussian noise, as it provides $\sim 3\%$ accuracy increase using ResNet34 on a BT dataset. Furthermore, simply introducing DA into FL framework shows limited potential (e.g., $\sim 0.3\%$ increase in performance) for skin cancer classification. In addition, the DA method can improve the interpretability of the models using the gradcam++ technique, which offers clinical values. Calibration analysis also demonstrates that using DA provides a lower expected calibration error (ECE) value $\sim 2\%$ compared to CNN alone on a multi-modality dataset.
领域适应(DA)是机器学习中一个迅速扩展的领域,它涉及到调整在一个特定环境中训练好的模型以使其在另一个不同的环境中也能有良好的表现。尽管在此领域已经取得了显著的进步,但许多领域的适应方法的核心概念依然保持不变:将不同领域中的数据映射到共享特征空间中,在这个空间内,从标记源数据中学到的知识可以改善目标数据(缺乏足够标签)上的模型训练效果。 在这项研究中,我们展示了如何使用10种深度学习模型来模拟常见的领域适应技术,并探讨了这些方法在四个医学影像数据集中的应用。我们考虑到了各种情况,包括多模态、噪声数据、联邦学习(FL)、可解释性分析和分类器校准。 实验结果表明,在脑肿瘤(BT)数据集中使用领域适应与ResNet34模型相结合能够使模型性能提高约4.7%。同样地,应用领域适应技术可以减轻高斯噪声的影响,例如在脑肿瘤数据集上使用ResNet34模型时准确率提高了大约3%。此外,在联邦学习框架中简单引入领域适应方法显示出了有限的潜力,比如皮肤癌分类中的性能提高约为0.3%。 研究还表明,通过gradcam++技术应用领域适应能够提高模型的可解释性,这对临床工作具有重要价值。校准分析同样证明了使用领域适应相比于仅用CNN的情况在多模态数据集中可以降低预期校准误差(ECE)约2%。
https://arxiv.org/abs/2602.09355
The rise of cyberbullying in social media platforms involving toxic comments has escalated the need for effective ways to monitor and moderate online interactions. Existing solutions of automated toxicity detection systems, are based on a machine or deep learning algorithms. However, statistics-based solutions are generally prone to adversarial attacks that contain logic based modifications such as negation in phrases and sentences. In that regard, we present a set of formal reasoning-based methodologies that wrap around existing machine learning toxicity detection systems. Acting as both pre-processing and post-processing steps, our formal reasoning wrapper helps alleviating the negation attack problems and significantly improves the accuracy and efficacy of toxicity scoring. We evaluate different variations of our wrapper on multiple machine learning models against a negation adversarial dataset. Experimental results highlight the improvement of hybrid (formal reasoning and machine-learning) methods against various purely statistical solutions.
社交媒体平台上涉及有毒评论的网络欺凌现象日益严重,这加大了对有效监测和管理在线互动方式的需求。现有的自动毒性检测系统主要基于机器学习或深度学习算法。然而,基于统计的方法通常容易受到包含逻辑修改(如否定)等对抗性攻击的影响。因此,我们提出了一组基于形式推理的方法,这些方法围绕现有机器学习的毒性检测系统构建。作为预处理和后处理步骤,我们的形式推理包装器有助于缓解否定攻击问题,并显著提高毒性的准确评分和效率。我们在多个机器学习模型上评估了我们包装器的不同变体,并针对否定对抗数据集进行了测试。实验结果表明,在与各种纯统计解决方案相比时,混合方法(形式推理与机器学习)的性能得到了提升。
https://arxiv.org/abs/2602.09343
Accurate classification of breast cancer histopathology images is pivotal for early oncological diagnosis and therapeutic this http URL, conventional deep learning architectures often encounter performance degradation under limited annotations and suffer from a "blackbox" nature, hindering their clinical integration. To mitigate these limitations, we propose GAFRNet, a robust and interpretable Graph Attention and FuzzyRule Network specifically engineered for histopathology image classification with scarce supervision. GAFRNet constructs a similarity-driven graph representation to model intersample relationships and employs a multihead graph attention mechanism to capture complex relational features across heterogeneous tissue this http URL, a differentiable fuzzy-rule module encodes intrinsic topological descriptorsincluding node degree, clustering coefficient, and label consistencyinto explicit, human-understandable diagnostic logic. This design establishes transparent "IF-THEN" mappings that mimic the heuristic deduction process of medical experts, providing clear reasoning behind each prediction without relying on post-hoc attribution methods. Extensive evaluations on three benchmark datasets (BreakHis, Mini-DDSM, and ICIAR2018) demonstrate that GAFR-Net consistently outperforms various state-of-the-art methods across multiple magnifications and classification tasks. These results validate the superior generalization and practical utility of GAFR-Net as a reliable decision-support tool for weakly supervised medical image analysis.
乳腺癌组织病理图像的准确分类对于早期肿瘤诊断和治疗至关重要。然而,传统的深度学习架构在标注数据有限的情况下性能会下降,并且由于其“黑箱”特性而难以被临床应用。为了解决这些问题,我们提出了一种名为GAFRNet(Graph Attention and Fuzzy Rule Network)的强大且可解释的网络,专门用于稀缺监督条件下的组织病理图像分类。GAFRNet通过构建以相似性驱动的图表示来建模样本间的相互关系,并采用多头图注意力机制来捕捉异质组织中的复杂关联特征。此外,一个可微分模糊规则模块将内在拓扑描述符(包括节点度、聚类系数和标签一致性)编码为明确的人类易于理解的诊断逻辑。这种设计建立了透明的“如果-那么”映射关系,模拟了医疗专家基于直觉进行推断的过程,并且无需依赖事后归因方法就能提供清晰的推理依据。 在三个基准数据集(BreakHis、Mini-DDSM和ICIA2018)上的广泛评估表明,GAFRNet在多种放大倍率和分类任务中均超越了各种最新的方法。这些结果验证了GAFR-Net作为弱监督医学影像分析可靠决策支持工具的优越泛化能力和实用价值。 通过这种方式,GAFRNet不仅能够提高诊断准确性,还能增强医生对模型输出的理解与信任度,从而促进其在临床环境中的广泛采用。
https://arxiv.org/abs/2602.09318
High-quality medical imaging datasets are essential for training deep learning models, but their unauthorized use raises serious copyright and ethical concerns. Medical imaging presents a unique challenge for existing dataset ownership verification methods designed for natural images, as static watermark patterns generated in fixed-scale images scale poorly dynamic and high-resolution scans with limited visual diversity and subtle anatomical structures, while preserving diagnostic quality. In this paper, we propose X-Mark, a sample-specific clean-label watermarking method for chest x-ray copyright protection. Specifically, X-Mark uses a conditional U-Net to generate unique perturbations within salient regions of each sample. We design a multi-component training objective to ensure watermark efficacy, robustness against dynamic scaling processes while preserving diagnostic quality and visual-distinguishability. We incorporate Laplacian regularization into our training objective to penalize high-frequency perturbations and achieve watermark scale-invariance. Ownership verification is performed in a black-box setting to detect characteristic behaviors in suspicious models. Extensive experiments on CheXpert verify the effectiveness of X-Mark, achieving WSR of 100% and reducing probability of false positives in Ind-M scenario by 12%, while demonstrating resistance to potential adaptive attacks.
高质量的医学影像数据集对于训练深度学习模型至关重要,但未经授权使用这些数据集会引发严重的版权和伦理问题。现有的数据集所有权验证方法针对自然图像设计时,在处理动态变化和高分辨率扫描(这些扫描具有有限的视觉多样性及细微的解剖结构)方面面临挑战,因为固定尺度图像中生成的静态水印图案难以适应这种特性,并且在保持诊断质量的同时很难做到这一点。 本文提出了一种名为X-Mark的方法,这是一种针对胸部X光影像版权保护的样本特定清洁标签水印技术。具体来说,X-Mark使用条件U-Net来为每个样本中的显著区域生成独特的扰动。我们设计了多组件训练目标以确保水印的有效性和鲁棒性,在动态缩放过程中保持诊断质量和视觉可区分性的同时减少高频扰动的影响,并通过拉普拉斯正则化实现水印的尺度不变性。 所有权验证在黑盒环境中进行,用于检测可疑模型中的特征行为。在CheXpert数据集上的广泛实验验证了X-Mark的有效性,实现了100%的误用率(WSR)和减少了Ind-M场景下的误报概率12%,同时展示了对潜在自适应攻击的抵抗力。 简言之: - X-Mark为胸部X光影像提供了独特的版权保护方案。 - 使用条件U-Net生成样本特定扰动,确保水印在不同尺度下保持有效性和鲁棒性,并维持诊断质量与视觉可区分度。 - 引入拉普拉斯正则化以减少高频扰动的影响,提高水印的规模不变性。 - 在黑盒设置中执行所有权验证,通过检测模型的行为特征来识别潜在的侵权行为。
https://arxiv.org/abs/2602.09284
Colorectal cancer (CRC) remains a significant cause of cancer-related mortality, despite the widespread implementation of prophylactic initiatives aimed at detecting and removing precancerous polyps. Although screening effectively reduces incidence, a notable portion of patients initially diagnosed with low-grade adenomatous polyps will still develop CRC later in life, even without the presence of known high-risk syndromes. Identifying which low-risk patients are at higher risk of progression is a critical unmet need for tailored surveillance and preventative therapeutic strategies. Traditional histological assessment of adenomas, while fundamental, may not fully capture subtle architectural or cytological features indicative of malignant potential. Advancements in digital pathology and machine learning provide an opportunity to analyze whole-slide images (WSIs) comprehensively and objectively. This study investigates whether machine learning algorithms, specifically convolutional neural networks (CNNs), can detect subtle histological features in WSIs of low-grade tubular adenomas that are predictive of a patient's long-term risk of developing colorectal cancer.
尽管已经广泛实施了旨在检测和移除癌前息肉的预防性措施,结直肠癌(CRC)仍然是癌症相关死亡的主要原因之一。虽然筛查有效降低了发病率,但仍有相当一部分最初被诊断为低级别腺瘤的患者会在日后发展成结直肠癌,即使他们没有已知的高风险综合症也不例外。识别哪些低风险患者具有较高的疾病进展风险是定制化监测和预防治疗策略的一个重要未满足需求。传统上对腺瘤进行组织学评估虽然基础但可能无法全面捕捉到预示恶性潜能的细微结构或细胞学特征。 数字病理学和机器学习的进步为全面而客观地分析整个载玻片图像(WSIs)提供了机会。本研究旨在探讨卷积神经网络(CNN)等机器学习算法是否能够检测出低级别管状腺瘤中预测患者长期发展结直肠癌风险的细微组织学特征。
https://arxiv.org/abs/2602.09155
Despite strong performance in data-rich regimes, deep learning often underperforms in the data-scarce settings common in practice. While foundation models (FMs) trained on massive datasets demonstrate strong generalization by extracting general-purpose features, they can still suffer from scarce labeled data during downstream fine-tuning. To address this, we propose GeLDA, a semantics-aware generative latent data augmentation framework that leverages conditional diffusion models to synthesize samples in an FM-induced latent space. Because this space is low-dimensional and concentrates task-relevant information compared to the input space, GeLDA enables efficient, high-quality data generation. GeLDA conditions generation on auxiliary feature vectors that capture semantic relationships among classes or subdomains, facilitating data augmentation in low-resource domains. We validate GeLDA in two large-scale recognition tasks: (a) in zero-shot language-specific speech emotion recognition, GeLDA improves the Whisper-large baseline's unweighted average recall by 6.13%; and (b) in long-tailed image classification, it achieves 74.7% tail-class accuracy on ImageNet-LT, setting a new state-of-the-art result.
尽管在数据丰富的环境中,深度学习表现强劲,但在实践中常见的数据稀缺场景中却往往表现不佳。虽然基于大规模数据集训练的基础模型(FMs)通过提取通用特征展示了强大的泛化能力,但它们在下游微调时仍会受到标签数据不足的影响。为解决这一问题,我们提出了GeLDA——一个语义感知的生成隐式数据增强框架,该框架利用条件扩散模型在基础模型诱导的潜在空间中合成样本。由于这个空间是低维且集中了任务相关信息(相比输入空间),GeLDA能够实现高效、高质量的数据生成。通过辅助特征向量对生成过程进行控制,这些向量捕捉类别或子域之间的语义关系,使得在资源匮乏领域内数据增强成为可能。我们在两个大规模识别任务中验证了GeLDA的效果:(a) 在零样本特定语言的语音情感识别中,GeLDA将Whisper-large基线模型的无权平均召回率提高了6.13%;以及(b) 在长尾图像分类中,在ImageNet-LT数据集上实现了74.7%的尾部类别准确率,刷新了最新的最佳结果。
https://arxiv.org/abs/2602.02841
Linear recurrent neural networks (LRNNs) provide a structured approach to sequence modeling that bridges classical linear dynamical systems and modern deep learning, offering both expressive power and theoretical guarantees on stability and trainability. In recent years, multiple LRNN-based architectures have been proposed, each introducing distinct parameterizations, discretization schemes, and implementation constraints. However, existing implementations are fragmented across different software frameworks, often rely on framework-specific optimizations, and in some cases require custom CUDA kernels or lack publicly available code altogether. As a result, using, comparing, or extending LRNNs requires substantial implementation effort. To address this, we introduce $\texttt{lrnnx}$, a unified software library that implements several modern LRNN architectures under a common interface. The library exposes multiple levels of control, allowing users to work directly with core components or higher-level model abstractions. $\texttt{lrnnx}$ aims to improve accessibility, reproducibility, and extensibility of LRNN research and applications. We make our code available under a permissive MIT license.
线性递归神经网络(LRNN)提供了一种结构化的方法来建模序列,它连接了经典的线性动态系统和现代深度学习技术,既具有表达能力又能保证稳定性和可训练性的理论保障。近年来,提出了多种基于LRNN的架构,每个架构都引入了不同的参数设置、离散化方案以及实现约束条件。然而,现有的实现方式分散在不同的软件框架中,常常依赖于特定框架的优化,并且有些需要定制CUDA内核或根本没有公开可用的代码。因此,使用、比较或扩展LRNN都需要大量的实现工作。 为解决这一问题,我们引入了$\texttt{lrnnx}$,这是一个统一的软件库,实现了几种现代的LRNN架构,在一个通用接口下进行操作。该库暴露了多个级别的控制,使用户可以直接与核心组件或更高层次的模型抽象进行交互。$\texttt{lrnnx}$旨在提高LRNN研究和应用的可访问性、重现性和扩展性。我们以宽松的MIT许可证提供代码。 --- 这段文本主要介绍了线性递归神经网络(LRNN)及其在现代深度学习中的重要地位,同时指出了当前实现这些模型时面临的一些挑战,并提出了一个名为$\texttt{lrnnx}$的新软件库来解决这些问题,通过统一接口和多层级控制提高了研究的可访问性和重现性。
https://arxiv.org/abs/2602.08810
Hypothesis. Artificial general intelligence is, at its core, a compression problem. Effective compression demands resonance: deep learning scales best when its architecture aligns with the fundamental structure of the data. These are the fundamental principles. Yet, modern vision architectures have strayed from these truths: visual signals are highly redundant, while discriminative information, the surprise, is sparse. Current models process dense pixel grids uniformly, wasting vast compute on static background rather than focusing on the predictive residuals that define motion and meaning. We argue that to solve visual understanding, we must align our architectures with the information-theoretic principles of video, i.e., Codecs. Method. OneVision-Encoder encodes video by compressing predictive visual structure into semantic meaning. By adopting Codec Patchification, OV-Encoder abandons uniform computation to focus exclusively on the 3.1%-25% of regions rich in signal entropy. To unify spatial and temporal reasoning under irregular token layouts, OneVision-Encoder employs a shared 3D RoPE and is trained with a large-scale cluster discrimination objective over more than one million semantic concepts, jointly capturing object permanence and motion dynamics. Evidence. The results validate our core hypothesis: efficiency and accuracy are not a trade-off; they are positively correlated. When integrated into LLM, it consistently outperforms strong vision backbones such as Qwen3-ViT and SigLIP2 across 16 image, video, and document understanding benchmarks, despite using substantially fewer visual tokens and pretraining data. Notably, on video understanding tasks, OV-Encoder achieves an average improvement of 4.1% over Qwen3-ViT. Codec-aligned, patch-level sparsity is a foundational principle, enabling OV-Encoder as a scalable engine for next-generation visual generalists.
假设。人工智能的通用智能在根本上是一个压缩问题。有效的压缩需要共振:深度学习在架构与数据的基本结构相匹配时表现最佳。这些是基本原则。然而,现代视觉架构已经偏离了这一真理:视觉信号高度冗余,而判别信息(即惊喜)则非常稀疏。当前模型对密集的像素网格进行均匀处理,浪费了大量的计算资源来处理静态背景,而不是专注于定义运动和意义的预测残差上。我们认为,要解决视觉理解问题,我们必须使架构与视频的信息论原理——编码器-解码器系统(Codec)相一致。 方法。OneVision-Encoder 通过将预测性的视觉结构压缩为语义含义来对视频进行编码。通过采用 Codec 补丁化技术,OV-Encoder 放弃了均匀计算,并专注于信号熵丰富的区域,这些区域占总区域的3.1%-25%之间。为了在不规则令牌布局下统一空间和时间推理,OneVision-Encoder 使用共享的三维RoPE(旋转位置嵌入),并通过超过一百万个语义概念的大规模集群判别目标进行训练,同时捕捉对象持久性和运动动态。 证据。结果验证了我们的核心假设:效率与准确度并不是权衡的关系;它们是正相关的。当集成到大型语言模型中时,OneVision-Encoder 在16个图像、视频和文档理解基准测试上始终优于强大的视觉骨干网络,如Qwen3-ViT和SigLIP2,尽管使用了更少的视觉令牌和预训练数据。值得注意的是,在视频理解任务中,OV-Encoder 相对于 Qwen3-ViT 平均提高了4.1%。 Codec 对齐、补丁级稀疏性是基本原则,使 OV-Encoder 成为下一代视觉通用智能的基础引擎。
https://arxiv.org/abs/2602.08683
Accurate annotation of fixation type is a critical step in slide preparation for pathology laboratories. However, this manual process is prone to errors, impacting downstream analyses and diagnostic accuracy. Existing methods for verifying formalin-fixed, paraffin-embedded (FFPE), and frozen section (FS) fixation types typically require full-resolution whole-slide images (WSIs), limiting scalability for high-throughput quality control. We propose a deep-learning model to predict fixation types using low-resolution, pre-scan thumbnail images. The model was trained on WSIs from the TUM Institute of Pathology (n=1,200, Leica GT450DX) and evaluated on a class-balanced subset of The Cancer Genome Atlas dataset (TCGA, n=8,800, Leica AT2), as well as on class-balanced datasets from Augsburg (n=695 [392 FFPE, 303 FS], Philips UFS) and Regensburg (n=202, 3DHISTECH P1000). Our model achieves an AUROC of 0.88 on TCGA, outperforming comparable pre-scan methods by 4.8%. It also achieves AUROCs of 0.72 on Regensburg and Augsburg slides, underscoring challenges related to scanner-induced domain shifts. Furthermore, the model processes each slide in 21 ms, $400\times$ faster than existing high-magnification, full-resolution methods, enabling rapid, high-throughput processing. This approach provides an efficient solution for detecting labelling errors without relying on high-magnification scans, offering a valuable tool for quality control in high-throughput pathology workflows. Future work will improve and evaluate the model's generalisation to additional scanner types. Our findings suggest that this method can increase accuracy and efficiency in digital pathology workflows and may be extended to other low-resolution slide annotations.
准确地标注组织切片的固定类型是病理实验室制备幻灯片的关键步骤。然而,这一手动过程容易出错,从而影响后续分析和诊断准确性。目前验证福尔马林固定石蜡包埋(FFPE)和冷冻切片(FS)类型的现有方法通常需要使用全分辨率的整张图像(WSI),这限制了高通量质量控制的可扩展性。我们提出了一种基于深度学习的模型,利用低分辨率预览图来预测组织切片的固定类型。该模型在来自慕尼黑工业大学病理学研究所的1200个全视窗图像上进行训练,并在包含8800张The Cancer Genome Atlas(TCGA)数据集中的平衡子集中进行了评估,此外还在奥格斯堡(n=695 [392 FFPE, 303 FS])、雷根斯堡(n=202)的数据集上进行了测试。在这些不同来源和扫描仪类型的数据集中,我们的模型在TCGA数据集上的AUROC值达到了0.88,相较于同类预览方法提高了4.8%。此外,在奥格斯堡和雷根斯堡切片上的AUROC分别为0.72,这表明了由于扫描仪差异导致的领域迁移问题带来的挑战。 该模型处理每张幻灯片只需21毫秒,比现有的高倍率、全分辨率方法快400倍。这种快速高效的方法使得在不依赖于高倍数扫描的情况下检测标注错误成为可能,并为高通量病理工作流程提供了一种有价值的工具进行质量控制。 未来的工作将致力于改进模型以更好地适应不同类型的扫描仪,同时评估其推广能力。我们的研究结果表明这种方法能够提高数字病理学工作的准确性和效率,并且可以扩展到其他低分辨率的幻灯片注释任务中。
https://arxiv.org/abs/2602.08652
This paper investigates the impact of hybridizing a multi-modal Genetic Algorithm with a Graph Neural Network for timetabling optimization. The Graph Neural Network is designed to encapsulate general domain knowledge to improve schedule quality, while the Genetic Algorithm explores different regions of the search space and integrates the deep learning model as an enhancement operator to guide the solution search towards optimality. Initially, both components of the hybrid technique were designed, developed, and optimized independently to solve the tackled task. Multiple experiments were conducted on Staff Rostering, a well-known timetabling problem, to compare the proposed hybridization with the standalone optimized versions of the Genetic Algorithm and Graph Neural Network. The experimental results demonstrate that the proposed hybridization brings statistically significant improvements in both the time efficiency and solution quality metrics, compared to the standalone methods. To the best of our knowledge, this work proposes the first hybridization of a Genetic Algorithm with a Graph Neural Network for solving timetabling problems.
本文研究了将多模态遗传算法与图神经网络混合以进行排程优化的影响。图神经网络被设计用来封装通用领域知识,从而提升日程质量;而遗传算法则用于探索搜索空间的不同区域,并整合深度学习模型作为增强操作符,引导解决方案向最优解靠近。最初,该混合技术的两个组成部分分别独立地设计、开发和优化以解决所讨论的任务。在员工排班(一个众所周知的日程安排问题)上进行了多次实验,以便将提议的混合方法与遗传算法和图神经网络各自单独优化后的版本进行比较。实验结果表明,提出的混合方法相比单一方法,在时间和解决方案质量指标方面都带来了统计意义上的显著改进。据我们所知,这是首次提出将遗传算法与图神经网络相融合以解决日程安排问题的工作。
https://arxiv.org/abs/2602.08619
While deep learning has advanced speech enhancement (SE), effective phase modeling remains challenging, as conventional networks typically operate within a flat Euclidean feature space, which is not easy to model the underlying circular topology of the phase. To address this, we propose a manifold-aware magnitude-phase dual-stream framework that aligns the phase stream with its intrinsic circular geometry by enforcing Global Rotation Equivariance (GRE) characteristic. Specifically, we introduce a Magnitude-Phase Interactive Convolutional Module (MPICM) for modulus-based information exchange and a Hybrid-Attention Dual-FFN (HADF) bottleneck for unified feature fusion, both of which are designed to preserve GRE in the phase stream. Comprehensive evaluations are conducted across phase retrieval, denoising, dereverberation, and bandwidth extension tasks to validate the superiority of the proposed method over multiple advanced baselines. Notably, the proposed architecture reduces Phase Distance by over 20\% in the phase retrieval task and improves PESQ by more than 0.1 in zero-shot cross-corpus denoising evaluations. The overall superiority is also established in universal SE tasks involving mixed distortions. Qualitative analysis further reveals that the learned phase features exhibit distinct periodic patterns, which are consistent with the intrinsic circular nature of the phase. The source code is available at this https URL.
尽管深度学习在语音增强(SE)方面取得了进展,但有效的相位建模仍然具有挑战性。传统网络通常在平坦的欧氏特征空间中操作,这难以模拟相位的基本环形拓扑结构。为了解决这个问题,我们提出了一种流形感知的幅度-相位双通道框架,通过强制执行全局旋转等变(GRE)特性来使相位通道与其固有的圆形几何形状对齐。具体而言,我们引入了基于模量的信息交换幅度-相位交互卷积模块(MPICM)和用于统一特征融合的混合注意力双FFN(HADF)瓶颈,两者都旨在在相位流中保持GRE。 为了验证所提出方法相对于多个高级基线模型的优势,我们在相位检索、降噪、去混响以及带宽扩展任务上进行了全面评估。值得注意的是,在相位检索任务中,我们的架构将相位距离降低了超过20%,并且在零样本跨语料库降噪评估中,PESQ提高了超过0.1分。在涉及混合失真的通用SE任务中也建立了整体优势。 定性分析进一步揭示了学习到的相位特征表现出明显的周期性模式,这与相位的基本环形本质一致。源代码可在此处获取(请将此处替换为实际链接)。
https://arxiv.org/abs/2602.08556
This paper presents the integration of flow field reconstruction, dynamic probabilistic modeling, search control, and machine vision detection in a system for autonomous maritime search operations. Field experiments conducted in Valun Bay (Cres Island, Croatia) involved real-time drifter data acquisition, surrogate flow model fitting based on computational fluid dynamics and numerical optimization, advanced multi-UAV search control and vision sensing, as well as deep learning-based object detection. The results demonstrate that a tightly coupled approach enables reliable detection of floating targets under realistic uncertainties and complex environmental conditions, providing concrete insights for future autonomous maritime search and rescue applications.
本文介绍了一种自主海上搜索操作系统的集成,该系统结合了流场重建、动态概率建模、搜索控制和机器视觉检测技术。在克罗地亚 Cres 岛 Valun 海湾进行的现场实验包括实时漂浮物数据采集、基于计算流体动力学和数值优化的代理流模型拟合、先进的多无人机搜索控制与视觉感知,以及基于深度学习的目标检测。结果表明,紧密耦合的方法能够在实际不确定性和复杂环境条件下可靠地检测出漂浮目标,为未来自主海上搜救应用提供了具体见解。
https://arxiv.org/abs/2602.08450