Automatic identification of screw types is important for industrial automation, robotics, and inventory management. However, publicly available datasets for screw classification are scarce, particularly for controlled single-object scenarios commonly encountered in automated sorting systems. In this work, we introduce $\textbf{SortScrews}$, a dataset for casewise visual classification of screws. The dataset contains 560 RGB images at $512\times512$ resolution covering six screw types and a background class. Images are captured using a standardized acquisition setup and include mild variations in lighting and camera perspective across four capture settings. To facilitate reproducible research and dataset expansion, we also provide a reusable data collection script that allows users to easily construct similar datasets for custom hardware components using inexpensive camera setups. We establish baseline results using transfer learning with EfficientNet-B0 and ResNet-18 classifiers pretrained on ImageNet. In addition, we conduct a well-explored failure analysis. Despite the limited dataset size, these lightweight models achieve strong classification accuracy, demonstrating that controlled acquisition conditions enable effective learning even with relatively small datasets. The dataset, collection pipeline, and baseline training code are publicly available at this https URL.
https://arxiv.org/abs/2603.13027
Epileptic seizure forecasting is a clinically important yet challenging problem in epilepsy research. Existing approaches predominantly rely on neural signals such as electroencephalography (EEG), which require specialized equipment and limit long-term deployment in real-world settings. In contrast, video data provide a non-invasive and accessible alternative, yet existing video-based studies mainly focus on post-onset seizure detection, leaving seizure forecasting largely unexplored. In this work, we formulate a novel task of video-based epileptic seizure forecasting, where short pre-ictal video segments (3-10 seconds) are used to predict whether a seizure will occur within the subsequent 5 seconds. To address the scarcity of annotated human epilepsy videos, we propose a cross-species transfer learning framework that leverages large-scale rodent video data for auxiliary pretraining. This enables the model to capture seizure-related behavioral dynamics that generalize across species. Experimental results demonstrate that our approach achieves over 70% prediction accuracy under a strictly video-only setting and outperforms existing baselines. These findings highlight the potential of cross-species learning for building non-invasive, scalable early-warning systems for epilepsy.
https://arxiv.org/abs/2603.12887
Abrasive flap wheels are common for finishing complex free-form surfaces due to their flexibility. However, this flexibility results in complex wear patterns such as concave/convex flap profiles or flap tears, which influence the grinding result. This paper proposes a novel, vision-based hierarchical classification framework to automate the wear condition monitoring of flap wheels. Unlike monolithic classification approaches, we decompose the problem into three logical levels: (1) state detection (new vs. worn), (2) wear type identification (rectangular, concave, convex) and flap tear detection, and (3) severity assessment (partial vs. complete deformation). A custom-built dataset of real flap wheel images was generated and a transfer learning approach with EfficientNetV2 architecture was used. The results demonstrate high robustness with classification accuracies ranging from 93.8% (flap tears) to 99.3% (concave severity). Furthermore, Gradient-weighted Class Activation Mapping (Grad-CAM) is utilized to validate that the models learn physically relevant features and examine false classifications. The proposed hierarchical method provides a basis for adaptive process control and wear consideration in automated flap wheel grinding.
https://arxiv.org/abs/2603.12852
Deep learning and computer vision techniques have become increasingly important in the development of self-driving cars. These techniques play a crucial role in enabling self-driving cars to perceive and understand their surroundings, allowing them to safely navigate and make decisions in real-time. Using Neural Networks self-driving cars can accurately identify and classify objects such as pedestrians, other vehicles, and traffic signals. Using deep learning and analyzing data from sensors such as cameras and radar, self-driving cars can predict the likely movement of other objects and plan their own actions accordingly. In this study, a novel approach to enhance the performance of selfdriving cars by using pre-trained and custom-made neural networks for key tasks, including traffic sign classification, vehicle detection, lane detection, and behavioral cloning is provided. The methodology integrates several innovative techniques, such as geometric and color transformations for data augmentation, image normalization, and transfer learning for feature extraction. These techniques are applied to diverse datasets,including the German Traffic Sign Recognition Benchmark (GTSRB), road and lane segmentation datasets, vehicle detection datasets, and data collected using the Udacity selfdriving car simulator to evaluate the model efficacy. The primary objective of the work is to review the state-of-the-art in deep learning and computer vision for self-driving cars. The findings of the work are effective in solving various challenges related to self-driving cars like traffic sign classification, lane prediction, vehicle detection, and behavioral cloning, and provide valuable insights into improving the robustness and reliability of autonomous systems, paving the way for future research and deployment of safer and more efficient self-driving technologies.
深度学习和计算机视觉技术在自动驾驶汽车的发展中变得越来越重要。这些技术对于使自动驾驶汽车能够感知和理解周围环境,从而实现实时安全导航和决策起着关键作用。通过使用神经网络,自动驾驶汽车可以准确地识别并分类行人、其他车辆以及交通信号等物体。利用深度学习分析来自摄像头和雷达等传感器的数据,自动驾驶汽车可以预测其他对象的可能移动,并据此规划自己的行动。 这项研究提供了一种新的方法来增强自动驾驶汽车性能,该方法通过使用预训练及定制化神经网络处理关键任务,包括交通标志分类、车辆检测、车道识别以及行为克隆。该方法集成了多种创新技术,例如几何和颜色变换的数据增强、图像标准化以及迁移学习以提取特征等技术,并应用于各种数据集中,如德国交通标志识别基准(GTSRB)、道路和车道分割数据集、车辆检测数据集以及使用Udacity自动驾驶汽车模拟器收集的数据来评估模型的有效性。 研究的主要目标是回顾深度学习与计算机视觉在自动驾驶汽车领域的最新进展。这项工作的发现对于解决自动驾驶汽车面临的各种挑战,如交通标志分类、车道预测、车辆检测及行为克隆等非常有效,并且提供了关于提高自主系统的稳健性和可靠性的宝贵见解,为未来更安全和高效的自动驾驶技术的研究与部署铺平了道路。
https://arxiv.org/abs/2603.09255
Accurate localization of tumor regions from hematoxylin and eosin-stained whole-slide images is fundamental for translational research including spatial analysis, molecular profiling, and tissue architecture investigation. However, deep learning-based tumor detection trained within specific cancers may exhibit reduced robustness when applied across different tumor types. We investigated whether balanced training across cancers at modest scale can achieve high performance and generalize to unseen tumor types. A multi-cancer tumor localization model (MuCTaL) was trained on 79,984 non-overlapping tiles from four cancers (melanoma, hepatocellular carcinoma, colorectal cancer, and non-small cell lung cancer) using transfer learning with DenseNet169. The model achieved a tile-level ROC-AUC of 0.97 in validation data from the four training cancers, and 0.71 on an independent pancreatic ductal adenocarcinoma cohort. A scalable inference workflow was built to generate spatial tumor probability heatmaps compatible with existing digital pathology tools. Code and models are publicly available at this https URL.
从苏木素和伊红染色的全玻片图像中准确定位肿瘤区域,对于包括空间分析、分子特征谱绘制以及组织结构研究在内的转化医学研究至关重要。然而,基于深度学习训练的肿瘤检测模型在特定癌症类型内表现良好,但应用于不同类型的肿瘤时可能表现出较低的鲁棒性。我们调查了跨不同类型癌症进行平衡训练能否以适度的数据规模达到高性能,并且泛化到未见过的肿瘤类型。 为此,我们构建了一个多癌症肿瘤定位模型(MuCTaL),该模型在四个癌症类型(黑色素瘤、肝细胞癌、结直肠癌和非小细胞肺癌)上进行了培训,使用DenseNet169进行迁移学习。训练集包含这四种癌症的79,984个不重叠的图像块。验证数据表明,在训练中的四类癌症中该模型达到了0.97的tile级ROC-AUC值;在独立的数据集中(胰腺导管腺癌),模型的性能为0.71。 我们还建立了一个可扩展的推理工作流程,生成的空间肿瘤概率热图与现有的数字病理学工具兼容。相关代码和模型可以在这里访问:[提供链接](原文中未给出具体URL,请根据实际发布的网址替换)。
https://arxiv.org/abs/2603.08844
We develop a novel transfer learning framework to tackle the challenge of limited training data in image reconstruction problems. The proposed framework consists of two training steps, both of which are formed as bi-level optimizations. In the first step, we train a powerful universal feature-extractor that is capable of learning important knowledge from large, heterogeneous data sets in various domains. In the second step, we train a task-specific domain-adapter for a new target domain or task with only a limited amount of data available for training. Then the composition of the adapter and the universal feature-extractor effectively explores feature which serve as an important component of image regularization for the new domains, and this leads to high-quality reconstruction despite the data limitation issue. We apply this framework to reconstruct under-sampled MR images with limited data by using a collection of diverse data samples from different domains, such as images of other anatomies, measurements of various sampling ratios, and even different image modalities, including natural images. Experimental results demonstrate a promising transfer learning capability of the proposed method.
我们开发了一种新颖的迁移学习框架,旨在解决图像重建问题中训练数据不足的问题。所提出的框架包含两个训练步骤,这两个步骤均采用双层优化形式构建。 在第一步中,我们训练了一个强大的通用特征提取器,它可以从不同领域的大规模异构数据集中学习到重要的知识。第二步中,我们在仅有少量可用训练数据的新目标域或任务上训练一个特定于该任务的领域适配器。然后,适配器与通用特征提取器组合使用,有效地探索出对新领域的图像正则化至关重要的特性,从而即使在数据受限的情况下也能实现高质量的重建。 我们将此框架应用于使用来自不同领域(如其他解剖学部位的图像、各种采样率下的测量值以及包括自然图像在内的不同成像模式)的多样化样本集合来重建欠采样的MRI图像。实验结果表明了该方法具有令人期待的迁移学习能力。
https://arxiv.org/abs/2603.07831
Monocular 3D object detection is a promising yet ill-posed task for autonomous vehicles due to the lack of accurate depth information. Cross-modality knowledge distillation could effectively transfer depth information from LiDAR to image-based network. However, modality gap between image and LiDAR seriously limits its accuracy. In this paper, we systematically investigate the negative transfer problem induced by modality gap in cross-modality distillation for the first time, including not only the architecture inconsistency issue but more importantly the feature overfitting issue. We propose a selective learning approach named MonoSTL to overcome these issues, which encourages positive transfer of depth information from LiDAR while alleviates the negative transfer on image-based network. On the one hand, we utilize similar architectures to ensure spatial alignment of features between image-based and LiDAR-based networks. On the other hand, we develop two novel distillation modules, namely Depth-Aware Selective Feature Distillation (DASFD) and Depth-Aware Selective Relation Distillation (DASRD), which selectively learn positive features and relationships of objects by integrating depth uncertainty into feature and relation distillations, respectively. Our approach can be seamlessly integrated into various CNN-based and DETR-based models, where we take three recent models on KITTI and a recent model on NuScenes for validation. Extensive experiments show that our approach considerably improves the accuracy of the base models and thereby achieves the best accuracy compared with all recently released SOTA models.
单目3D物体检测对于自动驾驶汽车来说是一个有前景但又难以处理的任务,因为缺乏准确的深度信息。跨模态知识蒸馏可以有效地将来自激光雷达(LiDAR)的深度信息转移到基于图像的网络中。然而,由于图像和激光雷达之间的模态差距,这一方法严重限制了其准确性。在这篇论文中,我们首次系统地研究了由跨模态蒸馏中的模态差距所引起的负面迁移问题,包括架构不一致的问题以及更重要的特征过拟合问题。为了克服这些问题,我们提出了一种名为MonoSTL的选择性学习方法,该方法鼓励从激光雷达向基于图像的网络正向传输深度信息的同时缓解负向转移。 一方面,我们使用相似的架构以确保基于图像和基于LiDAR网络之间的空间对齐。另一方面,我们开发了两个新颖的蒸馏模块:深度感知选择性特征蒸馏(Depth-Aware Selective Feature Distillation, DASFD)和深度感知选择性关系蒸馏(Depth-Aware Selective Relation Distillation, DASRD)。这些模块通过将深度不确定性分别整合到特征蒸馏和关系蒸馏中,有选择地学习物体的正向特征和关系。 我们的方法可以无缝集成到各种基于CNN和DETR的模型中,在KITTI数据集上的三种最近模型以及NuScenes数据集上的一种最近模型进行了验证。广泛的实验表明,相较于所有最近发布的最先进(SOTA)模型,我们提出的这种方法显著提高了基础模型的准确性,并实现了最佳精度。
https://arxiv.org/abs/2603.07464
Accurate detection of cancer tissue regions (CTR) enables deeper analysis of the tumor microenvironment and offers crucial insights into treatment response. Traditional CTR detection methods, which typically rely on the rich cellular morphology in histology images, are susceptible to a high rate of false positives due to morphological similarities across different tissue regions. The groundbreaking advances in spatial transcriptomics (ST) provide detailed cellular phenotypes and spatial localization information, offering new opportunities for more accurate cancer region detection. However, current methods are unable to effectively integrate histology images with ST data, especially in the context of cross-sample and cross-platform/batch settings for accomplishing the CTR detection. To address this challenge, we propose SpaCRD, a transfer learning-based method that deeply integrates histology images and ST data to enable reliable CTR detection across diverse samples, platforms, and batches. Once trained on source data, SpaCRD can be readily generalized to accurately detect cancerous regions across samples from different platforms and batches. The core of SpaCRD is a category-regularized variational reconstruction-guided bidirectional cross-attention fusion network, which enables the model to adaptively capture latent co-expression patterns between histological features and gene expression from multiple perspectives. Extensive benchmark analysis on 23 matched histology-ST datasets spanning various disease types, platforms, and batches demonstrates that SpaCRD consistently outperforms existing eight state-of-the-art methods in CTR detection.
精确检测癌症组织区域(CTR)有助于深入分析肿瘤微环境,并为治疗反应提供关键见解。传统的CTR检测方法通常依赖于组织学图像中丰富的细胞形态,但由于不同组织区域之间存在形态相似性,这种方法容易产生高比率的假阳性结果。空间转录组学(ST)领域的突破性进展提供了详细的细胞表型和空间定位信息,为更准确地检测癌症区域带来了新的机遇。然而,目前的方法无法有效地将组织学图像与ST数据相结合,尤其是在跨样本、跨平台/批次设置下完成CTR检测时更是如此。 为了应对这一挑战,我们提出了一种基于迁移学习的方法——SpaCRD,该方法深度整合了组织学图像和ST数据,能够在不同的样本、平台和批次中实现可靠的CTR检测。经过源数据训练后,SpaCRD可以轻松推广到不同平台和批次的样本中以准确地检测癌症区域。 SpaCRD的核心是一个类别正则化的变分重构引导双向交叉注意力融合网络,使模型能够从多个角度自适应捕捉组织学特征与基因表达之间的潜在共表达模式。在23组匹配的组织学-ST数据集上进行广泛的基准测试分析(涵盖不同的疾病类型、平台和批次),结果显示SpaCRD在CTR检测方面始终优于现有的八种最先进的方法。 这种方法不仅提升了癌症区域检测的准确性,还为跨样本、跨平台/批次的数据融合提供了新的思路和技术支持。
https://arxiv.org/abs/2603.06186
Generalizing skill policies to novel conditions remains a key challenge in robot learning. Imitation learning methods, while data-efficient, are largely confined to the training region and consistently fail on input data outside it, leading to unpredictable policy failures. Alternatively, transfer learning approaches offer methods for trajectory generation robust to both changes in environment or tasks, but they remain data-hungry and lack accuracy in zero-shot generalization. We address these challenges by framing the problem in the context of task inversion learning and proposing a novel joint learning approach to achieve accurate and efficient knowledge transfer. Our method constructs a common representation of the forward and inverse tasks, and leverages auxiliary forward demonstrations from novel configurations to successfully execute the corresponding inverse tasks, without any direct supervision. We show the extrapolation capabilities of our framework via ablation studies and experiments in simulated and real-world environments that require complex manipulation skills with a diverse set of objects and tools, where we outperform diffusion-based alternatives.
将技能策略泛化到新的条件中仍然是机器人学习中的一个关键挑战。模仿学习方法虽然数据效率高,但大多局限于训练区域,并且在输入数据超出此范围时会持续失败,导致政策失效的不可预测性。相比之下,迁移学习的方法为适应环境或任务的变化提供了生成轨迹的方式,但它们仍然需要大量数据并且在零样本泛化中缺乏准确性。 我们通过将问题置于任务逆向学习的框架下,并提出了一种新颖的联合学习方法来实现准确且高效的知识转移,以此应对这些挑战。我们的方法构建了正向和逆向任务之间的共同表示,并利用来自新配置的辅助正向演示,在没有直接监督的情况下成功执行相应的逆向任务。 我们通过消融研究和实验展示了框架在模拟和现实环境中所需的复杂操作技能上的外推能力,该实验涉及使用不同种类的对象和工具的情况。我们的方法在此类场景中超越了基于扩散的方法。
https://arxiv.org/abs/2603.05576
Machine learning (ML)-based wildfire detection methods have been developed in recent years, primarily using deep learning (DL) models trained on large collections of wildfire images and videos. However, peatland fires exhibit distinct visual and physical characteristics -- such as smoldering combustion, low flame intensity, persistent smoke, and subsurface burning -- that limit the effectiveness of conventional wildfire detectors trained on open-flame forest fires. In this work, we present a transfer learning-based approach for peatland fire detection that leverages knowledge learned from general wildfire imagery and adapts it to the peatland fire domain. We initialize a DL-based peatland fire detector using pretrained weights from a conventional wildfire detection model and subsequently fine-tune the network using a dataset composed of Malaysian peatland images and videos. This strategy enables effective learning despite the limited availability of labeled peatland fire data. Experimental results demonstrate that transfer learning significantly improves detection accuracy and robustness compared to training from scratch, particularly under challenging conditions such as low-contrast smoke, partial occlusions, and variable illumination. The proposed approach provides a practical and scalable solution for early peatland fire detection and has the potential to support real-time monitoring systems for fire prevention and environmental protection.
近年来,基于机器学习(ML)的野火检测方法主要通过深度学习(DL)模型,在大量野火图像和视频上进行训练而开发。然而,泥炭地火灾具有独特的视觉和物理特性,例如阴燃、火焰强度低、持续冒烟以及地下燃烧等特征,这限制了基于传统森林大火数据集训练的常规火灾检测器的有效性。 在本研究中,我们提出了一种基于迁移学习的方法来检测泥炭地火灾。该方法利用从一般野火图像和视频中学到的知识,并将其应用于专门针对泥炭地火灾的情境。我们通过使用预先训练好的传统野火检测模型的权重初始化一个DL基础的泥炭地火灾探测器,然后用包含马来西亚泥炭地图像和视频的数据集进行微调。这种方法在标记数据有限的情况下也能实现有效学习。 实验结果显示,在对比度低、部分遮挡及光照变化等复杂条件下,与从头开始训练相比,迁移学习显著提高了检测准确性和鲁棒性。所提出的方法为早期泥炭地火灾检测提供了一种实用且可扩展的解决方案,并有可能支持用于防火和环境保护的实时监控系统。
https://arxiv.org/abs/2603.02465
Generalization, the ability to perform well beyond the training context, is a hallmark of biological and artificial intelligence, yet anticipating unseen failures remains a central challenge. Conventional approaches often take a ``bottom-up'' mechanistic route by reverse-engineering interpretable features or circuits to build explanatory models. While insightful, these methods often struggle to provide the high-level, predictive signals for anticipating failure in real-world deployment. Here, we propose using a ``top-down'' approach to studying generalization failures inspired by medical biomarkers: identifying system-level measurements that serve as robust indicators of a model's future performance. Rather than mapping out detailed internal mechanisms, we systematically design and test network markers to probe structure, function links, identify prognostic indicators, and validate predictions in real-world settings. In image classification, we find that task-relevant geometric properties of in-distribution (ID) object manifolds consistently forecast poor out-of-distribution (OOD) generalization. In particular, reductions in two geometric measures, effective manifold dimensionality and utility, predict weaker OOD performance across diverse architectures, optimizers, and datasets. We apply this finding to transfer learning with ImageNet-pretrained models. We consistently find that the same geometric patterns predict OOD transfer performance more reliably than ID accuracy. This work demonstrates that representational geometry can expose hidden vulnerabilities, offering more robust guidance for model selection and AI interpretability.
泛化能力,即在训练场景之外仍能表现良好,是生物智能和人工智能的一个显著特征。然而,预测未见的失败仍然是一个核心挑战。传统方法通常采用“自下而上”的机制主义路径,通过逆向工程可解释性特征或电路来构建解释模型。尽管这些方法颇具洞察力,但它们在提供预测性的高阶信号以预见实际部署中的失败方面往往表现不佳。我们提出了一种受医学生物标志物启发的“自上而下”方法,用于研究泛化失败:识别系统级别的测量指标,作为模型未来性能的稳健指示器。这种方法不关注详细的内部机制,而是系统地设计和测试网络标记以探测结构与功能之间的联系、识别预后指标,并在实际场景中验证预测。 在图像分类任务中,我们发现分布内(ID)对象流形的相关几何属性会一致地预测出分布外(OOD)泛化的表现不佳。具体而言,两种几何度量的减少——有效流形维度和效用——可以跨多种架构、优化器及数据集预示较差的OOD性能。我们将这一发现应用于基于ImageNet预训练模型的任务迁移学习中。我们一致地发现,相同的几何模式比分布内的准确性更可靠地预测OOD转移表现。 这项工作展示了表示几何学如何揭示隐藏的脆弱性,并为模型选择和AI可解释性提供了更加稳健的指导。
https://arxiv.org/abs/2603.01879
Speech is a natural means of conveying emotions, making it an effective method for understanding and representing human feelings. Reliable speech emotion recognition (SER) is central to applications in human-computer interaction, healthcare, education, and customer service. However, most SER methods depend on heavy backbone models or hand-crafted features that fail to balance accuracy and efficiency, particularly for low-resource languages like Bangla. In this work, we present SpectroFusion-ViT, a lightweight SER framework built utilizing EfficientViT-b0, a compact Vision Transformer architecture equipped with self-attention to capture long-range temporal and spectral patterns. The model contains only 2.04M parameters and requires 0.1 GFLOPs, enabling deployment in resource-constrained settings without compromising accuracy. Our pipeline first performs preprocessing and augmentation on raw audio, then extracts Chroma and Mel-frequency cepstral coefficient (MFCC) features. These representations are fused into a complementary time-frequency descriptor that preserves both fine-grained spectral detail and broader harmonic structure. Using transfer learning, EfficientViT-b0 is fine-tuned for multi-class emotion classification. We evaluate the system on two benchmark Bangla emotional speech datasets, SUBESCO and BanglaSER, which vary in speaker diversity, recording conditions, and acoustic characteristics. The proposed approach achieves 92.56% accuracy on SUBESCO and 82.19% on BanglaSER, surpassing existing state-of-the-art methods. These findings demonstrate that lightweight transformer architectures can deliver robust SER performance while remaining computationally efficient for real-world deployment.
语音是传达情感的自然方式,因此成为理解并表达人类感受的有效方法。可靠的声音情感识别(SER)对于人机交互、医疗保健、教育和客户服务等应用至关重要。然而,大多数SER方法依赖于重型骨干模型或人工设计的功能,这些功能在低资源语言如孟加拉语中无法同时兼顾准确性和效率。 在此工作中,我们提出了SpectroFusion-ViT,这是一种轻量级的SER框架,利用了EfficientViT-b0这一小巧的Vision Transformer架构。该架构配备了自注意力机制,能够捕捉长时间跨度和频谱模式。模型仅有2.04M个参数,并且需要0.1GFLOPs计算力,这使其在资源有限的情况下也能部署而不影响准确性。 我们的处理流程首先对原始音频进行预处理及增强操作,随后提取Chroma(调式图)和Mel频率倒谱系数(MFCC)特征。这些表示被融合成一个互补的时间-频率描述符,同时保留了精细的频谱细节和更广泛的谐波结构。通过迁移学习,我们微调EfficientViT-b0进行多类别情感分类。 我们在两个孟加拉语情感语音基准数据集上评估了该系统:SUBESCO 和 BanglaSER。这些数据集中说话人的多样性、录音条件及声学特征各不相同。我们的方法在SUBESCO 数据集上实现了92.56%的准确率,在BanglaSER 上则为82.19%,超过了现有的最先进方法。 这些结果表明,轻量级的Transformer架构能够在实际部署中提供稳健的声音情感识别性能,并且计算效率高。
https://arxiv.org/abs/2603.00746
Khmer is a low-resource language characterized by a complex script, presenting significant challenges for optical character recognition (OCR). While document printed text recognition has advanced because of available datasets, performance on other modalities, such as handwritten and scene text, remains limited by data scarcity. Training modality-specific models for each modality does not allow cross-modality transfer learning, from which modalities with limited data could otherwise benefit. Moreover, deploying many modality-specific models results in significant memory overhead and requires error-prone routing each input image to the appropriate model. On the other hand, simply training on a combined dataset with a non-uniform data distribution across different modalities often leads to degraded performance on underrepresented modalities. To address these, we propose a universal Khmer text recognition (UKTR) framework capable of handling diverse text modalities. Central to our method is a novel modality-aware adaptive feature selection (MAFS) technique designed to adapt visual features according to a particular input image modality and enhance recognition robustness across modalities. Extensive experiments demonstrate that our model achieves state-of-the-art (SoTA) performance. Furthermore, we introduce the first comprehensive benchmark for universal Khmer text recognition, which we release to the community to facilitate future research. Our datasets and models can be accessible via this gated repository\footnote{in review}.
柬埔寨语是一种低资源语言,其特点是拥有复杂的书写系统,这给光学字符识别(OCR)带来了巨大的挑战。尽管由于可用的数据集,印刷文档文本的识别取得了进展,但在其他模态下(如手写和场景文字),性能仍然受限于数据稀缺性。为每种模态训练特定模型不允许跨模态迁移学习,而那些数据较少的模态本可以从这种迁移学习中受益。此外,部署多种特定于模态的模型会导致显著的内存开销,并需要将每个输入图像路由到适当模型的过程(该过程容易出错)。另一方面,在具有不同模态间非均匀数据分布的组合数据集上简单地进行训练往往导致在代表性不足的模态下性能下降。为了解决这些问题,我们提出了一种通用柬埔寨文本识别(UKTR)框架,能够处理各种文本模态。我们的方法的核心是一种新颖的认知模态自适应特征选择(MAFS)技术,旨在根据特定输入图像的模态调整视觉特征,并增强跨模态识别的稳健性。广泛的实验表明,我们的模型达到了最先进的性能。此外,我们还引入了第一个针对通用柬埔寨文本识别的全面基准测试,我们将此发布到社区以促进未来的研究。我们的数据集和模型可以通过这个受控访问仓库获取(注:正在审查中)。
https://arxiv.org/abs/2603.00702
Foundation models pre-trained on large-scale datasets demonstrate strong transfer learning capabilities; however, their adaptation to complex multi-label diagnostic tasks-such as comprehensive head CT finding detection-remains understudied. Standard parameter-efficient fine-tuning methods such as LoRA apply uniform adaptations across pathology types, which may limit performance for diverse medical findings. We propose a Mixture of Low-Rank Experts (MoLRE) framework that extends LoRA with multiple specialized low-rank adapters and unsupervised soft routing. This approach enables conditional feature adaptation with less than 0.5% additional parameters and without explicit pathology supervision. We present a comprehensive benchmark of MoLRE across six state-of-the-art medical imaging foundation models spanning 2D and 3D architectures, general-domain, medical-domain, and head CT-specific pretraining, and model sizes ranging from 7M to 431M parameters. Using over 70,000 non-contrast head CT scans with 75 annotated findings-including hemorrhage, infarction, trauma, mass lesions, structural abnormalities, and chronic changes-our experiments demonstrate consistent performance improvements across all models. Gains vary substantially: general-purpose and medical-domain models show the largest improvements (DINOv3-Base: +4.6%; MedGemma: +4.3%), whereas 3D CT-specialized or very large models show more modest gains (+0.2-1.3%). The combination of MoLRE and MedGemma achieves the highest average detection AUC of 0.917. These findings highlight the importance of systematic benchmarking on target clinical tasks, as pretraining domain, architecture, and model scale interact in non-obvious ways.
预训练于大规模数据集上的基础模型展示了强大的迁移学习能力;然而,它们在复杂多标签诊断任务(如全面的头部CT发现检测)中的适应性研究仍然不足。标准的参数高效微调方法,例如LoRA(低秩适应),对各种病理类型应用统一的适应策略,这可能限制了多样医学发现性能的提升。我们提出了一种“混合低秩专家”(Mixture of Low-Rank Experts, MoLRE)框架,该框架通过引入多个专门化的低秩适配器和无监督软路由来扩展LoRA方法。这种方法能够在不增加额外参数(少于0.5%)且无需明确的病理学指导的情况下实现条件特征适应。 我们对MoLRE进行了全面的基准测试,涉及六种最先进的医学成像基础模型,这些模型涵盖了二维和三维架构、通用领域和医疗领域的预训练以及大小从7M到431M参数的不同规模。使用超过70,000个非对比头部CT扫描(包含75种标注发现,如出血、梗塞、创伤、肿块病变、结构异常及慢性改变),我们的实验在所有模型上均展示了持续的性能提升。收益差异显著:通用型和医疗领域模型显示了最大的改进(DINOv3-Base: +4.6%;MedGemma: +4.3%),而三维CT专门化或非常大规模的模型则表现出较为温和的增长(+0.2-1.3%)。 MoLRE与MedGemma的结合实现了最高的平均检测AUC值为0.917。这些发现强调了在目标临床任务上进行系统基准测试的重要性,因为预训练领域、架构和模型规模之间的相互作用以非显而易见的方式存在。
https://arxiv.org/abs/2603.00675
Few-shot transfer has been revolutionized by stronger pre-trained models and improved adaptation this http URL, there lacks a unified, rigorous evaluation protocol that is both challenging and realistic for real-world usage. In this work, we establish FEWTRANS, a comprehensive benchmark containing 10 diverse datasets, and propose the Hyperparameter Ensemble (HPE) protocol to overcome the "validation set illusion" in data-scarce regimes. Our empirical findings demonstrate that the choice of pre-trained model is the dominant factor for performance, while many sophisticated transfer methods offer negligible practical advantages over a simple full-parameter fine-tuning baseline. To explain this surprising effectiveness, we provide an in-depth mechanistic analysis showing that full fine-tuning succeeds via distributed micro-adjustments and more flexible reshaping of high-level semantic presentations without suffering from overfitting. Additionally, we quantify the performance collapse of multimodal models in specialized domains as a result of linguistic rarity using adjusted Zipf frequency scores. By releasing FEWTRANS, we aim to provide a rigorous "ruler" to streamline reproducible advances in few-shot transfer learning research. We make the FEWTRANS benchmark publicly available at this https URL.
少量样本迁移学习(Few-shot Transfer Learning)已通过更强的预训练模型和改进的适应方法取得了显著进展,但在实际应用场景中,仍然缺乏一种统一且严格的评估协议。本文提出了FEWTRANS,这是一个包含10个多样化数据集的全面基准,并引入了超参数集成(Hyperparameter Ensemble, HPE)协议来克服在样本稀缺场景下的“验证集幻觉”问题。我们的实证研究表明,在性能方面预训练模型的选择是决定性因素,而许多复杂的迁移学习方法相比简单的全参数微调基线并未提供实际的优势。为了说明这种意外的有效性,我们进行了深入的机制分析,展示了全参数微调通过分布式细粒度调整和更灵活地重塑高级语义表示来取得成功,并且不会过度拟合。此外,我们使用调整后的Zipf频率分数量化了多模态模型在特定领域中的性能崩溃,这是由于语言稀有性所导致的。 为了为少量样本迁移学习研究提供一个严格的“标尺”,促进可重复的研究进展,我们将FEWTRANS基准公开发布在这个网址:[这里应该是原链接]。通过这一举措,我们希望能够推动该领域的进一步发展和标准化。
https://arxiv.org/abs/2603.00478
Observational learning requires an agent to learn to perform a task by referencing only observations of the performed task. This work investigates the equivalent setting in real-world robot learning where access to hand-designed rewards and demonstrator actions are not assumed. To address this data-constrained setting, this work presents a planning-based Inverse Reinforcement Learning (IRL) algorithm for world modeling from observation and interaction alone. Experiments conducted entirely in the real-world demonstrate that this paradigm is effective for learning image-based manipulation tasks from scratch in under an hour, without assuming prior knowledge, pre-training, or data of any kind beyond task observations. Moreover, this work demonstrates that the learned world model representation is capable of online transfer learning in the real-world from scratch. In comparison to existing approaches, including IRL, RL, and Behavior Cloning (BC), which have more restrictive assumptions, the proposed approach demonstrates significantly greater sample efficiency and success rates, enabling a practical path forward for online world modeling and planning from observation and interaction. Videos and more at: this https URL.
观察学习要求智能体通过仅参考完成任务的观测来学会执行任务。这项工作研究了真实世界中机器人学习中的等效设定,在此场景下,不假设可以获得人为设计的奖励和展示者的动作数据。为了解决这种数据受限的情况,该研究提出了一种基于规划的逆向强化学习(IRL)算法,仅通过观察和互动进行世界建模。 在完全真实世界的实验中证明了这一范式能够从零开始,在不到一小时内有效学会基于图像的操作任务,并且不需要任何先验知识、预训练或任何形式的数据。此外,该研究还展示了所学得的世界模型表示能够在现实世界中实现从零开始的在线迁移学习。 与现有的方法(包括IRL、强化学习和行为克隆)相比,这些现有方法假设有更多的限制性假设,所提出的方法在样本效率和成功率方面表现出显著优势,从而为通过观察和互动进行实时世界建模和规划提供了一条实用路径。更多视频和其他信息请访问:[此处插入URL]。
https://arxiv.org/abs/2602.24121
Point cloud video understanding is critical for robotics as it accurately encodes motion and scene interaction. We recognize that 4D datasets are far scarcer than 3D ones, which hampers the scalability of self-supervised 4D models. A promising alternative is to transfer 3D pre-trained models to 4D perception tasks. However, rigorous empirical analysis reveals two critical limitations that impede transfer capability: overfitting and the modality gap. To overcome these challenges, we develop a novel "Align then Adapt" (PointATA) paradigm that decomposes parameter-efficient transfer learning into two sequential stages. Optimal-transport theory is employed to quantify the distributional discrepancy between 3D and 4D datasets, enabling our proposed point align embedder to be trained in Stage 1 to alleviate the underlying modality gap. To mitigate overfitting, an efficient point-video adapter and a spatial-context encoder are integrated into the frozen 3D backbone to enhance temporal modeling capacity in Stage 2. Notably, with the above engineering-oriented designs, PointATA enables a pre-trained 3D model without temporal knowledge to reason about dynamic video content at a smaller parameter cost compared to previous work. Extensive experiments show that PointATA can match or even outperform strong full fine-tuning models, whilst enjoying the advantage of parameter efficiency, e.g. 97.21 \% accuracy on 3D action recognition, $+8.7 \%$ on 4 D action segmentation, and 84.06\% on 4D semantic segmentation.
点云视频理解在机器人技术中至关重要,因为它能准确地编码运动和场景交互。我们认识到,相较于3D数据集而言,4D数据集稀缺得多,这阻碍了自监督4D模型的可扩展性。一种有前景的替代方案是从3D预训练模型迁移至4D感知任务。然而,严格的实证分析揭示出两种关键限制,即过度拟合和模态差距问题,这些问题阻碍了迁移能力。 为了解决这些挑战,我们开发了一种新颖的“对齐后适应”(PointATA)范式,将参数效率高的迁移学习分解成两个连续阶段。利用最优传输理论来量化3D与4D数据集之间的分布差异,使我们的提议点对齐嵌入器能够在第一阶段训练以缓解潜在的模态差距问题。为减轻过度拟合,在第二阶段集成高效的点视频适配器和空间上下文编码器到冻结的3D主干中,从而增强时间建模能力。 值得注意的是,通过上述工程设计,PointATA使一个无时间知识的预训练3D模型能够以较小的参数成本推理出动态视频内容。广泛的实验表明,PointATA可以与强大的全微调模型相匹配甚至超越其性能,同时享受参数效率的优势,例如在3D动作识别中达到97.21%的准确率,在4D动作分割中提高8.7%,以及在4D语义分割中达到84.06%的成绩。
https://arxiv.org/abs/2602.23069
The identification and property prediction of chemical molecules is of central importance in the advancement of drug discovery and material science, where the tandem mass spectrometry technology gives valuable fragmentation cues in the form of mass-to-charge ratio peaks. However, the lack of experimental spectra hinders the attachment of each molecular identification, and thus urges the establishment of prediction approaches for computational models. Deep learning models appear promising for predicting molecular structure spectra, but overall assessment remains challenging as a result of the heterogeneity in methods and the lack of well-defined benchmarks. To address this, our contribution is the creation of benchmark framework FlexMS for constructing and evaluating diverse model architectures in mass spectrum prediction. With its easy-to-use flexibility, FlexMS supports the dynamic construction of numerous distinct combinations of model architectures, while assessing their performance on preprocessed public datasets using different metrics. In this paper, we provide insights into factors influencing performance, including the structural diversity of datasets, hyperparameters like learning rate and data sparsity, pretraining effects, metadata ablation settings and cross-domain transfer learning analysis. This provides practical guidance in choosing suitable models. Moreover, retrieval benchmarks simulate practical identification scenarios and score potential matches based on predicted spectra.
化学分子的识别和性质预测在药物发现和材料科学的进步中至关重要。其中,串联质谱技术通过提供质量与电荷比(m/z)峰的形式提供了有价值的碎片线索。然而,由于缺乏实验光谱数据,每种分子的识别都受到了限制,并因此迫切需要建立用于计算模型的预测方法。深度学习模型在预测分子结构光谱方面显示出前景,但由于方法上的异质性和缺乏明确定义的标准,全面评估仍然具有挑战性。为了解决这个问题,我们的贡献是创建了基准框架FlexMS,该框架旨在构建和评估多种不同的架构以进行质量光谱预测。通过其易于使用的灵活性,FlexMS支持动态构建众多不同组合的模型架构,并在预处理后的公共数据集上使用不同的指标对其性能进行评估。 本文中,我们探讨了一些影响表现的因素,包括数据集中的结构多样性、超参数(如学习率和数据稀疏性)、预训练效果、元数据消融设置以及跨领域迁移学习分析。这为选择合适的模型提供了实用的指导。此外,检索基准模拟了实际识别场景,并根据预测光谱来评分潜在匹配项。 这个框架及其研究成果有助于促进更精确且高效的分子识别技术的发展,在药物研发和材料科学等领域具有重要意义。
https://arxiv.org/abs/2602.22822
In this study, image processing and deep learning methodologies were employed to automatically classify local olive species cultivated in Turkiye. A stereo camera was utilized to capture images of five distinct olive species, which were then preprocessed to ensure their suitability for analysis. Convolutional Neural Network (CNN) architectures, specifically MobileNetV2 and EfficientNetB0, were employed for image classification. These models were optimized through a transfer learning approach. The training and testing results indicated that the EfficientNetB0 model exhibited the optimal performance, with an accuracy of 94.5%. The findings demonstrate that deep learning-based systems offer an effective solution for classifying olive species with high accuracy. The developed method has significant potential for application in areas such as automatic identification and quality control of agricultural products.
在这项研究中,使用了图像处理和深度学习的方法来自动分类在土耳其种植的地方橄榄品种。立体相机被用来拍摄五种不同的橄榄品种的图片,然后对这些图片进行预处理以确保它们适合分析。为了图像分类,采用了卷积神经网络(CNN)架构,具体来说是MobileNetV2和EfficientNetB0模型,并通过迁移学习的方法来优化这些模型。 训练和测试结果表明,EfficientNetB0模型表现出最佳性能,其准确率为94.5%。研究发现证明了基于深度学习的系统能够以高精度有效地对橄榄品种进行分类。开发出的方法在自动识别和农产品质量控制等领域具有重要的应用潜力。
https://arxiv.org/abs/2603.00168
Distributed denial-of-service (DDoS) attacks threaten the availability of Internet of Things (IoT) infrastructures, particularly under resource-constrained deployment conditions. Although transfer learning models have shown promising detection accuracy, their reliability, computational feasibility, and interpretability in operational environments remain insufficiently explored. This study presents an explainability-aware empirical evaluation of seven pre-trained convolutional neural network architectures for multi-class IoT DDoS detection using the CICDDoS2019 dataset and an image-based traffic representation. The analysis integrates performance metrics, reliability-oriented statistics (MCC, Youden Index, confidence intervals), latency and training cost assessment, and interpretability evaluation using Grad-CAM and SHAP. Results indicate that DenseNet and MobileNet-based architectures achieve strong detection performance while demonstrating superior reliability and compact, class-consistent attribution patterns. DenseNet169 offers the strongest reliability and interpretability alignment, whereas MobileNetV3 provides an effective latency-accuracy trade-off for fog-level deployment. The findings emphasize the importance of combining performance, reliability, and explainability criteria when selecting deep learning models for IoT DDoS detection.
分布式拒绝服务(DDoS)攻击威胁到物联网(IoT)基础设施的可用性,尤其是在资源受限的部署条件下。虽然迁移学习模型展示了令人鼓舞的检测准确率,但在操作环境中的可靠性、计算可行性及可解释性方面仍存在不足之处。本研究通过对CICDDoS2019数据集和基于图像的流量表示进行多类IoT DDoS检测,对七种预训练卷积神经网络架构进行了具有解释性的实证评估。该分析结合了性能指标、面向可靠性的统计量(MCC,Youden指数,置信区间),延迟及训练成本评估,以及使用Grad-CAM和SHAP进行的可解释性评价。结果显示,DenseNet和MobileNet架构在检测表现方面表现出色,并展示了卓越的可靠性与紧凑且类别一致的属性模式。DenseNet169提供了最强的可靠性和可解释性的契合度,而MobileNetV3则为雾计算级别的部署提供了一个有效的延迟-准确性权衡方案。研究结果强调了在选择用于IoT DDoS检测的深度学习模型时,应综合考虑性能、可靠性及可解释性标准的重要性。
https://arxiv.org/abs/2602.22488