Modern configurable software systems need to learn models that correlate configuration and performance. However, when the system operates in dynamic environments, the workload variations, hardware changes, and system updates will inevitably introduce concept drifts at different levels - global drifts, which reshape the performance landscape of the entire configuration space; and local drifts, which only affect certain sub-regions of that space. As such, existing offline and transfer learning approaches can struggle to adapt to these implicit and unpredictable changes in real-time, rendering configuration performance learning challenging. To address this, we propose DHDA, an online configuration performance learning framework designed to capture and adapt to these drifts at different levels. The key idea is that DHDA adapts to both the local and global drifts using dually hierarchical adaptation: at the upper level, we redivide the data into different divisions, within each of which the local model is retrained, to handle global drifts only when necessary. At the lower level, the local models of the divisions can detect local drifts and adapt themselves asynchronously. To balance responsiveness and efficiency, DHDA combines incremental updates with periodic full retraining to minimize redundant computation when no drifts are detected. Through evaluating eight software systems and against state-of-the-art approaches, we show that DHDA achieves considerably better accuracy and can effectively adapt to drifts with up to 2x improvements, while incurring reasonable overhead and is able to improve different local models in handling concept drift.
现代可配置软件系统需要学习能够关联配置与性能的模型。然而,当这些系统在动态环境中运行时,工作负载的变化、硬件更新以及系统升级不可避免地会在不同层次上引入概念漂移——全局漂移会重塑整个配置空间内的性能格局;局部漂移则仅影响该空间中的某些子区域。因此,现有的离线和迁移学习方法难以实时适应这些隐式且不可预测的变更,使得配置性能学习变得极具挑战性。为此,我们提出了DHDA(双重分层自适应框架),这是一种在线配置性能学习框架,旨在捕捉并适应不同层次上的漂移变化。该框架的核心思想是利用双重层级自适应来应对局部和全局漂移:在高层级上,我们将数据重新划分为不同的分区,在必要时仅针对这些分区内的全局漂移进行处理;而在底层,各个分区内配置的本地模型能够异步地检测并适应局部漂移。为平衡响应性和效率,DHDA结合了增量更新与定期全面重训练,以在没有检测到漂移变化的情况下尽量减少冗余计算。 通过评估八个软件系统并与最先进的方法进行对比,我们展示了DHDA能够在概念漂移下实现显著更高的准确率,并能有效地适应多达2倍的性能改进。此外,在保持合理开销的同时,该框架还能提升不同局部模型处理概念漂移的能力。
https://arxiv.org/abs/2507.08730
In this paper, we present MM-Gesture, the solution developed by our team HFUT-VUT, which ranked 1st in the micro-gesture classification track of the 3rd MiGA Challenge at IJCAI 2025, achieving superior performance compared to previous state-of-the-art methods. MM-Gesture is a multimodal fusion framework designed specifically for recognizing subtle and short-duration micro-gestures (MGs), integrating complementary cues from joint, limb, RGB video, Taylor-series video, optical-flow video, and depth video modalities. Utilizing PoseConv3D and Video Swin Transformer architectures with a novel modality-weighted ensemble strategy, our method further enhances RGB modality performance through transfer learning pre-trained on the larger MA-52 dataset. Extensive experiments on the iMiGUE benchmark, including ablation studies across different modalities, validate the effectiveness of our proposed approach, achieving a top-1 accuracy of 73.213%.
在这篇论文中,我们介绍了MM-Gesture,这是由我们的团队HFUT-VUT开发的解决方案。该方案在IJCAI 2025年举行的第三届MiGA挑战赛微手势分类赛道上排名首位,并且其性能优于之前的最先进的方法。MM-Gesture是一个专门为识别细微且短时长的手势(简称MGs)而设计的多模态融合框架,它结合了关节、肢体、RGB视频、泰勒级数视频、光流视频和深度视频等多种模式中的互补线索。 我们的方法利用了PoseConv3D和Video Swin Transformer架构,并采用了一种新颖的模态加权集成策略。通过在更大的MA-52数据集上进行预训练,我们的方法进一步提升了RGB模态的表现。我们在iMiGUE基准测试上的广泛实验,包括不同模态下的消融研究,验证了我们提出的方法的有效性,实现了73.213%的顶级准确率。
https://arxiv.org/abs/2507.08344
Accurate identification of fungi species presents a unique challenge in computer vision due to fine-grained inter-species variation and high intra-species variation. This paper presents our approach for the FungiCLEF 2025 competition, which focuses on few-shot fine-grained visual categorization (FGVC) using the FungiTastic Few-Shot dataset. Our team (DS@GT) experimented with multiple vision transformer models, data augmentation, weighted sampling, and incorporating textual information. We also explored generative AI models for zero-shot classification using structured prompting but found them to significantly underperform relative to vision-based models. Our final model outperformed both competition baselines and highlighted the effectiveness of domain specific pretraining and balanced sampling strategies. Our approach ranked 35/74 on the private test set in post-completion evaluation, this suggests additional work can be done on metadata selection and domain-adapted multi-modal learning. Our code is available at this https URL.
真菌种类的准确识别在计算机视觉领域面临着独特的挑战,主要是由于细粒度的种间差异和高度的种内变异。本文介绍了我们为FungiCLEF 2025竞赛设计的方法,该方法聚焦于使用FungiTastic Few-Shot数据集进行少量样本下的细粒度视觉分类(FGVC)。我们的团队(DS@GT)尝试了多种视觉变换模型、数据增强技术、加权采样以及文本信息的融合。我们还探索了生成式AI模型在零样本分类中的应用,通过结构化提示实现,但发现这些方法相对于基于视觉的方法来说表现不佳。 最终,我们的模型超越了竞赛中的基准,并强调了领域特定预训练和平衡采样策略的有效性。在比赛结束后对私人测试集的评估中,我们的方法排名为74个参赛团队中的第35位,这表明在元数据选择和跨域多模态学习方面还有进一步研究的空间。 我们的代码可在[此处](此链接应该指向一个公开可用的位置,例如GitHub)获取。
https://arxiv.org/abs/2507.08248
Visually impaired people face significant challenges in their day-to-day commutes in the urban cities of Bangladesh due to the vast number of obstructions on every path. With many injuries taking place through road accidents on a daily basis, it is paramount for a system to be developed that can alert the visually impaired of objects at close distance beforehand. To overcome this issue, a novel alert system is proposed in this research to assist the visually impaired in commuting through these busy streets without colliding with any objects. The proposed system can alert the individual to objects that are present at a close distance. It utilizes transfer learning to train models for depth estimation and object detection, and combines both models to introduce a novel system. The models are optimized through the utilization of quantization techniques to make them lightweight and efficient, allowing them to be easily deployed on embedded systems. The proposed solution achieved a lightweight real-time depth estimation and object detection model with an mAP50 of 0.801.
在孟加拉国的城市中,视障人士在日常通勤时面临诸多挑战,主要是因为道路上障碍物众多。由于每天都有许多因交通事故而受伤的事件发生,因此迫切需要一个系统来提前警示视障人士周围近距离的物体,以避免碰撞。为了解决这一问题,本研究提出了一种创新性的警报系统,旨在帮助视障人士在繁忙街道上安全通勤而不致撞到任何障碍物。 该提议系统能够提前向个体发出信号,警告其前方存在近距离内的物体。它通过转移学习来训练深度估计和物体检测的模型,并将两者结合引入一个全新的系统中。通过对量化技术的应用优化这些模型,使得它们变得轻量且高效,从而可以轻松部署到嵌入式设备上。所提出的解决方案实现了一个轻量级的实时深度估计算法和物体检测模型,其mAP50达到了0.801的成绩。
https://arxiv.org/abs/2507.08165
Accurate estimation of building heights using very high resolution (VHR) synthetic aperture radar (SAR) imagery is crucial for various urban applications. This paper introduces a Deep Learning (DL)-based methodology for automated building height estimation from single VHR COSMO-SkyMed images: an object-based regression approach based on bounding box detection followed by height estimation. This model was trained and evaluated on a unique multi-continental dataset comprising eight geographically diverse cities across Europe, North and South America, and Asia, employing a cross-validation strategy to explicitly assess out-of-distribution (OOD) generalization. The results demonstrate highly promising performance, particularly on European cities where the model achieves a Mean Absolute Error (MAE) of approximately one building story (2.20 m in Munich), significantly outperforming recent state-of-the-art methods in similar OOD scenarios. Despite the increased variability observed when generalizing to cities in other continents, particularly in Asia with its distinct urban typologies and prevalence of high-rise structures, this study underscores the significant potential of DL for robust cross-city and cross-continental transfer learning in building height estimation from single VHR SAR data.
使用非常高分辨率(VHR)合成孔径雷达(SAR)图像准确估算建筑物高度对于各种城市应用至关重要。本文介绍了一种基于深度学习(DL)的方法,用于从单张VHR COSMO-SkyMed图像中自动估计建筑物的高度:一种以边界框检测为基础的对象级回归方法,随后进行高度估计。该模型在独特的多大陆数据集上进行了训练和评估,该数据集包括八个地理位置多样化的城市,分布在欧洲、北美、南美以及亚洲,并采用交叉验证策略明确评估了跨分布(OOD)泛化能力。结果表明其性能非常有前景,在欧洲城市的测试中,该模型达到了约一层楼的平均绝对误差(2.20米在慕尼黑),显著优于其他同类任务中的最新方法。尽管在泛化到亚洲等其他大陆的城市时观察到了更高的变异性——尤其是在这些城市独特的城市形态和大量高层建筑的情况下——这项研究表明深度学习在从单个VHR SAR数据进行建筑物高度估计中,具有跨城市和地区迁移学习的重要潜力。
https://arxiv.org/abs/2507.08096
In recent years, deep learning has shown great promise in the automated detection and classification of brain tumors from MRI images. However, achieving high accuracy and computational efficiency remains a challenge. In this research, we propose Deep Brain Net, a novel deep learning system designed to optimize performance in the detection of brain tumors. The model integrates the strengths of two advanced neural network architectures which are EfficientNetB0 and ResNet50, combined with transfer learning to improve generalization and reduce training time. The EfficientNetB0 architecture enhances model efficiency by utilizing mobile inverted bottleneck blocks, which incorporate depth wise separable convolutions. This design significantly reduces the number of parameters and computational cost while preserving the ability of models to learn complex feature representations. The ResNet50 architecture, pre trained on large scale datasets like ImageNet, is fine tuned for brain tumor classification. Its use of residual connections allows for training deeper networks by mitigating the vanishing gradient problem and avoiding performance degradation. The integration of these components ensures that the proposed system is both computationally efficient and highly accurate. Extensive experiments performed on publicly available MRI datasets demonstrate that Deep Brain Net consistently outperforms existing state of the art methods in terms of classification accuracy, precision, recall, and computational efficiency. The result is an accuracy of 88 percent, a weighted F1 score of 88.75 percent, and a macro AUC ROC score of 98.17 percent which demonstrates the robustness and clinical potential of Deep Brain Net in assisting radiologists with brain tumor diagnosis.
近年来,深度学习在从MRI图像中自动检测和分类脑肿瘤方面展现出巨大潜力。然而,实现高精度和计算效率仍然是一个挑战。在这项研究中,我们提出了Deep Brain Net,这是一种新型的深度学习系统,旨在优化脑肿瘤检测性能。该模型结合了EfficientNetB0和ResNet50两种先进神经网络架构的优势,并通过迁移学习来提高泛化能力并减少训练时间。 EfficientNetB0架构利用移动倒残差瓶颈块(包括深度可分离卷积),显著减少了参数数量和计算成本,同时保持了模型学习复杂特征表示的能力。预训练于大规模数据集如ImageNet的ResNet50架构被微调用于脑肿瘤分类,其使用残差连接通过缓解梯度消失问题并避免性能下降来实现更深网络的训练。 这两种组件的集成确保了所提出的系统在计算效率和准确性方面都表现出色。在公开可用的MRI数据集上进行的大量实验表明,Deep Brain Net在分类精度、精确度、召回率和计算效率方面始终优于现有最先进的方法。结果是达到了88%的准确率,加权F1得分为88.75%,宏平均AUC ROC评分为98.17%,这展示了Deep Brain Net在辅助放射科医生进行脑肿瘤诊断方面的鲁棒性和临床潜力。
https://arxiv.org/abs/2507.07011
Hyperspectral image (HSI) classification presents inherent challenges due to high spectral dimensionality, significant domain shifts, and limited availability of labeled data. To address these issues, we propose a novel Active Transfer Learning (ATL) framework built upon a Spatial-Spectral Transformer (SST) backbone. The framework integrates multistage transfer learning with an uncertainty-diversity-driven active learning mechanism that strategically selects highly informative and diverse samples for annotation, thereby significantly reducing labeling costs and mitigating sample redundancy. A dynamic layer freezing strategy is introduced to enhance transferability and computational efficiency, enabling selective adaptation of model layers based on domain shift characteristics. Furthermore, we incorporate a self-calibrated attention mechanism that dynamically refines spatial and spectral weights during adaptation, guided by uncertainty-aware feedback. A diversity-promoting sampling strategy ensures broad spectral coverage among selected samples, preventing overfitting to specific classes. Extensive experiments on benchmark cross-domain HSI datasets demonstrate that the proposed SST-ATL framework achieves superior classification performance compared to conventional approaches. The source code is publicly available at this https URL.
高光谱图像(HSI)分类由于其高维光谱特性、显著的领域偏移以及标注数据的稀缺性,面临着固有的挑战。为了解决这些问题,我们提出了一种新颖的主动迁移学习(ATL)框架,该框架基于空间-光谱变换器(SST)骨干网络构建。此框架整合了多阶段迁移学习与一种不确定性-多样性驱动的主动学习机制,以战略化地选择具有高度信息量和多样性的样本进行标注,从而大大减少了标注成本并缓解了样本冗余问题。 为了提升模型的转移能力和计算效率,我们引入了一种动态层冻结策略,该策略可根据领域偏移特性选择性地调整模型层数。此外,框架中还集成了一个自校准注意力机制,在适应过程中根据不确定性感知反馈动态优化空间和光谱权重。一种促进多样性的采样策略确保了所选样本在广泛的光谱覆盖范围内,防止对特定类别的过度拟合。 在基准跨域HSI数据集上进行的大量实验表明,提出的SST-ATL框架相比传统方法实现了更为优异的分类性能。该研究的源代码可在以下网址公开获取:[此URL](请将方括号中的文本替换为实际提供的链接)。
https://arxiv.org/abs/2411.18115
This survey reviews prompt tuning, a parameter-efficient approach for adapting language models by prepending trainable continuous vectors while keeping the model frozen. We classify existing approaches into two categories: direct prompt learning and transfer learning. Direct prompt learning methods include: general optimization approaches, encoder-based methods, decomposition strategies, and mixture-of-experts frameworks. Transfer learning methods consist of: general transfer approaches, encoder-based methods, and decomposition strategies. For each method, we analyze method designs, innovations, insights, advantages, and disadvantages, with illustrative visualizations comparing different frameworks. We identify challenges in computational efficiency and training stability, and discuss future directions in improving training robustness and broadening application scope.
这项调查回顾了提示微调,这是一种通过在固定模型前添加可训练的连续向量来适应语言模型的有效参数方法。我们将现有方法分为两类:直接提示学习和迁移学习。 - 直接提示学习方法包括:通用优化方法、基于编码器的方法、分解策略以及专家混合框架。 - 迁移学习方法包括:通用迁移方法、基于编码器的方法,以及分解策略。 对于每种方法,我们分析了其设计思路、创新点、洞察力、优点和缺点,并通过可视化图示比较不同的框架。我们还指出了计算效率和训练稳定性方面的挑战,并讨论了未来在提高训练鲁棒性和扩展应用范围的方向上的可能性。
https://arxiv.org/abs/2507.06085
Recent advances in song identification leverage deep neural networks to learn compact audio fingerprints directly from raw waveforms. While these methods perform well under controlled conditions, their accuracy drops significantly in real-world scenarios where the audio is captured via mobile devices in noisy environments. In this paper, we introduce a novel evaluation protocol designed to better reflect such real-world conditions. We generate three recordings of the same audio, each with increasing levels of noise, captured using a mobile device's microphone. Our results reveal a substantial performance drop for two state-of-the-art CNN-based models under this protocol, compared to previously reported benchmarks. Additionally, we highlight the critical role of the augmentation pipeline during training with contrastive loss. By introduction low pass and high pass filters in the augmentation pipeline we significantly increase the performance of both systems in our proposed evaluation. Furthermore, we develop a transformer-based model with a tailored projection module and demonstrate that transferring knowledge from a semantically relevant domain yields a more robust solution. The transformer architecture outperforms CNN-based models across all noise levels, and query durations. In low noise conditions it achieves 47.99% for 1-sec queries, and 97% for 10-sec queries in finding the correct song, surpassing by 14%, and by 18.5% the second-best performing model, respectively, Under heavy noise levels, we achieve a detection rate 56.5% for 15-second query duration. All experiments are conducted on public large-scale dataset of over 100K songs, with queries matched against a database of 56 million vectors.
近期在歌曲识别领域的进展利用深度神经网络直接从原始波形中学习紧凑的音频指纹。尽管这些方法在受控条件下表现良好,但在通过移动设备在嘈杂环境中捕捉到的真实世界场景下,其准确性显著下降。本文引入了一种新的评估协议,旨在更好地反映这种真实环境条件。我们生成了同一音频的三个录音版本,每个版本都增加了不同级别的噪声,并使用移动设备麦克风进行录制。我们的实验结果显示,在此协议下,两个最新的基于CNN(卷积神经网络)的模型的表现大幅下滑,相比之前报道的基准测试结果有显著差距。 此外,我们强调在训练过程中利用对比损失时数据增强管道的关键作用。通过在增强管道中引入低通和高通滤波器,我们显著提高了两种系统在这项新评估中的性能。另外,我们开发了一种基于变压器架构的模型,并配备了定制化的投影模块,展示出从语义相关的领域转移知识能够带来更加稳健的解决方案。变压器架构在所有噪声级别和查询时长下都优于基于CNN的模型。在低噪声条件下,对于1秒的查询,它达到了47.99%的准确率,在寻找正确歌曲方面超过了第二好表现的模型14%,而对于10秒的查询,则达到了97%的准确率,超过其他模型18.5%。在高噪声环境中,我们实现了针对15秒查询时长下的检测率为56.5%。 所有实验均在一个公开的大规模数据集上进行,该数据集中包含超过10万首歌曲,并且查询与一个含有5600万个向量的数据库相匹配。
https://arxiv.org/abs/2507.06070
Soft tissue simulation in virtual environments is becoming increasingly important for medical applications. However, the high deformability of soft tissue poses significant challenges. Existing methods rely on segmentation, meshing and estimation of stiffness properties of tissues. In addition, the integration of haptic feedback requires precise force estimation to enable a more immersive experience. We introduce a novel data-driven model, a conditional graph neural network (cGNN) to tackle this complexity. Our model takes surface points and the location of applied forces, and is specifically designed to predict the deformation of the points and the forces exerted on them. We trained our model on experimentally collected surface tracking data of a soft tissue phantom and used transfer learning to overcome the data scarcity by initially training it with mass-spring simulations and fine-tuning it with the experimental data. This approach improves the generalisation capability of the model and enables accurate predictions of tissue deformations and corresponding interaction forces. The results demonstrate that the model can predict deformations with a distance error of 0.35$\pm$0.03 mm for deformations up to 30 mm and the force with an absolute error of 0.37$\pm$0.05 N for forces up to 7.5 N. Our data-driven approach presents a promising solution to the intricate challenge of simulating soft tissues within virtual environments. Beyond its applicability in medical simulations, this approach holds the potential to benefit various fields where realistic soft tissue simulations are required.
在虚拟环境中进行软组织模拟对于医学应用变得越来越重要。然而,由于软组织的高度可变形性,这对现有技术提出了重大挑战。目前的方法依赖于分割、网格划分和组织硬度属性的估算。此外,为了实现更加沉浸式的体验,触觉反馈的集成需要精确的力量估计。我们引入了一种基于数据驱动的新模型——条件图神经网络(cGNN),以解决这一复杂性问题。我们的模型接受表面点及其受力位置作为输入,并专门设计用于预测这些点的变形及作用在它们上的力量。 我们在软组织仿体实验收集到的表面跟踪数据上训练了该模型,并通过转移学习克服数据稀缺的问题,即先使用质量-弹簧仿真进行初步训练,然后用实验数据进行微调。这种方法提升了模型的泛化能力,使其能够准确预测组织变形及其相互作用力。实验证明,在30毫米范围内的变形距离误差为0.35±0.03毫米,在7.5牛顿范围内的力量绝对误差为0.37±0.05牛顿。 我们基于数据驱动的方法为在虚拟环境中模拟软组织提供了一个有前景的解决方案。除了医学仿真领域的应用,这种方法还具有潜在的应用价值,适用于任何需要现实软组织仿真的领域。
https://arxiv.org/abs/2507.05315
Colorectal cancer (CRC) is closely linked to the malignant transformation of colorectal polyps, making early detection essential. However, current models struggle with detecting small lesions, accurately localizing boundaries, and providing interpretable decisions. To address these issues, we propose HGNet, which integrates High-Order Spatial Awareness Hypergraph and Multi-Scale Context Attention. Key innovations include: (1) an Efficient Multi-Scale Context Attention (EMCA) module to enhance lesion feature representation and boundary modeling; (2) the deployment of a spatial hypergraph convolution module before the detection head to capture higher-order spatial relationships between nodes; (3) the application of transfer learning to address the scarcity of medical image data; and (4) Eigen Class Activation Map (Eigen-CAM) for decision visualization. Experimental results show that HGNet achieves 94% accuracy, 90.6% recall, and 90% mAP@0.5, significantly improving small lesion differentiation and clinical interpretability. The source code will be made publicly available upon publication of this paper.
结直肠癌(CRC)与结直肠息肉的恶性转化密切相关,因此早期检测至关重要。然而,目前的模型在检测小病变、准确确定边界以及提供可解释性决策方面存在困难。为了解决这些问题,我们提出了HGNet,该网络结合了高阶空间感知超图和多尺度上下文注意力机制。 关键创新包括: 1. **高效多尺度上下文注意(EMCA)模块**:增强了病灶特征表示和边界建模。 2. **在检测头部之前部署空间超图卷积模块**:以捕获节点之间的高阶空间关系。 3. **应用迁移学习**:以应对医疗图像数据的稀缺性问题。 4. **特征类激活图(Eigen-CAM)**:用于决策可视化。 实验结果显示,HGNet达到了94%的准确率、90.6%的召回率和90%mAP@0.5,在小病变区分和临床可解释性方面有显著提升。该论文发表后,源代码将公开发布。
https://arxiv.org/abs/2507.04880
We explore transfer learning strategies for musical onset detection in the Afro-Brazilian Maracatu tradition, which features complex rhythmic patterns that challenge conventional models. We adapt two Temporal Convolutional Network architectures: one pre-trained for onset detection (intra-task) and another for beat tracking (inter-task). Using only 5-second annotated snippets per instrument, we fine-tune these models through layer-wise retraining strategies for five traditional percussion instruments. Our results demonstrate significant improvements over baseline performance, with F1 scores reaching up to 0.998 in the intra-task setting and improvements of over 50 percentage points in best-case scenarios. The cross-task adaptation proves particularly effective for time-keeping instruments, where onsets naturally align with beat positions. The optimal fine-tuning configuration varies by instrument, highlighting the importance of instrument-specific adaptation strategies. This approach addresses the challenges of underrepresented musical traditions, offering an efficient human-in-the-loop methodology that minimizes annotation effort while maximizing performance. Our findings contribute to more inclusive music information retrieval tools applicable beyond Western musical contexts.
我们在非洲巴西马腊卡图传统中的音乐起音检测中探索了迁移学习策略,这种传统包含复杂的节奏模式,给常规模型带来了挑战。我们对两个时间卷积网络架构进行了调整:一个预先训练用于起音检测(任务内),另一个则用于节拍跟踪(跨任务)。仅使用每个乐器5秒的标注片段,通过逐层再训练策略对这两种模型进行微调,以适应五种传统打击乐器。我们的实验结果表明,在任务内的设置下,F1分数达到了0.998的显著改善,并且在最佳情况下比基线性能提高了超过50个百分点。跨任务调整对于节奏控制型乐器尤为有效,因为起音自然与节拍位置对齐。最有效的微调配置因乐器而异,这强调了针对特定乐器制定适应策略的重要性。这种做法解决了不被充分代表的音乐传统所面临的挑战,并提供了一种高效的人机协作方法,在最小化标注工作量的同时最大化性能。我们的发现有助于开发适用于非西方音乐背景下的更包容性音乐信息检索工具。
https://arxiv.org/abs/2507.04858
In this work, we propose a simple but effective channel pruning framework called Progressive Channel Pruning (PCP) to accelerate Convolutional Neural Networks (CNNs). In contrast to the existing channel pruning methods that prune channels only once per layer in a layer-by-layer fashion, our new progressive framework iteratively prunes a small number of channels from several selected layers, which consists of a three-step attempting-selecting-pruning pipeline in each iteration. In the attempting step, we attempt to prune a pre-defined number of channels from one layer by using any existing channel pruning methods and estimate the accuracy drop for this layer based on the labelled samples in the validation set. In the selecting step, based on the estimated accuracy drops for all layers, we propose a greedy strategy to automatically select a set of layers that will lead to less overall accuracy drop after pruning these layers. In the pruning step, we prune a small number of channels from these selected layers. We further extend our PCP framework to prune channels for the deep transfer learning methods like Domain Adversarial Neural Network (DANN), in which we effectively reduce the data distribution mismatch in the channel pruning process by using both labelled samples from the source domain and pseudo-labelled samples from the target domain. Our comprehensive experiments on two benchmark datasets demonstrate that our PCP framework outperforms the existing channel pruning approaches under both supervised learning and transfer learning settings.
在这项工作中,我们提出了一种简单但有效的通道剪枝框架,称为渐进式通道剪枝(PCP),用于加速卷积神经网络(CNN)。与现有的逐层一次性修剪一层中所有通道的通道剪枝方法不同,我们的新渐进式框架通过迭代地从选定的几层中剪除少量通道来工作。每个迭代包括尝试、选择和剪枝三个步骤的管道。 在尝试阶段,我们使用任何现有通道剪枝方法对一个预定义数量的通道进行逐层修剪,并基于验证集中的标记样本估计该层的精度损失。 在选择阶段,根据所有层上的估算精度下降值,我们提出了一种贪婪策略以自动选择一组经过这些操作后整体精度下降最小的层级。 在剪枝阶段,从选定的各层中实际剪除少量通道。 此外,我们将PCP框架扩展到了深层迁移学习方法(例如域对抗神经网络DANN)中的通道剪枝。在这个过程中,我们通过使用源领域和目标领域的标记样本以及伪标签样本来有效减少数据分布不匹配问题,在通道修剪流程中提升了性能表现。 我们在两个基准数据集上的全面实验表明,我们的PCP框架在监督学习和迁移学习设置下均优于现有的通道剪枝方法。
https://arxiv.org/abs/2507.04792
Recent advances in natural language processing (NLP) have been driven bypretrained language models like BERT, RoBERTa, T5, and GPT. Thesemodels excel at understanding complex texts, but biomedical literature, withits domain-specific terminology, poses challenges that models likeWord2Vec and bidirectional long short-term memory (Bi-LSTM) can't fullyaddress. GPT and T5, despite capturing context, fall short in tasks needingbidirectional understanding, unlike BERT. Addressing this, we proposedMedicalBERT, a pretrained BERT model trained on a large biomedicaldataset and equipped with domain-specific vocabulary that enhances thecomprehension of biomedical terminology. MedicalBERT model is furtheroptimized and fine-tuned to address diverse tasks, including named entityrecognition, relation extraction, question answering, sentence similarity, anddocument classification. Performance metrics such as the F1-score,accuracy, and Pearson correlation are employed to showcase the efficiencyof our model in comparison to other BERT-based models such as BioBERT,SciBERT, and ClinicalBERT. MedicalBERT outperforms these models onmost of the benchmarks, and surpasses the general-purpose BERT model by5.67% on average across all the tasks evaluated respectively. This work alsounderscores the potential of leveraging pretrained BERT models for medicalNLP tasks, demonstrating the effectiveness of transfer learning techniques incapturing domain-specific information. (PDF) MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model. Available from: this https URL [accessed Jul 06 2025].
近期,自然语言处理(NLP)领域的进展主要归功于像BERT、RoBERTa、T5和GPT这样的预训练语言模型。这些模型在理解复杂文本方面表现出色,但生物医学文献因其领域特定的术语而提出了挑战,这是Word2Vec和双向长短时记忆网络(Bi-LSTM)等传统方法难以完全解决的问题。虽然GPT和T5能够捕捉上下文信息,但在需要双向理解的任务中不如BERT表现优秀。为此,我们提出了一种称为MedicalBERT的新模型,这是一种基于大量生物医学数据集训练的预训练BERT模型,并配备了领域特定词汇表以增强对生物医学术语的理解能力。通过进一步优化和微调,MedicalBERT能够在命名实体识别、关系抽取、问答、句子相似度和文档分类等多样化任务中发挥出色性能。 为了展示我们的模型相较于其他基于BERT的模型(如BioBERT、SciBERT和ClinicalBERT)在效率上的优势,我们采用了诸如F1分数、准确率以及Pearson相关系数等多种性能指标进行评估。实验结果表明,在大多数基准测试上,MedicalBERT均超越了这些模型,并且平均而言比通用型的BERT模型高出5.67%的表现(按任务分别计算)。这项工作不仅强调了利用预训练的BERT模型来处理医疗NLP任务的巨大潜力,还证明了迁移学习技术在捕捉领域特定信息方面的有效性。 论文《MedicalBERT:使用基于预训练BERT的模型增强生物医学自然语言处理》可以从以下链接下载:[此链接] [最后访问日期: 2025年7月6日]。
https://arxiv.org/abs/2507.08013
Theoretical works on supervised transfer learning (STL) -- where the learner has access to labeled samples from both source and target distributions -- have for the most part focused on statistical aspects of the problem, while efficient optimization has received less attention. We consider the problem of designing an SGD procedure for STL that alternates sampling between source and target data, while maintaining statistical transfer guarantees without prior knowledge of the quality of the source data. A main algorithmic difficulty is in understanding how to design such an adaptive sub-sampling mechanism at each SGD step, to automatically gain from the source when it is informative, or bias towards the target and avoid negative transfer when the source is less informative. We show that, such a mixed-sample SGD procedure is feasible for general prediction tasks with convex losses, rooted in tracking an abstract sequence of constrained convex programs that serve to maintain the desired transfer guarantees. We instantiate these results in the concrete setting of linear regression with square loss, and show that the procedure converges, with $1/\sqrt{T}$ rate, to a solution whose statistical performance on the target is adaptive to the a priori unknown quality of the source. Experiments with synthetic and real datasets support the theory.
关于监督迁移学习(STL)的理论工作——在这种情况下,学习者可以从源分布和目标分布中获取标记样本——大多数研究主要集中在问题的统计方面,而高效的优化方法则受到较少的关注。我们考虑设计一种针对STL的SGD过程,该过程在源数据和目标数据之间交替采样,并且在没有先验知识的情况下维持统计迁移保证。算法的一个重要困难在于如何设计适应性的子抽样机制,在每一步的SGD过程中,当源数据具有信息量时可以自动从中获益;而当源数据不太具信息量时,则偏向于目标并避免负面转移。 我们展示了一种混合样本SGD过程对于一般预测任务(使用凸损失函数)是可行的,并且该过程根植于跟踪一系列抽象的约束凸程序,这些程序有助于保持所需的迁移保证。我们在具体的线性回归和平方损失环境中实现了这些结果,并证明了这种程序以$1/\sqrt{T}$的速度收敛至一个统计性能适应未知源质量的目标解。 通过合成数据集和真实数据集进行的实验支持了这一理论。
https://arxiv.org/abs/2507.04194
The global demand for radiologists is increasing rapidly due to a growing reliance on medical imaging services, while the supply of radiologists is not keeping pace. Advances in computer vision and image processing technologies present significant potential to address this gap by enhancing radiologists' capabilities and improving diagnostic accuracy. Large language models (LLMs), particularly generative pre-trained transformers (GPTs), have become the primary approach for understanding and generating textual data. In parallel, vision transformers (ViTs) have proven effective at converting visual data into a format that LLMs can process efficiently. In this paper, we present ChestGPT, a deep-learning framework that integrates the EVA ViT with the Llama 2 LLM to classify diseases and localize regions of interest in chest X-ray images. The ViT converts X-ray images into tokens, which are then fed, together with engineered prompts, into the LLM, enabling joint classification and localization of diseases. This approach incorporates transfer learning techniques to enhance both explainability and performance. The proposed method achieved strong global disease classification performance on the VinDr-CXR dataset, with an F1 score of 0.76, and successfully localized pathologies by generating bounding boxes around the regions of interest. We also outline several task-specific prompts, in addition to general-purpose prompts, for scenarios radiologists might encounter. Overall, this framework offers an assistive tool that can lighten radiologists' workload by providing preliminary findings and regions of interest to facilitate their diagnostic process.
全球对放射科医生的需求由于医疗成像服务的依赖日益增加而迅速增长,然而放射科医生的数量并未跟上这一需求的步伐。计算机视觉和图像处理技术的进步为缓解这一缺口提供了巨大的潜力,能够增强放射科医生的能力并提高诊断准确性。大型语言模型(LLMs),尤其是生成式预训练变换器(GPTs),已经成为理解和生成文本数据的主要方法。同时,视觉变换器(ViTs)已被证明能够有效将视觉数据转换成适合LLMs高效处理的格式。在这篇论文中,我们提出了ChestGPT,这是一个深度学习框架,它结合了EVA ViT和Llama 2 LLM来对胸部X光片进行疾病分类及定位感兴趣的区域。ViT将X射线图像转化为令牌,然后与工程化的提示一起输入LLM,从而实现疾病的联合分类和定位。这种方法通过采用迁移学习技术增强了可解释性和性能。我们提出的方法在VinDr-CXR数据集上的全球疾病分类任务中表现出色,取得了0.76的F1分数,并成功地生成了围绕感兴趣区域的边界框来定位病灶。此外,我们也为放射科医生可能遇到的各种场景提供了几种特定任务提示和通用提示。总体而言,该框架提供了一种辅助工具,通过提供初步发现及感兴趣的区域来减轻放射科医生的工作负担,从而方便他们的诊断过程。
https://arxiv.org/abs/2507.03739
In recent years, there has been a proliferation of spatiotemporal foundation models in different scientific disciplines. While promising, these models are often domain-specific and are only assessed within the particular applications for which they are designed. Given that many tasks can be represented as video modeling problems, video foundation models (ViFMs) hold considerable promise as general-purpose domain-agnostic approaches. However, it is not known whether the knowledge acquired on large-scale but potentially out-of-domain data can be effectively transferred across diverse scientific disciplines, and if a single, pretrained ViFM can be competitive with domain-specific baselines. To address this, we introduce SciVid, a comprehensive benchmark comprising five *Sci*entific *Vid*eo tasks, across medical computer vision, animal behavior, and weather forecasting. We adapt six leading ViFMs to SciVid using simple trainable readout modules, establishing strong baselines and demonstrating the potential for effective transfer learning. Specifically, we show that state-of-the-art results can be obtained in several applications by leveraging the general-purpose representations from ViFM backbones. Furthermore, our results reveal the limitations of existing ViFMs, and highlight opportunities for the development of generalizable models for high-impact scientific applications. We release our code at this https URL to facilitate further research in the development of ViFMs.
近年来,不同科学领域中涌现出了许多时空基础模型。尽管这些模型前景广阔,但它们往往是特定领域的,并且仅在其设计的应用场景中进行评估。鉴于许多任务可以被表示为视频建模问题,视频基础模型(Video Foundation Models, ViFMs)作为通用的、不受限于特定领域的解决方案展现了巨大的潜力。然而,尚不清楚在大规模但可能超出领域范围的数据上获得的知识能否有效地跨多个科学学科传递,以及单一的预训练ViFM是否能与特定领域的基准方法相竞争。 为了解决这些问题,我们引入了SciVid,这是一个全面的基准测试集,涵盖了五个*科研视频任务*(Scientific Video Tasks),涉及医学计算机视觉、动物行为和天气预报。我们将六个领先的ViFMs适配到SciVid中,使用简单的可训练输出模块,并建立了强大的基线模型,展示了有效迁移学习的可能性。具体来说,我们证明了通过利用来自ViFM骨干网络的通用表示形式,在多个应用中可以获得最先进的结果。此外,我们的结果显示现有ViFMs的局限性,并强调为具有高影响力的应用开发泛化模型的机会。 为了促进进一步的研究和视频基础模型(Video Foundation Models, ViFMs)的发展,我们公开了代码,网址是[此处](请将"this https URL"替换为实际的链接)。
https://arxiv.org/abs/2507.03578
Transfer learning has become an essential paradigm in artificial intelligence, enabling the transfer of knowledge from a source task to improve performance on a target task. This approach, particularly through techniques such as pretraining and fine-tuning, has seen significant success in fields like computer vision and natural language processing. However, despite its widespread use, how to reliably assess the transferability of knowledge remains a challenge. Understanding the theoretical underpinnings of each transferability metric is critical for ensuring the success of transfer learning. In this survey, we provide a unified taxonomy of transferability metrics, categorizing them based on transferable knowledge types and measurement granularity. This work examines the various metrics developed to evaluate the potential of source knowledge for transfer learning and their applicability across different learning paradigms emphasizing the need for careful selection of these metrics. By offering insights into how different metrics work under varying conditions, this survey aims to guide researchers and practitioners in selecting the most appropriate metric for specific applications, contributing to more efficient, reliable, and trustworthy AI systems. Finally, we discuss some open challenges in this field and propose future research directions to further advance the application of transferability metrics in trustworthy transfer learning.
迁移学习已经成为人工智能领域的一个重要范式,它通过将知识从源任务转移到目标任务来提升性能。这种方法在计算机视觉和自然语言处理等领域中,尤其是在预训练和微调技术的帮助下取得了显著成功。然而,尽管迁移学习得到了广泛应用,如何可靠地评估知识的可转移性仍然是一项挑战。理解每种可转移性度量的理论基础对于确保迁移学习的成功至关重要。在这项调查中,我们提供了一个统一的可转移性度量分类体系,基于可转移的知识类型和测量粒度进行分类。这项工作审查了各种用于评估源知识在迁移学习中的潜力以及这些度量在不同学习范式中的适用性的指标,并强调了谨慎选择这些度量的重要性。 通过揭示这些不同的度量在不同条件下的工作机制,本调查旨在指导研究人员和实践者为特定应用选择最合适的度量标准,从而促进更高效、可靠且值得信赖的人工智能系统的开发。最后,我们讨论了一些开放性挑战并提出未来研究方向,以进一步推进可转移性度量在可信迁移学习中的应用。
https://arxiv.org/abs/2507.03175
Wildlife re-identification aims to match individuals of the same species across different observations. Current state-of-the-art (SOTA) models rely on class labels to train supervised models for individual classification. This dependence on annotated data has driven the curation of numerous large-scale wildlife datasets. This study investigates self-supervised learning Self-Supervised Learning (SSL) for wildlife re-identification. We automatically extract two distinct views of an individual using temporal image pairs from camera trap data without supervision. The image pairs train a self-supervised model from a potentially endless stream of video data. We evaluate the learnt representations against supervised features on open-world scenarios and transfer learning in various wildlife downstream tasks. The analysis of the experimental results shows that self-supervised models are more robust even with limited data. Moreover, self-supervised features outperform supervision across all downstream tasks. The code is available here this https URL.
野生动物再识别的目标是在不同的观测中匹配同一物种的个体。当前最先进的(SOTA)模型依赖于类别标签来训练用于个体分类的监督模型。这种对标注数据的依赖推动了大量大型野生动物数据集的整理。本研究探讨了无监督自监督学习(Self-Supervised Learning, SSL)在野生动物再识别中的应用。我们利用摄像陷阱数据中成对的时间图像,自动提取每个个体的两个不同视角,而无需任何监督。这些图像对可以从潜在无限的视频流数据集中训练出一个自监督模型。 我们在开放世界场景和各种野生动物下游任务的迁移学习上评估了所学表示与监督特征的表现。实验结果分析表明,在有限的数据下,自监督模型更加稳健,并且在所有下游任务中,无监督特征的表现均优于监督方法。相关代码可在[此处](此链接应为一个指向具体代码仓库或项目的URL)获取。 请注意,原文中的“this https URL”是一个占位符,你需要将它替换为实际的链接地址以便访问相关资料。
https://arxiv.org/abs/2507.02403
Transfer learning in Reinforcement Learning (RL) enables agents to leverage knowledge from source tasks to accelerate learning in target tasks. While prior work, such as the Attend, Adapt, and Transfer (A2T) framework, addresses negative transfer and selective transfer, other critical challenges remain underexplored. This paper introduces the Generalized Adaptive Transfer Network (GATN), a deep RL architecture designed to tackle task generalization across domains, robustness to environmental changes, and computational efficiency in transfer. GATN employs a domain-agnostic representation module, a robustness-aware policy adapter, and an efficient transfer scheduler to achieve these goals. We evaluate GATN on diverse benchmarks, including Atari 2600, MuJoCo, and a custom chatbot dialogue environment, demonstrating superior performance in cross-domain generalization, resilience to dynamic environments, and reduced computational overhead compared to baselines. Our findings suggest GATN is a versatile framework for real-world RL applications, such as adaptive chatbots and robotic control.
在强化学习(RL)中,迁移学习使智能体能够利用来自源任务的知识来加速目标任务的学习过程。尽管之前的工作,如“Attend, Adapt, and Transfer”(A2T) 框架已经解决了负向迁移和选择性迁移的问题,但其他关键挑战仍然没有得到充分探索。本文介绍了广义自适应传输网络(GATN),这是一种深度强化学习架构,旨在解决跨领域任务泛化、对环境变化的鲁棒性和迁移过程中的计算效率等问题。 GATN采用了一个域无关表示模块,一个感知鲁棒性的策略适配器以及一个高效的转移调度程序来实现这些目标。我们在包括Atari 2600游戏、MuJoCo物理模拟器和一个自定义聊天机器人对话环境在内的多种基准测试中评估了GATN的性能,结果显示它在跨域泛化能力、对动态环境的适应能力和计算开销方面都优于基线方法。 我们的研究结果表明,GATN是一个适合现实世界强化学习应用(如自适应聊天机器人和机器人控制)的多功能框架。
https://arxiv.org/abs/2507.03026