Underwater acoustic target recognition (UATR) is of great significance for the protection of marine diversity and national defense security. The development of deep learning provides new opportunities for UATR, but faces challenges brought by the scarcity of reference samples and complex environmental interference. To address these issues, we proposes a multi-task balanced channel attention convolutional neural network (MT-BCA-CNN). The method integrates a channel attention mechanism with a multi-task learning strategy, constructing a shared feature extractor and multi-task classifiers to jointly optimize target classification and feature reconstruction tasks. The channel attention mechanism dynamically enhances discriminative acoustic features such as harmonic structures while suppressing noise. Experiments on the Watkins Marine Life Dataset demonstrate that MT-BCA-CNN achieves 97\% classification accuracy and 95\% $F1$-score in 27-class few-shot scenarios, significantly outperforming traditional CNN and ACNN models, as well as popular state-of-the-art UATR methods. Ablation studies confirm the synergistic benefits of multi-task learning and attention mechanisms, while a dynamic weighting adjustment strategy effectively balances task contributions. This work provides an efficient solution for few-shot underwater acoustic recognition, advancing research in marine bioacoustics and sonar signal processing.
水下声学目标识别(UATR)对于保护海洋生物多样性和国家安全具有重要意义。深度学习的发展为UATR提供了新的机遇,但同时也面临着参考样本稀缺和复杂环境干扰带来的挑战。为了应对这些问题,我们提出了一种多任务平衡通道注意卷积神经网络(MT-BCA-CNN)。该方法结合了通道注意力机制与多任务学习策略,构建了一个共享特征提取器及多个任务分类器,以同时优化目标分类和特征重构的任务。通过动态增强如谐波结构等具有区分性的声学特征并抑制噪声,通道注意机制能够显著提高模型性能。 在Watkins海洋生物数据集上的实验表明,在27类少量样本的场景中,MT-BCA-CNN达到了97%的分类准确率和95%的F1值,远远优于传统的CNN、ACNN模型及流行的新一代UATR方法。消融研究证实了多任务学习与注意力机制相结合所具有的协同效应,而动态权重调整策略有效地平衡了各任务贡献。这项工作为少量样本情况下的水下声学识别提供了一种高效解决方案,并推进了海洋生物声学和声呐信号处理领域的研究进展。
https://arxiv.org/abs/2504.13102
Deep neural networks face several challenges in hyperspectral image classification, including high-dimensional data, sparse distribution of ground objects, and spectral redundancy, which often lead to classification overfitting and limited generalization capability. To more efficiently adapt to ground object distributions while extracting image features without introducing excessive parameters and skipping redundant information, this paper proposes EKGNet based on an improved 3D-DenseNet model, consisting of a context-aware mapping network and a dynamic kernel generation module. The context-aware mapping module translates global contextual information of hyperspectral inputs into instructions for combining base convolutional kernels, while the dynamic kernels are composed of K groups of base convolutions, analogous to K different types of experts specializing in fundamental patterns across various dimensions. The mapping module and dynamic kernel generation mechanism form a tightly coupled system - the former generates meaningful combination weights based on inputs, while the latter constructs an adaptive expert convolution system using these weights. This dynamic approach enables the model to focus more flexibly on key spatial structures when processing different regions, rather than relying on the fixed receptive field of a single static convolutional kernel. EKGNet enhances model representation capability through a 3D dynamic expert convolution system without increasing network depth or width. The proposed method demonstrates superior performance on IN, UP, and KSC datasets, outperforming mainstream hyperspectral image classification approaches.
深度神经网络在高光谱图像分类中面临若干挑战,包括数据维度高、地面目标分布稀疏以及光谱冗余等问题。这些问题往往导致模型过拟合并限制了其泛化能力。为了更有效地适应地面对象的分布,并在提取图像特征时减少引入过多参数和跳过冗余信息的情况,本文提出了一种基于改进型3D-DenseNet模型的EKGNet架构。该架构包括一个上下文感知映射网络和动态核生成模块。 - 上下文感知映射模块将高光谱输入的全局上下文信息转化为组合基础卷积核的指令; - 动态核由K组基础卷积组成,类似于不同类型的专家专注于各种维度中的基本模式。 映射模块与动态核生成机制形成了一个紧密耦合系统:前者根据输入生成有意义的组合权重,而后者则利用这些权重构建出适应性的专家卷积系统。这种动态方法使模型在处理不同区域时能够更灵活地聚焦于关键的空间结构,而不是依赖单一静态卷积核固定的感受野。 EKGNet通过一个3D动态专家卷积系统增强了模型的表现能力,并且没有增加网络的深度或宽度。所提出的方法在IN、UP和KSC数据集上表现出色,优于主流高光谱图像分类方法。
https://arxiv.org/abs/2504.13045
Remote Sensing Image Super-Resolution (RSISR) reconstructs high-resolution (HR) remote sensing images from low-resolution inputs to support fine-grained ground object interpretation. Existing methods face three key challenges: (1) Difficulty in extracting multi-scale features from spatially heterogeneous RS scenes, (2) Limited prior information causing semantic inconsistency in reconstructions, and (3) Trade-off imbalance between geometric accuracy and visual quality. To address these issues, we propose the Texture Transfer Residual Denoising Dual Diffusion Model (TTRD3) with three innovations: First, a Multi-scale Feature Aggregation Block (MFAB) employing parallel heterogeneous convolutional kernels for multi-scale feature extraction. Second, a Sparse Texture Transfer Guidance (STTG) module that transfers HR texture priors from reference images of similar scenes. Third, a Residual Denoising Dual Diffusion Model (RDDM) framework combining residual diffusion for deterministic reconstruction and noise diffusion for diverse generation. Experiments on multi-source RS datasets demonstrate TTRD3's superiority over state-of-the-art methods, achieving 1.43% LPIPS improvement and 3.67% FID enhancement compared to best-performing baselines. Code/model: this https URL.
遥感图像超分辨率(RSISR)技术旨在从低分辨率输入中重建高分辨率的遥感影像,以支持对地面物体进行精细解释。现有方法面临三大挑战:(1) 难以提取空间异质性遥感场景中的多尺度特征;(2) 有限的先验信息导致重构图像语义不一致;(3) 几何精度与视觉质量之间的权衡问题。为解决这些问题,我们提出了纹理转移残差去噪双扩散模型(TTRD3),其具有三大创新:首先,使用平行异构卷积核进行多尺度特征提取的多尺度特征聚合模块(MFAB)。其次,一个稀疏纹理转移引导(STTG)模块,该模块从类似场景的参考图像中转移高分辨率纹理先验。第三,残差扩散和噪声扩散相结合以实现确定性重构与多样化生成的残差去噪双扩散模型框架(RDDM)。在多源遥感数据集上的实验表明,TTRD3优于现有先进方法,在LPIPS指标上较最佳基线提升1.43%,FID指标上提升3.67%。代码/模型:[此链接](https://this https URL)。 请注意,最后一行中的“this https URL”应被实际的访问链接替换。
https://arxiv.org/abs/2504.13026
This study presents an ensemble-based approach for cocoa pod disease classification by integrating transfer learning with three ensemble learning strategies: Bagging, Boosting, and Stacking. Pre-trained convolutional neural networks, including VGG16, VGG19, ResNet50, ResNet101, InceptionV3, and Xception, were fine-tuned and employed as base learners to detect three disease categories: Black Pod Rot, Pod Borer, and Healthy. A balanced dataset of 6,000 cocoa pod images was curated and augmented to ensure robustness against variations in lighting, orientation, and disease severity. The performance of each ensemble method was evaluated using accuracy, precision, recall, and F1-score. Experimental results show that Bagging consistently achieved superior classification performance with a test accuracy of 100%, outperforming Boosting (97%) and Stacking (92%). The findings confirm that combining transfer learning with ensemble techniques improves model generalization and reliability, making it a promising direction for precision agriculture and automated crop disease management.
本研究提出了一种基于集成方法的可可果病害分类方案,该方案结合了迁移学习和三种集成学习策略:Bagging(装袋法)、Boosting(提升法)和Stacking(堆叠法)。通过对VGG16、VGG19、ResNet50、ResNet101、InceptionV3和Xception等预训练的卷积神经网络进行微调,将其用作基础学习器来检测三种病害类别:黑腐病、果荚虫害以及健康样本。为了确保模型在光照变化、方向不同及病情严重程度不一的情况下具有鲁棒性,研究团队精心整理并扩充了一个包含6,000张可可果图像的平衡数据集。通过准确率(accuracy)、精确率(precision)、召回率(recall)和F1分数对每种集成方法的性能进行了评估。实验结果表明,Bagging法始终取得了最佳分类效果,在测试中的准确率达到100%,优于Boosting法(97%)和Stacking法(92%)。研究发现证实了将迁移学习与集成技术相结合可以提高模型泛化能力和可靠性,这为精准农业及作物病害自动管理提供了有前景的方向。
https://arxiv.org/abs/2504.12992
Mechanistic interpretability is concerned with analyzing individual components in a (convolutional) neural network (CNN) and how they form larger circuits representing decision mechanisms. These investigations are challenging since CNNs frequently learn polysemantic channels that encode distinct concepts, making them hard to interpret. To address this, we propose an algorithm to disentangle a specific kind of polysemantic channel into multiple channels, each responding to a single concept. Our approach restructures weights in a CNN, utilizing that different concepts within the same channel exhibit distinct activation patterns in the previous layer. By disentangling these polysemantic features, we enhance the interpretability of CNNs, ultimately improving explanatory techniques such as feature visualizations.
机械可解释性关注于分析卷积神经网络(CNN)中的单个组件及其如何形成表示决策机制的更大电路。这些研究具有挑战性,因为CNN经常学习多语义通道,这些通道编码不同的概念,使其难以解读。为了解决这个问题,我们提出了一种算法,将特定类型的多语义通道分解成多个通道,每个通道响应单一的概念。我们的方法通过重新配置CNN中的权重来实现这一点,利用了同一通道内的不同概念在前一层表现出独特的激活模式这一特点。通过分离这些多语义特征,我们可以提高CNN的可解释性,并最终改进诸如特征可视化之类的解释技术。
https://arxiv.org/abs/2504.12939
This paper introduces AdaptoVision, a novel convolutional neural network (CNN) architecture designed to efficiently balance computational complexity and classification accuracy. By leveraging enhanced residual units, depth-wise separable convolutions, and hierarchical skip connections, AdaptoVision significantly reduces parameter count and computational requirements while preserving competitive performance across various benchmark and medical image datasets. Extensive experimentation demonstrates that AdaptoVision achieves state-of-the-art on BreakHis dataset and comparable accuracy levels, notably 95.3\% on CIFAR-10 and 85.77\% on CIFAR-100, without relying on any pretrained weights. The model's streamlined architecture and strategic simplifications promote effective feature extraction and robust generalization, making it particularly suitable for deployment in real-time and resource-constrained environments.
本文介绍了AdaptoVision,这是一种新颖的卷积神经网络(CNN)架构,旨在高效地平衡计算复杂性和分类准确性。通过利用增强型残差单元、深度可分离卷积和分层跳跃连接,AdaptoVision显著减少了参数数量和计算需求,同时在各种基准数据集和医学图像数据集中保持了竞争性的性能水平。广泛的实验表明,在不依赖任何预训练权重的情况下,AdaptoVision在BreakHis数据集上达到了最先进的水平,并且在CIFAR-10数据集上的准确率高达95.3%,而在CIFAR-100数据集上的准确率为85.77%。该模型的精简架构和战略性简化促进了有效的特征提取和强大的泛化能力,使其特别适合部署于实时和资源受限环境中。
https://arxiv.org/abs/2504.12652
The application of artificial intelligence (AI) in medical imaging has revolutionized diagnostic practices, enabling advanced analysis and interpretation of radiological data. This study presents a comprehensive evaluation of radiomics-based and deep learning-based approaches for disease detection in chest radiography, focusing on COVID-19, lung opacity, and viral pneumonia. While deep learning models, particularly convolutional neural networks (CNNs) and vision transformers (ViTs), learn directly from image data, radiomics-based models extract and analyze quantitative features, potentially providing advantages in data-limited scenarios. This study systematically compares the diagnostic accuracy and robustness of various AI models, including Decision Trees, Gradient Boosting, Random Forests, Support Vector Machines (SVM), and Multi-Layer Perceptrons (MLP) for radiomics, against state-of-the-art computer vision deep learning architectures. Performance metrics across varying sample sizes reveal insights into each model's efficacy, highlighting the contexts in which specific AI approaches may offer enhanced diagnostic capabilities. The results aim to inform the integration of AI-driven diagnostic tools in clinical practice, particularly in automated and high-throughput environments where timely, reliable diagnosis is critical. This comparative study addresses an essential gap, establishing guidance for the selection of AI models based on clinical and operational needs.
在医学影像领域应用人工智能(AI)已经彻底改变了诊断实践,使得放射学数据的高级分析和解读成为可能。本研究全面评估了基于放射组学和深度学习的方法,在胸部X光片疾病检测中的表现,重点探讨了新冠病毒、肺部不透明区域及病毒性肺炎的情况。虽然深度学习模型特别是卷积神经网络(CNN)和视觉变换器(ViT),能够直接从图像数据中进行学习,而基于放射组学的模型则通过提取并分析定量特征来进行工作,在数据有限的情况下可能会提供优势。 本研究系统地比较了各种AI模型的诊断准确性和鲁棒性,包括决策树、梯度提升、随机森林、支持向量机(SVM)和多层感知器(MLP),用于放射组学,并将其与最先进的计算机视觉深度学习架构进行对比。在不同样本大小下的性能指标揭示了每种模型的有效性,突出了特定AI方法可能提供增强诊断能力的具体情况。 研究结果旨在为临床实践中集成基于人工智能的诊断工具提供指导,特别是在自动化和高通量环境中,这些地方需要及时、可靠的诊断。这项比较研究解决了一个重要缺口,并确立了根据临床及运营需求选择AI模型的指南。
https://arxiv.org/abs/2504.12249
Deep neural networks (DNNs) have recently become the leading method for low-light image enhancement (LLIE). However, despite significant progress, their outputs may still exhibit issues such as amplified noise, incorrect white balance, or unnatural enhancements when deployed in real world applications. A key challenge is the lack of diverse, large scale training data that captures the complexities of low-light conditions and imaging pipelines. In this paper, we propose a novel image signal processing (ISP) driven data synthesis pipeline that addresses these challenges by generating unlimited paired training data. Specifically, our pipeline begins with easily collected high-quality normal-light images, which are first unprocessed into the RAW format using a reverse ISP. We then synthesize low-light degradations directly in the RAW domain. The resulting data is subsequently processed through a series of ISP stages, including white balance adjustment, color space conversion, tone mapping, and gamma correction, with controlled variations introduced at each stage. This broadens the degradation space and enhances the diversity of the training data, enabling the generated data to capture a wide range of degradations and the complexities inherent in the ISP pipeline. To demonstrate the effectiveness of our synthetic pipeline, we conduct extensive experiments using a vanilla UNet model consisting solely of convolutional layers, group normalization, GeLU activation, and convolutional block attention modules (CBAMs). Extensive testing across multiple datasets reveals that the vanilla UNet model trained with our data synthesis pipeline delivers high fidelity, visually appealing enhancement results, surpassing state-of-the-art (SOTA) methods both quantitatively and qualitatively.
最近,深度神经网络(DNN)已成为低光图像增强(LLIE)的领先方法。然而,尽管取得了显著进展,它们在实际应用中的输出仍然可能出现诸如放大噪声、错误白平衡或不自然增强等问题。关键挑战之一是缺乏能够捕捉低光条件和成像流程复杂性的多样化大规模训练数据。 为此,本文提出了一种新颖的基于图像信号处理(ISP)的数据合成管道,通过生成无限制配对训练数据来解决这些问题。具体来说,我们的管道从易于收集的高质量正常光照图像开始,并使用反向ISP首先将其未加工为RAW格式。然后,在RAW域直接合成低光退化情况。随后,生成的数据会经过一系列ISP阶段处理,包括白平衡调整、颜色空间转换、色调映射和伽马校正等,同时在每个阶段引入受控变化。这拓宽了降级范围,并增强了训练数据的多样性,使得生成的数据能够捕捉到广泛的降级情况以及ISP流程中的固有复杂性。 为了证明我们合成管道的有效性,我们在一个简单的UNet模型上进行了大量实验,该模型仅由卷积层、组归一化、GeLU激活和卷积块注意模块(CBAMs)组成。在多个数据集上的广泛测试表明,使用我们的数据合成管道训练的简单UNet模型能够提供高保真度且视觉效果良好的增强结果,在量化和定性评估中均超越了最先进的方法(SOTA)。
https://arxiv.org/abs/2504.12204
Radar-based HAR has emerged as a promising alternative to conventional monitoring approaches, such as wearable devices and camera-based systems, due to its unique privacy preservation and robustness advantages. However, existing solutions based on convolutional and recurrent neural networks, although effective, are computationally demanding during deployment. This limits their applicability in scenarios with constrained resources or those requiring multiple sensors. Advanced architectures, such as ViT and SSM architectures, offer improved modeling capabilities and have made efforts toward lightweight designs. However, their computational complexity remains relatively high. To leverage the strengths of transformer architectures while simultaneously enhancing accuracy and reducing computational complexity, this paper introduces RadMamba, a parameter-efficient, radar micro-Doppler-oriented Mamba SSM specifically tailored for radar-based HAR. Across three diverse datasets, RadMamba matches the top-performing previous model's 99.8% classification accuracy on Dataset DIAT with only 1/400 of its parameters and equals the leading models' 92.0% accuracy on Dataset CI4R with merely 1/10 of their parameters. In scenarios with continuous sequences of actions evaluated on Dataset UoG2020, RadMamba surpasses other models with significantly higher parameter counts by at least 3%, achieving this with only 6.7k parameters. Our code is available at: this https URL.
基于雷达的人体动作识别(HAR)作为一种有前景的替代方法,已经超越了传统的监测手段,如可穿戴设备和摄像头系统。这得益于其在隐私保护和鲁棒性方面的独特优势。然而,现有的解决方案通常依赖于卷积神经网络(CNNs)和循环神经网络(RNNs),尽管这些模型在部署时非常有效,但它们的计算需求很高。这种高计算要求限制了它们在资源受限或需要多传感器环境中的应用。更为先进的架构如视觉变换器(ViT)和序列到序列模型(SSM)提供了改进的建模能力,并且已经朝着轻量级设计方向努力。然而,这些架构的计算复杂度仍然相对较高。 为了充分利用变换器架构的优势,同时提高准确性和减少计算复杂性,本文提出了RadMamba——一种参数高效的雷达微多普勒导向型Mamba SSM模型,专为基于雷达的人体动作识别而设计。在三个不同的数据集上进行了测试:在DIAT数据集中,RadMamba达到了99.8%的分类准确率,并且只使用了现有最佳模型1/400的参数量;而在CI4R数据集中,它与领先模型取得了相同的92.0%的准确性,但使用的参数仅为这些模型的1/10。在UoG2020数据集上评估连续动作序列的情况下,RadMamba通过仅使用6,700个参数就比其他大量参数模型高出了至少3%的性能。 我们的代码可在以下链接获取:[此处插入链接]
https://arxiv.org/abs/2504.12039
The large integration of microphones into devices increases the opportunities for Acoustic Side-Channel Attacks (ASCAs), as these can be used to capture keystrokes' audio signals that might reveal sensitive information. However, the current State-Of-The-Art (SOTA) models for ASCAs, including Convolutional Neural Networks (CNNs) and hybrid models, such as CoAtNet, still exhibit limited robustness under realistic noisy conditions. Solving this problem requires either: (i) an increased model's capacity to infer contextual information from longer sequences, allowing the model to learn that an initially noisily typed word is the same as a futurely collected non-noisy word, or (ii) an approach to fix misidentified information from the contexts, as one does not type random words, but the ones that best fit the conversation context. In this paper, we demonstrate that both strategies are viable and complementary solutions for making ASCAs practical. We observed that no existing solution leverages advanced transformer architectures' power for these tasks and propose that: (i) Visual Transformers (VTs) are the candidate solutions for capturing long-term contextual information and (ii) transformer-powered Large Language Models (LLMs) are the candidate solutions to fix the ``typos'' (mispredictions) the model might make. Thus, we here present the first-of-its-kind approach that integrates VTs and LLMs for ASCAs. We first show that VTs achieve SOTA performance in classifying keystrokes when compared to the previous CNN benchmark. Second, we demonstrate that LLMs can mitigate the impact of real-world noise. Evaluations on the natural sentences revealed that: (i) incorporating LLMs (e.g., GPT-4o) in our ASCA pipeline boosts the performance of error-correction tasks; and (ii) the comparable performance can be attained by a lightweight, fine-tuned smaller LLM (67 times smaller than GPT-4o), using...
将麦克风集成到设备中的广泛使用增加了声侧信道攻击(ASCAs)的机会,因为这些攻击可以捕获可能泄露敏感信息的按键音频信号。然而,当前用于ASCAs的状态-of-the-art(SOTA)模型,包括卷积神经网络(CNNs)和混合模型如CoAtNet,在现实世界的嘈杂条件下仍表现出有限的鲁棒性。解决这一问题需要:(i) 提高模型从更长序列中推断上下文信息的能力,使模型能够学习到最初被噪音干扰打字的单词与之后收集的无噪音版本实际上是相同的;或(ii) 一种方法来修正上下文中错误识别的信息,因为人们不会随意输入单词,而是输入最符合对话上下文的词。在这篇论文中,我们展示了这两种策略都是可行且互补的解决方案,可以使ASCAs变得实用。我们注意到目前没有现有的解决方案利用了先进变换器架构的强大功能,并提出:(i) 视觉变换器(VTs)是捕捉长期上下文信息的候选方案;(ii) 由变压器驱动的大规模语言模型(LLMs)是修复“错字”(误预测)的候选方案。因此,我们在这里提出了第一个此类方法,该方法将VTs和LLMs集成到ASCAs中使用。 首先,我们展示了VTs在与之前的CNN基准测试相比时,在分类按键方面达到了SOTA性能。其次,我们证明了LLMs可以减轻现实世界噪音的影响。对自然句子的评估表明:(i) 将LLMs(如GPT-4o)整合到我们的ASCA流程中可提升错误修正任务的表现;并且(ii) 可以通过使用一个轻量级、精细调整的小型LLM(相比GPT-4o小67倍)来实现相当的性能。
https://arxiv.org/abs/2504.11622
Deep learning has been reported to achieve high performances in the detection of skin cancer, yet many challenges regarding the reproducibility of results and biases remain. This study is a replication (different data, same analysis) of a study on Alzheimer's disease [28] which studied robustness of logistic regression (LR) and convolutional neural networks (CNN) across patient sexes. We explore sex bias in skin cancer detection, using the PAD-UFES-20 dataset with LR trained on handcrafted features reflecting dermatological guidelines (ABCDE and the 7-point checklist), and a pre-trained ResNet-50 model. We evaluate these models in alignment with [28]: across multiple training datasets with varied sex composition to determine their robustness. Our results show that both the LR and the CNN were robust to the sex distributions, but the results also revealed that the CNN had a significantly higher accuracy (ACC) and area under the receiver operating characteristics (AUROC) for male patients than for female patients. We hope these findings to contribute to the growing field of investigating potential bias in popular medical machine learning methods. The data and relevant scripts to reproduce our results can be found in our Github.
深度学习在皮肤癌检测中已被证明可以达到高精度,但仍有许多关于结果可重复性和偏差的问题需要解决。本研究是对一项针对阿尔茨海默病的研究[28]的复制研究(使用不同的数据集和相同的分析方法),该研究探讨了逻辑回归(LR)和卷积神经网络(CNN)在不同性别患者中的稳健性。我们使用PAD-UFES-20数据集,通过反映皮肤病学指南(ABCDE和7分检查表)的手工特征对LR进行训练,并使用预训练的ResNet-50模型,来探索皮肤癌检测中的性别偏差。我们按照[28]的方法评估这些模型:在包含不同性别比例的多个训练数据集中确定它们的稳健性。 我们的结果显示,无论是逻辑回归还是卷积神经网络,在不同的性别分布中都表现出了较强的稳健性;但是结果还显示,对于男性患者,CNN的准确率(ACC)和接收者操作特征曲线下的面积(AUROC)明显高于女性患者。我们希望这些发现能够为调查流行医学机器学习方法中存在的潜在偏差的研究领域做出贡献。 可以在我方GitHub上找到用于复制我们的研究结果的数据及相关脚本。
https://arxiv.org/abs/2504.11415
The necessity of abundant annotated data and complex network architectures presents a significant challenge in deep-learning Salient Object Detection (deep SOD) and across the broader deep-learning landscape. This challenge is particularly acute in medical applications in developing countries with limited computational resources. Combining modern and classical techniques offers a path to maintaining competitive performance while enabling practical applications. Feature Learning from Image Markers (FLIM) methodology empowers experts to design convolutional encoders through user-drawn markers, with filters learned directly from these annotations. Recent findings demonstrate that coupling a FLIM encoder with an adaptive decoder creates a flyweight network suitable for SOD, requiring significantly fewer parameters than lightweight models and eliminating the need for backpropagation. Cellular Automata (CA) methods have proven successful in data-scarce scenarios but require proper initialization -- typically through user input, priors, or randomness. We propose a practical intersection of these approaches: using FLIM networks to initialize CA states with expert knowledge without requiring user interaction for each image. By decoding features from each level of a FLIM network, we can initialize multiple CAs simultaneously, creating a multi-level framework. Our method leverages the hierarchical knowledge encoded across different network layers, merging multiple saliency maps into a high-quality final output that functions as a CA ensemble. Benchmarks across two challenging medical datasets demonstrate the competitiveness of our multi-level CA approach compared to established models in the deep SOD literature.
在深度学习显著目标检测(deep SOD)及更广泛的深度学习领域中,大量标注数据和复杂网络架构的需求构成了重大挑战。这一挑战尤其在计算资源有限的发展中国家的医疗应用中显得尤为严峻。结合现代与古典技术为保持竞争力并实现实际应用提供了一条途径。图像标记特征学习(Feature Learning from Image Markers, FLIM)方法使专家能够通过用户绘制的标记设计卷积编码器,并直接从这些标注中学习滤波器。近期研究发现,将FLIM编码器与自适应解码器结合可以创建适合SOD的轻量级网络,所需参数远少于现有轻量级模型,并且无需反向传播。 细胞自动机(Cellular Automata, CA)方法在数据稀缺场景中表现出色,但需要适当的初始化——通常通过用户输入、先验知识或随机性。我们提出了一种实用的方法:使用FLIM网络根据专家知识初始化CA状态,而不需要为每张图像进行用户交互式操作。通过对每一级FLIM网络中的特征解码,我们可以同时初始化多个CA,从而创建一个多层级框架。我们的方法利用了不同网络层次中编码的分层知识,并将多个显著性图合并成一个高质量的最终输出,该输出可以作为CA集合运行。 在两个具有挑战性的医学数据集上的基准测试表明,我们提出的多层次CA方法与深度SOD文献中的现有模型相比,在性能上具备竞争力。
https://arxiv.org/abs/2504.11406
This paper introduces ConvShareViT, a novel deep learning architecture that adapts Vision Transformers (ViTs) to the 4f free-space optical system. ConvShareViT replaces linear layers in multi-head self-attention (MHSA) and Multilayer Perceptrons (MLPs) with a depthwise convolutional layer with shared weights across input channels. Through the development of ConvShareViT, the behaviour of convolutions within MHSA and their effectiveness in learning the attention mechanism were analysed systematically. Experimental results demonstrate that certain configurations, particularly those using valid-padded shared convolutions, can successfully learn attention, achieving comparable attention scores to those obtained with standard ViTs. However, other configurations, such as those using same-padded convolutions, show limitations in attention learning and operate like regular CNNs rather than transformer models. ConvShareViT architectures are specifically optimised for the 4f optical system, which takes advantage of the parallelism and high-resolution capabilities of optical systems. Results demonstrate that ConvShareViT can theoretically achieve up to 3.04 times faster inference than GPU-based systems. This potential acceleration makes ConvShareViT an attractive candidate for future optical deep learning applications and proves that our ViT (ConvShareViT) can be employed using only the convolution operation, via the necessary optimisation of the ViT to balance performance and complexity.
这篇论文介绍了ConvShareViT,这是一种新的深度学习架构,它将视觉变换器(Vision Transformers, ViTs)应用于4f自由空间光学系统。在多头自注意力机制(MHSA)和多层感知机(MLPs)中,ConvShareViT使用具有跨输入通道共享权重的深度卷积层替代了线性层。通过开发ConvShareViT,研究人员系统地分析了卷积操作在MHSA中的行为及其对学习注意机制的有效性。实验结果表明,某些配置,特别是那些采用有效填充(valid-padded)和共享卷积的情况,能够成功地学会注意力机制,并且能达到与标准ViTs相当的注意力分数。然而,其他配置,如使用相同填充(same-padded)卷积的情况,在学习注意方面表现出局限性,并且更像传统的CNN而不是变换器模型。 ConvShareViT架构特别针对4f光学系统进行了优化,这种系统利用了光系统的并行性和高分辨率能力。实验结果表明,理论上ConvShareViT的推理速度可以比基于GPU的系统快3.04倍。这一潜在加速使ConvShareViT成为未来光学深度学习应用的一个有吸引力的选择,并证明我们的ViT(即ConvShareViT)可以通过仅使用卷积操作并适当优化ViT以平衡性能和复杂性来实现。
https://arxiv.org/abs/2504.11517
This report presents UniAnimate-DiT, an advanced project that leverages the cutting-edge and powerful capabilities of the open-source Wan2.1 model for consistent human image animation. Specifically, to preserve the robust generative capabilities of the original Wan2.1 model, we implement Low-Rank Adaptation (LoRA) technique to fine-tune a minimal set of parameters, significantly reducing training memory overhead. A lightweight pose encoder consisting of multiple stacked 3D convolutional layers is designed to encode motion information of driving poses. Furthermore, we adopt a simple concatenation operation to integrate the reference appearance into the model and incorporate the pose information of the reference image for enhanced pose alignment. Experimental results show that our approach achieves visually appearing and temporally consistent high-fidelity animations. Trained on 480p (832x480) videos, UniAnimate-DiT demonstrates strong generalization capabilities to seamlessly upscale to 720P (1280x720) during inference. The training and inference code is publicly available at this https URL.
该报告介绍了UniAnimate-DiT项目,该项目利用开源Wan2.1模型的先进和强大的功能来进行一致的人体图像动画制作。具体而言,为了保持原始Wan2.1模型的强大生成能力,我们采用了低秩适应(LoRA)技术来微调少量参数,从而显著减少了训练内存开销。设计了一个轻量级的姿态编码器,它由多个堆叠的3D卷积层组成,用于编码驱动姿态的动作信息。此外,我们采用简单的拼接操作将参考外观集成到模型中,并结合参考图像的姿态信息以增强姿态对齐。实验结果表明,我们的方法实现了视觉上连贯且时间一致性的高保真动画效果。在480p(832x480)视频上进行训练后,UniAnimate-DiT展示了强大的泛化能力,在推理过程中能够无缝升级到720P(1280x720)。该模型的训练和推理代码可在以下网址获取:[此URL]。
https://arxiv.org/abs/2504.11289
Convolutional neural networks (CNNs) have been widely used in efficient image super-resolution. However, for CNN-based methods, performance gains often require deeper networks and larger feature maps, which increase complexity and inference costs. Inspired by LoRA's success in fine-tuning large language models, we explore its application to lightweight models and propose Distillation-Supervised Convolutional Low-Rank Adaptation (DSCLoRA), which improves model performance without increasing architectural complexity or inference costs. Specifically, we integrate ConvLoRA into the efficient SR network SPAN by replacing the SPAB module with the proposed SConvLB module and incorporating ConvLoRA layers into both the pixel shuffle block and its preceding convolutional layer. DSCLoRA leverages low-rank decomposition for parameter updates and employs a spatial feature affinity-based knowledge distillation strategy to transfer second-order statistical information from teacher models (pre-trained SPAN) to student models (ours). This method preserves the core knowledge of lightweight models and facilitates optimal solution discovery under certain conditions. Experiments on benchmark datasets show that DSCLoRA improves PSNR and SSIM over SPAN while maintaining its efficiency and competitive image quality. Notably, DSCLoRA ranked first in the Overall Performance Track of the NTIRE 2025 Efficient Super-Resolution Challenge. Our code and models are made publicly available at this https URL.
卷积神经网络(CNN)在高效图像超分辨率领域得到了广泛应用。然而,基于CNN的方法往往需要更深层的网络和更大的特征图来提高性能,这增加了复杂性和推理成本。受LoRA(低秩适应)成功应用于大型语言模型微调的启发,我们探索了其在轻量级模型中的应用,并提出了蒸馏监督卷积低秩适应(DSCLoRA),该方法可以在不增加架构复杂性或推理成本的情况下提高模型性能。 具体而言,我们将ConvLoRA集成到高效的SR网络SPAN中,在SPAB模块上替换成提出的SConvLB模块,并将ConvLoRA层同时嵌入到像素混洗块及其之前的卷积层。DSCLoRA利用低秩分解进行参数更新,并采用基于空间特征亲和力的知识蒸馏策略,从教师模型(预训练的SPAN)向学生模型(我们的模型)转移二阶统计信息。这种方法保留了轻量级模型的核心知识,并在特定条件下促进了最优解决方案的发现。 实验结果表明,在基准数据集上,DSCLoRA在保持效率的同时提高了相对于SPAN的峰值信噪比(PSNR)和结构相似性指数(SSIM),并且图像质量具有竞争力。特别地,在NTIRE 2025高效超分辨率挑战赛的整体性能赛道中,DSCLoRA排名第一。 我们的代码和模型可以在以下链接公开获取:[此URL]。
https://arxiv.org/abs/2504.11271
With the rapid development of information technology, modern warfare increasingly relies on intelligence, making small target detection critical in military applications. The growing demand for efficient, real-time detection has created challenges in identifying small targets in complex environments due to interference. To address this, we propose a small target detection method based on multi-modal image fusion and attention mechanisms. This method leverages YOLOv5, integrating infrared and visible light data along with a convolutional attention module to enhance detection performance. The process begins with multi-modal dataset registration using feature point matching, ensuring accurate network training. By combining infrared and visible light features with attention mechanisms, the model improves detection accuracy and robustness. Experimental results on anti-UAV and Visdrone datasets demonstrate the effectiveness and practicality of our approach, achieving superior detection results for small and dim targets.
随着信息技术的快速发展,现代战争越来越依赖于情报,使得小目标检测在军事应用中变得至关重要。对高效实时检测的需求日益增长,在复杂环境中识别小目标面临着因干扰而导致的挑战。为此,我们提出了一种基于多模态图像融合和注意力机制的小目标检测方法。该方法利用YOLOv5框架,并结合红外与可见光数据以及卷积注意模块来提升检测性能。过程始于通过特征点匹配进行多模态数据集注册,确保网络训练的准确性。通过将红外和可见光特征与注意力机制相结合,模型能够提高对小目标的检测准确性和鲁棒性。在反无人机和Visdrone数据集上的实验结果证明了我们方法的有效性和实用性,在检测小型和暗淡目标方面取得了卓越的结果。
https://arxiv.org/abs/2504.11262
Predicting personality traits automatically has become a challenging problem in computer vision. This paper introduces an innovative multimodal feature learning framework for personality analysis in short video clips. For visual processing, we construct a facial graph and design a Geo-based two-stream network incorporating an attention mechanism, leveraging both Graph Convolutional Networks (GCN) and Convolutional Neural Networks (CNN) to capture static facial expressions. Additionally, ResNet18 and VGGFace networks are employed to extract global scene and facial appearance features at the frame level. To capture dynamic temporal information, we integrate a BiGRU with a temporal attention module for extracting salient frame representations. To enhance the model's robustness, we incorporate the VGGish CNN for audio-based features and XLM-Roberta for text-based features. Finally, a multimodal channel attention mechanism is introduced to integrate different modalities, and a Multi-Layer Perceptron (MLP) regression model is used to predict personality traits. Experimental results confirm that our proposed framework surpasses existing state-of-the-art approaches in performance.
自动预测人格特征已成为计算机视觉领域的一项挑战性问题。本文介绍了一种创新的多模态特征学习框架,用于对短视频片段进行个性分析。在视觉处理方面,我们构建了一个面部图谱,并设计了一种基于地理的双流网络,该网络结合了注意力机制,利用图卷积神经网络(GCN)和卷积神经网络(CNN)来捕捉静态面部表情。此外,在帧级别上还采用了ResNet18和VGGFace网络来提取全局场景和面部外观特征。为了捕捉动态的时间信息,我们整合了一个带有时间注意力模块的BiGRU,用于提取显著帧表示。为了增强模型的鲁棒性,我们引入了基于音频的VGGish CNN以及基于文本的XLM-Roberta来提取相应的特性。最后,我们引入了一种多模态通道注意机制以融合不同模态,并使用多层感知器(MLP)回归模型进行人格特征预测。实验结果证实,我们的框架在性能上超越了现有的最先进的方法。
https://arxiv.org/abs/2504.11515
Salient Object Detection (SOD) with deep learning often requires substantial computational resources and large annotated datasets, making it impractical for resource-constrained applications. Lightweight models address computational demands but typically strive in complex and scarce labeled-data scenarios. Feature Learning from Image Markers (FLIM) learns an encoder's convolutional kernels among image patches extracted from discriminative regions marked on a few representative images, dismissing large annotated datasets, pretraining, and backpropagation. Such a methodology exploits information redundancy commonly found in biomedical image applications. This study presents methods to learn dilated-separable convolutional kernels and multi-dilation layers without backpropagation for FLIM networks. It also proposes a novel network simplification method to reduce kernel redundancy and encoder size. By combining a FLIM encoder with an adaptive decoder, a concept recently introduced to estimate a pointwise convolution per image, this study presents very efficient (named flyweight) SOD models for biomedical images. Experimental results in challenging datasets demonstrate superior efficiency and effectiveness to lightweight models. By requiring significantly fewer parameters and floating-point operations, the results show competitive effectiveness to heavyweight models. These advances highlight the potential of FLIM networks for data-limited and resource-constrained applications with information redundancy.
基于深度学习的显著物体检测(Salient Object Detection,SOD)通常需要大量的计算资源和大规模标注数据集,这在计算资源受限的应用场景中是不可行的。轻量级模型虽然解决了计算需求问题,但在复杂且标签数据稀缺的情况下表现不佳。图像标记特征学习(Feature Learning from Image Markers,FLIM)通过从少量代表性图片上的鉴别性区域提取出的图像块来学习编码器的卷积核,这种方法不需要大规模标注数据集、预训练和反向传播。这一方法利用了生物医学图像应用中常见的信息冗余。 本文提出了一种无需反向传播即可学习膨胀可分离卷积核和多膨胀层的方法,并为FLIM网络引入了一种新的网络简化方法来减少内核冗余并减小编码器尺寸。通过结合一个FLIM编码器与一个自适应解码器,即最近用于估计每个图像的逐点卷积的概念,本文提出了一系列非常高效的(命名为“flyweight”)SOD模型,专为生物医学图像设计。 实验结果表明,在具有挑战性的数据集中,这些模型在效率和效果上均优于轻量级模型。通过显著减少参数数量和浮点运算次数,所提出的模型展示了与重型模型相媲美的有效性。这些进展突显了FLIM网络在信息冗余的、数据有限和资源受限的应用场景中的潜力。
https://arxiv.org/abs/2504.11112
The purpose of this paper is to explore the use of underwater image enhancement techniques to improve keypoint detection and matching. By applying advanced deep learning models, including generative adversarial networks and convolutional neural networks, we aim to find the best method which improves the accuracy of keypoint detection and the robustness of matching algorithms. We evaluate the performance of these techniques on various underwater datasets, demonstrating significant improvements over traditional methods.
本文的目的是探讨使用水下图像增强技术来提升关键点检测和匹配的效果。通过应用包括生成对抗网络(GAN)和卷积神经网络(CNN)在内的先进深度学习模型,我们旨在找到能够提高关键点检测准确性及匹配算法鲁棒性的最佳方法。我们在多种水下数据集上评估了这些技术的性能,并展示了相对于传统方法有显著的进步。
https://arxiv.org/abs/2504.11063
Background: Deep learning has significantly advanced medical image analysis, with Vision Transformers (ViTs) offering a powerful alternative to convolutional models by modeling long-range dependencies through self-attention. However, ViTs are inherently data-intensive and lack domain-specific inductive biases, limiting their applicability in medical imaging. In contrast, radiomics provides interpretable, handcrafted descriptors of tissue heterogeneity but suffers from limited scalability and integration into end-to-end learning frameworks. In this work, we propose the Radiomics-Embedded Vision Transformer (RE-ViT) that combines radiomic features with data-driven visual embeddings within a ViT backbone. Purpose: To develop a hybrid RE-ViT framework that integrates radiomics and patch-wise ViT embeddings through early fusion, enhancing robustness and performance in medical image classification. Methods: Following the standard ViT pipeline, images were divided into patches. For each patch, handcrafted radiomic features were extracted and fused with linearly projected pixel embeddings. The fused representations were normalized, positionally encoded, and passed to the ViT encoder. A learnable [CLS] token aggregated patch-level information for classification. We evaluated RE-ViT on three public datasets (including BUSI, ChestXray2017, and Retinal OCT) using accuracy, macro AUC, sensitivity, and specificity. RE-ViT was benchmarked against CNN-based (VGG-16, ResNet) and hybrid (TransMed) models. Results: RE-ViT achieved state-of-the-art results: on BUSI, AUC=0.950+/-0.011; on ChestXray2017, AUC=0.989+/-0.004; on Retinal OCT, AUC=0.986+/-0.001, which outperforms other comparison models. Conclusions: The RE-ViT framework effectively integrates radiomics with ViT architectures, demonstrating improved performance and generalizability across multimodal medical image classification tasks.
背景:深度学习在医学图像分析方面取得了显著进展,视觉变换器(ViTs)通过自注意力机制建模长距离依赖性,为卷积模型提供了一种强大的替代方案。然而,ViTs本质上对数据需求量大,并且缺乏领域特定的归纳偏置,这限制了其在医学成像中的应用。相比之下,放射组学提供了组织异质性的可解释、手工设计的描述符,但存在扩展性和集成到端到端学习框架方面的局限性。在此项工作中,我们提出了嵌入式放射组学视觉变换器(RE-ViT),该模型将放射组学特征与数据驱动的视觉嵌入结合在ViT骨干网络中。 目的:开发一种混合RE-ViT框架,通过早期融合集成放射组学和基于补丁的ViT嵌入,增强医学图像分类中的鲁棒性和性能。 方法:遵循标准的ViT管道流程,将图像分割成若干小块。对于每个补丁,提取手工设计的放射组学特征,并与线性投影像素嵌入相结合。结合后的表示形式经过归一化、位置编码后传递给ViT编码器。可学习的[CLS]标记汇集了补丁级别的信息用于分类。我们在三个公开数据集(包括BUSI、ChestXray2017和视网膜OCT)上使用准确率、宏AUC、敏感性和特异性来评估RE-ViT的效果,并将其与基于CNN(VGG-16、ResNet)和混合(TransMed)模型进行对比。 结果:RE-ViT取得了最先进的效果:在BUSI数据集上的AUC为0.950±0.011;在ChestXray2017数据集上的AUC为0.989±0.004;在视网膜OCT数据集上的AUC为0.986±0.001,这些结果均优于其他对比模型。 结论:RE-ViT框架有效地将放射组学与ViT架构结合在一起,在多模态医学图像分类任务中展示了更好的性能和泛化能力。
https://arxiv.org/abs/2504.10916