Brain tumor classification from magnetic resonance imaging, which is also known as MRI, plays a sensitive role in computer-assisted diagnosis systems. In recent years, deep learning models have achieved high classification accuracy. However, their sensitivity to adversarial perturbations has become an important reliability concern in medical applications. This study suggests a robust brain tumor classification framework that combines Non-Negative Matrix Factorization (NNMF or NMF), lightweight convolutional neural networks (CNNs), and diffusion-based feature purification. Initially, MRI images are preprocessed and converted into a non-negative data matrix, from which compact and interpretable NNMF feature representations are extracted. Statistical metrics, including AUC, Cohen's d, and p-values, are used to rank and choose the most discriminative components. Then, a lightweight CNN classifier is trained directly on the selected feature groups. To improve adversarial robustness, a diffusion-based feature-space purification module is introduced. A forward noise method followed by a learned denoiser network is used before classification. System performance is estimated using both clean accuracy and robust accuracy under powerful adversarial attacks created by AutoAttack. The experimental results show that the proposed framework achieves competitive classification performance while significantly enhancing robustness against adversarial this http URL findings presuppose that combining interpretable NNMF-based representations with a lightweight deep approach and diffusion-based defense technique supplies an effective and reliable solution for medical image classification under adversarial conditions.
https://arxiv.org/abs/2603.13182
Real-world multivariate time series can exhibit intricate multi-scale structures, including global trends, local periodicities, and non-stationary regimes, which makes long-horizon forecasting challenging. Although sparse Mixture-of-Experts (MoE) approaches improve scalability and specialization, they typically rely on homogeneous MLP experts that poorly capture the diverse temporal dynamics of time series data. We address these limitations with MoHETS, an encoder-only Transformer that integrates sparse Mixture-of-Heterogeneous-Experts (MoHE) layers. MoHE routes temporal patches to a small subset of expert networks, combining a shared depthwise-convolution expert for sequence-level continuity with routed Fourier-based experts for patch-level periodic structures. MoHETS further improves robustness to non-stationary dynamics by incorporating exogenous information via cross-attention over covariate patch embeddings. Finally, we replace parameter-heavy linear projection heads with a lightweight convolutional patch decoder, improving parameter efficiency, reducing training instability, and allowing a single model to generalize across arbitrary forecast horizons. We validate across seven multivariate benchmarks and multiple horizons, with MoHETS consistently achieving state-of-the-art performance, reducing the average MSE by $12\%$ compared to strong recent baselines, demonstrating effective heterogeneous specialization for long-term forecasting.
https://arxiv.org/abs/2601.21866
Regular monitoring of glycemic status is essential for diabetes management, yet conventional blood-based testing can be burdensome for frequent assessment. The sclera contains superficial microvasculature that may exhibit diabetes related alterations and is readily visible on the ocular surface. We propose ScleraGluNet, a multiview deep-learning framework for three-class metabolic status classification (normal, controlled diabetes, and high-glucose diabetes) and continuous fasting plasma glucose (FPG) estimation from multidirectional scleral vessel images. The dataset comprised 445 participants (150/140/155) and 2,225 anterior-segment images acquired from five gaze directions per participant. After vascular enhancement, features were extracted using parallel convolutional branches, refined with Manta Ray Foraging Optimization (MRFO), and fused via transformer-based cross-view attention. Performance was evaluated using subject-wise five-fold cross-validation, with all images from each participant assigned to the same fold. ScleraGluNet achieved 93.8% overall accuracy, with one-vs-rest AUCs of 0.971,0.956, and 0.982 for normal, controlled diabetes, and high-glucose diabetes, respectively. For FPG estimation, the model achieved MAE = 6.42 mg/dL and RMSE = 7.91 mg/dL, with strong correlation to laboratory measurements (r = 0.983; R2 = 0.966). Bland Altman analysis showed a mean bias of +1.45 mg/dL with 95% limits of agreement from -8.33 to +11.23$ mg/dL. These results support multidirectional scleral vessel imaging with multiview learning as a promising noninvasive approach for glycemic assessment, warranting multicenter validation before clinical deployment.
https://arxiv.org/abs/2603.12715
Semantic place categorization, which is one of the essential tasks for autonomous robots and vehicles, allows them to have capabilities of self-decision and navigation in unfamiliar environments. In particular, outdoor places are more difficult targets than indoor ones due to perceptual variations, such as dynamic illuminance over twenty-four hours and occlusions by cars and pedestrians. This paper presents a novel method of categorizing outdoor places using convolutional neural networks (CNNs), which take omnidirectional depth/reflectance images obtained by 3D LiDARs as the inputs. First, we construct a large-scale outdoor place dataset named Multi-modal Panoramic 3D Outdoor (MPO) comprising two types of point clouds captured by two different LiDARs. They are labeled with six outdoor place categories: coast, forest, indoor/outdoor parking, residential area, and urban area. Second, we provide CNNs for LiDAR-based outdoor place categorization and evaluate our approach with the MPO dataset. Our results on the MPO dataset outperform traditional approaches and show the effectiveness in which we use both depth and reflectance modalities. To analyze our trained deep networks we visualize the learned features.
https://arxiv.org/abs/2603.12663
Accurate visual fault detection in freight trains remains a critical challenge for intelligent transportation system maintenance, due to complex operational environments, structurally repetitive components, and frequent occlusions or contaminations in safety-critical regions. Conventional instance segmentation methods based on convolutional neural networks and Transformers often suffer from poor generalization and limited boundary accuracy under such conditions. To address these challenges, we propose a lightweight self-prompted instance segmentation framework tailored for freight train fault detection. Our method leverages the Segment Anything Model by introducing a self-prompt generation module that automatically produces task-specific prompts, enabling effective knowledge transfer from foundation models to domain-specific inspection tasks. In addition, we adopt a Tiny Vision Transformer backbone to reduce computational cost, making the framework suitable for real-time deployment on edge devices in railway monitoring systems. We construct a domain-specific dataset collected from real-world freight inspection stations and conduct extensive evaluations. Experimental results show that our method achieves 74.6 $AP^{\text{box}}$ and 74.2 $AP^{\text{mask}}$ on the dataset, outperforming existing state-of-the-art methods in both accuracy and robustness while maintaining low computational overhead. This work offers a deployable and efficient vision solution for automated freight train inspection, demonstrating the potential of foundation model adaptation in industrial-scale fault diagnosis scenarios. Project page: this https URL
https://arxiv.org/abs/2603.12624
Deep learning has achieved remarkable success in medical image segmentation, often reaching expert-level accuracy in delineating tumors and tissues. However, most existing approaches remain task-specific, showing strong performance on individual datasets but limited generalization across diverse imaging modalities. Moreover, many methods focus primarily on the encoder, relying on large pretrained backbones that increase computational complexity. In this paper, we propose a decoder-centric approach for generalized 2D medical image segmentation. The proposed Deco-Mamba follows a U-Net-like structure with a Transformer-CNN-Mamba design. The encoder combines a CNN block and Transformer backbone for efficient feature extraction, while the decoder integrates our novel Co-Attention Gate (CAG), Vision State Space Module (VSSM), and deformable convolutional refinement block to enhance multi-scale contextual representation. Additionally, a windowed distribution-aware KL-divergence loss is introduced for deep supervision across multiple decoding stages. Extensive experiments on diverse medical image segmentation benchmarks yield state-of-the-art performance and strong generalization capability while maintaining moderate model complexity. The source code will be released upon acceptance.
https://arxiv.org/abs/2603.12547
We present an alternative way of solving the steerable kernel constraint that appears in the design of steerable equivariant convolutional neural networks. We find explicit real and complex bases which are ready to use, for different symmetry groups and for feature maps of arbitrary tensor type. A major advantage of this method is that it bypasses the need to numerically or analytically compute Clebsch-Gordan coefficients and works directly with the representations of the input and output feature maps. The strategy is to find a basis of kernels that respect a simpler invariance condition at some point $x_0$, and then \textit{steer} it with the defining equation of steerability to move to some arbitrary point $x=g\cdot x_0$. This idea has already been mentioned in the literature before, but not advanced in depth and with some generality. Here we describe how it works with minimal technical tools to make it accessible for a general audience.
https://arxiv.org/abs/2603.12459
Convolutional Neural Networks have shown promising effectiveness in identifying different types of cancer from radiographs. However, the opaque nature of CNNs makes it difficult to fully understand the way they operate, limiting their assessment to empirical evaluation. Here we study the soundness of the standard practices by which CNNs are evaluated for the purpose of cancer pathology. Thirteen highly used cancer benchmark datasets were analyzed, using four common CNN architectures and different types of cancer, such as melanoma, carcinoma, colorectal cancer, and lung cancer. We compared the accuracy of each model with that of datasets made of cropped segments from the background of the original images that do not contain clinically relevant content. Because the rendered datasets contain no clinical information, the null hypothesis is that the CNNs should provide mere chance-based accuracy when classifying these datasets. The results show that the CNN models provided high accuracy when using the cropped segments, sometimes as high as 93\%, even though they lacked biomedical information. These results show that some CNN architectures are more sensitive to bias than others. The analysis shows that the common practices of machine learning evaluation might lead to unreliable results when applied to cancer pathology. These biases are very difficult to identify, and might mislead researchers as they use available benchmark datasets to test the efficacy of CNN methods.
https://arxiv.org/abs/2603.12445
Neural Architecture Search (NAS) automates network design, but conventional methods demand substantial computational resources. We propose a closed-loop pipeline leveraging large language models (LLMs) to iteratively generate, evaluate, and refine convolutional neural network architectures for image classification on a single consumer-grade GPU without LLM fine-tuning. Central to our approach is a historical feedback memory inspired by Markov chains: a sliding window of $K{=}5$ recent improvement attempts keeps context size constant while providing sufficient signal for iterative learning. Unlike prior LLM optimizers that discard failure trajectories, each history entry is a structured diagnostic triple -- recording the identified problem, suggested modification, and resulting outcome -- treating code execution failures as first-class learning signals. A dual-LLM specialization reduces per-call cognitive load: a Code Generator produces executable PyTorch architectures while a Prompt Improver handles diagnostic reasoning. Since both the LLM and architecture training share limited VRAM, the search implicitly favors compact, hardware-efficient models suited to edge deployment. We evaluate three frozen instruction-tuned LLMs (${\leq}7$B parameters) across up to 2000 iterations in an unconstrained open code space, using one-epoch proxy accuracy on CIFAR-10, CIFAR-100, and ImageNette as a fast ranking signal. On CIFAR-10, DeepSeek-Coder-6.7B improves from 28.2% to 69.2%, Qwen2.5-7B from 50.0% to 71.5%, and GLM-5 from 43.2% to 62.0%. A full 2000-iteration search completes in ${\approx}18$ GPU hours on a single RTX~4090, establishing a low-budget, reproducible, and hardware-aware paradigm for LLM-driven NAS without cloud infrastructure.
神经架构搜索(NAS)自动化了网络设计,但传统方法需要大量的计算资源。我们提出了一种闭环流水线,利用大型语言模型(LLMs)在单个消费级GPU上无需对LLM进行微调的情况下,迭代生成、评估和优化卷积神经网络架构以进行图像分类。我们的方法的核心是一种受马尔可夫链启发的历史反馈记忆:一个宽度为$K=5$的滑动窗口保持上下文大小恒定,同时提供足够的信号来进行迭代学习。与之前的LLM优化器丢弃失败路径不同,每个历史条目都是一个结构化的诊断三元组——记录识别的问题、建议的修改以及结果——将代码执行失败视为首要的学习信号。双LLM专业化减少了每次调用的认知负荷:一个代码生成器生产可执行的PyTorch架构,而提示改进者处理诊断推理。由于LLM和架构训练共享有限的VRAM资源,搜索过程隐式地倾向于紧凑、硬件高效的模型,适合边缘部署。 我们在约束最少的开放代码空间中评估了三个冻结指令调优的LLMs(≤70亿参数),最多进行2000次迭代,并使用CIFAR-10、CIFAR-100和ImageNette上一个epoch的代理准确度作为快速排名信号。在CIFAR-10数据集上,DeepSeek-Coder-6.7B从28.2%提高到了69.2%,Qwen2.5-7B从50.0%提升到71.5%,而GLM-5从43.2%升至62.0%。一个完整的2000次迭代搜索在单个RTX 4090 GPU上大约需要18小时,从而建立了低成本、可重复和硬件感知的LLM驱动NAS范式,无需云基础设施。 这种方法提供了显著的成本效益,并且为在边缘设备上部署高效的神经网络架构开辟了新的可能性。
https://arxiv.org/abs/2603.12091
Transcription factors (TFs) regulate gene expression through complex and co-operative mechanisms. While many TFs act together, the logic underlying TFs binding and their interactions is not fully understood yet. Most current approaches for TF binding site prediction focus on individual TFs and binary classification tasks, without a full analysis of the possible interactions among various TFs. In this paper we investigate DNA TF binding site recognition as a multi-label classification problem, achieving reliable predictions for multiple TFs on DNA sequences retrieved in public repositories. Our deep learning models are based on Temporal Convolutional Networks (TCNs), which are able to predict multiple TF binding profiles, capturing correlations among TFs andtheir cooperative regulatory mechanisms. Our results suggest that multi-label learning leading to reliable predictive performances can reveal biologically meaningful motifs and co-binding patterns consistent with known TF interactions, while also suggesting novel relationships and cooperation among TFs.
转录因子(TF)通过复杂的协同机制调节基因表达。尽管许多转录因子共同作用,但关于它们结合和相互作用的逻辑尚未完全理解。目前大多数用于预测转录因子结合位点的方法主要集中在单独的转录因子上,并且侧重于二分类任务,而没有对各种转录因子之间可能存在的交互进行全面分析。在这篇文章中,我们把DNA转录因子结合位点识别作为多标签分类问题进行研究,在公共存储库中检索到的DNA序列上实现多个转录因子的可靠预测。我们的深度学习模型基于时间卷积网络(TCN),能够预测多个转录因子的结合模式,并捕捉不同转录因子之间的相关性和它们的合作调控机制。我们的结果表明,多标签学习导致可靠的预测性能可以揭示具有生物学意义的基序和共结合模式,这些模式与已知的转录因子相互作用一致,同时也提出了新的转录因子之间关系和合作的可能性。
https://arxiv.org/abs/2603.12073
The convolution operator is the fundamental building block of modern convolutional neural networks (CNNs), owing to its simplicity, translational equivariance, and efficient implementation. However, its structure as a fixed, linear, locally-averaging operator limits its ability to capture structured signal properties such as low-rank decompositions, adaptive basis representations, and non-uniform spatial dependencies. This paper presents a systematic taxonomy of operators that extend or replace the standard convolution in learning-based image processing pipelines. We organise the landscape of alternative operators into five families: (i) decomposition-based operators, which separate structural and noise components through singular value or tensor decompositions; (ii) adaptive weighted operators, which modulate kernel contributions as a function of spatial position or signal content; (iii) basis-adaptive operators, which optimise the analysis bases together with the network weights; (iv) integral and kernel operators, which generalise the convolution to position-dependent and non-linear kernels; and (v) attention-based operators, which relax the locality assumption entirely. For each family, we provide a formal definition, a discussion of its structural properties with respect to the convolution, and a critical analysis of the tasks for which the operator is most appropriate. We further provide a comparative analysis of all families across relevant dimensions -- linearity, locality, equivariance, computational cost, and suitability for image-to-image and image-to-label tasks -- and outline the open challenges and future directions of this research area.
卷积运算符是现代卷积神经网络(CNN)的基础构建模块,得益于其简洁性、平移等变性和高效的实现方式。然而,作为一个固定、线性的局部平均操作符的结构限制了它捕捉有组织信号属性的能力,例如低秩分解、自适应基表示和非均匀空间依赖关系。本文提出了一种对扩展或替换标准卷积运算符的学习图像处理管道的系统分类方法。我们将替代运算符的领域分为五个家族: (i) 基于分解的运算符:通过奇异值分解或张量分解将结构化成分与噪声部分分离; (ii) 自适应加权运算符:根据空间位置或信号内容调节核贡献; (iii) 基础自适应运算符:优化分析基础,同时调整网络权重; (iv) 积分和核运算符:通过依赖于位置的非线性核推广卷积; (v) 注意力基运算符:完全放松局部假设。 对于每个家族,我们提供了正式定义、对其结构属性相对于传统卷积的讨论以及对该操作符最适用任务的关键分析。此外,我们在相关维度上(如线性度、局部性、等变性、计算成本和图像到图像及图像到标签任务的适应性)对所有家族进行了比较分析,并概述了该研究领域的开放挑战和未来方向。
https://arxiv.org/abs/2603.12067
Emotion recognition in in-the-wild video data remains a challenging problem due to large variations in facial appearance, head pose, illumination, background noise, and the inherently dynamic nature of human affect. Relying on a single modality, such as facial expressions or speech, is often insufficient to capture these complex emotional cues. To address this issue, we propose a multimodal emotion recognition framework for the Expression (EXPR) Recognition task in the 10th Affective Behavior Analysis in-the-wild (ABAW) Challenge. Our approach leverages large-scale pre-trained models, namely CLIP for visual encoding and Wav2Vec 2.0 for audio representation learning, as frozen backbone networks. To model temporal dependencies in facial expression sequences, we employ a Temporal Convolutional Network (TCN) over fixed-length video windows. In addition, we introduce a bi-directional cross-attention fusion module, in which visual and audio features interact symmetrically to enhance cross-modal contextualization and capture complementary emotional information. A lightweight classification head is then used for final emotion prediction. We further incorporate a text-guided contrastive objective based on CLIP text features to encourage semantically aligned visual representations. Experimental results on the ABAW 10th EXPR benchmark show that the proposed framework provides a strong multimodal baseline and achieves improved performance over unimodal modeling. These results demonstrate the effectiveness of combining temporal visual modeling, audio representation learning, and cross-modal fusion for robust emotion recognition in unconstrained real-world environments.
在野外视频数据中的情感识别仍然是一个具有挑战性的问题,因为面部外观、头部姿势、光照条件、背景噪声以及人类情绪固有的动态特性都存在很大的变化。仅依赖单一模态(如面部表情或言语)通常不足以捕捉这些复杂的情感线索。为了解决这个问题,我们提出了一种多模态情感识别框架,用于10th Affective Behavior Analysis in-the-wild (ABAW) 挑战赛中的Expression (EXPR) 任务。 我们的方法利用大规模预训练模型:CLIP(视觉编码)和Wav2Vec 2.0(音频表示学习),作为冻结的骨干网络。为了建模面部表情序列中时间依赖性,我们采用了一个在固定长度视频窗口上运行的时间卷积网络(TCN)。此外,我们引入了一种双向交叉注意力融合模块,在该模块中视觉和音频特征以对称方式相互作用,增强跨模式上下文化,并捕捉互补的情感信息。然后使用一个轻量级的分类头进行最终情感预测。 进一步地,我们将基于CLIP文本特征的引导式对比目标纳入其中,鼓励语义一致性的视觉表示。在ABAW 10th EXPR基准测试上的实验结果表明,所提出的框架提供了一个强大的多模态基线,并且比单一模式建模表现出更好的性能。这些结果显示了结合时间视觉建模、音频表示学习和跨模态融合对于在无约束的真实世界环境中进行稳健的情感识别的有效性。
https://arxiv.org/abs/2603.11971
The unrestrained proliferation of cells that are malignant in nature is cancer. In recent times, medical professionals are constantly acquiring enhanced diagnostic and treatment abilities by implementing deep learning models to analyze medical data for better clinical decision, disease diagnosis and drug discovery. A majority of cancers are studied and treated by incorporating these technologies. However, ovarian cancer remains a dilemma as it has inaccurate non-invasive detection procedures and a time consuming, invasive procedure for accurate detection. Thus, in this research, several Convolutional Neural Networks such as LeNet-5, ResNet, VGGNet and GoogLeNet/Inception have been utilized to develop 15 variants and choose a model that accurately detects and identifies ovarian cancer. For effective model training, the dataset OvarianCancer&SubtypesDatasetHistopathology from Mendeley has been used. After constructing a model, we utilized Explainable Artificial Intelligence (XAI) models such as LIME, Integrated Gradients and SHAP to explain the black box outcome of the selected model. For evaluating the performance of the model, Accuracy, Precision, Recall, F1-Score, ROC Curve and AUC have been used. From the evaluation, it was seen that the slightly compact InceptionV3 model with ReLu had the overall best result achieving an average score of 94% across all the performance metrics in the augmented dataset. Lastly for XAI, the three aforementioned XAI have been used for an overall comparative analysis. It is the aim of this research that the contributions of the study will help in achieving a better detection method for ovarian cancer.
恶性细胞不受限制的增殖被称为癌症。近年来,医疗专业人员通过实施深度学习模型来分析医学数据,从而不断提升诊断和治疗能力,以做出更好的临床决策、疾病诊断及药物发现。大多数癌症的研究与治疗都采用了这些技术。然而,卵巢癌仍是一个难题,因为其非侵入性检测程序不准确且精确检测的侵入性程序耗时长。因此,在这项研究中,使用了诸如LeNet-5、ResNet、VGGNet和GoogLeNet/Inception等几种卷积神经网络(CNN)来开发15种变体,并选择一种能够精确诊断和识别卵巢癌的模型。为了有效训练该模型,我们使用了Mendeley提供的OvarianCancer&SubtypesDatasetHistopathology数据集。 构建模型后,我们利用可解释的人工智能(XAI)模型如LIME、综合梯度和SHAP来解释所选模型的“黑箱”结果。为了评估模型性能,采用了准确性、精确性、召回率、F1分数、ROC曲线及AUC等指标。从评估中可以看到,在增强数据集的所有性能指标上,具有Relu激活函数的稍微紧凑型InceptionV3模型表现最佳,平均得分为94%。 最后,在XAI方面,使用了上述三种方法进行了全面比较分析。本研究旨在通过这些贡献帮助实现更有效的卵巢癌检测方法。
https://arxiv.org/abs/2603.11818
Hierarchical multi-label classification (HMLC) is essential for modeling complex label dependencies in remote sensing. Existing methods, however, struggle with multi-path hierarchies where instances belong to multiple branches, and they rarely exploit unlabeled data. We introduce HELM (\textit{Hierarchical and Explicit Label Modeling}), a novel framework that overcomes these limitations. HELM: (i) uses hierarchy-specific class tokens within a Vision Transformer to capture nuanced label interactions; (ii) employs graph convolutional networks to explicitly encode the hierarchical structure and generate hierarchy-aware embeddings; and (iii) integrates a self-supervised branch to effectively leverage unlabeled imagery. We perform a comprehensive evaluation on four remote sensing image (RSI) datasets (UCM, AID, DFC-15, MLRSNet). HELM achieves state-of-the-art performance, consistently outperforming strong baselines in both supervised and semi-supervised settings, demonstrating particular strength in low-label scenarios.
层次多标签分类(HMLC)对于建模遥感中的复杂标签依赖关系至关重要。然而,现有方法在处理实例属于多个分支的多路径层级结构时遇到困难,并且很少利用未标记的数据。为此,我们引入了HELM(Hierarchical and Explicit Label Modeling),这是一种新颖的框架,旨在克服这些限制。HELM具有以下特点: (i) 在视觉变换器中使用特定于层次结构的类别令牌来捕捉细微的标签交互; (ii) 使用图卷积网络明确编码层级结构并生成感知层级的嵌入表示; (iii) 集成一个自监督分支,以有效利用未标记的图像数据。 我们在四个遥感图像(RSI)数据集(UCM、AID、DFC-15、MLRSNet)上进行了全面评估。HELM在有监督和半监督设置中均达到了最先进的性能,并且显著优于强大的基线方法,特别是在标签较少的情况下表现尤为突出。
https://arxiv.org/abs/2603.11783
This article examines the application of Explainable Artificial Intelligence (XAI) in NLP based fake news detection and compares selected interpretability methods. The work outlines key aspects of disinformation, neural network architectures, and XAI techniques, with a focus on SHAP, LIME, and Integrated Gradients. In the experimental study, classification models were implemented and interpreted using these methods. The results show that XAI enhances model transparency and interpretability while maintaining high detection accuracy. Each method provides distinct explanatory value: SHAP offers detailed local attributions, LIME provides simple and intuitive explanations, and Integrated Gradients performs efficiently with convolutional models. The study also highlights limitations such as computational cost and sensitivity to parameterization. Overall, the findings demonstrate that integrating XAI with NLP is an effective approach to improving the reliability and trustworthiness of fake news detection systems.
本文探讨了可解释人工智能(XAI)在自然语言处理(NLP)领域中假新闻检测的应用,并比较了几种选定的可解释性方法。文章概述了虚假信息的关键方面、神经网络架构以及XAI技术,重点关注SHAP(Shapley Additive Explanations)、LIME(Local Interpretable Model-agnostic Explanations)和综合梯度法。 在实验研究中,使用这些方法实现了分类模型并进行了解释分析。结果显示,XAI能够提高模型的透明度和可解释性,同时保持高检测准确性。每种方法都提供了独特的解释价值:SHAP提供详细的局部归因信息,LIME则提供简单直观的解释,而综合梯度法在卷积模型中表现出高效的性能。 此外,该研究还指出了诸如计算成本以及对参数化的敏感性等限制因素。总体而言,研究表明将XAI与NLP相结合是提高假新闻检测系统可靠性和可信度的有效方法。
https://arxiv.org/abs/2603.11778
Cotton harvesting is a critical phase where cotton capsules are physically manipulated and can lead to fibre degradation. To maintain the highest quality, harvesting methods must emulate delicate manual grasping, to preserve cotton's intrinsic properties. Automating this process requires systems capable of recognising cotton capsules across various phenological stages. To address this challenge, we propose COTONET, an enhanced custom YOLO11 model tailored with attention mechanisms to improve the detection of difficult instances. The architecture incorporates gradients in non-learnable operations to enhance shape and feature extraction. Key architectural modifications include: the replacement of convolutional blocks with Squeeze-and-Exitation blocks, a redesigned backbone integrating attention mechanisms, and the substitution of standard upsampling operations for Content Aware Reassembly of Features (CARAFE). Additionally, we integrate Simple Attention Modules (SimAM) for primary feature aggregation and Parallel Hybrid Attention Mechanisms (PHAM) for channel-wise, spatial-wise and coordinate-wise attention in the downward neck path. This configuration offers increased flexibility and robustness for interpreting the complexity of cotton crop growth. COTONET aligns with small-to-medium YOLO models utilizing 7.6M parameters and 27.8 GFLOPS, making it suitable for low-resource edge computing and mobile robotics. COTONET outperforms the standard YOLO baselines, achieving a mAP50 of 81.1% and a mAP50-95 of 60.6%.
棉花收获是关键环节,此过程中需要物理地操作棉桃,并可能引起纤维降解。为了保持最高品质,收获方法必须模仿精细的手工抓取方式,以保留棉花的内在特性。自动化这一过程要求系统能够识别不同发育阶段的棉桃。为应对这一挑战,我们提出了COTONET,这是一种增强型定制YOLO11模型,并结合了注意力机制以提高难以处理实例的检测能力。该架构通过在非可学习操作中引入梯度来增强形状和特征提取。 关键的架构修改包括:用Squeeze-and-Exitation块替换卷积块;重新设计主干网络,整合注意机制;以及用内容感知特征重组(CARAFE)替代标准上采样操作。此外,我们还集成了简单的注意力模块(SimAM),用于初级特征聚合,并引入平行混合注意力机制(PHAM),在下颈路径中实现通道级、空间级和坐标级的注意力。 这种配置提供了更高的灵活性和鲁棒性,能够解析棉花作物生长的复杂性。COTONET与小型到中型YOLO模型相适应,使用7.6M参数和27.8GFLOPS,适用于低资源边缘计算和移动机器人应用。在标准YOLO基准测试中,COTONET表现出色,达到了mAP50为81.1%,mAP50-95为60.6%的性能指标。
https://arxiv.org/abs/2603.11717
The three-dimensional (3D) microstructures of polycrystalline materials exert a critical influence on their mechanical and physical properties. Realistic, controllable construction of these microstructures is a key step toward elucidating structure-property relationships, yet remains a formidable challenge. Herein, we propose PolyCrysDiff, a framework based on conditional latent diffusion that enables the end-to-end generation of computable 3D polycrystalline microstructures. Comprehensive qualitative and quantitative evaluations demonstrate that PolyCrysDiff faithfully reproduces target grain morphologies, orientation distributions, and 3D spatial correlations, while achieving an $R^2$ over 0.972 on grain attributes (e.g., size and sphericity) control, thereby outperforming mainstream approaches such as Markov random field (MRF)- and convolutional neural network (CNN)-based methods. The computability and physical validity of the generated microstructures are verified through a series of crystal plasticity finite element method (CPFEM) simulations. Leveraging PolyCrysDiff's controllable generative capability, we systematically elucidate how grain-level microstructural characteristics affect the mechanical properties of polycrystalline materials. This development is expected to pave a key step toward accelerated, data-driven optimization and design of polycrystalline materials.
多晶体材料的三维(3D)微结构对其机械和物理性能具有关键影响。真实且可控地构建这些微结构是揭示结构与性能关系的关键步骤,但仍然是一个艰巨的挑战。在此,我们提出了PolyCrysDiff框架,该框架基于条件潜在扩散模型,能够端到端生成可计算的三维多晶体微观结构。全面的质量和数量评估表明,PolyCrysDiff可以忠实再现目标晶粒形态、取向分布以及3D空间相关性,并且在晶粒属性(如大小和球形度)控制方面实现了超过0.972的$R^2$值,从而优于主流方法,例如基于马尔可夫随机场(MRF)和卷积神经网络(CNN)的方法。通过一系列晶体塑性有限元法(CPFEM)模拟验证了生成微结构的计算能力和物理有效性。利用PolyCrysDiff可控生成的能力,我们系统地阐明了晶粒级微观结构特性如何影响多晶体材料的机械性能。这一发展有望为加速、数据驱动的多晶体材料优化和设计开辟关键步骤。
https://arxiv.org/abs/2603.11695
Hybrid CNN-Transformer architectures achieve strong results in image super-resolution, but scaling attention windows or convolution kernels significantly increases computational cost, limiting deployment on resource-constrained devices. We present UCAN, a lightweight network that unifies convolution and attention to expand the effective receptive field efficiently. UCAN combines window-based spatial attention with a Hedgehog Attention mechanism to model both local texture and long-range dependencies, and introduces a distillation-based large-kernel module to preserve high-frequency structure without heavy computation. In addition, we employ cross-layer parameter sharing to further reduce complexity. On Manga109 ($4\times$), UCAN-L achieves 31.63 dB PSNR with only 48.4G MACs, surpassing recent lightweight models. On BSDS100, UCAN attains 27.79 dB, outperforming methods with significantly larger models. Extensive experiments show that UCAN achieves a superior trade-off between accuracy, efficiency, and scalability, making it well-suited for practical high-resolution image restoration.
混合CNN-Transformer架构在图像超分辨率任务中取得了很好的结果,但扩大注意力窗口或卷积核会显著增加计算成本,从而限制了其在资源受限设备上的部署。我们提出了UCAN(Unified Convolution and Attention Network),这是一种轻量级网络,它将卷积和注意力统一起来以高效地扩展有效感受野。 UCAN结合了基于窗口的空间注意机制与刺猬注意力机制,以便同时建模局部纹理和长程依赖关系,并引入了一种基于知识蒸馏的大核模块来在不进行大量计算的情况下保留高频结构。此外,我们还采用了跨层参数共享策略进一步降低复杂度。 在Manga109(4倍超分辨率)数据集上,UCAN-L实现了31.63 dB的PSNR值,仅使用了48.4G MACs的操作量,超过了最近的一些轻量级模型。而在BSDS100数据集上,UCAN达到了27.79 dB的成绩,优于那些拥有显著更大规模模型的方法。 通过广泛的实验显示,UCAN在精度、效率和可扩展性之间实现了更佳的平衡,使其非常适合用于实际高分辨率图像恢复任务中。
https://arxiv.org/abs/2603.11680
Millimeter-wave or terahertz communications can meet demands of low-altitude economy networks for high-throughput sensing and real-time decision making. However, high-frequency characteristics of wireless channels result in severe propagation loss and strong beam directivity, which make beam prediction challenging in highly mobile uncrewed aerial vehicles (UAV) scenarios. In this paper, we employ agentic AI to enable the transformation of mmWave base stations toward embodied intelligence. We innovatively design a multi-agent collaborative reasoning architecture for UAV-to-ground mmWave communications and propose a hybrid beam prediction model system based on bimodal data. The multi-agent architecture is designed to overcome the limited context window and weak controllability of large language model (LLM)-based reasoning by decomposing beam prediction into task analysis, solution planning, and completeness assessment. To align with the agentic reasoning process, a hybrid beam prediction model system is developed to process multimodal UAV data, including numeric mobility information and visual observations. The proposed hybrid model system integrates Mamba-based temporal modelling, convolutional visual encoding, and cross-attention-based multimodal fusion, and dynamically switches data-flow strategies under multi-agent guidance. Extensive simulations on a real UAV mmWave communication dataset demonstrate that proposed architecture and system achieve high prediction accuracy and robustness under diverse data conditions, with maximum top-1 accuracy reaching 96.57%.
毫米波或太赫兹通信可以满足低空经济网络对高吞吐量感知和实时决策的需求。然而,无线信道的高频特性会导致严重的传播损耗和强烈的光束方向性,这使得在高度移动的无人驾驶飞行器(UAV)场景中的光束预测具有挑战性。 本文中,我们采用代理人工智能技术,使毫米波基站向实体智能转型成为可能。我们创新地设计了一个多代理协同推理架构用于UAV至地面的毫米波通信,并提出了一种基于双模态数据的混合光束预测模型系统。该多代理架构通过将光束预测任务分解为任务分析、解决方案规划和完整度评估,克服了大型语言模型(LLM)推理中有限的时间窗口以及较弱的控制能力的问题。 为了与代理推理过程保持一致,我们开发了一种基于多元UAV数据处理的混合光束预测模型系统,包括数值移动信息及视觉观察。提出的混合模型系统集成了以Mamba为基础的时间建模、卷积视觉编码和基于跨注意力的多模式融合技术,并在多代理引导下动态切换数据流策略。 通过真实UAV毫米波通信数据集进行广泛的模拟实验表明,在多种数据条件下,我们所提出的架构与系统能够实现高达96.57%的最大Top-1预测准确率以及良好的鲁棒性。
https://arxiv.org/abs/2603.11392
Facial Action Unit (AU) detection in in-the-wild environments remains a formidable challenge due to severe spatial-temporal heterogeneity, unconstrained poses, and complex audio-visual dependencies. While recent multimodal approaches have made progress, they often rely on capacity-limited encoders and shallow fusion mechanisms that fail to capture fine-grained semantic shifts and ultra-long temporal contexts. To bridge this gap, we propose a novel multimodal framework driven by Hierarchical Granularity Alignment and State Space this http URL, we leverage powerful foundation models, namely DINOv2 and WavLM, to extract robust and high-fidelity visual and audio representations, effectively replacing traditional feature extractors. To handle extreme facial variations, our Hierarchical Granularity Alignment module dynamically aligns global facial semantics with fine-grained local active patches. Furthermore, we overcome the receptive field limitations of conventional temporal convolutional networks by introducing a Vision-Mamba architecture. This approach enables temporal modeling with O(N) linear complexity, effectively capturing ultra-long-range dynamics without performance degradation. A novel asymmetric cross-attention mechanism is also introduced to deeply synchronize paralinguistic audio cues with subtle visual this http URL experiments on the challenging Aff-Wild2 dataset demonstrate that our approach significantly outperforms existing baselines, achieving state-of-the-art performance. Notably, this framework secured top rankings in the AU Detection track of the 10th Affective Behavior Analysis in-the-wild Competition.
在野外环境下的面部动作单元(AU)检测仍然是一项艰巨的任务,主要是由于严重的时空异质性、不受约束的姿态以及复杂的音视频依赖关系。尽管最近的多模态方法已经取得了一定的进步,但它们往往依赖于容量有限的编码器和浅层融合机制,这些机制无法捕捉到细粒度的语义变化和超长时序上下文信息。为了解决这一问题,我们提出了一种由分层粒度对齐(Hierarchical Granularity Alignment)和状态空间(State Space)驱动的新多模态框架。 在我们的方法中,我们利用强大的基础模型——DINOv2 和 WavLM 来提取稳健且高保真的视觉和音频表示,从而取代了传统的特征抽取器。为了应对极端面部变化的挑战,我们引入了一个分层粒度对齐模块,该模块能够动态地将全局面部语义与细粒度局部活动区域进行对齐。此外,通过引入 Vision-Mamba 架构,我们克服了传统时序卷积网络在感受野上的局限性。这一方法使得时间建模能够在 O(N) 的线性复杂度下有效捕捉超长范围的动态变化,并且不会导致性能下降。 另外,我们还提出了一种新颖的非对称交叉注意力机制来深度同步副语言音频线索与微妙的视觉细节。在具有挑战性的 Aff-Wild2 数据集上的实验表明,我们的方法显著超越了现有的基准模型,达到了最先进的性能水平。值得注意的是,这一框架在第十届野外情感行为分析竞赛中的 AU 检测赛道上获得了顶级排名。
https://arxiv.org/abs/2603.11306