Automated analysis of ancient coins has the potential to help researchers extract more historical insights from large collections of coins and to help collectors understand what they are buying or selling. Recent research in this area has shown promise in focusing on identification of semantic elements as they are commonly depicted on ancient coins, by using convolutional neural networks (CNNs). This paper is the first to apply the recently proposed Vision Transformer (ViT) deep learning architecture to the task of identification of semantic elements on coins, using fully automatic learning from multi-modal data (images and unstructured text). This article summarises previous research in the area, discusses the training and implementation of ViT and CNN models for ancient coins analysis and provides an evaluation of their performance. The ViT models were found to outperform the newly trained CNN models in accuracy.
自动化分析古代硬币有望帮助研究人员从大量的硬币收藏中提取更多的历史见解,并且有助于收藏者更好地理解他们所购买或出售的物品。近年来,该领域的研究通过使用卷积神经网络(CNNs)专注于识别古币上常见的语义元素,显示出巨大的潜力。本文首次将最近提出的视觉变压器(ViT)深度学习架构应用于识别硬币上的语义元素任务,并利用多模态数据(图像和非结构化文本)进行全自动学习。本文总结了该领域的先前研究,讨论了为古代硬币分析训练和实现的ViT和CNN模型,并提供了它们性能的评估。实验结果表明,ViT模型在准确性方面优于新训练的CNN模型。
https://arxiv.org/abs/2601.09433
We propose Spectral Complex Autoencoder Pruning (SCAP), a reconstruction-based criterion that measures functional redundancy at the level of individual output channels. For each convolutional layer, we construct a complex interaction field by pairing the full multi-channel input activation as the real part with a single output-channel activation (spatially aligned and broadcast across input channels) as the imaginary part. We transform this complex field to the frequency domain and train a low-capacity autoencoder to reconstruct normalized spectra. Channels whose spectra are reconstructed with high fidelity are interpreted as lying close to a low-dimensional manifold captured by the autoencoder and are therefore more compressible; conversely, channels with low fidelity are retained as they encode information that cannot be compactly represented by the learned manifold. This yields an importance score (optionally fused with the filter L1 norm) that supports simple threshold-based pruning and produces a structurally consistent pruned network. On VGG16 trained on CIFAR-10, at a fixed threshold of 0.6, we obtain 90.11% FLOP reduction and 96.30% parameter reduction with an absolute Top-1 accuracy drop of 1.67% from a 93.44% baseline after fine-tuning, demonstrating that spectral reconstruction fidelity of complex interaction fields is an effective proxy for channel-level redundancy under aggressive compression.
我们提出了一种基于重建的方法——谱复杂自编码器剪枝(Spectral Complex Autoencoder Pruning,简称SCAP),这种方法能够衡量单个输出通道的功能冗余性。对于每个卷积层而言,我们将多通道输入激活与单一输出通道的激活配对(空间上对齐并扩展到所有输入通道)形成复数交互场,其中多通道输入激活作为实部而单独的输出通道激活则作为虚部。将此复数值域转换至频域,并训练一个低容量自编码器来重构标准化后的谱图。如果某个通道的谱能够被高度准确地重建出来,则表明该通道接近由自编码器捕捉到的低维流形,因此更容易压缩;反之,若某通道的谱无法得到高保真度的还原,则说明它包含的信息在已学习的流形中不能紧凑表示,因此需要保留。这为我们提供了一个重要性评分(可选择地融合滤波器L1范数),支持基于阈值的剪枝方法,并且生成结构一致性的修剪网络。 具体来说,在针对CIFAR-10数据集训练好的VGG16模型上实验显示:在固定为0.6的阈值条件下,SCAP可实现90.11%的浮点运算次数(FLOPs)减少和96.30%的参数量压缩。经过微调后,与原始准确率93.44%相比,仅下降了绝对Top-1精度1.67%,这表明谱图重建保真度可以作为在激进压缩下衡量通道冗余的有效代理指标。
https://arxiv.org/abs/2601.09352
Vision Transformers (ViTs) have gained rapid adoption in computational pathology for their ability to model long-range dependencies through self-attention, addressing the limitations of convolutional neural networks that excel at local pattern capture but struggle with global contextual reasoning. Recent pathology-specific foundation models have further advanced performance by leveraging large-scale pretraining. However, standard ViTs remain inherently non-equivariant to transformations such as rotations and reflections, which are ubiquitous variations in histopathology imaging. To address this limitation, we propose Equi-ViT, which integrates an equivariant convolution kernel into the patch embedding stage of a ViT architecture, imparting built-in rotational equivariance to learned representations. Equi-ViT achieves superior rotation-consistent patch embeddings and stable classification performance across image orientations. Our results on a public colorectal cancer dataset demonstrate that incorporating equivariant patch embedding enhances data efficiency and robustness, suggesting that equivariant transformers could potentially serve as more generalizable backbones for the application of ViT in histopathology, such as digital pathology foundation models.
视觉变压器(ViTs)在计算病理学中得到了迅速的应用,因其能够通过自注意力机制建模长距离依赖关系,解决了卷积神经网络擅长捕捉局部模式但难以进行全局上下文推理的局限性。最近专门针对病理学的基础模型进一步通过大规模预训练提高了性能。然而,标准的ViT架构本质上对于旋转和反射等变换不具备等变性(即对图像变换的不变性),而这些是组织病理成像中常见的变化形式。为了解决这一限制,我们提出了Equi-ViT,它在ViT架构的补丁嵌入阶段整合了一个等变卷积核,使得学习到的表示具备内置的旋转等变性。Equi-ViT实现了更加优越的旋转一致性的补丁嵌入以及图像方向变化下的稳定分类性能。我们在一个公开的结直肠癌数据集上的实验结果表明,结合等变补丁嵌入可以提高数据效率和鲁棒性,这表明等变变压器可能作为ViT在组织病理学应用中的更为通用的基础架构,例如数字病理学基础模型。
https://arxiv.org/abs/2601.09130
Accurate delineation of acute ischemic stroke lesions in MRI is a key component of stroke diagnosis and management. In recent years, deep learning models have been successfully applied to the automatic segmentation of such lesions. While most proposed architectures are based on the U-Net framework, they primarily differ in their choice of loss functions and in the use of deep supervision, residual connections, and attention mechanisms. Moreover, many implementations are not publicly available, and the optimal configuration for acute ischemic stroke (AIS) lesion segmentation remains unclear. In this work, we introduce ISLA (Ischemic Stroke Lesion Analyzer), a new deep learning model for AIS lesion segmentation from diffusion MRI, trained on three multicenter databases totaling more than 1500 AIS participants. Through systematic optimization of the loss function, convolutional architecture, deep supervision, and attention mechanisms, we developed a robust segmentation framework. We further investigated unsupervised domain adaptation to improve generalization to an external clinical dataset. ISLA outperformed two state-of-the-art approaches for AIS lesion segmentation on an external test set. Codes and trained models will be made publicly available to facilitate reuse and reproducibility.
在MRI中准确地描绘急性缺血性脑卒中的病变是诊断和管理的重要组成部分。近年来,深度学习模型已被成功应用于此类病变的自动分割任务上。尽管大多数提出的架构基于U-Net框架,但它们主要通过选择不同的损失函数、深层监督、残差连接以及注意力机制来区别开来。此外,许多实现并未公开发布,并且针对急性缺血性脑卒中(AIS)病变分割的最佳配置仍然不明确。 在这项工作中,我们引入了ISLA(缺血性脑卒中病变分析仪),这是一种新的深度学习模型,用于从弥散加权MRI图像中进行AIS病变的分割。该模型是在三个多中心数据库上训练而成的,这些数据库包含超过1500名急性缺血性脑卒中的参与者的数据。通过系统地优化损失函数、卷积架构、深层监督以及注意力机制,我们开发了一个稳健的分割框架。此外,为了提高在外部临床数据集上的泛化能力,我们还研究了无监督领域自适应技术。 ISLA在外部分割测试集中超越了两个最先进的AIS病变分割方法。代码和训练好的模型将被公开发布,以促进再利用和可重复性研究。
https://arxiv.org/abs/2601.08732
Detecting anomalies in high-dimensional, time-dependent simulation data is challenging due to complex spatial and temporal dynamics. We study reconstruction-based anomaly detection for ensemble data from parameterized Kármán vortex street simulations using convolutional autoencoders. We compare a 2D autoencoder operating on individual frames with a 3D autoencoder that processes short temporal stacks. The 2D model identifies localized spatial irregularities in single time steps, while the 3D model exploits spatio-temporal context to detect anomalous motion patterns and reduces redundant detections across time. We further evaluate volumetric time-dependent data and find that reconstruction errors are strongly influenced by the spatial distribution of mass, with highly concentrated regions yielding larger errors than dispersed configurations. Our results highlight the importance of temporal context for robust anomaly detection in dynamic simulations.
在高维、时间依赖的仿真数据中检测异常具有挑战性,因为这些数据包含复杂的时空动态特性。我们研究了基于重建的集合数据异常检测方法,该方法使用卷积自编码器处理参数化的卡门涡街(Kármán vortex street)模拟数据。我们将二维(2D)自动编码器与三维(3D)自动编码器进行了比较:前者在单个时间帧上操作,后者则处理短时序栈。2D模型能够在单一时间步骤中识别局部空间异常,而3D模型利用时空上下文来检测异常运动模式,并减少不同时间点间的冗余检测。 进一步地,我们评估了体素化的时变数据,发现重建误差强烈依赖于质量的空间分布:集中区域相比分散配置会产生更大的误差。我们的结果强调了在动态模拟中进行鲁棒性异常检测时,考虑时间上下文的重要性。
https://arxiv.org/abs/2601.08659
We introduce a two-stage multitask learning framework for analyzing Electroencephalography (EEG) signals that integrates denoising, dynamical modeling, and representation learning. In the first stage, a denoising autoencoder is trained to suppress artifacts and stabilize temporal dynamics, providing robust signal representations. In the second stage, a multitask architecture processes these denoised signals to achieve three objectives: motor imagery classification, chaotic versus non-chaotic regime discrimination using Lyapunov exponent-based labels, and self-supervised contrastive representation learning with NT-Xent loss. A convolutional backbone combined with a Transformer encoder captures spatial-temporal structure, while the dynamical task encourages sensitivity to nonlinear brain dynamics. This staged design mitigates interference between reconstruction and discriminative goals, improves stability across datasets, and supports reproducible training by clearly separating noise reduction from higher-level feature learning. Empirical studies show that our framework not only enhances robustness and generalization but also surpasses strong baselines and recent state-of-the-art methods in EEG decoding, highlighting the effectiveness of combining denoising, dynamical features, and self-supervised learning.
我们介绍了一种用于分析脑电图(EEG)信号的两阶段多任务学习框架,该框架集成了去噪、动态建模和表示学习。在第一阶段,训练一个去噪自编码器来抑制伪迹并稳定时间动力学,从而提供鲁棒的信号表示。在第二阶段,一个多任务架构处理这些去噪后的信号以实现三个目标:运动想象分类、使用Lyapunov指数标签区分混沌与非混沌状态,以及利用NT-Xent损失进行自我监督对比表征学习。 该框架采用卷积骨干网络结合Transformer编码器来捕捉时空结构,而动态任务则鼓励对非线性脑动力学的敏感度。这种分阶段的设计有助于缓解重构和判别目标之间的干扰,提高数据集间的稳定性,并通过明确区分噪声减少与高级特征学习支持可重复训练。 实证研究表明,我们的框架不仅增强了鲁棒性和泛化能力,还在EEG解码方面超越了强大的基线方法以及最近的最先进方法,突显了结合去噪、动态特性及自我监督学习的有效性。
https://arxiv.org/abs/2601.08549
Convolutional neural networks (CNNs) have been widely used in the computer vision community, significantly improving the state-of-the-art. But learning good features often is computationally expensive in machine learning settings and is especially difficult when there is a lack of data. One-shot learning is one such area where only limited data is available. In one-shot learning, predictions have to be made after seeing only one example from one class, which requires special techniques. In this paper we explore different approaches to one-shot identification tasks in different domains including an industrial application and face recognition. We use a special technique with stacked images and use siamese capsule networks. It is encouraging to see that the approach using capsule architecture achieves strong results and exceeds other techniques on a wide range of datasets from industrial application to face recognition benchmarks while being easy to use and optimise.
卷积神经网络(CNNs)在计算机视觉社区中得到了广泛应用,显著提升了技术前沿。然而,在机器学习环境中学习良好的特征通常需要较高的计算成本,尤其是在数据不足的情况下尤其困难。一次性学习就是这样一个领域,仅提供有限的数据。在一次性学习中,模型必须基于看到的一个类别的单一示例进行预测,这要求使用特殊的技术。 本文探讨了不同的一次性识别任务方法,涵盖工业应用和人脸识别等不同领域。我们采用了一种特殊的叠图技术,并利用孪生胶囊网络(Siamese Capsule Networks)来解决这些问题。令人鼓舞的是,基于胶囊架构的方法在从工业应用到人脸识别基准测试的广泛数据集上取得了优异的结果,并且易于使用和优化,超过了其他方法的表现。
https://arxiv.org/abs/2601.08278
Hand gesture recognition is an important aspect of human-computer interaction. It forms the basis of sign language for the visually impaired people. This work proposes a novel hand gesture recognizing system for the differently-abled persons. The model uses a convolutional neural network, known as VGG-16 net, for building a trained model on a widely used image dataset by employing Python and Keras libraries. Furthermore, the result is validated by the NUS dataset, consisting of 10 classes of hand gestures, fed to the model as the validation set. Afterwards, a testing dataset of 10 classes is built by employing Google's open source Application Programming Interface (API) that captures different gestures of human hand and the efficacy is then measured by carrying out experiments. The experimental results show that by combining a transfer learning mechanism together with the image data augmentation, the VGG-16 net produced around 98% accuracy.
手部手势识别是人机交互的重要组成部分,它是视障人士使用的手语的基础。本研究提出了一种针对残障人士的手势识别系统。该模型采用卷积神经网络(称为VGG-16网络),利用Python和Keras库在广泛使用的一组图像数据上建立训练模型。此外,结果通过新加坡国立大学(NUS)数据集进行验证,该数据集中包含10类手部手势,并将其作为模型的验证集输入。随后,构建了一个包含10个类别测试数据的数据集,利用Google开源的应用编程接口(API),捕捉人类不同手势动作的有效性并执行实验以测量其效能。实验结果显示,通过结合迁移学习机制和图像数据增强技术,VGG-16网络达到了约98%的准确率。
https://arxiv.org/abs/2601.08262
Diabetic retinopathy (DR), affecting millions globally with projections indicating a significant rise, poses a severe blindness risk and strains healthcare systems. Diagnostic complexity arises from visual symptom overlap with conditions like age-related macular degeneration and hypertensive retinopathy, exacerbated by high misdiagnosis rates in underserved regions. This study introduces TIMM-ProRS, a novel deep learning framework integrating Vision Transformer (ViT), Convolutional Neural Network (CNN), and Graph Neural Network (GNN) with multi-modal fusion. TIMM-ProRS uniquely leverages both retinal images and temporal biomarkers (HbA1c, retinal thickness) to capture multi-modal and temporal dynamics. Evaluated comprehensively across diverse datasets including APTOS 2019 (trained), Messidor-2, RFMiD, EyePACS, and Messidor-1 (validated), the model achieves 97.8\% accuracy and an F1-score of 0.96, demonstrating state-of-the-art performance and outperforming existing methods like RSG-Net and DeepDR. This approach enables early, precise, and interpretable diagnosis, supporting scalable telemedical management and enhancing global eye health sustainability.
糖尿病视网膜病变(DR)是一种全球影响数百万人的疾病,预计病例数量还将显著增加。它会导致严重的失明风险,并对医疗系统造成压力。诊断复杂性源于视觉症状与年龄相关黄斑变性和高血压性视网膜病等状况之间的重叠现象,在医疗资源不足地区误诊率也较高。 本研究介绍了TIMM-ProRS,这是一种结合了Vision Transformer(ViT)、卷积神经网络(CNN)和图神经网络(GNN),并实现了多模态融合的新型深度学习框架。TIMM-ProRS的独特之处在于它同时利用视网膜图像和时间生物标志物(如糖化血红蛋白HbA1c和视网膜厚度)来捕捉多模态和时间动态变化。 该模型在APTOS 2019、Messidor-2、RFMiD、EyePACS及Messidor-1等多样化的数据集上进行了全面评估,取得了高达97.8%的准确率和F1评分为0.96的成绩,显示出业界领先的表现,并优于现有的方法如RSG-Net和DeepDR。 这一方法能够实现早期、精确且可解释性的诊断,支持远程医疗的大规模管理并增强全球眼健康可持续性。
https://arxiv.org/abs/2601.08240
Data quality plays a central role in the performance and robustness of convolutional neural networks (CNNs) for image classification. While high-quality data is often preferred for training, real-world inputs are frequently affected by noise and other distortions. This paper investigates the effect of deliberately introducing controlled noise into the training data to improve model robustness. Using the CIFAR-10 dataset, we evaluate the impact of three common corruptions, namely Gaussian noise, Salt-and-Pepper noise, and Gaussian blur at varying intensities and training set pollution levels. Experiments using a Resnet-18 model reveal that incorporating just 10\% noisy data during training is sufficient to significantly reduce test loss and enhance accuracy under fully corrupted test conditions, with minimal impact on clean-data performance. These findings suggest that strategic exposure to noise can act as a simple yet effective regularizer, offering a practical trade-off between traditional data cleanliness and real-world resilience.
数据质量在卷积神经网络(CNN)进行图像分类时的性能和鲁棒性中扮演着核心角色。尽管高质量的数据通常更适用于训练,但真实世界中的输入常常会受到噪声和其他失真的影响。本文研究了有意地向训练数据中引入可控噪声以提升模型鲁棒性的效果。使用CIFAR-10数据集,我们评估了三种常见干扰对测试性能的影响:高斯噪声、椒盐噪声以及不同强度的高斯模糊,并考虑了不同程度的数据污染情况。 实验采用了Resnet-18模型,结果表明,在训练过程中仅引入10%的含噪数据就足以显著降低全失真测试条件下的测试损失并提高准确性,同时对干净数据的表现影响甚微。这些发现表明,策略性地暴露于噪声中可以作为一种简单而有效的正则化手段,提供传统数据清洁度与现实世界鲁棒性的实用折衷方案。
https://arxiv.org/abs/2601.08043
The rapid deployment of drones poses significant challenges for airspace management, security, and surveillance. Current detection and classification technologies, including cameras, LiDAR, and conventional radar systems, often struggle to reliably identify and differentiate drones, especially those of similar models, under diverse environmental conditions and at extended ranges. Moreover, low radar cross sections and clutter further complicate accurate drone identification. To address these limitations, we propose a novel drone classification method based on artificial micro-Doppler signatures encoded by resonant electromagnetic stickers attached to drone blades. These tags generate distinctive, configuration-specific radar returns, enabling robust identification. We develop a tailored convolutional neural network (CNN) capable of processing raw radar signals, achieving high classification accuracy. Extensive experiments were conducted both in anechoic chambers with 43 tag configurations and outdoors under realistic flight trajectories and noise conditions. Dimensionality reduction techniques, including Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP), provided insight into code separability and robustness. Our results demonstrate reliable drone classification performance at signal-to-noise ratios as low as 7 dB, indicating the feasibility of long-range detection with advanced surveillance radar systems. Preliminary range estimations indicate potential operational distances of several kilometers, suitable for critical applications such as airport airspace monitoring. The integration of electromagnetic tagging with machine learning enables scalable and efficient drone identification, paving the way for enhanced aerial traffic management and security in increasingly congested airspaces.
无人机的快速部署给空中交通管理和安全监控带来了重大挑战。当前的检测和分类技术,包括摄像头、LiDAR 和传统雷达系统,在各种环境条件下尤其是在远距离上可靠地识别和区分相似型号的无人机方面往往遇到困难。此外,低雷达截面以及杂波进一步复杂化了准确的无人机识别问题。为了解决这些限制,我们提出了一种基于人工微多普勒特征的新颖无人机分类方法,这种特征是由贴附在无人机旋翼上的共振电磁标签编码产生的。这些标签能够产生具有独特配置特性的雷达回波信号,从而实现稳健的识别能力。 为此,我们开发了一个专门用于处理原始雷达信号的卷积神经网络(CNN),该网络实现了高准确度分类效果。我们在消声室中进行了广泛的实验,使用了43种不同的标签配置,并在室外环境中通过实际飞行轨迹和噪声条件下进一步测试了系统的性能。降维技术如主成分分析(PCA)和均匀流形逼近及投影(UMAP)为我们提供了代码分离性和鲁棒性方面的见解。 我们的研究结果表明,在信噪比低至7dB的情况下,该系统仍能可靠地进行无人机分类,这预示着使用先进的监控雷达系统在长距离范围内检测无人机是可行的。初步的距离估计显示可能的操作范围为数公里,这对于机场空域监测等关键应用而言非常合适。 通过将电磁标记技术与机器学习相结合,这种方案能够实现大规模且高效的无人机识别,从而为进一步优化空中交通管理和增强安全措施铺平道路,在日益拥挤的领空环境下尤为重要。
https://arxiv.org/abs/2601.08042
Maize disease classification plays a vital role in mitigating yield losses and ensuring food security. However, the deployment of traditional disease detection models in resource-constrained environments, such as those using smartphones and drones, faces challenges due to high computational costs. To address these challenges, we propose LWMSCNN-SE, a lightweight convolutional neural network (CNN) that integrates multi-scale feature extraction, depthwise separable convolutions, and squeeze-and-Excitation (SE) attention mechanisms. This novel combination enables the model to achieve 96.63% classification accuracy with only 241,348 parameters and 0.666 GFLOPs, making it suitable for real-time deployment in field applications. Our approach addresses the accuracy--efficiency trade-off by delivering high accuracy while maintaining low computational costs, demonstrating its potential for efficient maize disease diagnosis on edge devices in precision farming systems.
玉米病害分类在减少产量损失和确保粮食安全方面发挥着重要作用。然而,在资源受限的环境中部署传统的疾病检测模型(如使用智能手机和无人机)面临着高昂计算成本的挑战。为了解决这些难题,我们提出了一种轻量级卷积神经网络(LWMSCNN-SE),该模型结合了多尺度特征提取、深度可分离卷积以及挤压与激励(SE)注意力机制。这种新型组合使模型能够在仅使用241,348个参数和0.666 GFLOPS的情况下达到96.63%的分类准确率,使其适合在实际田间应用中实时部署。我们的方法通过提供高精度的同时保持低计算成本的方式解决了精度与效率之间的权衡问题,展示了其在精准农业系统中的边缘设备上进行有效玉米病害诊断的应用潜力。
https://arxiv.org/abs/2601.07957
Data-driven flow-field reconstruction typically relies on autoencoder architectures that compress high-dimensional states into low-dimensional latent representations. However, classical approaches such as variational autoencoders (VAEs) often struggle to preserve the higher-order statistical structure of fluid flows when subjected to strong compression. We propose DiffCoder, a coupled framework that integrates a probabilistic diffusion model with a conventional convolutional ResNet encoder and trains both components end-to-end. The encoder compresses the flow field into a latent representation, while the diffusion model learns a generative prior over reconstructions conditioned on the compressed state. This design allows DiffCoder to recover distributional and spectral properties that are not strictly required for minimizing pointwise reconstruction loss but are critical for faithfully representing statistical properties of the flow field. We evaluate DiffCoder and VAE baselines across multiple model sizes and compression ratios on a challenging dataset of Kolmogorov flow fields. Under aggressive compression, DiffCoder significantly improves the spectral accuracy while VAEs exhibit substantial degradation. Although both methods show comparable relative L2 reconstruction error, DiffCoder better preserves the underlying distributional structure of the flow. At moderate compression levels, sufficiently large VAEs remain competitive, suggesting that diffusion-based priors provide the greatest benefit when information bottlenecks are severe. These results demonstrate that the generative decoding by diffusion offers a promising path toward compact, statistically consistent representations of complex flow fields.
数据驱动的流场重建通常依赖于自编码器架构,将高维状态压缩成低维潜在表示。然而,传统方法如变分自编码器(VAEs)在面对强烈压缩时,往往难以保留流体流动的高级统计结构。我们提出了DiffCoder,这是一个结合了概率扩散模型与常规卷积ResNet编码器的框架,并对其进行端到端训练。该编码器将流场压缩为潜在表示,而扩散模型则学习一个基于被压缩状态的生成先验。这种设计使DiffCoder能够在不严格要求最小化逐点重建损失的情况下恢复分布和频谱特性,这些特性对于准确表示流场的统计属性至关重要。 我们在具有挑战性的Kolmogorov流场数据集上评估了多种规模的DiffCoder和VAE基线方法,并针对不同的压缩比进行了测试。在激进的压缩下,DiffCoder显著提高了频谱精度,而VAEs的表现则明显下降。尽管两种方法在相对L2重建误差方面表现出相似性,但DiffCoder更好地保留了流场的基本分布结构。在中等程度的压缩水平下,足够大的VAE仍然具有竞争力,这表明当信息瓶颈严重时,基于扩散的先验提供了最大的优势。 这些结果证明,通过扩散进行生成解码为复杂流场提供了一种紧凑且统计一致表示的有效路径。
https://arxiv.org/abs/2601.07946
Fully convolutional networks have become the backbone of modern medical imaging due to their ability to learn multi-scale representations and perform end-to-end inference. Yet their potential for slice-to-volume reconstruction (SVR), the task of jointly estimating 3D anatomy and slice poses from misaligned 2D acquisitions, remains underexplored. We introduce a fast convolutional framework that fuses multiple orthogonal 2D slice stacks to recover coherent 3D structure and refines slice alignment through lightweight model-based optimization. Applied to fetal brain MRI, our approach reconstructs high-quality 3D volumes in under 10s, with 1s slice registration and accuracy on par with state-of-the-art iterative SVR pipelines, offering more than speedup. The framework uses non-rigid displacement fields to represent transformations, generalizing to other SVR problems like fetal body and placental MRI. Additionally, the fast inference time paves the way for real-time, scanner-side volumetric feedback during MRI acquisition.
全卷积网络(Fully Convolutional Networks,FCNs)由于其能够学习多尺度表示并执行端到端推理的能力,已经成为现代医学影像领域的支柱技术。然而,它们在切片到体积重建(Slice-to-Volume Reconstruction, SVR)方面的潜力——即从对齐不准确的2D图像中同时估计3D解剖结构和切片位置——仍然未被充分探索。我们介绍了一个快速卷积框架,该框架融合了多个正交2D切片堆栈以恢复连贯的3D结构,并通过轻量级模型优化来改进切片对齐。当应用于胎儿大脑MRI时,我们的方法能够在不到10秒的时间内重建高质量的3D体积,在1秒钟内完成切片配准,其精度与最先进的迭代SVR管道相当,提供了比传统方法更快的速度。 该框架使用非刚性位移场来表示变换,可以推广到其他SVR问题,例如胎儿身体和胎盘MRI。此外,快速推理时间还为在MRI采集过程中提供实时、扫描仪端的体积反馈铺平了道路。
https://arxiv.org/abs/2601.07519
Glioblastoma (GBM) is a highly aggressive primary brain tumor with limited therapeutic options and poor prognosis. The methylation status of the O6-methylguanine-DNA methyltransferase (MGMT) gene promoter is a critical molecular biomarker that influences patient response to temozolomide chemotherapy. Traditional methods for determining MGMT status rely on invasive biopsies and are limited by intratumoral heterogeneity and procedural risks. This study presents a radiogenomic molecular imaging analysis framework for the non-invasive prediction of MGMT promoter methylation using multi-parametric magnetic resonance imaging (mpMRI). Our approach integrates radiomics, deep learning, and explainable artificial intelligence (XAI) to analyze MRI-derived imaging phenotypes and correlate them with molecular labels. Radiomic features are extracted from FLAIR, T1-weighted, T1-contrast-enhanced, and T2-weighted MRI sequences, while a 3D convolutional neural network learns deep representations from the same modalities. These complementary features are fused using both early fusion and attention-based strategies and classified to predict MGMT methylation status. To enhance clinical interpretability, we apply XAI methods such as Grad-CAM and SHAP to visualize and explain model decisions. The proposed framework is trained on the RSNA-MICCAI Radiogenomic Classification dataset and externally validated on the BraTS 2021 dataset. This work advances the field of molecular imaging by demonstrating the potential of AI-driven radiogenomics for precision oncology, supporting non-invasive, accurate, and interpretable prediction of clinically actionable molecular biomarkers in GBM.
胶质母细胞瘤(GBM)是一种高度侵袭性的原发性脑肿瘤,治疗选择有限且预后不佳。O6-甲基鸟嘌呤-DNA甲基转移酶(MGMT)基因启动子的甲基化状态是影响患者对替莫唑胺化疗反应的关键分子生物标志物。传统方法通过侵入性的活检来确定MGMT的状态,但这种方法受限于肿瘤内部异质性和程序风险。 本研究提出了一种基于放射组学和多参数磁共振成像(mpMRI)的非侵入性预测MGMT启动子甲基化的分子影像分析框架。该方法结合了放射组学、深度学习以及可解释的人工智能(XAI),以分析从MRI衍生出的影像表型,并与分子标签进行关联。本研究中,从FLAIR、T1加权、T1对比增强和T2加权MRI序列中提取放射组学特征,同时使用3D卷积神经网络学习相同的模态的深层表示。这些互补特征通过早期融合及基于注意力的方法被整合,并分类以预测MGMT甲基化状态。为了提高临床解释性,我们应用了Grad-CAM和SHAP等XAI方法来可视化并解释模型决策。 所提出的框架在RSNA-MICCAI放射组学分类数据集上进行训练,并在外部的BraTS 2021数据集上进行了验证。这项工作通过展示人工智能驱动的放射组学在精准肿瘤学中的潜力,推进了分子影像领域的进展,支持非侵入性、准确且可解释地预测GBM中具有临床意义的分子生物标志物。
https://arxiv.org/abs/2601.07035
Selective fixed-filter active noise control (SFANC) is a novel approach capable of mitigating noise with varying frequency characteristics. It offers faster response and greater computational efficiency compared to traditional adaptive algorithms. However, spatial factors, particularly the influence of the noise source location, are often overlooked. Some existing studies have explored the impact of the direction-of-arrival (DoA) of the noise source on ANC performance, but they are mostly limited to free-field conditions and do not consider the more complex indoor reverberant environments. To address this gap, this paper proposes a learning-based directional SFANC method that incorporates the DoA of the noise source in reverberant environments. In this framework, multiple reference signals are processed by a convolutional neural network (CNN) to estimate the azimuth and elevation angles of the noise source, as well as to identify the most appropriate control filter for effective noise cancellation. Compared to traditional adaptive algorithms, the proposed approach achieves superior noise reduction with shorter response times, even in the presence of reverberations.
选择性固定滤波器主动噪声控制(SFANC)是一种新颖的方法,能够减轻具有不同频率特性的噪音。与传统的自适应算法相比,它提供了更快的响应速度和更高的计算效率。然而,空间因素特别是声源位置的影响通常被忽视了。一些现有的研究探讨了噪声来源的方向到达(DoA)对主动噪声控制(ANC)性能的影响,但这些研究大多局限于自由场条件,并未考虑更复杂的室内混响环境。为了解决这一空白,本文提出了一种基于学习的方向性SFANC方法,在复杂室内环境中结合声源的DoA。在这个框架中,多个参考信号通过卷积神经网络(CNN)处理来估计噪声源的方位角和仰角,同时确定最合适的控制滤波器以实现有效的噪音消除。与传统的自适应算法相比,所提出的方法在存在混响的情况下也能获得更优的降噪效果,并且响应时间更短。
https://arxiv.org/abs/2601.06981
The rapid growth of multimedia consumption, driven by major advances in mobile devices since the mid-2000s, has led to widespread use of video conferencing applications (VCAs) such as Zoom and Google Meet, as well as instant messaging applications (IMAs) like WhatsApp and Telegram, which increasingly support video conferencing as a core feature. Many of these systems rely on the Web Real-Time Communication (WebRTC) protocol, enabling direct peer-to-peer media streaming without requiring a third-party server to relay data, reducing the latency and facilitating a real-time communication. Despite WebRTC's potential, adverse network conditions can degrade streaming quality and consequently reduce users' Quality of Experience (QoE). Maintaining high QoE therefore requires continuous monitoring and timely intervention when QoE begins to deteriorate. While content providers can often estimate QoE by directly comparing transmitted and received media, this task is significantly more challenging for internet service providers (ISPs). End-to-end encryption, commonly used by modern VCAs and IMAs, prevent ISPs from accessing the original media stream, leaving only Quality of Service (QoS) and routing information available. To address this limitation, we propose the QoE Attention Convolutional Neural Network (qAttCNN), a model that leverages packet size parameter of the traffic to infer two no-reference QoE metrics viz. BRISQUE and frames per second (FPS). We evaluate qAttCNN on a custom dataset collected from WhatsApp video calls and compare it against existing QoE models. Using mean absolute error percentage (MAEP), our approach achieves 2.14% error for BRISQUE and 7.39% for FPS prediction.
自2000年代中期以来,移动设备的重大进步推动了多媒体消费的快速增长。这导致视频会议应用(VCAs)如Zoom和Google Meet以及即时通讯应用程序(IMAs)如WhatsApp和Telegram等广泛应用,这些应用越来越多地支持视频会议作为核心功能。许多此类系统依赖于Web实时通信(WebRTC)协议,该协议通过直接点对点媒体流传输而无需第三方服务器中转数据来减少延迟并促进实时沟通。尽管WebRTC具有巨大潜力,但不良的网络条件可能会降低流媒体质量,并因此降低用户体验质量(QoE)。为了保持高QoE,需要持续监测并在QoE开始下降时及时采取干预措施。 虽然内容提供商通常可以通过直接比较传输和接收的媒体来估算QoE,但对于互联网服务提供商(ISP)来说,这项任务更具挑战性。现代VCAs和IMAs普遍使用端到端加密,这使ISP无法访问原始媒体流,仅剩下服务质量(QoS)和路由信息可供利用。为了克服这一限制,我们提出了一种名为QoE注意力卷积神经网络(qAttCNN)的模型,该模型通过利用流量中的包大小参数来推断两个无参考QoE指标——BRISQUE和每秒帧数(FPS)。我们在一个自收集的WhatsApp视频通话数据集上评估了qAttCNN,并将其与现有的QoE模型进行比较。使用平均绝对误差百分比(MAEP)衡量,我们的方法在BRISQUE预测中实现了2.14%的错误率,在FPS预测中实现了7.39%的错误率。
https://arxiv.org/abs/2601.06862
We introduce EyeTheia, a lightweight and open deep learning pipeline for webcam-based gaze estimation, designed for browser-based experimental platforms and real-world cognitive and clinical research. EyeTheia enables real-time gaze tracking using only a standard laptop webcam, combining MediaPipe-based landmark extraction with a convolutional neural network inspired by iTracker and optional user-specific fine-tuning. We investigate two complementary strategies: adapting a model pretrained on mobile data and training the same architecture from scratch on a desktop-oriented dataset. Validation results on MPIIFaceGaze show comparable performance between both approaches prior to calibration, while lightweight user-specific fine-tuning consistently reduces gaze prediction error. We further evaluate EyeTheia in a realistic Dot-Probe task and compare it to the commercial webcam-based tracker SeeSo SDK. Results indicate strong agreement in left-right gaze allocation during stimulus presentation, despite higher temporal variability. Overall, EyeTheia provides a transparent and extensible solution for low-cost gaze tracking, suitable for scalable and reproducible experimental and clinical studies. The code, trained models, and experimental materials are publicly available.
我们介绍EyeTheia,这是一个轻量级的开源深度学习流水线,用于基于网络摄像头的眼动追踪估计,旨在为浏览器基础实验平台和现实世界中的认知及临床研究设计。EyeTheia能够仅使用标准笔记本电脑内置网络摄像头实现实时眼动追踪,它结合了MediaPipe基元的特征点提取与受iTracker启发的卷积神经网络,并提供可选的用户特定微调选项。 我们调查了两种互补策略:一种是在移动数据上预训练模型进行适应,另一种是从头开始在以桌面为中心的数据集上训练相同架构。在MPIIFaceGaze验证结果表明,在校准之前,这两种方法表现出相当的性能,而轻量级的用户特定微调选项能持续减少眼动预测误差。 我们进一步在一个现实的Dot-Probe任务中评估EyeTheia,并将其与商用网络摄像头眼动追踪器SeeSo SDK进行比较。结果显示,在呈现刺激时,左右眼动分配有强烈的符合性,尽管时间变异性较高。总体而言,EyeTheia为低成本眼动追踪提供了一个透明且可扩展的解决方案,适合大规模和可重复性的实验及临床研究。 该项目的代码、训练好的模型以及实验材料均公开可用。
https://arxiv.org/abs/2601.06279
Transformers require positional encodings to represent sequence order, yet most prior work focuses on designing new positional encodings rather than examining how positional information is fused with token embeddings. In this paper, we study whether the fusion mechanism itself affects performance, particularly in long-sequence settings. We conduct a controlled empirical study comparing three canonical fusion strategies--element-wise addition, concatenation with projection, and scalar gated fusion--under identical Transformer architectures, data splits, and random seeds. Experiments on three text classification datasets spanning short (AG News), medium (IMDB), and long (ArXiv) sequences show that fusion choice has negligible impact on short texts but produces consistent gains on long documents. To verify that these gains are structural rather than stochastic, we perform paired-seed analysis and cross-dataset comparison across sequence-length regimes. Additional experiments on the ArXiv dataset indicate that the benefit of learnable fusion generalizes across multiple positional encoding families. Finally, we explore a lightweight convolutional gating mechanism that introduces local inductive bias at the fusion level, evaluated on long documents only. Our results indicate that positional-encoding fusion is a non-trivial design choice for long-sequence Transformers and should be treated as an explicit modeling decision rather than a fixed default.
翻译如下: 转换器(Transformers)需要位置编码来表示序列顺序,然而大多数先前的工作主要集中在设计新的位置编码上,而不是研究如何将位置信息与标记嵌入融合。在这篇论文中,我们探讨了融合机制本身是否会影响性能,特别是在长序列设置下。我们在相同的Transformer架构、数据分割和随机种子条件下,对三种标准的融合策略——元素级相加、连接后投影以及标量门控融合进行了有控制的经验研究。在三个文本分类数据集(包括短文AG News、中等长度IMDB和长文档ArXiv)上的实验表明,在短文中选择融合方式几乎不会影响性能,但在长文档上则会产生一致的改进效果。为了验证这些改进是结构性而非随机性的,我们进行了配对种子分析及跨序列长度范围的数据集比较。在ArXiv数据集上的额外实验证明了可学习的融合方法能够跨多种位置编码家族进行泛化。最后,我们探索了一种轻量级的卷积门控机制,在长文档上对其进行评估,以引入局部归纳偏差。我们的结果表明,对于长序列转换器而言,位置编码的融合是一个重要的设计选择,并且应该被视为一个明确的建模决策而非固定的默认选项。
https://arxiv.org/abs/2601.05807
Convolutional Neural Networks (CNNs) are known to exhibit a strong texture bias, favoring local patterns over global shape information--a tendency inherent to their convolutional architecture. While this bias is beneficial for texture-rich natural images, it often degrades performance on shape-dominant data such as illustrations and sketches. Although prior work has proposed shape-biased models to mitigate this issue, these approaches lack a quantitative metric for identifying which datasets would actually benefit from such modifications. To address this gap, we propose a data-driven metric that quantifies the shape-texture balance of a dataset by computing the Structural Similarity Index (SSIM) between each image's luminance channel and its L0-smoothed counterpart. Building on this metric, we further introduce a computationally efficient adaptation method that promotes shape bias by modifying the dilation of max-pooling operations while keeping convolutional weights frozen. Experimental results show that this approach consistently improves classification accuracy on shape-dominant datasets, particularly in low-data regimes where full fine-tuning is impractical, requiring training only the final classification layer.
卷积神经网络(CNN)以其强烈的纹理偏好而著称,这种偏好使得局部模式比全局形状信息更为重要——这是由其卷积架构所固有的。虽然这种倾向在纹理丰富的自然图像上是有益的,但在以插图和草图为代表的形状主导的数据集上,则往往会导致性能下降。尽管之前的研究提出了一些偏重于形状的模型来缓解这一问题,但这些方法缺少一种定量的标准来识别哪些数据集实际上会从这样的修改中获益。 为了解决这个问题,我们提出了一个基于数据驱动的度量标准,该标准通过计算每个图像的亮度通道与其L0平滑版本之间的结构相似性指数(SSIM),从而量化数据集中的形状-纹理平衡。在此基础上,我们进一步引入了一种计算效率高的适应方法,这种方法通过调整最大池化操作的膨胀率来促进对形状偏好的支持,同时保持卷积权重不变。 实验结果显示,在低数据量的情况下,这种方法在形状主导的数据集中可以持续提高分类精度,并且在这种情况下进行完全微调是不切实际的。只需训练最终的分类层即可实现这一改进。
https://arxiv.org/abs/2601.05599