Image quality assessment (IQA) represents a pivotal challenge in image-focused technologies, significantly influencing the advancement trajectory of image processing and computer vision. Recently, IQA has witnessed a notable surge in innovative research efforts, driven by the emergence of novel architectural paradigms and sophisticated computational techniques. This survey delivers an extensive analysis of contemporary IQA methodologies, organized according to their application scenarios, serving as a beneficial reference for both beginners and experienced researchers. We analyze the advantages and limitations of current approaches and suggest potential future research pathways. The survey encompasses both general and specific IQA methodologies, including conventional statistical measures, machine learning techniques, and cutting-edge deep learning models such as convolutional neural networks (CNNs) and Transformer models. The analysis within this survey highlights the necessity for distortion-specific IQA methods tailored to various application scenarios, emphasizing the significance of practicality, interpretability, and ease of implementation in future developments.
图像质量评估(IQA)在以图像为中心的技术中代表了一个关键的挑战,对图像处理和计算机视觉的发展路径有着重要影响。近年来,由于新型架构范式和复杂计算技术的出现,IQA领域见证了创新研究工作的显著增长。本次综述提供了当代IQA方法学的全面分析,并按照其应用场景进行组织,为初学者和有经验的研究人员提供有益的参考资源。我们评估了现有方法的优势与局限性,并提出未来潜在的研究方向建议。此次调查涵盖了通用及特定场景下的图像质量评估技术,包括传统的统计指标、机器学习技术以及前沿深度学习模型(如卷积神经网络(CNN)和Transformer模型)。综述中的分析强调了针对不同应用场景的失真特异性IQA方法的重要性,并突出了未来发展的实用性和可解释性等方面的必要要求。
https://arxiv.org/abs/2502.08540
Handwritten Text Recognition (HTR) has become an essential field within pattern recognition and machine learning, with applications spanning historical document preservation to modern data entry and accessibility solutions. The complexity of HTR lies in the high variability of handwriting, which makes it challenging to develop robust recognition systems. This survey examines the evolution of HTR models, tracing their progression from early heuristic-based approaches to contemporary state-of-the-art neural models, which leverage deep learning techniques. The scope of the field has also expanded, with models initially capable of recognizing only word-level content progressing to recent end-to-end document-level approaches. Our paper categorizes existing work into two primary levels of recognition: (1) \emph{up to line-level}, encompassing word and line recognition, and (2) \emph{beyond line-level}, addressing paragraph- and document-level challenges. We provide a unified framework that examines research methodologies, recent advances in benchmarking, key datasets in the field, and a discussion of the results reported in the literature. Finally, we identify pressing research challenges and outline promising future directions, aiming to equip researchers and practitioners with a roadmap for advancing the field.
手写文本识别(HTR)已成为模式识别和机器学习领域的一个重要分支,其应用范围从历史文档的保存到现代数据录入及无障碍解决方案。HTR 的复杂性在于手写风格的高度变异性,这使得开发稳健的识别系统具有挑战性。本文综述了 HTR 模型的发展历程,追溯其从早期基于启发式的方法演进至当前最先进的神经网络模型的过程,后者利用深度学习技术来提升性能。该领域的研究范围也已扩大,初期只能识别单词级内容的模型逐渐发展为现今涵盖整个文档级别的端到端方法。我们在论文中将现有的工作分类为两个主要的识别层次:(1)**线级别及以下**,包括词和行的识别;以及 (2)**超出行级别**,解决段落和整篇文档层面的问题。我们提供了一个统一的研究框架,涵盖了研究方法、近期基准测试的进步、领域中的关键数据集,以及对文献中报告结果的讨论。最后,我们指出了亟待解决的研究挑战,并概述了未来有前景的发展方向,旨在为研究人员及从业者提供一份推进该领域的路线图。
https://arxiv.org/abs/2502.08417
Deep learning can learn high-level semantic features in Euclidean space effectively for PolSAR images, while they need to covert the complex covariance matrix into a feature vector or complex-valued vector as the network input. However, the complex covariance matrices are essentially a complex Hermit positive definite (HPD) matrix endowed in Riemannian manifold rather than Euclidean space. The matrix's real and imagery parts are with the same significance, as the imagery part represents the phase information. The matrix vectorization will destroy the geometric structure and manifold characteristics of complex covariance matrices. To learn complex HPD matrices directly, we propose a Riemannian complex HPD convolution network(HPD\_CNN) for PolSAR images. This method consists of a complex HPD unfolding network(HPDnet) and a CV-3DCNN enhanced network. The proposed complex HPDnet defines the HPD mapping, rectifying and the logEig layers to learn geometric features of complex matrices. In addition, a fast eigenvalue decomposition method is designed to reduce computation burden. Finally, a Riemannian-to-Euclidean enhanced network is defined to enhance contextual information for classification. Experimental results on two real PolSSAR datasets demonstrate the proposed method can achieve superior performance than the state-of-the-art methods especially in heterogeneous regions.
深度学习能够有效地在欧几里得空间中为PolSAR(极化合成孔径雷达)图像学习高级语义特征,然而它们需要将复杂的协方差矩阵转换成特征向量或复值向量作为网络输入。然而,复杂协方差矩阵本质上是黎曼流形上的复赫米特正定(HPD)矩阵,而不是欧几里得空间中的对象。该矩阵的实部和虚部具有相同的重要性,因为虚部代表相位信息。将矩阵矢量化会破坏复数协方差矩阵的几何结构和流形特性。 为了直接学习复杂的HPD矩阵,我们提出了一种用于PolSAR图像的黎曼复HPD卷积网络(HPD\_CNN)。该方法由一个复杂HPD展开网络(HPDnet)和一个增强型CV-3DCNN组成。所提出的复数HPDnet定义了HPD映射、校正以及logEig层,以学习复矩阵的几何特征。此外,还设计了一种快速的本征值分解方法来减少计算负担。最后,定义了一个黎曼到欧几里得增强网络,用于提高分类中的上下文信息。 在两个真实的PolSAR数据集上的实验结果表明,所提出的方法比现有的最先进的方法表现出色,尤其是在异质区域中。
https://arxiv.org/abs/2502.08137
While functional magnetic resonance imaging (fMRI) offers rich spatial resolution, it is limited by high operational costs and significant infrastructural demands. In contrast, electroencephalography (EEG) provides millisecond-level precision in capturing electrical activity but lacks the spatial resolution necessary for precise neural localization. To bridge these gaps, we introduce E2fNet, a simple yet effective deep learning model for synthesizing fMRI images from low-cost EEG data. E2fNet is specifically designed to capture and translate meaningful features from EEG across electrode channels into accurate fMRI representations. Extensive evaluations across three datasets demonstrate that E2fNet consistently outperforms existing methods, achieving state-of-the-art results in terms of the structural similarity index measure (SSIM). Our findings suggest that E2fNet is a promising, cost-effective solution for enhancing neuroimaging capabilities. The code is available at this https URL.
尽管功能性磁共振成像(fMRI)提供了丰富的空间分辨率,但它受制于高昂的操作成本和显著的基础设施需求。相比之下,脑电图(EEG)能够以毫秒级精度捕捉大脑的电信号活动,但其缺乏进行精确神经定位所需的高空间分辨率。为弥合这些差距,我们引入了E2fNet,这是一种简单而有效的深度学习模型,用于从低成本的EEG数据合成出fMRI图像。E2fNet专门设计来捕获并翻译来自电极通道中的有意义特征,并将其转化为准确的fMRI表示形式。 在三个不同数据集上的广泛评估表明,E2fNet始终优于现有的方法,在结构相似性指数测量(SSIM)方面达到了最先进的结果。我们的发现表明,E2fNet是一种有前景且成本效益高的解决方案,可以增强神经影像能力。代码可在[此处](https://this https URL)获取。
https://arxiv.org/abs/2502.08025
Employing self-supervised learning (SSL) methodologies assumes par-amount significance in handling unlabeled polyp datasets when building deep learning-based automatic polyp segmentation models. However, the intricate privacy dynamics surrounding medical data often preclude seamless data sharing among disparate medical centers. Federated learning (FL) emerges as a formidable solution to this privacy conundrum, yet within the realm of FL, optimizing model generalization stands as a pressing imperative. Robust generalization capabilities are imperative to ensure the model's efficacy across diverse geographical domains post-training on localized client datasets. In this paper, a Federated self-supervised Domain Generalization method is proposed to enhance the generalization capacity of federated and Label-efficient intestinal polyp segmentation, named LFDG. Based on a classical SSL method, DropPos, LFDG proposes an adversarial learning-based data augmentation method (SSADA) to enhance the data diversity. LFDG further proposes a relaxation module based on Source-reconstruction and Augmentation-masking (SRAM) to maintain stability in feature learning. We have validated LFDG on polyp images from six medical centers. The performance of our method achieves 3.80% and 3.92% better than the baseline and other recent FL methods and SSL methods, respectively.
在基于深度学习的自动息肉分割模型构建过程中,面对未标记的数据集时,采用自我监督学习(Self-Supervised Learning, SSL)方法显得尤为重要。然而,医疗数据周围复杂的隐私动态往往阻碍了不同医疗机构之间的无缝数据共享。联邦学习(Federated Learning, FL)作为一种强有力的方法,能够解决这一隐私难题,但在FL中优化模型的泛化能力成为了一个紧迫的任务。强大的泛化能力对于确保模型在训练后基于本地客户数据集的情况下,在不同的地理区域仍然有效至关重要。 本文提出了一种名为LFDG的联邦自我监督领域泛化方法,旨在增强联邦和标签高效型结肠息肉分割的泛化性能。该方法基于经典的SSL方法DropPos,并提出了基于对抗学习的数据增强方法(SSADA)以提高数据多样性。此外,LFDG还引入了一个基于源重构与增强掩蔽(SRAM)的松弛模块来保持特征学习中的稳定性。 我们在来自六个医疗机构的息肉图像上验证了LFDG的效果。我们的方法相较于基线以及其他最近的FL和SSL方法分别取得了3.80%和3.92%的成绩提升。
https://arxiv.org/abs/2502.07951
Hypercomplex image processing extends conventional techniques in a unified paradigm encompassing algebraic and geometric principles. This work leverages quaternions and the two-dimensional orthogonal planes split framework (splitting of a quaternion - representing a pixel - into pairs of orthogonal 2D planes) for natural/biomedical image analysis through the following computational workflows and outcomes: natural/biomedical image re-colorization, natural image de-colorization, natural/biomedical image contrast enhancement, computational re-staining and stain separation in histological images, and performance gains in machine/deep learning pipelines for histological images. The workflows are analyzed separately for natural and biomedical images to showcase the effectiveness of the proposed approaches. The proposed workflows can regulate color appearance (e.g. with alternative renditions and grayscale conversion) and image contrast, be part of automated image processing pipelines (e.g. isolating stain components, boosting learning models), and assist in digital pathology applications (e.g. enhancing biomarker visibility, enabling colorblind-friendly renditions). Employing only basic arithmetic and matrix operations, this work offers a computationally accessible methodology - in the hypercomplex domain - that showcases versatility and consistency across image processing tasks and a range of computer vision and biomedical applications. The proposed non-data-driven methods achieve comparable or better results (particularly in cases involving well-known methods) to those reported in the literature, showcasing the potential of robust theoretical frameworks with practical effectiveness. Results, methods, and limitations are detailed alongside discussion of promising extensions, emphasizing the potential of feature-rich mathematical/computational frameworks for natural and biomedical images.
超复数图像处理扩展了传统技术,通过统一的范式涵盖了代数和几何原理。这项工作利用四元数以及二维正交平面分割框架(即将代表像素的四元数拆分为两对正交2D平面),用于自然/生物医学图像分析,并通过以下计算流程及成果来实现: - 自然/生物医学图像重新着色 - 自然图像去色彩化 - 自然/生物医学图像对比度增强 - 组织学图像的计算机重染和染料分离 - 用于组织学图像的机器学习/深度学习管道性能提升 这些工作流程分别对自然图像和生物医学图像进行分析,以展示所提出方法的有效性。提出的流程可以调节颜色外观(例如通过替代渲染和灰度转换),增强图像对比度,并可成为自动图像处理流水线的一部分(例如隔离染料成分、增强学习模型)。此外,它们还可以帮助数字病理学应用(例如增强生物标志物的可见性,实现对色盲友好的呈现)。 仅使用基本算术运算和矩阵操作,这项工作提供了一种在超复数域内易于计算的方法,并展示了其在图像处理任务中以及计算机视觉和生物医学应用范围内的灵活性和一致性。提出的非数据驱动方法实现了与文献报告相媲美或更好的结果(尤其是在涉及知名方法的情况下),突显了基于强大理论框架的实际效果的潜力。 详细介绍了成果、方法和限制,同时讨论了有前景的扩展方向,并强调了丰富特征数学/计算框架在自然图像和生物医学图像中的潜力。
https://arxiv.org/abs/2502.07758
The prevalence of noisy labels in real-world datasets poses a significant impediment to the effective deployment of deep learning models. While meta-learning strategies have emerged as a promising approach for addressing this challenge, existing methods often suffer from limited transferability and task-specific designs. This paper introduces TMLC-Net, a novel Transferable Meta-Learner for Correcting Noisy Labels, designed to overcome these limitations. TMLC-Net learns a general-purpose label correction strategy that can be readily applied across diverse datasets and model architectures without requiring extensive retraining or fine-tuning. Our approach integrates three core components: (1) Normalized Noise Perception, which captures and normalizes training dynamics to handle distribution shifts; (2) Time-Series Encoding, which models the temporal evolution of sample statistics using a recurrent neural network; and (3) Subclass Decoding, which predicts a corrected label distribution based on the learned representations. We conduct extensive experiments on benchmark datasets with various noise types and levels, demonstrating that TMLC-Net consistently outperforms state-of-the-art methods in terms of both accuracy and robustness to label noise. Furthermore, we analyze the transferability of TMLC-Net, showcasing its adaptability to new datasets and noise conditions, and establishing its potential as a broadly applicable solution for robust deep learning in noisy environments.
在现实世界的数据集中,噪声标签的普遍存在对深度学习模型的有效部署构成了重大障碍。虽然元学习策略作为解决这一挑战的一种有前景的方法已经出现,但现有的方法往往面临转移能力有限和特定任务设计的问题。本文介绍了一种新的可移植元学习器——用于校正噪声标签的TMLC-Net,旨在克服这些限制。TMLC-Net 学习一种通用的标签校正策略,可以轻松应用于各种数据集和模型架构,而无需进行广泛的重新训练或微调。我们的方法整合了三个核心组件: 1. **归一化噪声感知**:捕捉并标准化训练动态以处理分布变化。 2. **时间序列编码**:使用循环神经网络建模样本统计的时间演化。 3. **子类解码**:基于学习到的表示预测校正后的标签分布。 我们在具有各种噪声类型和水平的标准数据集上进行了广泛的实验,结果表明TMLC-Net在准确性以及对标签噪声的鲁棒性方面始终优于最先进的方法。此外,我们还分析了TMLC-Net 的可转移性,展示了其适应新数据集和噪声条件的能力,并确立了它作为在嘈杂环境中进行稳健深度学习的广泛适用解决方案的潜力。
https://arxiv.org/abs/2502.07721
Multimodal deep learning systems are deployed in dynamic scenarios due to the robustness afforded by multiple sensing modalities. Nevertheless, they struggle with varying compute resource availability (due to multi-tenancy, device heterogeneity, etc.) and fluctuating quality of inputs (from sensor feed corruption, environmental noise, etc.). Current multimodal systems employ static resource provisioning and cannot easily adapt when compute resources change over time. Additionally, their reliance on processing sensor data with fixed feature extractors is ill-equipped to handle variations in modality quality. Consequently, uninformative modalities, such as those with high noise, needlessly consume resources better allocated towards other modalities. We propose ADMN, a layer-wise Adaptive Depth Multimodal Network capable of tackling both challenges - it adjusts the total number of active layers across all modalities to meet compute resource constraints, and continually reallocates layers across input modalities according to their modality quality. Our evaluations showcase ADMN can match the accuracy of state-of-the-art networks while reducing up to 75% of their floating-point operations.
多模态深度学习系统由于多种传感模式提供的鲁棒性而被部署在动态场景中。然而,这些系统面临着计算资源可用性的变化(由多租户、设备异构性等因素引起)和输入质量的波动(来自传感器数据污染、环境噪声等)。当前的多模态系统采用静态资源配置,并且当计算资源随时间发生变化时,它们无法轻易适应。此外,它们依赖于使用固定的特征提取器处理传感器数据,这使得在处理模式质量变化方面显得力不从心。因此,信息含量低的模式(例如噪声高的模式)会无谓地消耗资源,这些资源本应分配给其他更有价值的模式。 我们提出了一种分层自适应深度多模态网络ADMN,能够应对上述两个挑战:它可以根据计算资源约束调整所有模式下活跃层数量,并根据输入模式质量动态重新分配各层次。我们的评估结果表明,ADNM能够在减少最多75%浮点运算的同时达到现有最先进网络的精度水平。
https://arxiv.org/abs/2502.07862
Prostate cancer is a leading health concern among men, requiring accurate and accessible methods for early detection and risk stratification. Prostate volume (PV) is a key parameter in multivariate risk stratification for early prostate cancer detection, commonly estimated using transrectal ultrasound (TRUS). While TRUS provides precise prostate volume measurements, its invasive nature often compromises patient comfort. Transabdominal ultrasound (TAUS) provides a non-invasive alternative but faces challenges such as lower image quality, complex interpretation, and reliance on operator expertise. This study introduces a new deep-learning-based framework for automatic PV estimation using TAUS, emphasizing its potential to enable accurate and non-invasive prostate cancer risk stratification. A dataset of TAUS videos from 100 individual patients was curated, with manually delineated prostate boundaries and calculated diameters by an expert clinician as ground truth. The introduced framework integrates deep-learning models for prostate segmentation in both axial and sagittal planes, automatic prostate diameter estimation, and PV calculation. Segmentation performance was evaluated using Dice correlation coefficient (%) and Hausdorff distance (mm). Framework's volume estimation capabilities were evaluated on volumetric error (mL). The framework demonstrates that it can estimate PV from TAUS videos with a mean volumetric error of -5.5 mL, which results in an average relative error between 5 and 15%. The introduced framework for automatic PV estimation from TAUS images, utilizing deep learning models for prostate segmentation, shows promising results. It effectively segments the prostate and estimates its volume, offering potential for reliable, non-invasive risk stratification for early prostate detection.
前列腺癌是男性健康的主要关注点,需要准确且易于获取的方法来进行早期检测和风险评估。前列腺体积(PV)在多变量风险分层中是一个关键参数,常通过经直肠超声波(TRUS)来估算。虽然TRUS能提供精确的前列腺体积测量值,但其侵入性特性往往影响患者的舒适度。经腹壁超声波(TAUS)则提供了非侵入性的替代方案,但面临着图像质量较低、解读复杂和依赖操作者经验等挑战。 本研究引入了一个基于深度学习的新框架,用于从TAUS视频中自动估算前列腺体积(PV),强调其在实现准确且非侵入性早期前列腺癌风险评估中的潜力。该数据集由100名患者的不同TAUS视频组成,所有视频均附有专业医生手动描绘的前列腺边界和计算出的直径作为真实参照值。 新框架整合了用于轴向和平面中前列腺分割的深度学习模型、自动前列腺直径估算以及PV计算等功能。分割性能通过Dice相关系数(%)和豪斯多夫距离(mm)进行评估,而体积估计能力则通过容积误差(mL)来衡量。研究结果表明,该框架能够从TAUS视频中以平均5.5毫升的体积误差准确估算前列腺体积,相对误差在5到15%之间。 利用深度学习模型进行前列腺分割并从TAUS图像自动估算PV的新框架展现了很有前景的结果。它有效地区分了前列腺并估计其体积,为早期非侵入性前列腺癌风险评估提供了潜在的可能性。
https://arxiv.org/abs/2502.07859
Mass-produced optical lenses often exhibit defects that alter their scattering properties and compromise quality standards. Manual inspection is usually adopted to detect defects, but it is not recommended due to low accuracy, high error rate and limited scalability. To address these challenges, this study presents an automated defect detection system based on the YOLOv8 deep learning model. A custom dataset of optical lenses, annotated with defect and lens regions, was created to train the model. Experimental results obtained in this study reveal that the system can be used to efficiently and accurately detect defects in optical lenses. The proposed system can be utilized in real-time industrial environments to enhance quality control processes by enabling reliable and scalable defect detection in optical lens manufacturing.
大批量生产的光学镜头常常存在影响其散射特性的缺陷,从而降低产品质量标准。通常采用人工检查来发现这些缺陷,但这种方法由于精度低、错误率高以及可扩展性有限而不被推荐使用。为了解决这些问题,本研究提出了一种基于YOLOv8深度学习模型的自动缺陷检测系统。为此,创建了一个包含光学镜头并标注了缺陷区域和镜头区域的自定义数据集来训练该模型。 本研究所获得的实验结果显示,该系统能够高效且准确地检测光学镜头中的缺陷。所提出的系统可以在实时工业环境中使用,通过实现可靠的可扩展性缺陷检测来提高光学镜片制造过程中的质量控制水平。
https://arxiv.org/abs/2502.07592
Global leaders and policymakers are unified in their unequivocal commitment to decarbonization efforts in support of Net-Zero agreements. District Heating Systems (DHS), while contributing to carbon emissions due to the continued reliance on fossil fuels for heat production, are embracing more sustainable practices albeit with some sense of vulnerability as it could constrain their ability to adapt to dynamic demand and production scenarios. As demographic demands grow and renewables become the central strategy in decarbonizing the heating sector, the need for accurate demand forecasting has intensified. Advances in digitization have paved the way for Machine Learning (ML) based solutions to become the industry standard for modeling complex time series patterns. In this paper, we focus on building a Deep Learning (DL) model that uses deconstructed components of independent and dependent variables that affect heat demand as features to perform multi-step ahead forecasting of head demand. The model represents the input features in a time-frequency space and uses an attention mechanism to generate accurate forecasts. The proposed method is evaluated on a real-world dataset and the forecasting performance is assessed against LSTM and CNN-based forecasting models. Across different supply zones, the attention-based models outperforms the baselines quantitatively and qualitatively, with an Mean Absolute Error (MAE) of 0.105 with a standard deviation of 0.06kW h and a Mean Absolute Percentage Error (MAPE) of 5.4% with a standard deviation of 2.8%, in comparison the second best model with a MAE of 0.10 with a standard deviation of 0.06kW h and a MAPE of 5.6% with a standard deviation of 3%.
全球领导人和政策制定者在支持净零协议方面对减排的承诺是统一且明确的。尽管区域供热系统(DHS)由于继续依赖化石燃料进行热能生产,导致碳排放增加,但它们正逐渐采纳更加可持续的做法,即便如此仍存在一定的脆弱性——因为这可能限制其适应动态需求和生产情景的能力。随着人口需求的增长以及可再生能源成为供暖行业脱碳核心策略的实施,对准确的需求预测的需要变得更加迫切。 数字化的进步为基于机器学习(ML)的解决方案铺平了道路,这些方案已成为建模复杂时间序列模式的行业标准。本文重点在于构建一个深度学习(DL)模型,该模型使用影响热需求的独立和依赖变量的分解组件作为特征来进行多步预测。此模型在时频空间中表示输入特征,并采用注意力机制来生成准确的预测。 所提出的方法在一个真实世界的数据集上进行了评估,并与长短期记忆(LSTM)和基于卷积神经网络(CNN)的预测模型进行比较,以评估其预测性能。跨不同的供应区,在定量和定性指标上,带有注意机制的模型优于基线模型,具体表现为平均绝对误差(MAE)为0.105千瓦时,标准偏差为0.06千瓦时,以及平均绝对百分比误差(MAPE)为5.4%,标准偏差为2.8%。相比之下,第二好的模型具有0.10千瓦时的MAE和0.06千瓦时的标准偏差,以及5.6%的MAPE和3%的标准偏差。
https://arxiv.org/abs/2502.07854
This study uses deep-learning models to predict city partition crime counts on specific days. It helps police enhance surveillance, gather intelligence, and proactively prevent crimes. We formulate crime count prediction as a spatiotemporal sequence challenge, where both input data and prediction targets are spatiotemporal sequences. In order to improve the accuracy of crime forecasting, we introduce a new model that combines Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks. We conducted a comparative analysis to access the effects of various data sequences, including raw and binned data, on the prediction errors of four deep learning forecasting models. Directly inputting raw crime data into the forecasting model causes high prediction errors, making the model unsuitable for real - world use. The findings indicate that the proposed CNN-LSTM model achieves optimal performance when crime data is categorized into 10 or 5 groups. Data binning can enhance forecasting model performance, but poorly defined intervals may reduce map granularity. Compared to dividing into 5 bins, binning into 10 intervals strikes an optimal balance, preserving data characteristics and surpassing raw data in predictive modelling efficacy.
这项研究利用深度学习模型来预测特定日期的城市分区犯罪数量,这有助于警察加强监控、收集情报并主动预防犯罪。我们将犯罪数量的预测视为时空序列挑战,在此框架下,输入数据和预测目标都是时空序列。 为了提高犯罪预测的准确性,我们引入了一种结合卷积神经网络(CNN)和长短期记忆(LSTM)网络的新模型。我们进行了比较分析,评估了不同数据序列对四种深度学习预测模型预测误差的影响,包括原始数据和分箱后的数据。直接将原始犯罪数据输入到预测模型会导致较高的预测错误率,使其不适合实际应用。 研究发现表明,所提出的CNN-LSTM模型在将犯罪数据分类为10组或5组时能够达到最佳性能。数据分箱可以提高预测模型的表现,但如果区间的定义不当,则可能会降低地图的细节程度。相比于分成5个区间,分成10个区间能取得更佳平衡,在保留数据特征的同时,在预测建模效果上超过原始数据。
https://arxiv.org/abs/2502.07465
We present our solution to the AAAI-25 VRD-IU challenge, achieving first place in the competition. Our approach integrates large margin loss for improved feature discrimination and employs heuristic rules to refine hierarchical relationships. By combining a deep learning-based matching strategy with greedy algorithms, we achieve a significant boost in accuracy while maintaining computational efficiency. Our method attains an accuracy of 0.98904 on the private leaderboard, demonstrating its effectiveness in document structure parsing. Source codes are publicly available at this https URL
我们介绍了针对AAAI-25 VRD-IU挑战赛的解决方案,并在比赛中获得了第一名。我们的方法整合了大边缘损失,以提高特征区分度,并采用启发式规则来优化层次关系。通过结合基于深度学习的匹配策略和贪婪算法,我们在保持计算效率的同时大幅提升了准确性。我们的方法在私人排行榜上达到了0.98904的准确率,展示了其在文档结构解析方面的有效性。源代码可在以下链接公开获取:[此URL]
https://arxiv.org/abs/2502.07442
Cardiovascular diseases, a leading cause of noncommunicable disease-related deaths, require early and accurate detection to improve patient outcomes. Taking advantage of advances in machine learning and deep learning, multiple approaches have been proposed in the literature to address the challenge of detecting ECG anomalies. Typically, these methods are based on the manual interpretation of ECG signals, which is time consuming and depends on the expertise of healthcare professionals. The objective of this work is to propose a deep learning system, FADE, designed for normal ECG forecasting and anomaly detection, which reduces the need for extensive labeled datasets and manual interpretation. FADE has been trained in a self-supervised manner with a novel morphological inspired loss function. Unlike conventional models that learn from labeled anomalous ECG waveforms, our approach predicts the future of normal ECG signals, thus avoiding the need for extensive labeled datasets. Using a novel distance function to compare forecasted ECG signals with actual sensor data, our method effectively identifies cardiac anomalies. Additionally, this approach can be adapted to new contexts through domain adaptation techniques. To evaluate our proposal, we performed a set of experiments using two publicly available datasets: MIT-BIH NSR and MIT-BIH Arrythmia. The results demonstrate that our system achieves an average accuracy of 83.84% in anomaly detection, while correctly classifying normal ECG signals with an accuracy of 85.46%. Our proposed approach exhibited superior performance in the early detection of cardiac anomalies in ECG signals, surpassing previous methods that predominantly identify a limited range of anomalies. FADE effectively detects both abnormal heartbeats and arrhythmias, offering significant advantages in healthcare through cost reduction or processing of large-scale ECG data.
心血管疾病是导致非传染性疾病死亡的主要原因之一,早期和准确地检测这些疾病对于改善患者预后至关重要。利用机器学习和深度学习的进步,文献中提出了多种方法来应对检测心电图(ECG)异常的挑战。传统的方法通常基于对心电信号的手动解读,这种做法既耗时又依赖于医疗专业人员的经验水平。本工作的目标是提出一种名为FADE的深度学习系统,该系统旨在进行正常心电图预测和异常检测,并且减少了对大量标注数据集及手动解释的需求。FADE采用了一种新颖的形态启发式损失函数,在自监督模式下进行了训练。 与传统模型通过从标记的心电图波形中学习不同,我们的方法是预测正常心电信号的未来趋势,从而避免了对大规模标注数据集的需求。我们使用一种新的距离函数来比较预测的心电图信号和实际传感器数据,从而使异常检测变得更加有效。此外,这种方法可以通过领域适应技术应用于新场景。 为了评估该提议的有效性,我们在两个公开可用的数据集中进行了实验:MIT-BIH正常窦性心律(NSR)数据库和MIT-BIH心律失常数据库。结果显示,我们的系统在异常检测中达到了83.84%的平均准确率,并且正确分类了正常的ECG信号,其准确性为85.46%。与之前的大多数方法相比,我们提出的方法在早期发现心电图信号中的心脏异常方面表现出色,这些先前的方法主要识别有限范围内的异常情况。FADE能够有效检测到不规则的心跳和心律失常,通过减少成本或处理大规模心电图数据,在医疗保健领域提供了显著的优势。
https://arxiv.org/abs/2502.07389
Video microscopy, when combined with machine learning, offers a promising approach for studying the early development of in vitro produced (IVP) embryos. However, manually annotating developmental events, and more specifically cell divisions, is time-consuming for a biologist and cannot scale up for practical applications. We aim to automatically classify the cell stages of embryos from 2D time-lapse microscopy videos with a deep learning approach. We focus on the analysis of bovine embryonic development using video microscopy, as we are primarily interested in the application of cattle breeding, and we have created a Bovine Embryos Cell Stages (ECS) dataset. The challenges are three-fold: (1) low-quality images and bovine dark cells that make the identification of cell stages difficult, (2) class ambiguity at the boundaries of developmental stages, and (3) imbalanced data distribution. To address these challenges, we introduce CLEmbryo, a novel method that leverages supervised contrastive learning combined with focal loss for training, and the lightweight 3D neural network CSN-50 as an encoder. We also show that our method generalizes well. CLEmbryo outperforms state-of-the-art methods on both our Bovine ECS dataset and the publicly available NYU Mouse Embryos dataset.
将视频显微镜技术与机器学习结合,为研究体外生产的胚胎早期发育提供了一种有前景的方法。然而,手动标注发育事件(尤其是细胞分裂)对于生物学家来说耗时且无法扩展到实际应用中。我们旨在使用深度学习方法自动分类2D延时显微摄影视频中的胚胎细胞阶段。我们的重点是利用视频显微镜分析牛胚胎的发育情况,因为我们主要对畜牧业的应用感兴趣,并为此创建了“牛胚胎细胞阶段(ECS)”数据集。 这项研究面临三大挑战:(1) 图像质量低和牛暗色细胞使得识别细胞阶段变得困难;(2) 在发育阶段边界处存在类别模糊性;以及 (3) 数据分布不平衡。为了应对这些挑战,我们引入了一种新的方法CLEmbryo,该方法结合了监督对比学习和焦点损失用于训练,并采用轻量级的3D神经网络CSN-50作为编码器。此外,我们的实验结果表明该方法具有良好的泛化能力。在“牛胚胎细胞阶段(ECS)”数据集以及公开可用的NYU小鼠胚胎数据集中,CLEmbryo的表现优于现有最先进的方法。
https://arxiv.org/abs/2502.07360
High-quality open-source datasets, which necessitate substantial efforts for curation, has become the primary catalyst for the swift progress of deep learning. Concurrently, protecting these datasets is paramount for the well-being of the data owner. Dataset ownership verification emerges as a crucial method in this domain, but existing approaches are often limited to supervised models and cannot be directly extended to increasingly popular unsupervised pre-trained models. In this work, we propose the first dataset ownership verification method tailored specifically for self-supervised pre-trained models by contrastive learning. Its primary objective is to ascertain whether a suspicious black-box backbone has been pre-trained on a specific unlabeled dataset, aiding dataset owners in upholding their rights. The proposed approach is motivated by our empirical insights that when models are trained with the target dataset, the unary and binary instance relationships within the embedding space exhibit significant variations compared to models trained without the target dataset. We validate the efficacy of this approach across multiple contrastive pre-trained models including SimCLR, BYOL, SimSiam, MOCO v3, and DINO. The results demonstrate that our method rejects the null hypothesis with a $p$-value markedly below $0.05$, surpassing all previous methodologies. Our code is available at this https URL.
高质量的开源数据集,其维护需要大量的努力,已成为深度学习快速发展的主要推动力。同时,保护这些数据集对数据所有者的权益至关重要。数据集所有权验证在此领域中成为一个关键方法,但现有的方法往往局限于监督模型,并且无法直接扩展到日益流行的无监督预训练模型上。在本文工作中,我们提出了首个专门针对对比学习自监督预训练模型的数据集所有权验证方法。其主要目的是确定一个可疑的黑盒骨干网络是否已经使用特定的未标记数据集进行了预训练,从而帮助数据所有者维护自己的权利。 我们的方法受到实证洞察的启发,即当模型使用目标数据集进行训练时,在嵌入空间中的单实例和二元实例关系与未使用该目标数据集训练的模型相比会表现出显著的变化。我们在包括SimCLR、BYOL、SimSiam、MOCO v3和DINO在内的多个对比预训练模型上验证了这种方法的有效性。结果表明,我们的方法以p值明显低于0.05的结果拒绝了零假设,超越了所有先前的方法。 代码可在 [提供URL] 获取。
https://arxiv.org/abs/2502.07276
In this work, various analysis methods are conducted on frequency-dependent methods on SED to further delve into their detailed characteristics and behaviors on SED. While SED has been rapidly advancing through the adoption of various deep learning techniques from other pattern recognition fields, these techniques are often not suitable for SED. To address this issue, two frequency-dependent SED methods were previously proposed: FilterAugment, a data augmentation randomly weighting frequency bands, and frequency dynamic convolution (FDY Conv), an architecture applying frequency adaptive convolution kernels. These methods have demonstrated superior performance in SED, and we aim to further analyze their detailed effectiveness and characteristics in SED. We compare class-wise performance to find out specific pros and cons of FilterAugment and FDY Conv. We apply Gradient-weighted Class Activation Mapping (Grad-CAM), which highlights time-frequency region that is more inferred by the model, on SED models with and without frequency masking and two types of FilterAugment to observe their detailed characteristics. We propose simpler frequency dependent convolution methods and compare them with FDY Conv to further understand which components of FDY Conv affects SED performance. Lastly, we apply PCA to show how FDY Conv adapts dynamic kernel across frequency dimensions on different sound event classes. The results and discussions demonstrate that frequency dependency plays a significant role in sound event detection and further confirms the effectiveness of frequency dependent methods on SED.
在这项工作中,我们对SED(声音事件检测)中基于频率的方法进行了多种分析,以便更深入地探讨这些方法的详细特性和行为。尽管通过采用其他模式识别领域中的各种深度学习技术,SED得到了迅速发展,但这些技术往往不适合用于SED。为了解决这一问题,之前提出了两种基于频率的方法:FilterAugment(随机加权频率带的数据增强)和频率动态卷积(FDY Conv,一种应用频率自适应卷积核的架构)。这两种方法在SED中表现出了优越性,我们的目标是进一步分析它们在SED中的详细有效性和特性。我们通过比较类别级别的性能来找出FilterAugment和FDY Conv的具体优缺点。此外,我们将梯度加权类激活映射(Grad-CAM)应用于具有和不具频率掩码以及两种类型FilterAugment的SED模型上,以观察它们的详细特征。我们还提出了更简单的基于频率的方法,并与FDY Conv进行比较,以便进一步了解哪些成分影响了FDY Conv在SED中的表现。最后,我们应用PCA(主成分分析)来展示FDY Conv如何根据不同的声音事件类别在频率维度上调整动态内核。研究结果和讨论表明,频率依赖性在声音事件检测中起着重要作用,并进一步证实了基于频率的方法在SED中的有效性。
https://arxiv.org/abs/2502.07208
Reverberant speech, denoting the speech signal degraded by the process of reverberation, contains crucial knowledge of both anechoic source speech and room impulse response (RIR). This work proposes a variational Bayesian inference (VBI) framework with neural speech prior (VINP) for joint speech dereverberation and blind RIR identification. In VINP, a probabilistic signal model is constructed in the time-frequency (T-F) domain based on convolution transfer function (CTF) approximation. For the first time, we propose using an arbitrary discriminative dereverberation deep neural network (DNN) to predict the prior distribution of anechoic speech within a probabilistic model. By integrating both reverberant speech and the anechoic speech prior, VINP yields the maximum a posteriori (MAP) and maximum likelihood (ML) estimations of the anechoic speech spectrum and CTF filter, respectively. After simple transformations, the waveforms of anechoic speech and RIR are estimated. Moreover, VINP is effective for automatic speech recognition (ASR) systems, which sets it apart from most deep learning (DL)-based single-channel dereverberation approaches. Experiments on single-channel speech dereverberation demonstrate that VINP reaches an advanced level in most metrics related to human perception and displays unquestionable state-of-the-art (SOTA) performance in ASR-related metrics. For blind RIR identification, experiments indicate that VINP attains the SOTA level in blind estimation of reverberation time at 60 dB (RT60) and direct-to-reverberation ratio (DRR). Codes and audio samples are available online.
回声语音,指的是由混响过程退化的语音信号,包含了无回声源语音和房间脉冲响应(RIR)的重要信息。本文提出了一种基于神经语音先验的变分贝叶斯推理框架(VINP),用于联合进行语音去混响和盲RIR识别。在VINP中,一种基于卷积传输函数(CTF)近似的概率信号模型被构建于时频域内。首次提出了使用任意判别性的去混响深度神经网络(DNN)来预测无回声语音的先验分布,并将其集成到概率模型中。通过结合回声语音和无回声语音先验,VINP分别得到了无回声语音谱和CTF滤波器的最大后验(MAP)和最大似然(ML)估计。经过简单的变换之后,可以估算出无回声语音的波形以及RIR。此外,VINP对于自动语音识别(ASR)系统而言也是有效的,这使其区别于大多数基于深度学习的单通道去混响方法。 实验结果显示,在单通道语音去混响方面,VINP在几乎所有与人类感知相关的指标中达到了先进水平,并且在与ASR相关的指标上表现出无可争议的最佳性能。对于盲RIR识别来说,实验表明VINP在60分贝(RT60)和直达声到回声比率(DRR)的盲估计方面也达到了最佳水平。 代码及音频样本已在线提供。
https://arxiv.org/abs/2502.07205
This research addresses the challenge of limited data in tabular data classification, particularly prevalent in domains with constraints like healthcare. We propose Tab2Visual, a novel approach that transforms heterogeneous tabular data into visual representations, enabling the application of powerful deep learning models. Tab2Visual effectively addresses data scarcity by incorporating novel image augmentation techniques and facilitating transfer learning. We extensively evaluate the proposed approach on diverse tabular datasets, comparing its performance against a wide range of machine learning algorithms, including classical methods, tree-based ensembles, and state-of-the-art deep learning models specifically designed for tabular data. We also perform an in-depth analysis of factors influencing Tab2Visual's performance. Our experimental results demonstrate that Tab2Visual outperforms other methods in classification problems with limited tabular data.
这项研究旨在解决表格数据分类中受限领域(如医疗健康)普遍存在的数据量有限的问题。我们提出了一种名为Tab2Visual的新方法,该方法将异构的表格数据转换为视觉表示形式,从而可以应用强大的深度学习模型。通过采用新颖的图像增强技术并促进迁移学习,Tab2Visual有效地解决了由于数据稀缺引起的问题。 我们在多种不同的表格数据集上广泛评估了所提出的方法,并将其性能与一系列机器学习算法进行了比较,包括经典方法、基于树的集成和专门为处理表格数据而设计的最新深度学习模型。我们还深入分析了影响Tab2Visual性能的因素。 实验结果显示,在数据量有限的情况下进行分类问题时,Tab2Visual优于其他方法。
https://arxiv.org/abs/2502.07181
In recent years, deep learning with Convolutional Neural Networks (CNNs) has achieved remarkable results in the field of HMER (Handwritten Mathematical Expression Recognition). However, it remains challenging to improve performance with limited labeled training data. This paper presents, for the first time, a simple yet effective semi-supervised HMER framework by introducing dual-branch semi-supervised learning. Specifically, we simplify the conventional deep co-training from consistency regularization to cross-supervised learning, where the prediction of one branch is used as a pseudo-label to supervise the other branch directly end-to-end. Considering that the learning of the two branches tends to converge in the later stages of model optimization, we also incorporate a weak-to-strong strategy by applying different levels of augmentation to each branch, which behaves like expanding the training data and improving the quality of network training. Meanwhile, We propose a novel module, Global Dynamic Counting Module(GDCM), to enhance the performance of the HMER decoder, which alleviates recognition inaccuracies in long-distance formula recognition and the occurrence of repeated characters. We release our code at this https URL.
近年来,使用卷积神经网络(CNN)的深度学习在手写数学表达式识别(HMER)领域取得了显著成果。然而,在训练数据有限的情况下提高性能仍然具有挑战性。本文首次提出了一种简单而有效的半监督HMER框架,通过引入双分支半监督学习来实现这一目标。具体来说,我们将传统的深度协同训练简化为一致性正则化到交叉监督学习,其中一个分支的预测被用作另一个分支的伪标签,并进行端到端的监督。考虑到两个分支的学习在模型优化后期趋于收敛,我们还提出了一种弱至强策略,通过将不同级别的增强应用于每个分支来实现,这类似于扩展训练数据并提高网络训练质量。 同时,我们提出了一种新的模块——全局动态计数模块(GDCM),以提升HMER解码器的性能。该模块能够缓解长距离公式识别中的识别不准确问题以及重复字符的发生。 我们的代码可以在以下链接中获取:[此URL](https://this https URL) (请将"this https URL"替换为实际发布的代码链接地址)
https://arxiv.org/abs/2502.07172