Traffic flow forecasting is a crucial task in transportation management and planning. The main challenges for traffic flow forecasting are that (1) as the length of prediction time increases, the accuracy of prediction will decrease; (2) the predicted results greatly rely on the extraction of temporal and spatial dependencies from the road networks. To overcome the challenges mentioned above, we propose a multi-channel spatial-temporal transformer model for traffic flow forecasting, which improves the accuracy of the prediction by fusing results from different channels of traffic data. Our approach leverages graph convolutional network to extract spatial features from each channel while using a transformer-based architecture to capture temporal dependencies across channels. We introduce an adaptive adjacency matrix to overcome limitations in feature extraction from fixed topological structures. Experimental results on six real-world datasets demonstrate that introducing a multi-channel mechanism into the temporal model enhances performance and our proposed model outperforms state-of-the-art models in terms of accuracy.
交通流量预测是运输管理和规划中的一个关键任务。交通流量预测的主要挑战是:(1)随着预测时间的增加,预测的准确性会降低;(2)预测结果很大程度上依赖于从道路网络中提取时间间隔和空间依赖关系。为了克服上述挑战,我们提出了一个多通道时空转换器模型来进行交通流量预测,通过融合不同交通数据的通道结果,提高了预测的准确性。我们的方法利用图卷积网络从每个通道中提取空间特征,并使用基于转换器的架构捕捉通道之间的时间依赖关系。我们引入自适应邻接矩阵来克服固定拓扑结构的特征提取限制。对于六个真实世界数据集的实验结果表明,将多通道机制引入时间模型可以提高性能,而且我们提出的模型在准确性方面优于最先进的模型。
https://arxiv.org/abs/2405.06266
Objective: Heart murmurs are abnormal sounds caused by turbulent blood flow within the heart. Several diagnostic methods are available to detect heart murmurs and their severity, such as cardiac auscultation, echocardiography, phonocardiogram (PCG), etc. However, these methods have limitations, including extensive training and experience among healthcare providers, cost and accessibility of echocardiography, as well as noise interference and PCG data processing. This study aims to develop a novel end-to-end real-time heart murmur detection approach using traditional and depthwise separable convolutional networks. Methods: Continuous wavelet transform (CWT) was applied to extract meaningful features from the PCG data. The proposed network has three parts: the Squeeze net, the Bottleneck, and the Expansion net. The Squeeze net generates a compressed data representation, whereas the Bottleneck layer reduces computational complexity using a depthwise-separable convolutional network. The Expansion net is responsible for up-sampling the compressed data to a higher dimension, capturing tiny details of the representative data. Results: For evaluation, we used four publicly available datasets and achieved state-of-the-art performance in all datasets. Furthermore, we tested our proposed network on two resource-constrained devices: a Raspberry PI and an Android device, stripping it down into a tiny machine learning model (TinyML), achieving a maximum of 99.70%. Conclusion: The proposed model offers a deep learning framework for real-time accurate heart murmur detection within limited resources. Significance: It will significantly result in more accessible and practical medical services and reduced diagnosis time to assist medical professionals. The code is publicly available at TBA.
目标:心杂音是由心脏内湍流血液流动引起的异常声音。有几种诊断方法可用于检测心杂音及其严重程度,例如心音听诊、超声心动图、频谱心动图等。然而,这些方法有局限性,包括医疗保健专业人员的广泛培训和经验、超声心动图的昂贵和可获取性,以及噪声干扰和PCG数据处理。本研究旨在利用传统和深度可分离卷积网络开发一种端到端实时心杂音检测方法。方法:连续波形变换(CWT)应用于从PCG数据中提取有意义特征。所提出的网络包括三个部分:挤压网络、瓶颈层和扩展网络。挤压网络生成压缩数据表示,而瓶颈层通过深度可分离卷积网络降低计算复杂性。扩展网络负责将压缩数据上采样至更高维度,捕捉代表数据的微小细节。结果:为了评估,我们使用了四个公开可用的数据集,在所有数据集上都实现了最先进的性能。此外,我们在两个资源受限的设备上测试了所提出的网络:一个GPIO板(Raspberry Pi)和一个安卓设备,将它们卸载为微型机器学习模型(TinyML),取得了99.70%的最大性能。结论:所提出的模型为在有限资源下实现实时准确心杂音检测提供了一个深度学习框架。意义:这将大大提高可获取性和实用性医疗服务,并帮助医疗专业人员缩短诊断时间。代码公开可用,在TBA中。
https://arxiv.org/abs/2405.09570
In the realm of practical fine-grained visual classification applications rooted in deep learning, a common scenario involves training a model using a pre-existing dataset. Subsequently, a new dataset becomes available, prompting the desire to make a pivotal decision for achieving enhanced and leveraged inference performance on both sides: Should one opt to train datasets from scratch or fine-tune the model trained on the initial dataset using the newly released dataset? The existing literature reveals a lack of methods to systematically determine the optimal training strategy, necessitating explainability. To this end, we present an automatic best-suit training solution searching framework, the Dual-Carriageway Framework (DCF), to fill this gap. DCF benefits from the design of a dual-direction search (starting from the pre-existing or the newly released dataset) where five different training settings are enforced. In addition, DCF is not only capable of figuring out the optimal training strategy with the capability of avoiding overfitting but also yields built-in quantitative and visual explanations derived from the actual input and weights of the trained model. We validated DCF's effectiveness through experiments with three convolutional neural networks (ResNet18, ResNet34 and Inception-v3) on two temporally continued commercial product datasets. Results showed fine-tuning pathways outperformed training-from-scratch ones by up to 2.13% and 1.23% on the pre-existing and new datasets, respectively, in terms of mean accuracy. Furthermore, DCF identified reflection padding as the superior padding method, enhancing testing accuracy by 3.72% on average. This framework stands out for its potential to guide the development of robust and explainable AI solutions in fine-grained visual classification tasks.
在基于深度学习的实际细粒度视觉分类应用领域中,一种常见的情况是使用预先存在的数据集训练模型。随后,一个新的数据集变得可用,激发了对提高和充分利用推理性能的渴望:是否选择从头训练数据集,或者使用新发布的数据集对预训练模型进行微调?现有的文献表明,没有方法可以系统地确定最优训练策略,导致可解释性。为此,我们提出了一个自动最佳适应训练解决方案——双车道框架(DCF),来填补这个空白。 DCF 得益于双方向搜索(从预先存在的或新发布的数据集开始)的设计。在这里,我们应用了五种不同的训练设置。此外,DCF 不仅能够通过避免过拟合的能力来确定最优训练策略,而且还能够通过训练模型的实际输入和权重生成内置的数值和视觉解释。我们通过在两个时间持续的商用产品数据集(ResNet18,ResNet34 和 Inception-v3)上进行实验来评估 DCF 的有效性。 结果表明,在预先存在和新的数据集上,微调路径的准确率均高于从头训练路径。此外,DCF 将反射填充称为优越的填充方法,通过平均提高测试准确率3.72%。这项框架在引导细粒度视觉分类任务中实现健壮和可解释 AI 解决方案的发展方面具有突出特点。
https://arxiv.org/abs/2405.05853
The objective of single image dehazing is to restore hazy images and produce clear, high-quality visuals. Traditional convolutional models struggle with long-range dependencies due to their limited receptive field size. While Transformers excel at capturing such dependencies, their quadratic computational complexity in relation to feature map resolution makes them less suitable for pixel-to-pixel dense prediction tasks. Moreover, fixed kernels or tokens in most models do not adapt well to varying blur sizes, resulting in suboptimal dehazing performance. In this study, we introduce a novel dehazing network based on Parallel Stripe Cross Attention (PCSA) with a multi-scale strategy. PCSA efficiently integrates long-range dependencies by simultaneously capturing horizontal and vertical relationships, allowing each pixel to capture contextual cues from an expanded spatial domain. To handle different sizes and shapes of blurs flexibly, We employs a channel-wise design with varying convolutional kernel sizes and strip lengths in each PCSA to capture context information at different scales.Additionally, we incorporate a softmax-based adaptive weighting mechanism within PCSA to prioritize and leverage more critical features.
单图像去雾的目的是恢复模糊的图像并产生清晰、高质量的视觉效果。传统的卷积模型由于其较小的感受野尺寸而难以处理长距离依赖关系。虽然Transformer在捕捉这种依赖关系方面表现出色,但由于其与特征图分辨率之间的二次计算复杂度,它们在像素到像素密集预测任务上并不理想。此外,大多数模型中的固定卷积核或标记不能很好地适应不同的模糊程度,导致去雾性能 suboptimal。在本研究中,我们引入了一种基于Parallel Stripe Cross Attention(PCSA)的新去雾网络,并采用多尺度策略。PCSA通过同时捕获水平和垂直关系有效地整合了长距离依赖关系,使得每个像素都能从扩展的局部域中捕捉上下文信息。为了处理不同的模糊大小和形状,我们采用了一种通道端设计,其中每个PCSA的卷积核大小和条长有所不同,以捕捉不同尺度下的上下文信息。此外,我们还引入了一个基于软度的自适应权重机制在PCSA中,以便优先考虑和利用更关键的特征。
https://arxiv.org/abs/2405.05811
In recent years, convolutional neural networks (CNNs) with channel-wise feature refining mechanisms have brought noticeable benefits to modelling channel dependencies. However, current attention paradigms fail to infer an optimal channel descriptor capable of simultaneously exploiting statistical and spatial relationships among feature maps. In this paper, to overcome this shortcoming, we present a novel channel-wise spatially autocorrelated (CSA) attention mechanism. Inspired by geographical analysis, the proposed CSA exploits the spatial relationships between channels of feature maps to produce an effective channel descriptor. To the best of our knowledge, this is the f irst time that the concept of geographical spatial analysis is utilized in deep CNNs. The proposed CSA imposes negligible learning parameters and light computational overhead to the deep model, making it a powerful yet efficient attention module of choice. We validate the effectiveness of the proposed CSA networks (CSA-Nets) through extensive experiments and analysis on ImageNet, and MS COCO benchmark datasets for image classification, object detection, and instance segmentation. The experimental results demonstrate that CSA-Nets are able to consistently achieve competitive performance and superior generalization than several state-of-the-art attention-based CNNs over different benchmark tasks and datasets.
近年来,带有通道级特征细化机制的卷积神经网络(CNNs)在建模通道依赖方面带来了显著的益处。然而,当前的注意范式无法推断出能够同时利用特征图之间的统计和空间关系的最优通道描述符。在本文中,为了克服这一不足,我们提出了一个新的通道级自相关(CSA)注意机制。受到地理分析的启发,所提出的CSA利用了特征图通道之间的空间关系来产生有效的通道描述符。据我们所知,这是第一次将地理空间分析的概念应用于深度CNN中。所提出的CSA对深度模型只引入了微小的学习和计算开销,使其成为了一个强大而有效的注意模块。我们通过在ImageNet和MS COCO基准数据集上进行广泛的实验和分析来验证所提出的CSA网络(CSA-Nets)的有效性。实验结果表明,CSA-Nets能够比几种最先进的基于注意力的CNN在不同的基准任务和数据集上始终实现竞争力的性能和卓越的泛化能力。
https://arxiv.org/abs/2405.05755
Action recognition is a key technology in building interactive metaverses. With the rapid development of deep learning, methods in action recognition have also achieved great advancement. Researchers design and implement the backbones referring to multiple standpoints, which leads to the diversity of methods and encountering new challenges. This paper reviews several action recognition methods based on deep neural networks. We introduce these methods in three parts: 1) Two-Streams networks and their variants, which, specifically in this paper, use RGB video frame and optical flow modality as input; 2) 3D convolutional networks, which make efforts in taking advantage of RGB modality directly while extracting different motion information is no longer necessary; 3) Transformer-based methods, which introduce the model from natural language processing into computer vision and video understanding. We offer objective sights in this review and hopefully provide a reference for future research.
动作识别是构建交互式虚拟现实的关键技术。随着深度学习的快速发展,动作识别方法也取得了很大的进步。研究人员设计并实现了多个观点的动作识别骨骼,这导致了方法的多样性和遇到的新挑战。本文回顾了几种基于深度神经网络的动作识别方法。我们在第一部分介绍了两个通道网络及其变体,尤其是在本文中,使用RGB视频帧和光学流模式作为输入;第二部分介绍了3D卷积网络,它们致力于利用RGB模式直接提取不同运动信息;第三部分介绍了基于Transformer的方法,将自然语言处理模型的思想引入计算机视觉和视频理解。我们在本文的回顾中提供了客观的看法,并希望为未来的研究提供参考。
https://arxiv.org/abs/2405.05584
Our work tackles the fundamental challenge of image segmentation in computer vision, which is crucial for diverse applications. While supervised methods demonstrate proficiency, their reliance on extensive pixel-level annotations limits scalability. In response to this challenge, we present an enhanced unsupervised Convolutional Neural Network (CNN)-based algorithm called DynaSeg. Unlike traditional approaches that rely on a fixed weight factor to balance feature similarity and spatial continuity, requiring manual adjustments, our novel, dynamic weighting scheme automates parameter tuning, adapting flexibly to image details. We also introduce the novel concept of a Silhouette Score Phase that addresses the challenge of dynamic clustering during iterations. Additionally, our methodology integrates both CNN-based and pre-trained ResNet feature extraction, offering a comprehensive and adaptable approach. We achieve state-of-the-art results on diverse datasets, with a notable 12.2% and 14.12% mIOU improvement compared to the current benchmarks on COCO-All and COCO-Stuff, respectively. The proposed approach unlocks the potential for unsupervised image segmentation and addresses scalability concerns in real-world scenarios by obviating the need for meticulous parameter tuning.
我们的工作解决了计算机视觉中图像分割的基本挑战,这对各种应用至关重要。虽然监督方法表现出熟练,但它们对广泛的像素级注释的依赖限制了可扩展性。为了应对这个挑战,我们提出了一个增强的无监督卷积神经网络(CNN)算法,称为DynaSeg。与传统方法不同,它们依赖于固定权重因子来平衡特征相似性和空间连续性,需要手动调整,而我们的新动态权重方案自动调整参数,适应图像细节。我们还引入了新的轮廓得分阶段来解决迭代过程中动态聚类的挑战。此外,我们的方法将基于CNN和预训练ResNet的特征提取相结合,提供全面且可适应的解决方案。我们在各种数据集上取得最先进的成果,比COCO-All和COCO-Stuff等基准数据集的现有结果分别提高了12.2%和14.12%的mIOU。所提出的方法解锁了无监督图像分割的潜力,并解决了在现实场景中需要仔细参数调整的规模性问题。
https://arxiv.org/abs/2405.05477
We present AFEN (Audio Feature Ensemble Learning), a model that leverages Convolutional Neural Networks (CNN) and XGBoost in an ensemble learning fashion to perform state-of-the-art audio classification for a range of respiratory diseases. We use a meticulously selected mix of audio features which provide the salient attributes of the data and allow for accurate classification. The extracted features are then used as an input to two separate model classifiers 1) a multi-feature CNN classifier and 2) an XGBoost Classifier. The outputs of the two models are then fused with the use of soft voting. Thus, by exploiting ensemble learning, we achieve increased robustness and accuracy. We evaluate the performance of the model on a database of 920 respiratory sounds, which undergoes data augmentation techniques to increase the diversity of the data and generalizability of the model. We empirically verify that AFEN sets a new state-of-the-art using Precision and Recall as metrics, while decreasing training time by 60%.
我们提出了AFEN(音频特征集成学习)模型,该模型利用卷积神经网络(CNN)和XGBoost在集成学习方式下对各种呼吸疾病进行最先进的音频分类。我们使用精心选择的音频特征,这些特征提供了数据的突出特征并允许准确分类。提取的特征随后作为输入输入到两个单独的模型分类器(1)一个多特征CNN分类器和(2)一个XGBoost分类器。两个模型的输出然后通过软投票进行融合。因此,通过利用集成学习,我们实现了模型的稳健性和准确性。我们在一个由920个呼吸声数据集组成的数据库上评估了该模型的性能,该数据集经过数据增强技术以增加数据的多样性和模型的泛化能力。我们通过实验验证,AFEN模型通过精确度和召回率作为指标,达到了新的最先进水平,同时将训练时间降低了60%。
https://arxiv.org/abs/2405.05467
In this paper we describe ECG-SMART-NET for identification of occlusion myocardial infarction (OMI). OMI is a severe form of heart attack characterized by complete blockage of one or more coronary arteries requiring immediate referral for cardiac catheterization to restore blood flow to the heart. Two thirds of OMI cases are difficult to visually identify from a 12-lead electrocardiogram (ECG) and can be potentially fatal if not identified in a timely fashion. Previous works on this topic are scarce, and current state-of-the-art evidence suggests that both random forests with engineered features and convolutional neural networks (CNNs) are promising approaches to improve the ECG detection of OMI. While the ResNet architecture has been successfully adapted for use with ECG recordings, it is not ideally suited to capture informative temporal features within each lead and the spatial concordance or discordance across leads. We propose a clinically informed modification of the ResNet-18 architecture. The model first learns temporal features through temporal convolutional layers with 1xk kernels followed by a spatial convolutional layer, after the residual blocks, with 12x1 kernels to learn spatial features. The new ECG-SMART-NET was benchmarked against the original ResNet-18 and other state-of-the-art models on a multisite real-word clinical dataset that consists of 10,893 ECGs from 7,297 unique patients (rate of OMI = 6.5%). ECG-SMART-NET outperformed other models in the classification of OMI with a test AUC score of 0.889 +/- 0.027 and a test average precision score of 0.587 +/- 0.087.
在本文中,我们描述了ECG-SMART-NET用于识别梗阻性心肌梗死(OMI)。OMI是一种严重的心脏攻击形式,其特征是完全阻塞一条或多条冠状动脉,需要立即进行心导管检查以恢复心脏血液流动。本文关注的2/3的OMI病例在12导联心电图(ECG)上很难从视觉上识别,如果不能及时发现,可能会导致致命后果。关于这个话题的前人研究稀少,当前最先进的证据表明,具有工程特征的随机森林和卷积神经网络(CNN)是改善ECG检测OMI的有前途的方法。尽管ResNet架构已成功适应用于心电图记录,但它并不理想地捕捉每个导联的时空特征以及导联之间的时空一致性或差异。我们提出了一个临床关注的ResNet-18架构的修改版本。该模型首先通过具有1xk核的时空卷积层学习时间特征,然后在残差块后加上一个空间卷积层,使用12x1核学习空间特征。新型的ECG-SMART-NET在多站点真实世界临床数据集上(包含7,297个独特患者的心电图,OMI发生率为6.5%)对原始ResNet-18和其他最先进模型的分类表现优异。ECG-SMART-NET在其他模型在分类OMI方面超过了其他模型,其测试AUC分数为0.889 +/- 0.027,测试平均精度分数为0.587 +/- 0.087。
https://arxiv.org/abs/2405.09567
Making AI safe and dependable requires the generation of dependable models and dependable execution of those models. We propose redundant execution as a well-known technique that can be used to ensure reliable execution of the AI model. This generic technique will extend the application scope of AI-accelerators that do not feature well-documented safety or dependability properties. Typical redundancy techniques incur at least double or triple the computational expense of the original. We adopt a co-design approach, integrating reliable model execution with non-reliable execution, focusing that additional computational expense only where it is strictly necessary. We describe the design, implementation and some preliminary results of a hybrid CNN.
使AI安全并可靠需要生成可靠的模型和执行这些模型的可靠执行。我们提出冗余执行作为已知技术,可用于确保AI模型的可靠执行。这种通用技术将扩展不具有良好文档化安全或可靠性特性的AI加速器的应用范围。典型的冗余技术至少会带来两到三倍的计算开销。我们采用了一种共同设计方法,将可靠的模型执行与不可靠的执行相结合,仅在必要时增加额外的计算开销。我们描述了采用混合CNN的设计、实现和一些初步结果。
https://arxiv.org/abs/2405.05146
Mitigating bias in machine learning models is a critical endeavor for ensuring fairness and equity. In this paper, we propose a novel approach to address bias by leveraging pixel image attributions to identify and regularize regions of images containing significant information about bias attributes. Our method utilizes a model-agnostic approach to extract pixel attributions by employing a convolutional neural network (CNN) classifier trained on small image patches. By training the classifier to predict a property of the entire image using only a single patch, we achieve region-based attributions that provide insights into the distribution of important information across the image. We propose utilizing these attributions to introduce targeted noise into datasets with confounding attributes that bias the data, thereby constraining neural networks from learning these biases and emphasizing the primary attributes. Our approach demonstrates its efficacy in enabling the training of unbiased classifiers on heavily biased datasets.
缓解机器学习模型的偏见是一项至关重要的任务,以确保公正和平等。在本文中,我们提出了一种通过利用图像像素归一化来识别和正则化图像中包含关于偏见属性重要信息的新颖方法。我们的方法采用了一个模型无关的方法,通过在小型图像补丁上训练卷积神经网络(CNN)分类器来实现像素归一化。通过将分类器训练为仅使用单个补丁预测整个图像的属性,我们实现基于区域的归一化,并提供了对图像中重要信息分布的见解。我们提出将这些归一化用于具有误导性属性的数据集中,以限制神经网络学习这些偏见,并强调主要的属性。我们的方法在使具有严重偏见的训练数据上训练无偏见分类器方面取得了显著成效。
https://arxiv.org/abs/2405.05031
Automatic medical image segmentation technology has the potential to expedite pathological diagnoses, thereby enhancing the efficiency of patient care. However, medical images often have complex textures and structures, and the models often face the problem of reduced image resolution and information loss due to downsampling. To address this issue, we propose HC-Mamba, a new medical image segmentation model based on the modern state space model Mamba. Specifically, we introduce the technique of dilated convolution in the HC-Mamba model to capture a more extensive range of contextual information without increasing the computational cost by extending the perceptual field of the convolution kernel. In addition, the HC-Mamba model employs depthwise separable convolutions, significantly reducing the number of parameters and the computational power of the model. By combining dilated convolution and depthwise separable convolutions, HC-Mamba is able to process large-scale medical image data at a much lower computational cost while maintaining a high level of performance. We conduct comprehensive experiments on segmentation tasks including skin lesion, and conduct extensive experiments on ISIC17 and ISIC18 to demonstrate the potential of the HC-Mamba model in medical image segmentation. The experimental results show that HC-Mamba exhibits competitive performance on all these datasets, thereby proving its effectiveness and usefulness in medical image segmentation.
自动医学图像分割技术具有加速病理诊断的效果,从而提高了患者护理的效率。然而,医学图像通常具有复杂的纹理和结构,由于下采样,模型经常面临图像分辨率降低和信息丢失的问题。为解决这个问题,我们提出了HC-Mamba,一种基于现代状态空间模型Mamba的新的医学图像分割模型。具体来说,我们在HC-Mamba模型中引入了扩散卷积技术,以在不增加计算成本的情况下捕捉更广泛的上下文信息。此外,HC-Mamba模型采用深度可分离卷积,显著减少了模型的参数和计算能力。通过结合扩散卷积和深度可分离卷积,HC-Mamba能够以较低的计算成本处理大型医学图像数据,同时保持高性能。我们对包括皮肤斑点在内的分割任务进行了全面的实验,并在ISIC17和ISIC18上进行了广泛的实验,以证明HC-Mamba模型在医学图像分割中的潜在能力。实验结果表明,HC-Mamba在这些数据集上表现出竞争力的性能,从而证明了其在医学图像分割中的有效性和有用性。
https://arxiv.org/abs/2405.05007
Transformer-based methods have demonstrated excellent performance on super-resolution visual tasks, surpassing conventional convolutional neural networks. However, existing work typically restricts self-attention computation to non-overlapping windows to save computational costs. This means that Transformer-based networks can only use input information from a limited spatial range. Therefore, a novel Hybrid Multi-Axis Aggregation network (HMA) is proposed in this paper to exploit feature potential information better. HMA is constructed by stacking Residual Hybrid Transformer Blocks(RHTB) and Grid Attention Blocks(GAB). On the one side, RHTB combines channel attention and self-attention to enhance non-local feature fusion and produce more attractive visual results. Conversely, GAB is used in cross-domain information interaction to jointly model similar features and obtain a larger perceptual field. For the super-resolution task in the training phase, a novel pre-training method is designed to enhance the model representation capabilities further and validate the proposed model's effectiveness through many experiments. The experimental results show that HMA outperforms the state-of-the-art methods on the benchmark dataset. We provide code and models at this https URL.
基于Transformer的方法在超分辨率视觉任务上已经表现出卓越的性能,超过了传统的卷积神经网络。然而,现有的工作通常将自注意力计算限制在非重叠的窗口以节省计算成本。这意味着基于Transformer的网络只能利用输入信息的有限空间范围。因此,本文提出了一种新颖的混合多轴聚合网络(HMA)来更好地利用特征潜力信息。HMA由Residual Hybrid Transformer Blocks(RHTB)和Grid Attention Blocks(GAB)堆叠而成。一方面,RHTB通过结合通道关注和自注意力来增强非局部特征融合,产生更具吸引力的视觉结果。另一方面,GAB用于跨域信息交互,共同建模类似特征,获得更大的感知场。在训练阶段,为增强模型表示能力并验证所提出的模型的有效性,设计了一种新颖的预训练方法。实验结果表明,HMA在基准数据集上优于最先进的方法。我们在这个网址提供了代码和模型。
https://arxiv.org/abs/2405.05001
Accurately estimating a Health Index (HI) from condition monitoring data (CM) is essential for reliable and interpretable prognostics and health management (PHM) in complex systems. In most scenarios, complex systems operate under varying operating conditions and can exhibit different fault modes, making unsupervised inference of an HI from CM data a significant challenge. Hybrid models combining prior knowledge about degradation with deep learning models have been proposed to overcome this challenge. However, previously suggested hybrid models for HI estimation usually rely heavily on system-specific information, limiting their transferability to other systems. In this work, we propose an unsupervised hybrid method for HI estimation that integrates general knowledge about degradation into the convolutional autoencoder's model architecture and learning algorithm, enhancing its applicability across various systems. The effectiveness of the proposed method is demonstrated in two case studies from different domains: turbofan engines and lithium batteries. The results show that the proposed method outperforms other competitive alternatives, including residual-based methods, in terms of HI quality and their utility for Remaining Useful Life (RUL) predictions. The case studies also highlight the comparable performance of our proposed method with a supervised model trained with HI labels.
准确从病况监测数据(CM)估算健康指数(HI)对于复杂系统中的可靠且可解释的预后护理和健康管理(PHM)至关重要。在大多数情况下,复杂系统在不同的运行条件下运行,可能表现出不同的故障模式,因此从CM数据中无监督地推断HI是一个重要的挑战。为了克服这一挑战,已经提出了结合前知识 about degradation with deep learning models的混合模型。然而,以前提出的用于HI估计的混合模型通常依赖于系统特定信息,限制了它们在其他系统上的可迁移性。在本文中,我们提出了一种无监督的混合方法用于HI估计,将降解的一般知识融入到卷积自编码器的模型结构和求解算法中,增强了它在各种系统上的适用性。所提出方法的有效性在两个不同领域的案例研究中得到了证明:涡轮喷气发动机和锂离子电池。结果表明,与基于残余的方法的竞争对手相比,所提出的方法在HI质量和它们对剩余使用寿命(RUL)预测的实用性方面都表现优异。案例研究还强调了与使用HI标签进行有监督训练的监督模型具有可比较的性能。
https://arxiv.org/abs/2405.04990
Recent progress in remote sensing image (RSI) super-resolution (SR) has exhibited remarkable performance using deep neural networks, e.g., Convolutional Neural Networks and Transformers. However, existing SR methods often suffer from either a limited receptive field or quadratic computational overhead, resulting in sub-optimal global representation and unacceptable computational costs in large-scale RSI. To alleviate these issues, we develop the first attempt to integrate the Vision State Space Model (Mamba) for RSI-SR, which specializes in processing large-scale RSI by capturing long-range dependency with linear complexity. To achieve better SR reconstruction, building upon Mamba, we devise a Frequency-assisted Mamba framework, dubbed FMSR, to explore the spatial and frequent correlations. In particular, our FMSR features a multi-level fusion architecture equipped with the Frequency Selection Module (FSM), Vision State Space Module (VSSM), and Hybrid Gate Module (HGM) to grasp their merits for effective spatial-frequency fusion. Recognizing that global and local dependencies are complementary and both beneficial for SR, we further recalibrate these multi-level features for accurate feature fusion via learnable scaling adaptors. Extensive experiments on AID, DOTA, and DIOR benchmarks demonstrate that our FMSR outperforms state-of-the-art Transformer-based methods HAT-L in terms of PSNR by 0.11 dB on average, while consuming only 28.05% and 19.08% of its memory consumption and complexity, respectively.
近年来,遥感图像(RSI)超分辨率(SR)方面的进步已经显著地使用了深度神经网络,例如卷积神经网络(CNN)和Transformer。然而,现有的SR方法通常存在接收范围有限或线性计算开销等问题,导致全局表示效果不佳,并在大规模RSI上产生不可接受的计算成本。为了减轻这些问题,我们开发了第一个将Vision State Space Model(Mamba)集成到RSI-SR中的尝试,Mamba专门处理大规模RSI并通过线性复杂性捕捉长距离依赖。为了实现更好的SR复原,我们在Mamba的基础上设计了一个Frequency-assisted Mamba框架,称之为FMSR,以探讨其空间和频率关联。特别地,我们的FMSR配备了多级融合架构,包括频率选择模块(FSM)、视觉状态空间模块(VSSM)和混合门模块(HGM),以把握其对有效空间-频率融合的优点。认识到全局和局部依赖是互补的,两者都对SR有益,我们通过可学习缩放调整器进一步重新校准这些多级特征以实现准确的特征融合。在AID、DOTA和DIOR基准测试上进行的广泛实验证明,我们的FMSR在PSNR方面平均优于基于Transformer的当前最先进方法HAT-L,同时消耗只有28.05%和19.08%的内存开销和复杂度。
https://arxiv.org/abs/2405.04964
Facial feature tracking is essential in imaging ballistocardiography for accurate heart rate estimation and enables motor degradation quantification in Parkinson's disease through skin feature tracking. While deep convolutional neural networks have shown remarkable accuracy in tracking tasks, they typically require extensive labeled data for supervised training. Our proposed pipeline employs a convolutional stacked autoencoder to match image crops with a reference crop containing the target feature, learning deep feature encodings specific to the object category in an unsupervised manner, thus reducing data requirements. To overcome edge effects making the performance dependent on crop size, we introduced a Gaussian weight on the residual errors of the pixels when calculating the loss function. Training the autoencoder on facial images and validating its performance on manually labeled face and hand videos, our Deep Feature Encodings (DFE) method demonstrated superior tracking accuracy with a mean error ranging from 0.6 to 3.3 pixels, outperforming traditional methods like SIFT, SURF, Lucas Kanade, and the latest transformers like PIPs++ and CoTracker. Overall, our unsupervised learning approach excels in tracking various skin features under significant motion conditions, providing superior feature descriptors for tracking, matching, and image registration compared to both traditional and state-of-the-art supervised learning methods.
面部特征跟踪在球面心电图成像中至关重要,因为它能准确估计心脏率,并且通过皮肤特征跟踪在帕金森病患者中实现运动降解量化。虽然深度卷积神经网络在跟踪任务中表现出惊人的准确性,但通常需要大量的有标签数据进行监督训练。我们提出的方案采用卷积堆叠自编码器将图像块与包含目标特征的参考块匹配,无监督地学习特定于物体类别的深度特征编码,从而减少了数据需求。为了克服边缘效果,使性能取决于图像大小,我们在计算损失函数时对像素残差应用高斯权重。在面部图像上训练自编码器并验证其性能,我们的Deep Feature Encodings(DFE)方法在平均误差范围内从0.6到3.3像素,超越了传统方法(如SIFT,SURF,Lucas Kanade)和最先进的变压器(如PIPs++和CoTracker),展示了卓越的跟踪精度。总的来说,我们的无监督学习方法在重大运动条件下 excels于跟踪各种皮肤特征,为跟踪、匹配和图像配准提供卓越的性能,与传统和最先进的监督学习方法相比。
https://arxiv.org/abs/2405.04943
Satellite imagery has played an increasingly important role in post-disaster building damage assessment. Unfortunately, current methods still rely on manual visual interpretation, which is often time-consuming and can cause very low accuracy. To address the limitations of manual interpretation, there has been a significant increase in efforts to automate the process. We present a solution that performs the two most important tasks in building damage assessment, segmentation and classification, through deep-learning models. We show our results submitted as part of the xView2 Challenge, a competition to design better models for identifying buildings and their damage level after exposure to multiple kinds of natural disasters. Our best model couples a building identification semantic segmentation convolutional neural network (CNN) to a building damage classification CNN, with a combined F1 score of 0.66, surpassing the xView2 challenge baseline F1 score of 0.28. We find that though our model was able to identify buildings with relatively high accuracy, building damage classification across various disaster types is a difficult task due to the visual similarity between different damage levels and different damage distribution between disaster types, highlighting the fact that it may be important to have a probabilistic prior estimate regarding disaster damage in order to obtain accurate predictions.
卫星影像在灾后建筑损害评估中扮演着越来越重要的角色。然而,目前的评估方法仍然依赖于人工视觉解释,这通常需要花费大量时间,并可能导致非常低的精度。为解决手动解释的局限性,已经加大了自动化过程的努力。我们提出了一个解决方案,通过深度学习模型执行建筑损害评估中的两个最重要的任务:分割和分类。我们在xView2挑战中展示了我们的结果,该挑战旨在为识别在多种自然灾害中受损的建筑和其损害程度提供更好的模型。我们最好的模型将具有建筑识别的语义分割卷积神经网络(CNN)与建筑损害分类卷积神经网络相结合,F1分数为0.66,超过了xView2挑战基线F1分数0.28。我们发现,尽管我们的模型能够以相对较高的准确度识别建筑物,但不同灾害类型之间建筑损害的分类仍然具有困难,因为不同灾害类型的损害水平和损害分布之间存在视觉相似性,这表明在获得准确预测之前,关于灾害损害的概率先验估计可能是重要的。
https://arxiv.org/abs/2405.04800
Few-Shot Class-Incremental Learning presents an extension of the Class Incremental Learning problem where a model is faced with the problem of data scarcity while addressing the catastrophic forgetting problem. This problem remains an open problem because all recent works are built upon the convolutional neural networks performing sub-optimally compared to the transformer approaches. Our paper presents Robust Transformer Approach built upon the Compact Convolution Transformer. The issue of overfitting due to few samples is overcome with the notion of the stochastic classifier, where the classifier's weights are sampled from a distribution with mean and variance vectors, thus increasing the likelihood of correct classifications, and the batch-norm layer to stabilize the training process. The issue of CF is dealt with the idea of delta parameters, small task-specific trainable parameters while keeping the backbone networks frozen. A non-parametric approach is developed to infer the delta parameters for the model's predictions. The prototype rectification approach is applied to avoid biased prototype calculations due to the issue of data scarcity. The advantage of ROBUSTA is demonstrated through a series of experiments in the benchmark problems where it is capable of outperforming prior arts with big margins without any data augmentation protocols.
少样本分类增量学习扩展了分类增量学习问题,其中模型在解决数据稀缺的同时还要应对灾难性遗忘问题。这个问题仍然是一个开放问题,因为所有最近的工作都是基于在训练过程中表现不佳的卷积神经网络的。我们的论文提出了基于Compact卷积转置的鲁棒Transformer方法。通过随机分类器的概念,我们克服了由于样本数量较少而导致的过拟合问题。此外,通过使用批归一化层来稳定训练过程,解决了分类器基线网络的冻结问题。为了处理分类器中的delta参数问题,我们提出了一种非参数方法来推断模型的预测中的delta参数。原型纠正方法应用于避免由于数据稀缺而导致的偏差原型计算问题。通过在基准问题系列实验中证明ROBUSTA的优势,无需数据增强方案即可在大幅度的性能提升中超越前人水平。
https://arxiv.org/abs/2405.05984
In recent years, convolutional neural networks (CNNs) have achieved remarkable advancement in the field of remote sensing image super-resolution due to the complexity and variability of textures and structures in remote sensing images (RSIs), which often repeat in the same images but differ across others. Current deep learning-based super-resolution models focus less on high-frequency features, which leads to suboptimal performance in capturing contours, textures, and spatial information. State-of-the-art CNN-based methods now focus on the feature extraction of RSIs using attention mechanisms. However, these methods are still incapable of effectively identifying and utilizing key content attention signals in RSIs. To solve this problem, we proposed an advanced feature extraction module called Channel and Spatial Attention Feature Extraction (CSA-FE) for effectively extracting the features by using the channel and spatial attention incorporated with the standard vision transformer (ViT). The proposed method trained over the UCMerced dataset on scales 2, 3, and 4. The experimental results show that our proposed method helps the model focus on the specific channels and spatial locations containing high-frequency information so that the model can focus on relevant features and suppress irrelevant ones, which enhances the quality of super-resolved images. Our model achieved superior performance compared to various existing models.
近年来,卷积神经网络(CNNs)在远程 sensing图像超分辨率领域取得了显著进展,这是由于远程 sensing图像(RSIs)中纹理和结构复杂性和变异性,这些图像在同一图像上重复,但在其他图像上不同。目前基于深度学习的超分辨率方法更加关注低频特征,这导致在捕捉轮廓、纹理和空间信息方面性能较低。最先进的基于CNN的超级分辨率方法现在专注于使用注意力机制提取RSIs的特征。然而,这些方法仍然无法有效地识别和利用RSIs中的关键内容关注信号。为了解决这个问题,我们提出了一个名为通道和空间注意力特征提取(CSA-FE)的高级特征提取模块,通过使用与标准视觉Transformer(ViT)集成的通道和空间注意力来有效地提取特征。在训练方面,我们使用UC Merced数据集在规模2、3和4上进行训练。实验结果表明,与各种现有模型相比,我们提出的方法有助于模型集中于包含高频信息的特定通道和空间位置,使模型可以关注相关特征并抑制无关特征,从而提高超分辨率图像的质量。我们的模型在各种现有模型中具有卓越的性能。
https://arxiv.org/abs/2405.04595
All fields of knowledge are being impacted by Artificial Intelligence. In particular, the Deep Learning paradigm enables the development of data analysis tools that support subject matter experts in a variety of sectors, from physics up to the recognition of ancient languages. Palaeontology is now observing this trend as well. This study explores the capability of Convolutional Neural Networks (CNNs), a particular class of Deep Learning algorithms specifically crafted for computer vision tasks, to classify images of isolated fossil shark teeth gathered from online datasets as well as from the authors$'$ experience on Peruvian Miocene and Italian Pliocene fossil assemblages. The shark taxa that are included in the final, composite dataset (which consists of more than one thousand images) are representative of both extinct and extant genera, namely, Carcharhinus, Carcharias, Carcharocles, Chlamydoselachus, Cosmopolitodus, Galeocerdo, Hemipristis, Notorynchus, Prionace and Squatina. We developed a CNN, named SharkNet-X, specifically tailored on our recognition task, reaching a 5-fold cross validated mean accuracy of 0.85 to identify images containing a single shark tooth. Furthermore, we elaborated a visualization of the features extracted from images using the last dense layer of the CNN, achieved through the application of the clustering technique t-SNE. In addition, in order to understand and explain the behaviour of the CNN while giving a paleontological point of view on the results, we introduced the explainability method SHAP. To the best of our knowledge, this is the first instance in which this method is applied to the field of palaeontology. The main goal of this work is to showcase how Deep Learning techniques can aid in identifying isolated fossil shark teeth, paving the way for developing new information tools for automating the recognition and classification of fossils.
知识领域的所有领域都受到人工智能的影响。特别是,深度学习范式使数据分析工具得以开发,支持各个领域的专家,从物理学到古语言的识别。古生物学领域现在也加入了这个趋势。这项研究探讨了卷积神经网络(CNNs)作为一种特别为计算机视觉任务而设计的深度学习算法的分类能力,将孤立的化石鲨牙齿图片从在线数据集中到作者在秘鲁米奥科新世和意大利普利奥新世化石群的经历中进行分类的能力。包括在最终综合 dataset(包含超过1000张图片)中的鲨鱼种类,都是灭绝和现存的物种,包括Carcharhinus、Carcharias、Carcharocles、Chlamydoselachus、Cosmopolitodus、Galeocerdo、Hemipristis、Notorynchus、Prionace和Squatina。我们开发了一个名为SharkNet-X的CNN,专门针对我们的识别任务,达到5倍交叉验证平均准确率0.85,识别包含单颗鲨牙齿的图片。此外,我们通过应用聚类技术t-SNE对提取图像特征进行可视化。为了了解和解释CNN在识别化石鲨牙齿时的行为,我们还引入了Shap解释方法。据我们所知,这是第一个将这种方法应用于古生物学领域的实例。本工作的主要目标是通过展示深度学习技术如何帮助识别孤立的化石鲨牙齿,为开发新的信息工具,用于自动化化石的识别和分类铺平道路。
https://arxiv.org/abs/2405.04189