In this paper, we investigate a novel artificial intelligence generation task, termed as generated contents enrichment (GCE). Different from conventional artificial intelligence contents generation task that enriches the given textual description implicitly with limited semantics for generating visually real content, our proposed GCE strives to perform content enrichment explicitly on both the visual and textual domain, from which the enriched contents are visually real, structurally reasonable, and semantically abundant. Towards to solve GCE, we propose a deep end-to-end method that explicitly explores the semantics and inter-semantic relationships during the enrichment. Specifically, we first model the input description as a semantic graph, wherein each node represents an object and each edge corresponds to the inter-object relationship. We then adopt Graph Convolutional Networks on top of the input scene description to predict the enriching objects and their relationships with the input objects. Finally, the enriched graph is fed into an image synthesis model to carry out the visual contents generation. Our experiments conducted on the Visual Genome dataset exhibit promising and visually plausible results.
在本文中,我们研究了一种名为生成内容丰富(GCE)的新人工智能生成任务。与传统的人工智能内容生成任务不同,该任务通过在给定的文本描述中隐含有限语义来丰富文本内容,以生成视觉上真实的内容,但这种丰富内容在语义上不充分。为了解决GCE,我们提出了一个端到端的方法,在丰富过程中明确探索语义和跨语义关系。具体来说,我们首先将输入描述建模为语义图,其中每个节点表示一个对象,每条边表示对象之间的相互作用。然后我们在输入场景描述上应用图卷积网络来预测要生成的丰富对象及其与输入对象的关系。最后,生成的丰富图被输入到图像合成模型中进行视觉内容生成。我们在视觉基因组数据集上进行实验,结果表明具有鼓舞人心的视觉和视觉合理的结果。
https://arxiv.org/abs/2405.03650
Object detection algorithms particularly those based on YOLO have demonstrated remarkable efficiency in balancing speed and accuracy. However, their application in brain tumour detection remains underexplored. This study proposes RepVGG-GELAN, a novel YOLO architecture enhanced with RepVGG, a reparameterized convolutional approach for object detection tasks particularly focusing on brain tumour detection within medical images. RepVGG-GELAN leverages the RepVGG architecture to improve both speed and accuracy in detecting brain tumours. Integrating RepVGG into the YOLO framework aims to achieve a balance between computational efficiency and detection performance. This study includes a spatial pyramid pooling-based Generalized Efficient Layer Aggregation Network (GELAN) architecture which further enhances the capability of RepVGG. Experimental evaluation conducted on a brain tumour dataset demonstrates the effectiveness of RepVGG-GELAN surpassing existing RCS-YOLO in terms of precision and speed. Specifically, RepVGG-GELAN achieves an increased precision of 4.91% and an increased AP50 of 2.54% over the latest existing approach while operating at 240.7 GFLOPs. The proposed RepVGG-GELAN with GELAN architecture presents promising results establishing itself as a state-of-the-art solution for accurate and efficient brain tumour detection in medical images. The implementation code is publicly available at this https URL.
物体检测算法,尤其是基于YOLO的算法,已经在平衡速度和精度方面取得了显著的效率。然而,在肿瘤检测应用中,它们的应用仍然没有被充分利用。这项研究提出了RepVGG-GELAN,一种新颖的YOLO架构,通过在RepVGG上进行重新参数化卷积,特别是在医学图像中的肿瘤检测。RepVGG-GELAN利用RepVGG架构来提高检测肿瘤的速度和准确性。将RepVGG集成到YOLO框架中旨在实现计算效率和检测性能的平衡。本研究包括一个基于空间金字塔池化的全局 efficient层聚合网络(GELAN)架构,进一步增强了RepVGG的检测能力。在肿瘤数据集上进行的实验评估表明,RepVGG-GELAN在准确性和速度方面超过了现有的RCS-YOLO。具体来说,RepVGG-GELAN在最新的现有方法的基础上实现了4.91%的增加的准确性和2.54%的增加的AP50。所提出的RepVGG-GELAN与GELAN架构相结合,为准确且高效的肿瘤检测在医学图像中提供了有前景的解决方案。实现代码可在此链接的公开URL中获取。
https://arxiv.org/abs/2405.03541
Transparency and explainability in image classification are essential for establishing trust in machine learning models and detecting biases and errors. State-of-the-art explainability methods generate saliency maps to show where a specific class is identified, without providing a detailed explanation of the model's decision process. Striving to address such a need, we introduce a post-hoc method that explains the entire feature extraction process of a Convolutional Neural Network. These explanations include a layer-wise representation of the features the model extracts from the input. Such features are represented as saliency maps generated by clustering and merging similar feature maps, to which we associate a weight derived by generalizing Grad-CAM for the proposed methodology. To further enhance these explanations, we include a set of textual labels collected through a gamified crowdsourcing activity and processed using NLP techniques and Sentence-BERT. Finally, we show an approach to generate global explanations by aggregating labels across multiple images.
透明度和可解释性在图像分类中至关重要,用于建立对机器学习模型的信任并检测偏见和错误。最先进的可解释性方法生成确切显示特定类别的 saliency 地图,而不会提供模型决策过程的详细解释。为了解决这个问题,我们引入了一种后置方法,该方法解释了卷积神经网络(CNN)的完整特征提取过程。这些解释包括从输入中提取的每个层的特征的层级表示。这些特征以通过聚类和合并类似特征图生成的 saliency 地图的形式表示,并附有通过扩展 Grad-CAM 获得的权重。为了进一步增强这些解释,我们在活动中通过游戏化众包活动收集了一组文本标签,并使用 NLP 技术和 Sentence-BERT 对这些标签进行处理。最后,我们展示了通过聚合多个图像上的标签来生成全局解释的方法。
https://arxiv.org/abs/2405.03301
Spectral graph convolutional network (SGCN) is a kind of graph neural networks (GNN) based on graph signal filters, and has shown compelling expressivity for modeling graph-structured data. Most SGCNs adopt polynomial filters and learn the coefficients from the training data. Many of them focus on which polynomial basis leads to optimal expressive power and models' architecture is little discussed. In this paper, we propose a general form in terms of spectral graph convolution, where the coefficients of polynomial basis are stored in a third-order tensor. Then, we show that the convolution block in existing SGCNs can be derived by performing a certain coefficient decomposition operation on the coefficient tensor. Based on the generalized view, we develop novel spectral graph convolutions CoDeSGC-CP and -Tucker by tensor decomposition CP and Tucker on the coefficient tensor. Extensive experimental results demonstrate that the proposed convolutions achieve favorable performance improvements.
光谱图卷积神经网络(SGCN)是一种基于图信号滤波器的图神经网络(GNN),在建模图形数据方面表现出强大的表现力。大多数SGCN采用多项式滤波器并从训练数据中学习系数。许多关注于哪个多项式基带来最优的表现力,模型的架构讨论较少。在本文中,我们提出了一种关于谱图卷积的一般形式,其中多项式基的系数存储在第三维张量中。然后,我们证明了现有SGCN中的卷积模块可以通过对系数张量执行某种系数分解操作来推导出来。基于扩展观点,我们在系数张量上开发了新的光谱图卷积CDeSGC-CP和-Tucker。大量实验结果表明,与传统的卷积方法相比,所提出的卷积方法实现了显著的性能改进。
https://arxiv.org/abs/2405.03296
Brain disorders are a major challenge to global health, causing millions of deaths each year. Accurate diagnosis of these diseases relies heavily on advanced medical imaging techniques such as Magnetic Resonance Imaging (MRI) and Computed Tomography (CT). However, the scarcity of annotated data poses a significant challenge in deploying machine learning models for medical diagnosis. To address this limitation, deep learning techniques have shown considerable promise. Domain adaptation techniques enhance a model's ability to generalize across imaging modalities by transferring knowledge from one domain (e.g., CT images) to another (e.g., MRI images). Such cross-modality adaptation is essential to improve the ability of models to consistently generalize across different imaging modalities. This study collected relevant resources from the Kaggle website and employed the Maximum Mean Difference (MMD) method - a popular domain adaptation method - to reduce the differences between imaging domains. By combining MMD with Convolutional Neural Networks (CNNs), the accuracy and utility of the model is obviously enhanced. The excellent experimental results highlight the great potential of data-driven domain adaptation techniques to improve diagnostic accuracy and efficiency, especially in resource-limited environments. By bridging the gap between different imaging modalities, the study aims to provide clinicians with more reliable diagnostic tools.
脑部疾病对全球健康构成了重大挑战,每年导致数百万人的死亡。准确诊断这些疾病依赖于先进的医学成像技术,如磁共振成像(MRI)和计算机断层扫描(CT)。然而,缺乏标注数据使得将机器学习模型应用于医学诊断方面存在重大挑战。为解决这个问题,深度学习技术已经显示出了巨大的潜力。领域自适应技术通过将一个领域的知识传递到另一个领域(例如,CT图像)来增强模型在成像模式上的泛化能力。这种跨模式适应对提高模型在不同成像模式上的一致泛化能力至关重要。 本研究从Kaggle网站收集了相关资源,并采用了一种流行的领域自适应方法——最大均差(MMD)法来降低成像领域的差异。通过将MMD与卷积神经网络(CNN)相结合,模型的准确性和实用性明显增强。出色的实验结果强调了数据驱动的领域自适应技术在提高诊断准确性和效率方面具有巨大的潜力,特别是在资源有限的环境中。通过缩小不同成像模式之间的差距,该研究旨在为临床医生提供更加可靠的诊断工具。
https://arxiv.org/abs/2405.03235
We propose a novel, brain-inspired deep neural network model known as the Deep Oscillatory Neural Network (DONN). Deep neural networks like the Recurrent Neural Networks indeed possess sequence processing capabilities but the internal states of the network are not designed to exhibit brain-like oscillatory activity. With this motivation, the DONN is designed to have oscillatory internal dynamics. Neurons of the DONN are either nonlinear neural oscillators or traditional neurons with sigmoidal or ReLU activation. The neural oscillator used in the model is the Hopf oscillator, with the dynamics described in the complex domain. Input can be presented to the neural oscillator in three possible modes. The sigmoid and ReLU neurons also use complex-valued extensions. All the weight stages are also complex-valued. Training follows the general principle of weight change by minimizing the output error and therefore has an overall resemblance to complex backpropagation. A generalization of DONN to convolutional networks known as the Oscillatory Convolutional Neural Network is also proposed. The two proposed oscillatory networks are applied to a variety of benchmark problems in signal and image/video processing. The performance of the proposed models is either comparable or superior to published results on the same data sets.
我们提出了一个新型的基于脑的深度神经网络模型,称为深度 oscillatory 神经网络(DONN)。与递归神经网络这样的深度神经网络确实具有序列处理能力,但是网络内部状态并没有被设计成具有类似脑的周期性活动。为了实现这一点,DONN 被设计具有周期性的内部动态。DONN 的神经元可以是非线性神经元周期器或具有sigmoidal 或 ReLU 激活的传统神经元。在模型中使用的神经元是 Hopf oscillator,其动力学在复数域中描述。输入可以通过三种可能的模式呈现给神经元振荡器。sigmoid 和 ReLU 神经元也使用了复数值扩展。所有权重阶段也是复数值。训练通过最小化输出误差来遵循一般原则,因此具有与复杂反向传播相似的整体外观。也提出了将 DONN 扩展到卷积网络的振荡卷积神经网络的一般化模型。两个提出的振荡网络被应用于各种信号和图像/视频处理基准问题。所提出模型的性能与相同数据集中的已发布结果相比要么相当,要么更优。
https://arxiv.org/abs/2405.03725
Complementary RGB and TIR modalities enable RGB-T tracking to achieve competitive performance in challenging scenarios. Therefore, how to better fuse cross-modal features is the core issue of RGB-T tracking. Some previous methods either insufficiently fuse RGB and TIR features, or depend on intermediaries containing information from both modalities to achieve cross-modal information interaction. The former does not fully exploit the potential of using only RGB and TIR information of the template or search region for channel and spatial feature fusion, and the latter lacks direct interaction between the template and search area, which limits the model's ability to fully exploit the original semantic information of both modalities. To alleviate these limitations, we explore how to improve the performance of a visual Transformer by using direct fusion of cross-modal channels and spatial features, and propose CSTNet. CSTNet uses ViT as a backbone and inserts cross-modal channel feature fusion modules (CFM) and cross-modal spatial feature fusion modules (SFM) for direct interaction between RGB and TIR features. The CFM performs parallel joint channel enhancement and joint multilevel spatial feature modeling of RGB and TIR features and sums the features, and then globally integrates the sum feature with the original features. The SFM uses cross-attention to model the spatial relationship of cross-modal features and then introduces a convolutional feedforward network for joint spatial and channel integration of multimodal features. Comprehensive experiments show that CSTNet achieves state-of-the-art performance on three public RGB-T tracking benchmarks. Code is available at this https URL.
互补的RGB和TIR模式使RGB-T跟踪在具有挑战性的场景中实现竞争力的性能。因此,如何更好地融合跨模态特征是RGB-T跟踪的核心问题。之前的方法要么不足以充分利用仅使用模板或搜索区域的RGB和TIR信息进行通道和空间特征融合,要么依赖于包含来自两个模态信息的中间体以实现跨模态信息交互。前者没有充分利用使用仅基于RGB和TIR信息的模板或搜索区域进行通道和空间特征融合的潜力,而后者缺乏直接模板和搜索区域之间的交互,从而限制了模型对两种模态原始语义信息的充分利用能力。为了减轻这些限制,我们探讨了如何通过直接融合跨模态通道和空间特征来提高视觉Transformer的性能,并提出了CSTNet。CSTNet使用ViT作为骨干网络,并插入跨模态通道特征融合模块(CFM)和跨模态空间特征融合模块(SFM)进行直接交互,CFM对RGB和TIR特征进行并行联合通道增强和多级空间特征建模,然后将特征加总并全局整合与原始特征。SFM利用跨注意力和一个卷积前馈网络对多模态特征进行联合空间和通道整合。全面的实验结果表明,CSTNet在三个公开的RGB-T跟踪基准上实现了最先进的性能。代码可以从该链接下载。
https://arxiv.org/abs/2405.03177
Accurate 3D human pose estimation is a challenging task due to occlusion and depth ambiguity. In this paper, we introduce a multi-hop graph transformer network designed for 2D-to-3D human pose estimation in videos by leveraging the strengths of multi-head self-attention and multi-hop graph convolutional networks with disentangled neighborhoods to capture spatio-temporal dependencies and handle long-range interactions. The proposed network architecture consists of a graph attention block composed of stacked layers of multi-head self-attention and graph convolution with learnable adjacency matrix, and a multi-hop graph convolutional block comprised of multi-hop convolutional and dilated convolutional layers. The combination of multi-head self-attention and multi-hop graph convolutional layers enables the model to capture both local and global dependencies, while the integration of dilated convolutional layers enhances the model's ability to handle spatial details required for accurate localization of the human body joints. Extensive experiments demonstrate the effectiveness and generalization ability of our model, achieving competitive performance on benchmark datasets.
准确的人体姿态估计是一个具有遮挡和深度不确定性挑战的任务。在本文中,我们提出了一种用于2D到3D人体姿态估计的视频的多级图卷积网络,通过利用多头自注意力和多级图卷积网络的优势来捕捉空间-时间依赖关系并处理长距离相互作用。所提出的网络架构由一个由多头自注意力和图卷积组成的图注意力模块和一个由多级卷积和扩散卷积组成的图卷积模块组成。多头自注意力和多级图卷积层的结合使模型能够捕捉局部和全局依赖关系,而整合扩散卷积层增强了模型处理人体关节准确定位所需的空间细节的能力。大量实验证明了我们模型的有效性和泛化能力,在基准数据集上实现了竞争力的性能。
https://arxiv.org/abs/2405.03055
Colorectal cancer contributes significantly to cancer-related mortality. Timely identification and elimination of polyps through colonoscopy screening is crucial in order to decrease mortality rates. Accurately detecting polyps in colonoscopy images is difficult because of the differences in characteristics such as size, shape, texture, and similarity to surrounding tissues. Current deep-learning methods often face difficulties in capturing long-range connections necessary for segmentation. This research presents BetterNet, a convolutional neural network (CNN) architecture that combines residual learning and attention methods to enhance the accuracy of polyp segmentation. The primary characteristics encompass (1) a residual decoder architecture that facilitates efficient gradient propagation and integration of multiscale features. (2) channel and spatial attention blocks within the decoder block to concentrate the learning process on the relevant areas of polyp regions. (3) Achieving state-of-the-art performance on polyp segmentation benchmarks while still ensuring computational efficiency. (4) Thorough ablation tests have been conducted to confirm the influence of architectural components. (5) The model code has been made available as open-source for further contribution. Extensive evaluations conducted on datasets such as Kvasir-SEG, CVC ClinicDB, Endoscene, EndoTect, and Kvasir-Sessile demonstrate that BetterNets outperforms current SOTA models in terms of segmentation accuracy by significant margins. The lightweight design enables real-time inference for various applications. BetterNet shows promise in integrating computer-assisted diagnosis techniques to enhance the detection of polyps and the early recognition of cancer. Link to the code: this https URL
直肠癌对癌症相关死亡率的贡献非常大。通过结肠镜筛查及时发现和消除结肠内的结节是降低死亡率的關鍵。然而,准确地在结肠镜图像中检测结节存在很大困难,因为结肠内结节的特征(如大小、形状、质地和与周围组织的相似性)存在差異。目前的大深度学习方法往往在捕捉分割过程中需要的长距离连接方面遇到困难。这项研究提出了BetterNet,一种结合残差学习和关注方法的卷积神经网络(CNN)架构,以提高结肠癌分割的准确性。主要特点包括:(1)一个残差解码器架构,可促进高效的梯度传播和多尺度特征整合。(2)解码器block内的通道和空间关注块,以将学习过程集中在结肠癌区域的 relevant 区域上。(3)在保证准确性的同时提高结肠癌分割基准测试的性能。(4)已经对建筑组件进行了全面消融测试,以确认其影响。(5)模型代码已公开为开源贡献,以进一步发挥其作用。在Kvasir-SEG、CVC诊所数据库、Endoscene、EndoTect和Kvasir-Sessile等数据集上进行的大量评估证明,BetterNets在分割准确性方面显著优于当前的最优模型。轻量级的设计使得各种应用实现实时推理。BetterNet在将计算机辅助诊断技术集成到结肠癌检测和早期癌症识别方面具有前景。链接到代码:https:// this URL
https://arxiv.org/abs/2405.04288
Skin lesion segmentation is a critical task in computer-aided diagnosis systems for dermatological diseases. Accurate segmentation of skin lesions from medical images is essential for early detection, diagnosis, and treatment planning. In this paper, we propose a new model for skin lesion segmentation namely AC-MambaSeg, an enhanced model that has the hybrid CNN-Mamba backbone, and integrates advanced components such as Convolutional Block Attention Module (CBAM), Attention Gate, and Selective Kernel Bottleneck. AC-MambaSeg leverages the Vision Mamba framework for efficient feature extraction, while CBAM and Selective Kernel Bottleneck enhance its ability to focus on informative regions and suppress background noise. We evaluate the performance of AC-MambaSeg on diverse datasets of skin lesion images including ISIC-2018 and PH2; then compare it against existing segmentation methods. Our model shows promising potential for improving computer-aided diagnosis systems and facilitating early detection and treatment of dermatological diseases. Our source code will be made available at: this https URL.
皮肤病变分割是计算机辅助诊断系统皮肤疾病诊断中的一项关键任务。准确从医学图像中分割皮肤病变是早期诊断、诊断和治疗规划的必要条件。在本文中,我们提出了一个名为AC-MambaSeg的新模型用于皮肤病变分割,这是一种增强模型,具有混合CNN-Mamba骨干网络和高级组件,如卷积块注意模块(CBAM)、注意门和选择性内核瓶颈。AC-MambaSeg利用Vision Mamba框架进行高效的特征提取,而CBAM和选择性内核瓶颈则增强了其关注有信息区域并抑制背景噪声的能力。我们在包括ISIC-2018和PH2等多样数据集的皮肤病变图像上评估AC-MambaSeg的性能,然后与现有分割方法进行比较。我们的模型在改善计算机辅助诊断系统和促进早期诊断和治疗皮肤疾病方面具有令人鼓舞的潜力。我们的源代码将在此处公布:https://this URL。
https://arxiv.org/abs/2405.03011
Efficient Image Super-Resolution (SR) aims to accelerate SR network inference by minimizing computational complexity and network parameters while preserving performance. Existing state-of-the-art Efficient Image Super-Resolution methods are based on convolutional neural networks. Few attempts have been made with Mamba to harness its long-range modeling capability and efficient computational complexity, which have shown impressive performance on high-level vision tasks. In this paper, we propose DVMSR, a novel lightweight Image SR network that incorporates Vision Mamba and a distillation strategy. The network of DVMSR consists of three modules: feature extraction convolution, multiple stacked Residual State Space Blocks (RSSBs), and a reconstruction module. Specifically, the deep feature extraction module is composed of several residual state space blocks (RSSB), each of which has several Vision Mamba Moudles(ViMM) together with a residual connection. To achieve efficiency improvement while maintaining comparable performance, we employ a distillation strategy to the vision Mamba network for superior performance. Specifically, we leverage the rich representation knowledge of teacher network as additional supervision for the output of lightweight student networks. Extensive experiments have demonstrated that our proposed DVMSR can outperform state-of-the-art efficient SR methods in terms of model parameters while maintaining the performance of both PSNR and SSIM. The source code is available at this https URL
高效的图像超分辨率(SR)旨在通过最小化计算复杂度和网络参数来加速SR网络推理,同时保持性能。现有的最先进的Efficient Image Super-Resolution方法基于卷积神经网络。在Mamba上,已经尝试了一些利用其远距离建模能力和高性价比的方法,这些方法在高级视觉任务上的表现令人印象深刻。在本文中,我们提出了DVMSR,一种新颖的轻量级图像SR网络,它结合了Vision Mamba和差分策略。DVMSR网络由三个模块组成:特征提取卷积、多层堆叠残差状态空间块(RSSB)和重构模块。具体来说,深层特征提取模块由多个残差状态空间块(RSSB)组成,每个RSSB都包含多个Vision Mamba模块和一个残差连接。为了在保持性能的同时实现效率提升,我们对视觉Mamba网络采用了差分策略,以获得更好的性能。具体来说,我们利用教师网络的丰富表示知识作为对轻量学生网络输出的附加监督。大量实验证明,与最先进的有效SR方法相比,我们提出的DVMSR在模型参数方面具有优越的性能,同时保持PSNR和SSIM的性能。源代码可在此处访问:https://url
https://arxiv.org/abs/2405.03008
Salient object detection (SOD) remains an important task in computer vision, with applications ranging from image segmentation to autonomous driving. Fully convolutional network (FCN)-based methods have made remarkable progress in visual saliency detection over the last few decades. However, these methods have limitations in accurately detecting salient objects, particularly in challenging scenes with multiple objects, small objects, or objects with low resolutions. To address this issue, we proposed a Saliency Fusion Attention U-Net (SalFAU-Net) model, which incorporates a saliency fusion module into each decoder block of the attention U-net model to generate saliency probability maps from each decoder block. SalFAU-Net employs an attention mechanism to selectively focus on the most informative regions of an image and suppress non-salient regions. We train SalFAU-Net on the DUTS dataset using a binary cross-entropy loss function. We conducted experiments on six popular SOD evaluation datasets to evaluate the effectiveness of the proposed method. The experimental results demonstrate that our method, SalFAU-Net, achieves competitive performance compared to other methods in terms of mean absolute error (MAE), F-measure, s-measure, and e-measure.
突出物体检测(SOD)在计算机视觉中仍然是一个重要的任务,其应用范围从图像分割到自动驾驶。在过去的几十年里,完全卷积网络(FCN)为基础的方法在视觉突出物体检测方面取得了显著进展。然而,这些方法在准确检测突出物体方面存在局限性,特别是在具有多个物体、小物体或低分辨率物体的挑战性场景中。为解决这个问题,我们提出了一个突出物体融合注意U-Net(SalFAU-Net)模型,该模型将突出物体融合模块融入每个注意U-Net模型的解码块中,从每个解码块生成突出物体概率图。SalFAU-Net采用关注机制选择性地关注图像中最具信息性的区域,并抑制非突出物体区域。我们使用二元交叉熵损失函数在DUTS数据集上训练SalFAU-Net。我们对六个流行的SOD评估数据集进行了实验,以评估所提出方法的有效性。实验结果表明,与其它方法相比,我们的方法SalFAU-Net在平均绝对误差(MAE)、F-分数、s-分数和e-分数方面具有竞争力的性能。
https://arxiv.org/abs/2405.02906
Brain tumor segmentation is a fundamental step in assessing a patient's cancer progression. However, manual segmentation demands significant expert time to identify tumors in 3D multimodal brain MRI scans accurately. This reliance on manual segmentation makes the process prone to intra- and inter-observer variability. This work proposes a brain tumor segmentation method as part of the BraTS-GoAT challenge. The task is to segment tumors in brain MRI scans automatically from various populations, such as adults, pediatrics, and underserved sub-Saharan Africa. We employ a recent CNN architecture for medical image segmentation, namely MedNeXt, as our baseline, and we implement extensive model ensembling and postprocessing for inference. Our experiments show that our method performs well on the unseen validation set with an average DSC of 85.54% and HD95 of 27.88. The code is available on this https URL.
肿瘤分割是评估患者癌症进展的重要步骤。然而,手动分割需要大量专业时间在3D多模态脑部MRI扫描中准确地识别肿瘤。这种对手动分割的依赖使得过程容易受到内和间观察者变异性。本文提出了一种作为 BraTS-GoAT 挑战的一部分的脑肿瘤分割方法。任务是从各种人群中自动分割脑MRI扫描中的肿瘤,包括成人、儿科和欠发达的撒哈拉以南非洲。我们采用最近的一个卷积神经网络架构——MedNeXt 作为基础,并对推理进行 extensive model ensemble 和 postprocessing。我们的实验结果表明,我们的方法在未见过的验证集上的平均DSC为85.54%和HD95为27.88。代码可以在这个 https:// URL 上找到。
https://arxiv.org/abs/2405.02852
Attempt to use convolutional neural network to achieve kinematic analysis of plane bar structure. Through 3dsMax animation software and OpenCV module, self-build image dataset of geometrically stable system and geometrically unstable system. we construct and train convolutional neural network model based on the TensorFlow and Keras deep learning platform framework. The model achieves 100% accuracy on the training set, validation set, and test set. The accuracy on the additional test set is 93.7%, indicating that convolutional neural network can learn and master the relevant knowledge of kinematic analysis of structural mechanics. In the future, the generalization ability of the model can be improved through the diversity of dataset, which has the potential to surpass human experts for complex structures. Convolutional neural network has certain practical value in the field of kinematic analysis of structural mechanics. Using visualization technology, we reveal how convolutional neural network learns and recognizes structural features. Using pre-trained VGG16 model for feature extraction and fine-tuning, we found that the generalization ability is inferior to the self-built model.
尝试使用卷积神经网络来实现平面梁结构的运动分析。通过3dsMax动画软件和OpenCV模块,基于TensorFlow和Keras深度学习平台框架构建和训练几何稳定系统和高不稳定性系统图像数据集。该模型在训练集、验证集和测试集上的准确度均为100%。在附加测试集上的准确度为93.7%,表明卷积神经网络可以学习和掌握相关结构力学运动分析的知识。通过数据集的多样性来提高模型的泛化能力,该模型在复杂结构上的表现有可能超过人类专家。在结构力学运动分析领域,卷积神经网络具有一定的实际价值。通过可视化技术,我们揭示了卷积神经网络学习和识别结构特征的过程。使用预训练的VGG16模型进行特征提取和微调,我们发现自建模型的泛化能力要强于预训练模型。
https://arxiv.org/abs/2405.02807
Artificial neural networks trained on large, expert-labelled datasets are considered state-of-the-art for a range of medical image recognition tasks. However, categorically labelled datasets are time-consuming to generate and constrain classification to a pre-defined, fixed set of classes. For neuroradiological applications in particular, this represents a barrier to clinical adoption. To address these challenges, we present a self-supervised text-vision framework that learns to detect clinically relevant abnormalities in brain MRI scans by directly leveraging the rich information contained in accompanying free-text neuroradiology reports. Our training approach consisted of two-steps. First, a dedicated neuroradiological language model - NeuroBERT - was trained to generate fixed-dimensional vector representations of neuroradiology reports (N = 50,523) via domain-specific self-supervised learning tasks. Next, convolutional neural networks (one per MRI sequence) learnt to map individual brain scans to their corresponding text vector representations by optimising a mean square error loss. Once trained, our text-vision framework can be used to detect abnormalities in unreported brain MRI examinations by scoring scans against suitable query sentences (e.g., 'there is an acute stroke', 'there is hydrocephalus' etc.), enabling a range of classification-based applications including automated triage. Potentially, our framework could also serve as a clinical decision support tool, not only by suggesting findings to radiologists and detecting errors in provisional reports, but also by retrieving and displaying examples of pathologies from historical examinations that could be relevant to the current case based on textual descriptors.
通过在大型、专家标注的数据集上训练的人工神经网络被认为是各种医学图像识别任务的当前最先进的。然而,分类标注的数据集需要花费较长的时间来生成,并限制将分类限制为预定义、固定的类。特别是,在神经放射学应用中,这代表了临床采用的障碍。为了应对这些挑战,我们提出了一个自监督的文本视觉框架,通过直接利用伴随的免费文本神经放射学报告中的丰富信息来检测临床相关的异常脑MRI扫描。我们的训练方法包括两个步骤。首先,一个专门的语言模型——NeuroBERT 通过领域特定的自监督学习任务训练,生成固定维度的神经放射学报告的固定维向量表示(N = 50,523)。接下来,卷积神经网络(每个MRI序列一个)通过优化均方误差损失来学习将单个脑扫描映射到相应的文本向量表示。经过训练后,我们的文本视觉框架可用于通过评分扫描与适当的查询句子(例如,“有急性中风”,“有高血压”等)相匹配来检测未报告的脑MRI examination中的异常,实现各种分类基础应用(包括自动分类分诊)。可能的是,我们的框架还可以作为临床决策支持工具,不仅通过向放射科医生建议发现,还通过根据文本描述检索和显示历史检查中的疾病实例来发挥作用。
https://arxiv.org/abs/2405.02782
The volume of unlabelled Earth observation (EO) data is huge, but many important applications lack labelled training data. However, EO data offers the unique opportunity to pair data from different modalities and sensors automatically based on geographic location and time, at virtually no human labor cost. We seize this opportunity to create a diverse multi-modal pretraining dataset at global scale. Using this new corpus of 1.2 million locations, we propose a Multi-Pretext Masked Autoencoder (MP-MAE) approach to learn general-purpose representations for optical satellite images. Our approach builds on the ConvNeXt V2 architecture, a fully convolutional masked autoencoder (MAE). Drawing upon a suite of multi-modal pretext tasks, we demonstrate that our MP-MAE approach outperforms both MAEs pretrained on ImageNet and MAEs pretrained on domain-specific satellite images. This is shown on several downstream tasks including image classification and semantic segmentation. We find that multi-modal pretraining notably improves the linear probing performance, e.g. 4pp on BigEarthNet and 16pp on So2Sat, compared to pretraining on optical satellite images only. We show that this also leads to better label and parameter efficiency which are crucial aspects in global scale applications.
未标记的地球观测(EO)数据的体积巨大,但许多重要的应用缺乏标记的训练数据。然而,EO数据为根据地理位置和时间自动对不同模式和传感器数据进行对齐提供了独特的机会,几乎不需要任何人类劳动成本。我们抓住了这个机会,在全球范围内创建了一个多样化的多模态预训练数据集。利用这个包含1200万个位置的新数据集,我们提出了一个多预文本掩码自编码器(MP-MAE)方法,用于学习光学卫星图像的通用表示。我们的方法基于ConvNeXt V2架构,这是一个完全卷积掩码自编码器(MAE)。利用一系列多模态预文本任务,我们证明了我们的MP-MAE方法在ImageNet上预训练的MAEs和预训练在领域特定卫星图像上的MAEs之间都表现优异。这在几个下游任务中得到了证实,包括图像分类和语义分割。我们发现,多模态预训练显著提高了线性探测性能,例如在BigEarthNet上的4pp和So2Sat上的16pp,而仅预训练在光学卫星图像上。我们还证明了这还导致了更好的标签和参数效率,这是在全局应用中至关重要的一些方面。
https://arxiv.org/abs/2405.02771
Model compression and hardware acceleration are essential for the resource-efficient deployment of deep neural networks. Modern object detectors have highly interconnected convolutional layers with concatenations. In this work, we study how pruning can be applied to such architectures, exemplary for YOLOv7. We propose a method to handle concatenation layers, based on the connectivity graph of convolutional layers. By automating iterative sensitivity analysis, pruning, and subsequent model fine-tuning, we can significantly reduce model size both in terms of the number of parameters and FLOPs, while keeping comparable model accuracy. Finally, we deploy pruned models to FPGA and NVIDIA Jetson Xavier AGX. Pruned models demonstrate a 2x speedup for the convolutional layers in comparison to the unpruned counterparts and reach real-time capability with 14 FPS on FPGA. Our code is available at this https URL.
模型压缩和硬件加速对深度神经网络的资源高效部署至关重要。现代目标检测器具有高度互联的卷积层,并具有连接。在这项工作中,我们研究了如何应用于这种架构,以实现YOLOv7的资源高效部署。我们提出了一种处理连接层的方法,基于卷积层的连接图。通过自动处理迭代敏感分析、剪枝和后续模型微调,我们可以显著降低模型的大小,无论是参数数量还是FLOPs,同时保持竞争力的模型精度。最后,我们将修剪后的模型部署到FPGA和NVIDIA Jetson Xavier AGX。修剪后的模型在卷积层方面表现出与未修剪相比的2倍速度提升,并在FPGA上达到14 FPS的实时性能。我们的代码可在此处访问:https://www.things.int/https://www.things.int/
https://arxiv.org/abs/2405.03715
This paper aims to create a deep learning framework that can estimate the deformation vector field (DVF) for directly registering abdominal MRI-CT images. The proposed method assumed a diffeomorphic deformation. By using topology-preserved deformation features extracted from the probabilistic diffeomorphic registration model, abdominal motion can be accurately obtained and utilized for DVF estimation. The model integrated Swin transformers, which have demonstrated superior performance in motion tracking, into the convolutional neural network (CNN) for deformation feature extraction. The model was optimized using a cross-modality image similarity loss and a surface matching loss. To compute the image loss, a modality-independent neighborhood descriptor (MIND) was used between the deformed MRI and CT images. The surface matching loss was determined by measuring the distance between the warped coordinates of the surfaces of contoured structures on the MRI and CT images. The deformed MRI image was assessed against the CT image using the target registration error (TRE), Dice similarity coefficient (DSC), and mean surface distance (MSD) between the deformed contours of the MRI image and manual contours of the CT image. When compared to only rigid registration, DIR with the proposed method resulted in an increase of the mean DSC values of the liver and portal vein from 0.850 and 0.628 to 0.903 and 0.763, a decrease of the mean MSD of the liver from 7.216 mm to 3.232 mm, and a decrease of the TRE from 26.238 mm to 8.492 mm. The proposed deformable image registration method based on a diffeomorphic transformer provides an effective and efficient way to generate an accurate DVF from an MRI-CT image pair of the abdomen. It could be utilized in the current treatment planning workflow for liver radiotherapy.
本文旨在创建一个深度学习框架,可以准确估计直接注册的腹部MRI-CT图像的变形矢量场(DVF)。所提出的方法基于等变形的变形。通过使用概率形态不变的变形特征提取,可以准确获得腹部运动,并用于DVF估计。模型将Swin变换器集成到卷积神经网络(CNN)中,用于变形特征提取。模型使用跨模态图像相似性损失和表面匹配损失进行优化。为了计算图像损失,在MRI和CT图像之间使用了一个模态无关的邻域描述符(MIND)。表面匹配损失通过测量MRI和CT图像上轮廓结构的变形坐标之间的距离来确定。对MRI图像的变形轮廓使用目标注册误差(TRE)、余弦相似度系数(DSC)和平均表面距离(MSD)与手动CT图像的变形轮廓进行比较。与仅刚性注册相比,所提出的方法导致肝脏和门静脉的平均DSC值从0.850和0.628增加至0.903和0.763,肝脏平均MSD从7.216 mm减少至3.232 mm,TRE从26.238 mm减少至8.492 mm。基于等变形的图像注册方法,可以生成准确的可用于腹部MRI-CT图像对中的DVF。它可用于当前的肝脏放射治疗计划工作流程。
https://arxiv.org/abs/2405.02692
In this paper, we propose a novel model for a malware classification system based on Application Programming Interface (API) calls and opcodes, to improve classification accuracy. This system uses a novel design of combined Convolutional Neural Network and Long Short-Term Memory. We extract opcode sequences and API Calls from Windows malware samples for classification. We transform these features into N-grams (N = 2, 3, and 10)-gram sequences. Our experiments on a dataset of 9,749,57 samples produce high accuracy of 99.91% using the 8-gram sequences. Our method significantly improves the malware classification performance when using a wide range of recent deep learning architectures, leading to state-of-the-art performance. In particular, we experiment with ConvNeXt-T, ConvNeXt-S, RegNetY-4GF, RegNetY-8GF, RegNetY-12GF, EfficientNetV2, Sequencer2D-L, Swin-T, ViT-G/14, ViT-Ti, ViT-S, VIT-B, VIT-L, and MaxViT-B. Among these architectures, Swin-T and Sequencer2D-L architectures achieved high accuracies of 99.82% and 99.70%, respectively, comparable to our CNN-LSTM architecture although not surpassing it.
在本文中,我们提出了一个基于API调用和opcodes的新型恶意软件分类系统模型,以提高分类准确性。该系统采用了一种新颖的结合卷积神经网络和长短时记忆的架构。我们通过对Windows恶意软件样本的分类,提取opcode序列和API调用。我们将这些特征转换为N-gram(N = 2, 3, 和10)序列。我们对9,749,57个样本的实验结果表明,使用8-gram序列取得了99.91%的高准确率。我们的方法在使用各种最新的深度学习架构时显著提高了恶意软件分类的性能,达到了最先进水平。特别是,我们进行了对ConvNeXt-T、ConvNeXt-S、RegNetY-4GF、RegNetY-8GF、RegNetY-12GF、EfficientNetV2、Sequencer2D-L、Swin-T、ViT-G/14、ViT-Ti、ViT-S、VIT-B、VIT-L和MaxViT-B架构的实验。在这些架构中,Swin-T和Sequencer2D-L架构的准确率分别为99.82%和99.70%, respectively,尽管没有超过我们的CNN-LSTM架构,但与我们的CNN-LSTM架构相当。
https://arxiv.org/abs/2405.02548
Although space weather events may not directly affect human life, they have the potential to inflict significant harm upon our communities. Harmful space weather events can trigger atmospheric changes that result in physical and economic damages on a global scale. In 1989, Earth experienced the effects of a powerful geomagnetic storm that caused satellites to malfunction, while triggering power blackouts in Canada, along with electricity disturbances in the United States and Europe. With the solar cycle peak rapidly approaching, there is an ever-increasing need to prepare and prevent the damages that can occur, especially to modern-day technology, calling for the need of a comprehensive prediction system. This study aims to leverage machine learning techniques to predict instances of space weather (solar flares, coronal mass ejections, geomagnetic storms), based on active region magnetograms of the Sun. This was done through the use of the NASA DONKI service to determine when these solar events occur, then using data from the NASA Solar Dynamics Observatory to compile a dataset that includes magnetograms of active regions of the Sun 24 hours before the events. By inputting the magnetograms into a convolutional neural network (CNN) trained from this dataset, it can serve to predict whether a space weather event will occur, and what type of event it will be. The model was designed using a custom architecture CNN, and returned an accuracy of 90.27%, a precision of 85.83%, a recall of 91.78%, and an average F1 score of 92.14% across each class (Solar flare [Flare], geomagnetic storm [GMS], coronal mass ejection [CME]). Our results show that using magnetogram data as an input for a CNN is a viable method to space weather prediction. Future work can involve prediction of the magnitude of solar events.
尽管空间天气事件可能不会直接影响人类生活,但它们有可能对社区造成严重伤害。有害的空间天气事件可能引发全球性的大气变化,导致物理和经济损失。1989年,地球经历了强大的地磁暴,导致卫星故障,引发了加拿大的电力停电以及美国和欧洲的电力干扰。随着太阳周期的临近,越来越多地需要为预防可能出现的损坏做好准备,特别是对现代技术,这就要求建立一个全面预测系统。 本研究旨在利用机器学习技术,根据太阳活动区磁图预测空间天气事件(太阳黑子活动,日冕物质抛射,地磁暴)。这是通过使用NASA的DONKI服务确定这些太阳事件发生的时间,然后使用NASA的Solar Dynamics Observatory收集这些事件发生前24小时的磁图数据。将磁图输入到从该数据集中提取的卷积神经网络(CNN)中,可以预测是否会发生空间天气事件以及事件类型。该模型采用了一种自定义的CNN架构,在每类上取得了90.27%的准确度、85.83%的精度、91.78%的召回率和92.14%的均方误差(F1)分数。我们的结果表明,使用磁图数据作为CNN的输入是一种可行的空间天气预测方法。未来的工作可以包括预测太阳事件的规模。
https://arxiv.org/abs/2405.02545