Current state-of-the-art two-stage models on instance segmentation task suffer from several types of imbalances. In this paper, we address the Intersection over the Union (IoU) distribution imbalance of positive input Regions of Interest (RoIs) during the training of the second stage. Our Self-Balanced R-CNN (SBR-CNN), an evolved version of the Hybrid Task Cascade (HTC) model, brings brand new loop mechanisms of bounding box and mask refinements. With an improved Generic RoI Extraction (GRoIE), we also address the feature-level imbalance at the Feature Pyramid Network (FPN) level, originated by a non-uniform integration between low- and high-level features from the backbone layers. In addition, the redesign of the architecture heads toward a fully convolutional approach with FCC further reduces the number of parameters and obtains more clues to the connection between the task to solve and the layers used. Moreover, our SBR-CNN model shows the same or even better improvements if adopted in conjunction with other state-of-the-art models. In fact, with a lightweight ResNet-50 as backbone, evaluated on COCO minival 2017 dataset, our model reaches 45.3% and 41.5% AP for object detection and instance segmentation, with 12 epochs and without extra tricks. The code is available at this https URL
目前最先进的两阶段模型在实例分割任务中存在多种不平衡类型。在本文中,我们解决了在第二阶段训练过程中输入区域关键点(RoIs)的交集over联合(IoU)分布不平衡。我们的自平衡R-CNN(SBR-CNN)模型,是Hybrid Task Cascade(HTC)模型的进化版本,带来了边界框和掩码精度的循环机制。通过改进的通用RoI提取(GRoIE),我们还解决了特征层不平衡问题,源于低层和高层特征之间的非均匀整合。此外,模型的架构朝着全卷积方法迈进,FCC进一步减少了参数并获得更多关于任务要解决的和层使用的提示。此外,与最先进的其他模型相结合,我们的SBR-CNN模型显示出相同或更好的性能。事实上,使用轻量级的ResNet-50作为骨架,在COCO minival 2017数据集上评估,我们的模型达到45.3%和41.5%的AP,经过12个epoch和无需额外技巧。代码可在此处访问:https://url
https://arxiv.org/abs/2404.16633
In below freezing winter conditions, road surface friction can greatly vary based on the mixture of snow, ice, and water on the road. Friction between the road and vehicle tyres is a critical parameter defining vehicle dynamics, and therefore road surface friction information is essential to acquire for several intelligent transportation applications, such as safe control of automated vehicles or alerting drivers of slippery road conditions. This paper explores computer vision-based evaluation of road surface friction from roadside cameras. Previous studies have extensively investigated the application of convolutional neural networks for the task of evaluating the road surface condition from images. Here, we propose a hybrid deep learning architecture, WCamNet, consisting of a pretrained visual transformer model and convolutional blocks. The motivation of the architecture is to combine general visual features provided by the transformer model, as well as finetuned feature extraction properties of the convolutional blocks. To benchmark the approach, an extensive dataset was gathered from national Finnish road infrastructure network of roadside cameras and optical road surface friction sensors. Acquired results highlight that the proposed WCamNet outperforms previous approaches in the task of predicting the road surface friction from the roadside camera images.
在严寒的冬季条件下,道路表面的摩擦系数会因路面上积雪、冰和水混合物的影响而大大不同。道路与车辆轮胎之间的摩擦是定义车辆动力学的重要参数,因此获取道路表面摩擦信息对于多个智能交通应用至关重要,例如安全控制自动车辆或警示驾驶员道路湿滑情况。本文从路边摄像机对道路表面摩擦进行计算机视觉评估。之前的研究已经广泛探讨了使用卷积神经网络从图像中评估道路表面状况。本文提出了一种混合深度学习架构WCamNet,包括预训练的视觉 transformer模型和卷积模块。架构的动机是结合 transformer 模型提供的通用视觉特征以及卷积模块的微调特征提取特性。为了验证该方法,从国家芬兰道路基础设施网络的路边摄像机和光学道路表面摩擦传感器中收集了广泛的數據。得到的结果表明,与之前的方法相比,所提出的 WCamNet 在预测从路边摄像机图像中预测道路表面摩擦方面表现优异。
https://arxiv.org/abs/2404.16578
In recent years, with the rapid development of computer information technology, the development of artificial intelligence has been accelerating. The traditional geometry recognition technology is relatively backward and the recognition rate is low. In the face of massive information database, the traditional algorithm model inevitably has the problems of low recognition accuracy and poor performance. Deep learning theory has gradually become a very important part of machine learning. The implementation of convolutional neural network (CNN) reduces the difficulty of graphics generation algorithm. In this paper, using the advantages of lenet-5 architecture sharing weights and feature extraction and classification, the proposed geometric pattern recognition algorithm model is faster in the training data set. By constructing the shared feature parameters of the algorithm model, the cross-entropy loss function is used in the recognition process to improve the generalization of the model and improve the average recognition accuracy of the test data set.
近年来,随着计算机信息技术的快速发展,人工智能的发展也加速了。传统的几何识别技术相对较落后,识别率也较低。面对大规模的信息数据库,传统的算法模型无疑存在识别准确度低和性能差的问题。深度学习理论逐渐成为机器学习的重要组成部分。卷积神经网络(CNN)的实现减化了图形生成算法的难度。在本文中,利用lenet-5架构共享权重和特征提取与分类的优势,所提出的几何模式识别算法模型在训练数据集上训练速度更快。通过构建算法模型的共享特征参数,交叉熵损失函数在识别过程中用于提高模型的泛化能力和测试数据集的平均识别准确度。
https://arxiv.org/abs/2404.16561
Scour around bridge piers is a critical challenge for infrastructures around the world. In the absence of analytical models and due to the complexity of the scour process, it is difficult for current empirical methods to achieve accurate predictions. In this paper, we exploit the power of deep learning algorithms to forecast the scour depth variations around bridge piers based on historical sensor monitoring data, including riverbed elevation, flow elevation, and flow velocity. We investigated the performance of Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN) models for real-time scour forecasting using data collected from bridges in Alaska and Oregon from 2006 to 2021. The LSTM models achieved mean absolute error (MAE) ranging from 0.1m to 0.5m for predicting bed level variations a week in advance, showing a reasonable performance. The Fully Convolutional Network (FCN) variant of CNN outperformed other CNN configurations, showing a comparable performance to LSTMs with significantly lower computational costs. We explored various innovative random-search heuristics for hyperparameter tuning and model optimisation which resulted in reduced computational cost compared to grid-search method. The impact of different combinations of sensor features on scour prediction showed the significance of the historical time series of scour for predicting upcoming events. Overall, this study provides a greater understanding of the potential of Deep Learning (DL) for real-time scour forecasting and early warning in bridges with diverse scour and flow characteristics including riverine and tidal/coastal bridges.
在世界各地的基础设施中,清理桥墩是一个关键的挑战。缺乏分析模型以及由于侵蚀过程的复杂性,当前的实证方法很难实现准确的预测。在本文中,我们利用深度学习算法的优势来预测基于历史传感器监测数据桥墩周围的侵蚀深度变化,包括河床高度、流速和流深。我们还研究了使用2006年至2021年阿拉斯加和俄勒冈州桥梁收集的数据来预测实时侵蚀预测的LSTM和卷积神经网络模型的性能。LSTM模型的预测床面变化平均绝对误差(MAE)在提前一周预测时从0.1米到0.5米,表现出相当不错的性能。全卷积网络(FCN)变体在其他CNN配置中表现优异,与LSTM模型的性能相当,但计算成本较低。我们研究了各种创新随机搜索策略进行超参数调整和模型优化,从而使计算成本比网格搜索方法降低。不同传感器特征组合对侵蚀预测的影响表明了历史侵蚀时间序列对于预测即将发生事件的显著性。总体而言,本研究为深入理解DL在具有多样scour和flow特性的桥梁上的实时侵蚀预测和预警提供了更大的认识。
https://arxiv.org/abs/2404.16549
Deep convolutional neural networks (DCNNs) are a class of artificial neural networks, primarily for computer vision tasks such as segmentation and classification. Many nonlinear operations, such as activation functions and pooling strategies, are used in DCNNs to enhance their ability to process different signals with different tasks. Conceptional convolution, a linear filter, is the essential component of DCNNs while nonlinear convolution is generally implemented as higher-order Volterra filters, However, for Volterra filtering, significant memory and computational costs pose a primary limitation for its widespread application in DCNN applications. In this study, we propose a novel method to perform higher-order Volterra filtering with lower memory and computation cost in forward and backward pass in DCNN training. The proposed method demonstrates computational advantages compared with conventional Volterra filter implementation. Furthermore, based on the proposed method, a new attention module called Higher-order Local Attention Block (HLA) is proposed and tested on CIFAR-100 dataset, which shows competitive improvement for classification task. Source code is available at: this https URL
深度卷积神经网络(DCNNs)是一种用于计算机视觉任务(如分割和分类)的人工神经网络。许多非线性操作,如激活函数和池化策略,用于增强DCNNs处理不同任务的能力。概念上的卷积是DCNNs的关键组成部分,而通常非线性卷积通过高阶Volterra滤波器实现。然而,对于Volterra滤波器,在DCNN应用中广泛应用的记忆和计算成本方面的限制尤为突出。在这项研究中,我们提出了一种在DCNN训练过程中实现高阶Volterra滤波且具有较低记忆和计算成本的新方法。与传统Volterra滤波器实现相比,该方法具有计算优势。此外,根据所提出的方法,还提出并测试了一个名为 Higher-order Local Attention Block (HLA) 的自注意力模块,在CIFAR-100 数据集上进行了分类任务的测试,其分类性能具有竞争力的提升。源代码可在此处访问:https://this URL
https://arxiv.org/abs/2404.16380
Multi-label Recognition (MLR) involves the identification of multiple objects within an image. To address the additional complexity of this problem, recent works have leveraged information from vision-language models (VLMs) trained on large text-images datasets for the task. These methods learn an independent classifier for each object (class), overlooking correlations in their occurrences. Such co-occurrences can be captured from the training data as conditional probabilities between a pair of classes. We propose a framework to extend the independent classifiers by incorporating the co-occurrence information for object pairs to improve the performance of independent classifiers. We use a Graph Convolutional Network (GCN) to enforce the conditional probabilities between classes, by refining the initial estimates derived from image and text sources obtained using VLMs. We validate our method on four MLR datasets, where our approach outperforms all state-of-the-art methods.
多标签识别(MLR)涉及在图像中识别多个目标。为解决这个问题,最近的工作利用了在大型文本图像数据集上训练的视觉语言模型(VLMs)生成的信息来识别任务。这些方法为每个对象(类别)学习了一个独立的分类器,而忽略了它们发生的相关性。这些相关性可以从训练数据中捕获为两个类别之间的条件概率。我们提出了一个框架,通过将对象对之间的相关性信息融入独立分类器的定义中,来扩展独立分类器的性能。我们使用图卷积网络(GCN)来强制类别之间的条件概率,通过优化使用VLMs生成的图像和文本源获得的初始估计。我们在四个MLR数据集上验证我们的方法,结果表明,我们的方法超过了所有最先进的解决方案。
https://arxiv.org/abs/2404.16193
Anomaly detection in industrial systems is crucial for preventing equipment failures, ensuring risk identification, and maintaining overall system efficiency. Traditional monitoring methods often rely on fixed thresholds and empirical rules, which may not be sensitive enough to detect subtle changes in system health and predict impending failures. To address this limitation, this paper proposes, a novel Attention-based convolutional autoencoder (ABCD) for risk detection and map the risk value derive to the maintenance planning. ABCD learns the normal behavior of conductivity from historical data of a real-world industrial cooling system and reconstructs the input data, identifying anomalies that deviate from the expected patterns. The framework also employs calibration techniques to ensure the reliability of its predictions. Evaluation results demonstrate that with the attention mechanism in ABCD a 57.4% increase in performance and a reduction of false alarms by 9.37% is seen compared to without attention. The approach can effectively detect risks, the risk priority rank mapped to maintenance, providing valuable insights for cooling system designers and service personnel. Calibration error of 0.03% indicates that the model is well-calibrated and enhances model's trustworthiness, enabling informed decisions about maintenance strategies
工业系统中的异常检测对于防止设备故障、确保风险识别和维持整体系统效率至关重要。传统的监测方法通常依赖于固定的阈值和经验规则,这些阈值可能不足以检测系统健康状况的细微变化并预测即将发生的故障。为了克服这一局限,本文提出了一种新颖的自注意力卷积自动编码器(ABCD)用于风险检测,并将其风险价值映射到维护计划。ABCD从现实工业冷却系统的 historical 数据中学习电导率的正常行为,并重构输入数据,识别出与预期模式不符的异常。框架还采用校准技术来确保其预测的可靠性。评估结果表明,与没有注意机制的 ABCD 相比,性能提高了 57.4%,虚假警报减少了 9.37%。这种方法可以有效地检测风险,将风险优先级映射到维护,为冷却系统设计和服务人员提供了宝贵的见解。校准误差为 0.03% 说明模型已经很好地校准,提高了模型的可信度,使得在维护策略方面做出知情的决策。
https://arxiv.org/abs/2404.16183
In the field of 3D Human Pose Estimation (HPE), accurately estimating human pose, especially in scenarios with occlusions, is a significant challenge. This work identifies and addresses a gap in the current state of the art in 3D HPE concerning the scarcity of data and strategies for handling occlusions. We introduce our novel BlendMimic3D dataset, designed to mimic real-world situations where occlusions occur for seamless integration in 3D HPE algorithms. Additionally, we propose a 3D pose refinement block, employing a Graph Convolutional Network (GCN) to enhance pose representation through a graph model. This GCN block acts as a plug-and-play solution, adaptable to various 3D HPE frameworks without requiring retraining them. By training the GCN with occluded data from BlendMimic3D, we demonstrate significant improvements in resolving occluded poses, with comparable results for non-occluded ones. Project web page is available at this https URL.
在3D人体姿态估计(HPE)领域,准确估计人体姿态,尤其是在遮挡情况下,是一个重要的挑战。这项工作识别并解决了当前3D HPE领域关于数据稀缺性和处理遮挡策略的不足。我们引入了我们的新BlendMimic3D数据集,旨在模拟真实世界场景中遮挡发生的情况,实现无缝集成到3D HPE算法中。此外,我们提出了一个3D姿态优化模块,通过图卷积网络(GCN)增强姿态表示。这个GCN模块是一个可插拔的解决方案,适用于各种3D HPE框架,无需重新训练它们。通过用BlendMimic3D中的遮挡数据训练GCN,我们证明了在解决遮挡姿态方面取得了显著的改进,与未遮挡的姿态具有可比较的结果。项目网页链接为https://www.example.com。
https://arxiv.org/abs/2404.16136
Analyzing volumetric data with rotational invariance or equivariance is an active topic in current research. Existing deep-learning approaches utilize either group convolutional networks limited to discrete rotations or steerable convolutional networks with constrained filter structures. This work proposes a novel equivariant neural network architecture that achieves analytical Equivariance to Local Pattern Orientation on the continuous SO(3) group while allowing unconstrained trainable filters - EquiLoPO Network. Our key innovations are a group convolutional operation leveraging irreducible representations as the Fourier basis and a local activation function in the SO(3) space that provides a well-defined mapping from input to output functions, preserving equivariance. By integrating these operations into a ResNet-style architecture, we propose a model that overcomes the limitations of prior methods. A comprehensive evaluation on diverse 3D medical imaging datasets from MedMNIST3D demonstrates the effectiveness of our approach, which consistently outperforms state of the art. This work suggests the benefits of true rotational equivariance on SO(3) and flexible unconstrained filters enabled by the local activation function, providing a flexible framework for equivariant deep learning on volumetric data with potential applications across domains. Our code is publicly available at \url{this https URL}.
分析体积数据具有旋转不变性或等价性是当前研究的一个活跃主题。现有的深度学习方法要么是有限离散旋转的组卷积网络,要么是具有约束滤波器结构的可调节卷积网络。本文提出了一种新颖的等价神经网络架构,可以在连续SO(3)组上实现对局部模式方向的分析等价性,同时允许无约束的训练滤波器 - EquiLoPO网络。我们的关键创新点是一个利用不可约表示作为傅里叶基的组卷积操作,以及SO(3)空间中提供输入到输出函数的良好定义的局部激活函数。通过将这些操作整合到ResNet风格的架构中,我们提出了一个克服了先前方法局限性的模型。对MedMNIST3D等多样3D医疗成像数据集的全面评估表明,我们的方法的有效性得到了充分证明,该方法 consistently超越了最先进的技术水平。这项工作揭示了真旋转等价性对SO(3)的益处以及由局部激活函数实现的可伸缩和不约束滤波器,为在体积数据上实现等价深度学习提供了灵活的框架,具有广泛的应用前景。我们的代码公开可用,通过点击以下链接访问:https:// this https URL。
https://arxiv.org/abs/2404.15979
A well-known retinal disease that feels blurry visions to the affected patients is Macular Degeneration. This research is based on classifying the healthy and macular degeneration fundus with localizing the affected region of the fundus. A CNN architecture and CNN with ResNet architecture (ResNet50, ResNet50v2, ResNet101, ResNet101v2, ResNet152, ResNet152v2) as the backbone are used to classify the two types of fundus. The data are split into three categories including (a) Training set is 90% and Testing set is 10% (b) Training set is 80% and Testing set is 20%, (c) Training set is 50% and Testing set is 50%. After the training, the best model has been selected from the evaluation metrics. Among the models, CNN with backbone of ResNet50 performs best which gives the training accuracy of 98.7\% for 90\% train and 10\% test data split. With this model, we have performed the Grad-CAM visualization to get the region of affected area of fundus.
一种让受影响患者感觉模糊视觉的知名眼病是黄斑变性。这项研究基于将健康和黄斑变性眼底分为定位受影响的区域进行分类。使用卷积神经网络(CNN)架构和带ResNet架构的CNN(ResNet50,ResNet50v2,ResNet101,ResNet101v2,ResNet152,ResNet152v2)作为骨干网络对两种眼底进行分类。数据分为三个类别,包括(a)训练集占90%,测试集占10%;(b)训练集占80%,测试集占20%;(c)训练集占50%,测试集占50%。在训练后,从评估指标中选择最佳模型。在这些模型中,以ResNet50作为骨架的CNN表现最佳,其训练准确率为98.7% for 90% train and 10% test data split。使用这个模型,我们进行了Grad-CAM视觉化,以获取眼底的受影响区域。
https://arxiv.org/abs/2404.15918
Unsupervised domain adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain. The most recent UDA methods always resort to adversarial training to yield state-of-the-art results and a dominant number of existing UDA methods employ convolutional neural networks (CNNs) as feature extractors to learn domain invariant features. Vision transformer (ViT) has attracted tremendous attention since its emergence and has been widely used in various computer vision tasks, such as image classification, object detection, and semantic segmentation, yet its potential in adversarial domain adaptation has never been investigated. In this paper, we fill this gap by employing the ViT as the feature extractor in adversarial domain adaptation. Moreover, we empirically demonstrate that ViT can be a plug-and-play component in adversarial domain adaptation, which means directly replacing the CNN-based feature extractor in existing UDA methods with the ViT-based feature extractor can easily obtain performance improvement. The code is available at this https URL.
无监督领域适应(UDA)的目的是将来自标注源域的知识转移到未标注的目标域。最最新的UDA方法总是依赖于对抗性训练以获得最先进的成果和主导数量现有的UDA方法采用卷积神经网络(CNN)作为特征提取器来学习域不变的特征。自Vision Transformer(ViT) emergence以来,已经引起了巨大的关注,并在各种计算机视觉任务中得到了广泛应用,然而其在对抗领域适应性的潜在能量从未被研究。在本文中,我们通过将ViT作为对抗领域适应的特征提取器来填补这一空白。此外,我们通过实验实证证明,ViT可以成为对抗领域适应的一个插件,这意味着在现有UDA方法中,将基于CNN的特征提取器直接替换为ViT的特征提取器可以轻松获得性能提升。代码可在此处下载:https://url.com/
https://arxiv.org/abs/2404.15817
The integration of deep learning based systems in clinical practice is often impeded by challenges rooted in limited and heterogeneous medical datasets. In addition, prioritization of marginal performance improvements on a few, narrowly scoped benchmarks over clinical applicability has slowed down meaningful algorithmic progress. This trend often results in excessive fine-tuning of existing methods to achieve state-of-the-art performance on selected datasets rather than fostering clinically relevant innovations. In response, this work presents a comprehensive benchmark for the MedMNIST+ database to diversify the evaluation landscape and conduct a thorough analysis of common convolutional neural networks (CNNs) and Transformer-based architectures, for medical image classification. Our evaluation encompasses various medical datasets, training methodologies, and input resolutions, aiming to reassess the strengths and limitations of widely used model variants. Our findings suggest that computationally efficient training schemes and modern foundation models hold promise in bridging the gap between expensive end-to-end training and more resource-refined approaches. Additionally, contrary to prevailing assumptions, we observe that higher resolutions may not consistently improve performance beyond a certain threshold, advocating for the use of lower resolutions, particularly in prototyping stages, to expedite processing. Notably, our analysis reaffirms the competitiveness of convolutional models compared to ViT-based architectures emphasizing the importance of comprehending the intrinsic capabilities of different model architectures. Moreover, we hope that our standardized evaluation framework will help enhance transparency, reproducibility, and comparability on the MedMNIST+ dataset collection as well as future research within the field. Code will be released soon.
深度学习在临床实践中集成往往受到基于有限和异质医疗数据集的挑战的阻碍。此外,在关注几个狭窄的基准上优先改善边缘性能的度量导致在临床应用上的实质性算法进步减缓。这种趋势通常导致在现有方法上进行过度的微调,以在选定的数据集上实现最先进的性能,而不是促进与临床相关的创新。因此,本文提出了一个全面的基准,为 MedMNIST+ 数据库提供多样性,对常见的卷积神经网络(CNN)和基于 Transformer 的架构进行深入分析,以提高医学图像分类的临床相关性。我们的评估包括各种医疗数据集、训练方法和技术,旨在重新评估广泛使用的模型变体。我们的研究结果表明,计算高效的训练方案和现代基础模型有望弥合昂贵端到端训练和更精简的资源优化方法之间的差距。此外,与普遍假设相反,我们观察到,在某些阈值以上,更高的分辨率并不一定改善性能,我们主张在原型阶段使用较低的分辨率,特别是加快处理速度。值得注意的是,我们的分析证实了卷积模型相对于基于 ViT 的架构具有竞争力,突出了理解不同模型架构的固有能力的的重要性。此外,我们希望,我们的标准化评估框架将有助于增强 MedMNIST+ 数据集收集的透明度、可重复性和可比性,同时提高该领域未来的研究水平。代码即将发布。
https://arxiv.org/abs/2404.15786
Autonomous vehicles (AVs) heavily rely on LiDAR perception for environment understanding and navigation. LiDAR intensity provides valuable information about the reflected laser signals and plays a crucial role in enhancing the perception capabilities of AVs. However, accurately simulating LiDAR intensity remains a challenge due to the unavailability of material properties of the objects in the environment, and complex interactions between the laser beam and the environment. The proposed method aims to improve the accuracy of intensity simulation by incorporating physics-based modalities within the deep learning framework. One of the key entities that captures the interaction between the laser beam and the objects is the angle of incidence. In this work we demonstrate that the addition of the LiDAR incidence angle as a separate input to the deep neural networks significantly enhances the results. We present a comparative study between two prominent deep learning architectures: U-NET a Convolutional Neural Network (CNN), and Pix2Pix a Generative Adversarial Network (GAN). We implemented these two architectures for the intensity prediction task and used SemanticKITTI and VoxelScape datasets for experiments. The comparative analysis reveals that both architectures benefit from the incidence angle as an additional input. Moreover, the Pix2Pix architecture outperforms U-NET, especially when the incidence angle is incorporated.
自动驾驶车辆(AVs)对环境理解和导航重度依赖激光雷达感知。激光雷达强度提供了关于反射激光信号的有价值的信息,并在增强AV的感知能力中发挥了关键作用。然而,准确模拟激光雷达强度仍然是一个挑战,由于环境中物体的材料性质不可用,以及激光束与环境的复杂相互作用。所提出的方法旨在通过在深度学习框架中引入基于物理的模态来提高强度模拟的准确性。一个捕捉激光束与物体之间互动的关键实体是入射角。在本文中,我们证明了将激光雷达入射角作为额外的输入到深度神经网络可以显著增强结果。我们比较了两个著名的深度学习架构:U-NET和Pix2Pix。我们将这两个架构用于强度预测任务,并使用SemanticKITTI和VoxelScape数据集进行实验。比较分析揭示了,这两个架构都从入射角作为额外的输入受益。此外,Pix2Pix架构在纳入入射角时优于U-NET。
https://arxiv.org/abs/2404.15774
Skeleton-based action recognition has gained considerable traction thanks to its utilization of succinct and robust skeletal representations. Nonetheless, current methodologies often lean towards utilizing a solitary backbone to model skeleton modality, which can be limited by inherent flaws in the network backbone. To address this and fully leverage the complementary characteristics of various network architectures, we propose a novel Hybrid Dual-Branch Network (HDBN) for robust skeleton-based action recognition, which benefits from the graph convolutional network's proficiency in handling graph-structured data and the powerful modeling capabilities of Transformers for global information. In detail, our proposed HDBN is divided into two trunk branches: MixGCN and MixFormer. The two branches utilize GCNs and Transformers to model both 2D and 3D skeletal modalities respectively. Our proposed HDBN emerged as one of the top solutions in the Multi-Modal Video Reasoning and Analyzing Competition (MMVRAC) of 2024 ICME Grand Challenge, achieving accuracies of 47.95% and 75.36% on two benchmarks of the UAV-Human dataset by outperforming most existing methods. Our code will be publicly available at: this https URL.
基于骨架的动作识别因为利用了简洁且鲁棒的骨架表示而获得了相当大的关注。然而,当前的方法通常倾向于使用单一的骨架来建模骨架模式,这可能会受到网络骨架固有缺陷的限制。为了解决这个问题,并充分利用各种网络架构的互补特点,我们提出了一种新颖的混合双分支网络(HDBN),用于鲁棒骨架 based 动作识别,该网络从图卷积网络的擅长处理图形数据和Transformer的强大的建模能力中受益。具体而言,我们提出的 HDBN 分为两个主分支:MixGCN 和 MixFormer。这两个分支分别使用 GCN 和 Transformer 建模 2D 和 3D 骨架模式。我们提出的 HDBN 在 2024 ICME Grand Challenge 多模态视频推理和分析比赛中成为了一个一流解决方案,在两个 UAV-Human 数据集的基准上实现了准确度分别为 47.95% 和 75.36%,超过了大多数现有方法。我们的代码将在这个链接上公开发布:https://this URL。
https://arxiv.org/abs/2404.15719
Hazy images degrade visual quality, and dehazing is a crucial prerequisite for subsequent processing tasks. Most current dehazing methods rely on neural networks and face challenges such as high computational parameter pressure and weak generalization capabilities. This paper introduces PriorNet--a novel, lightweight, and highly applicable dehazing network designed to significantly improve the clarity and visual quality of hazy images while avoiding excessive detail extraction issues. The core of PriorNet is the original Multi-Dimensional Interactive Attention (MIA) mechanism, which effectively captures a wide range of haze characteristics, substantially reducing the computational load and generalization difficulties associated with complex systems. By utilizing a uniform convolutional kernel size and incorporating skip connections, we have streamlined the feature extraction process. Simplifying the number of layers and architecture not only enhances dehazing efficiency but also facilitates easier deployment on edge devices. Extensive testing across multiple datasets has demonstrated PriorNet's exceptional performance in dehazing and clarity restoration, maintaining image detail and color fidelity in single-image dehazing tasks. Notably, with a model size of just 18Kb, PriorNet showcases superior dehazing generalization capabilities compared to other methods. Our research makes a significant contribution to advancing image dehazing technology, providing new perspectives and tools for the field and related domains, particularly emphasizing the importance of improving universality and deployability.
模糊图像降低视觉质量,而去雾是后续处理任务的關鍵先決條件。目前的大多数去雾方法依賴於神經網絡,並面臨高計算參數壓力和弱泛化能力的挑戰。本文介紹了 PriorNet--一個新輕量級且高度適用的去霧網絡,旨在顯著改善模糊圖像的清晰度和視覺質量,同時避免過度細節提取問題。PriorNet 的核心是原始的多維交互注意力(MIA)機制,有效捕捉廣泛的灰霾特性,大幅降低複雜系統的計算負荷和泛化困難。通過使用相同的卷積内核大小並包含跳躍連接,我們簡化了特徵提取過程。通過在多個數據集上的測試,PriorNet 在去霧和清晰度恢復方面的表現非常出色,保持單一圖像去霧任務中的圖像細節和色彩保真度。值得注意的是,PriorNet 的模型大小僅為 18Kb,其在去霧擴展能力上比其他方法优越。我們的研究在推動圖像去霧技術發展方面做出了重要的貢獻,為該領域和相關領域提供了新的視角和工具,尤其強調了提高普遍性和部署能力的必要性。
https://arxiv.org/abs/2404.15638
The advancement of The Laser Interferometer Gravitational-Wave Observatory (LIGO) has significantly enhanced the feasibility and reliability of gravitational wave detection. However, LIGO's high sensitivity makes it susceptible to transient noises known as glitches, which necessitate effective differentiation from real gravitational wave signals. Traditional approaches predominantly employ fully supervised or semi-supervised algorithms for the task of glitch classification and clustering. In the future task of identifying and classifying glitches across main and auxiliary channels, it is impractical to build a dataset with manually labeled ground-truth. In addition, the patterns of glitches can vary with time, generating new glitches without manual labels. In response to this challenge, we introduce the Cross-Temporal Spectrogram Autoencoder (CTSAE), a pioneering unsupervised method for the dimensionality reduction and clustering of gravitational wave glitches. CTSAE integrates a novel four-branch autoencoder with a hybrid of Convolutional Neural Networks (CNN) and Vision Transformers (ViT). To further extract features across multi-branches, we introduce a novel multi-branch fusion method using the CLS (Class) token. Our model, trained and evaluated on the GravitySpy O3 dataset on the main channel, demonstrates superior performance in clustering tasks when compared to state-of-the-art semi-supervised learning methods. To the best of our knowledge, CTSAE represents the first unsupervised approach tailored specifically for clustering LIGO data, marking a significant step forward in the field of gravitational wave research. The code of this paper is available at this https URL
LIGO's advancements have significantly enhanced the feasibility and reliability of gravitational wave detection. However, LIGO's high sensitivity makes it susceptible to transient noises called glitches, which necessitate effective differentiation from real gravitational wave signals. Traditional approaches predominantly employ fully supervised or semi-supervised algorithms for the task of glitch classification and clustering. In the future task of identifying and classifying glitches across main and auxiliary channels, it is impractical to build a dataset with manually labeled ground-truth. In addition, the patterns of glitches can vary with time, generating new glitches without manual labels. In response to this challenge, we introduce the Cross-Temporal Spectrogram Autoencoder (CTSAE), a pioneering unsupervised method for the dimensionality reduction and clustering of gravitational wave glitches. CTSAE integrates a novel four-branch autoencoder with a hybrid of Convolutional Neural Networks (CNN) and Vision Transformers (ViT). To further extract features across multi-branches, we introduce a novel multi-branch fusion method using the CLS (Class) token. Our model, trained and evaluated on the GravitySpy O3 dataset on the main channel, demonstrates superior performance in clustering tasks when compared to state-of-the-art semi-supervised learning methods. To the best of our knowledge, CTSAE represents the first unsupervised approach tailored specifically for clustering LIGO data, marking a significant step forward in the field of gravitational wave research. The code of this paper is available at this [https:// URL](https:// URL).
https://arxiv.org/abs/2404.15552
Feature pyramids have been widely adopted in convolutional neural networks (CNNs) and transformers for tasks like medical image segmentation and object detection. However, the currently existing models generally focus on the Encoder-side Transformer to extract features, from which decoder improvement can bring further potential with well-designed architecture. We propose CFPFormer, a novel decoder block that integrates feature pyramids and transformers. Specifically, by leveraging patch embedding, cross-layer feature concatenation, and Gaussian attention mechanisms, CFPFormer enhances feature extraction capabilities while promoting generalization across diverse tasks. Benefiting from Transformer structure and U-shaped Connections, our introduced model gains the ability to capture long-range dependencies and effectively up-sample feature maps. Our model achieves superior performance in detecting small objects compared to existing methods. We evaluate CFPFormer on medical image segmentation datasets and object detection benchmarks (VOC 2007, VOC2012, MS-COCO), demonstrating its effectiveness and versatility. On the ACDC Post-2017-MICCAI-Challenge online test set, our model reaches exceptionally impressive accuracy, and performed well compared with the original decoder setting in Synapse multi-organ segmentation dataset.
特点金字塔已经在卷积神经网络(CNN)和Transformer中广泛应用于医学图像分割和目标检测任务。然而,目前现有的模型通常只关注编码器侧Transformer提取特征,从而使得解码器的改进具有更大的潜力,并设计出好的架构会进一步提高潜力。我们提出了CFPFormer,一种新型的解码器模块,集成了特征金字塔和Transformer。具体来说,通过利用补丁嵌入、跨层特征连接和Gaussian注意力机制,CFPFormer增强了特征提取能力,并促进了不同任务上的泛化。得益于Transformer结构和U形连接,我们引入的模型具有捕捉长距离依赖关系的能力,并且能够有效上采样特征图。与现有方法相比,我们的模型在检测小物体方面表现出优越的性能。我们在医学图像分割数据集和目标检测基准(VOC 2007,VOC2012,MS-COCO)上评估CFPFormer,证明了其有效性和多样性。在ACDC后2017-MICCAI挑战在线测试集中,我们的模型达到令人惊讶的准确度,并且在Synapse多器官分割数据集上的解码器设置中表现良好。
https://arxiv.org/abs/2404.15451
Optical imaging quality can be severely degraded by system and sample induced aberrations. Existing adaptive optics systems typically rely on iterative search algorithm to correct for aberrations and improve images. This study demonstrates the application of convolutional neural networks to characterise the optical aberration by directly predicting the Zernike coefficients from two to three phase-diverse optical images. We evaluated our network on 600,000 simulated Point Spread Function (PSF) datasets randomly generated within the range of -1 to 1 radians using the first 25 Zernike coefficients. The results show that using only three phase-diverse images captured above, below and at the focal plane with an amplitude of 1 achieves a low RMSE of 0.10 radians on the simulated PSF dataset. Furthermore, this approach directly predicts Zernike modes simulated extended 2D samples, while maintaining a comparable RMSE of 0.15 radians. We demonstrate that this approach is effective using only a single prediction step, or can be iterated a small number of times. This simple and straightforward technique provides rapid and accurate method for predicting the aberration correction using three or less phase-diverse images, paving the way for evaluation on real-world dataset.
光学成像质量可能会受到系统和样品诱导的色差严重降解。现有的自适应光学系统通常依赖于迭代搜索算法来纠正色差并改善图像。本研究展示了将卷积神经网络应用于直接从两到三个相干 diverse光学图像中预测Zernike系数来表征光学色差的应用。我们使用前25个Zernike系数对600,000个随机的点扩散函数(PSF)数据集进行了评估。结果表明,仅使用上面、下面和焦点平面捕获的三个相干 diverse 图像,其幅度为1的模拟 PSF 数据集的 RMSE 低至 0.10 弧度。此外,这种方法还直接预测模拟扩展 2D 样本的 Zernike 模式,同时保持与模拟 PSF 数据集相当 RMSE 值(0.15 弧度)。我们证明了这种方法仅用一个预测步骤即可有效,或者可以进行少量迭代。这种简单而直接的技术为使用三个或更少的相干 diverse 图像预测色差纠正提供了快速且准确的方法,为在现实世界数据集上进行评估铺平了道路。
https://arxiv.org/abs/2404.15231
Local-nonlocal coupling approaches combine the computational efficiency of local models and the accuracy of nonlocal models. However, the coupling process is challenging, requiring expertise to identify the interface between local and nonlocal regions. This study introduces a machine learning-based approach to automatically detect the regions in which the local and nonlocal models should be used in a coupling approach. This identification process uses the loading functions and provides as output the selected model at the grid points. Training is based on datasets of loading functions for which reference coupling configurations are computed using accurate coupled solutions, where accuracy is measured in terms of the relative error between the solution to the coupling approach and the solution to the nonlocal model. We study two approaches that differ from one another in terms of the data structure. The first approach, referred to as the full-domain input data approach, inputs the full load vector and outputs a full label vector. In this case, the classification process is carried out globally. The second approach consists of a window-based approach, where loads are preprocessed and partitioned into windows and the problem is formulated as a node-wise classification approach in which the central point of each window is treated individually. The classification problems are solved via deep learning algorithms based on convolutional neural networks. The performance of these approaches is studied on one-dimensional numerical examples using F1-scores and accuracy metrics. In particular, it is shown that the windowing approach provides promising results, achieving an accuracy of 0.96 and an F1-score of 0.97. These results underscore the potential of the approach to automate coupling processes, leading to more accurate and computationally efficient solutions for material science applications.
局部非局部耦合方法结合了局部模型的计算效率和全局模型的准确性。然而,耦合过程具有挑战性,需要专业知识和技能来确定局部和全局区域的接口。本研究介绍了一种基于机器学习的方法,用于自动检测在耦合方法中应该使用局部和全局模型的区域。识别过程基于加载函数,并输出网格点上的选定模型。训练基于准确的耦合解决方案计算的数据集,以计算相对于耦合方法和非局部模型的解决方案的相对误差。我们研究了两种不同的数据结构。第一种方法被称为完整域输入数据方法,它输入完整的加载向量并输出完整的标签向量。在这种情况下,分类过程是全局进行的。第二种方法包括一个基于窗口的方法,其中预处理并分隔负载,将问题转化为每个窗口的节点分类方法。分类问题通过基于卷积神经网络的深度学习算法解决。我们研究了这些方法基于F1分数和准确率指标的性能。特别是,证明了窗口方法提供了有前途的结果,实现了0.96的准确性和0.97的F1得分。这些结果强调了这个方法自动耦合过程的潜力,为材料科学应用提供了更准确和计算效率的解决方案。
https://arxiv.org/abs/2404.15388
Graph neural networks are becoming increasingly popular in the field of machine learning due to their unique ability to process data structured in graphs. They have also been applied in safety-critical environments where perturbations inherently occur. However, these perturbations require us to formally verify neural networks before their deployment in safety-critical environments as neural networks are prone to adversarial attacks. While there exists research on the formal verification of neural networks, there is no work verifying the robustness of generic graph convolutional network architectures with uncertainty in the node features and in the graph structure over multiple message-passing steps. This work addresses this research gap by explicitly preserving the non-convex dependencies of all elements in the underlying computations through reachability analysis with (matrix) polynomial zonotopes. We demonstrate our approach on three popular benchmark datasets.
由于其处理数据结构为图的能力,图形神经网络在机器学习领域变得越来越受欢迎。它们还应用于安全性至关重要的环境中,在这些环境中,扰动是固有的。然而,这些扰动需要我们在将神经网络部署在安全性至关重要的环境中之前,正式验证神经网络。虽然关于神经网络的正式验证的研究已经存在,但是没有研究验证具有不确定性节点特征和图结构的通用图形卷积网络架构的鲁棒性。本文通过采用(矩阵)多项式界来保留计算背后的所有元素的凹凸依赖,解决这一研究空白。我们在三个流行的基准数据集上证明了我们的方法。
https://arxiv.org/abs/2404.15065