Abstract
Railway Turnout Machines (RTMs) are mission-critical components of the railway transportation infrastructure, responsible for directing trains onto desired tracks. For safety assurance applications, especially in early-warning scenarios, RTM faults are expected to be detected as early as possible on a continuous 7x24 basis. However, limited emphasis has been placed on distributed model inference frameworks that can meet the inference latency and reliability requirements of such mission critical fault diagnosis systems. In this paper, an edge-cloud collaborative early-warning system is proposed to enable real-time and downtime-tolerant fault diagnosis of RTMs, providing a new paradigm for the deployment of models in safety-critical scenarios. Firstly, a modular fault diagnosis model is designed specifically for distributed deployment, which utilizes a hierarchical architecture consisting of the prior knowledge module, subordinate classifiers, and a fusion layer for enhanced accuracy and parallelism. Then, a cloud-edge collaborative framework leveraging pipeline parallelism, namely CEC-PA, is developed to minimize the overhead resulting from distributed task execution and context exchange by strategically partitioning and offloading model components across cloud and edge. Additionally, an election consensus mechanism is implemented within CEC-PA to ensure system robustness during coordinator node downtime. Comparative experiments and ablation studies are conducted to validate the effectiveness of the proposed distributed fault diagnosis approach. Our ensemble-based fault diagnosis model achieves a remarkable 97.4% accuracy on a real-world dataset collected by Nanjing Metro in Jiangsu Province, China. Meanwhile, CEC-PA demonstrates superior recovery proficiency during node disruptions and speed-up ranging from 1.98x to 7.93x in total inference time compared to its counterparts.
Abstract (translated)
铁路道岔机器(RTM)是铁路运输基础设施的关键组成部分,负责将火车导向所需轨道。为了确保安全的应用,尤其是在早期预警场景中,RTM故障需要在连续的7×24小时基础上尽早被检测到。然而,在满足此类任务关键型故障诊断系统推理延迟和可靠性要求的分布式模型推断框架方面投入的关注有限。本文提出了一种边缘-云协作早期预警系统,以实现实时且能容忍停机时间的RTM故障诊断,为安全关键场景中的模型部署提供了一个新范式。 首先,设计了一个专门用于分布式部署的模块化故障诊断模型,该模型利用了分层架构,包括先验知识模块、下属分类器和融合层,以提高准确性和并行性。接着,开发了一种基于流水线并行性的云-边协作框架,即CEC-PA(Cloud-Edge Collaborative Pipeline Architecture),通过在云和边缘之间战略性地划分和卸载模型组件来最小化分布式任务执行和上下文交换带来的开销。此外,在CEC-PA中实现了一个选举共识机制,以确保在协调节点停机期间系统的鲁棒性。 进行了比较实验和消融研究以验证所提出的分布式故障诊断方法的有效性。我们的基于集成的故障诊断模型在中国江苏省南京地铁收集的真实数据集上实现了令人瞩目的97.4%的准确率。同时,在节点中断时,CEC-PA展示了优越的恢复能力和相比其他方案从1.98倍到7.93倍的速度提升。 以上是对原文的翻译内容。
URL
https://arxiv.org/abs/2411.02086