Abstract
To accurately understand engineering drawings, it is essential to establish the correspondence between images and their description tables within the drawings. Existing document understanding methods predominantly focus on text as the main modality, which is not suitable for documents containing substantial image information. In the field of visual relation detection, the structure of the task inherently limits its capacity to assess relationships among all entity pairs in the drawings. To address this issue, we propose a vision-based relation detection model, named ViRED, to identify the associations between tables and circuits in electrical engineering drawings. Our model mainly consists of three parts: a vision encoder, an object encoder, and a relation decoder. We implement ViRED using PyTorch to evaluate its performance. To validate the efficacy of ViRED, we conduct a series of experiments. The experimental results indicate that, within the engineering drawing dataset, our approach attained an accuracy of 96\% in the task of relation prediction, marking a substantial improvement over existing methodologies. The results also show that ViRED can inference at a fast speed even when there are numerous objects in a single engineering drawing.
Abstract (translated)
要准确理解工程图纸,有必要在图纸中建立图像与其描述表之间的对应关系。现有的文档理解方法主要关注文本作为主要模式,这并不适合包含大量图像信息的文档。在可视关系检测领域,任务的结构使其无法评估图纸中所有实体对之间的关系。为解决这个问题,我们提出了一个基于视觉关系的检测模型,名为ViRED,用于识别电气工程图纸中表与电路之间的关联。我们的模型主要由三个部分组成:一个视觉编码器、一个对象编码器和一个关系解码器。我们使用PyTorch实现ViRED,以评估其性能。为了验证ViRED的有效性,我们进行了一系列实验。实验结果表明,在我们的工程图纸数据集上,我们的方法在关系预测任务上的准确度达到了96%,标志着与现有方法相比取得了显著的改进。结果还显示,即使在一个工程图纸中有大量的物体,ViRED仍可以在快速速度下进行推理。
URL
https://arxiv.org/abs/2409.00909