Paper Reading AI Learner

ViRED: Prediction of Visual Relations in Engineering Drawings

2024-09-02 02:42:34
Chao Gu, Ke Lin, Yiyang Luo, Jiahui Hou, Xiang-Yang Li

Abstract

To accurately understand engineering drawings, it is essential to establish the correspondence between images and their description tables within the drawings. Existing document understanding methods predominantly focus on text as the main modality, which is not suitable for documents containing substantial image information. In the field of visual relation detection, the structure of the task inherently limits its capacity to assess relationships among all entity pairs in the drawings. To address this issue, we propose a vision-based relation detection model, named ViRED, to identify the associations between tables and circuits in electrical engineering drawings. Our model mainly consists of three parts: a vision encoder, an object encoder, and a relation decoder. We implement ViRED using PyTorch to evaluate its performance. To validate the efficacy of ViRED, we conduct a series of experiments. The experimental results indicate that, within the engineering drawing dataset, our approach attained an accuracy of 96\% in the task of relation prediction, marking a substantial improvement over existing methodologies. The results also show that ViRED can inference at a fast speed even when there are numerous objects in a single engineering drawing.

Abstract (translated)

要准确理解工程图纸,有必要在图纸中建立图像与其描述表之间的对应关系。现有的文档理解方法主要关注文本作为主要模式,这并不适合包含大量图像信息的文档。在可视关系检测领域,任务的结构使其无法评估图纸中所有实体对之间的关系。为解决这个问题,我们提出了一个基于视觉关系的检测模型,名为ViRED,用于识别电气工程图纸中表与电路之间的关联。我们的模型主要由三个部分组成:一个视觉编码器、一个对象编码器和一个关系解码器。我们使用PyTorch实现ViRED,以评估其性能。为了验证ViRED的有效性,我们进行了一系列实验。实验结果表明,在我们的工程图纸数据集上,我们的方法在关系预测任务上的准确度达到了96%,标志着与现有方法相比取得了显著的改进。结果还显示,即使在一个工程图纸中有大量的物体,ViRED仍可以在快速速度下进行推理。

URL

https://arxiv.org/abs/2409.00909

PDF

https://arxiv.org/pdf/2409.00909.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot