Paper Reading AI Learner

Improving Table Structure Recognition with Visual-Alignment Sequential Coordinate Modeling

2023-03-13 09:34:08
Yongshuai Huang, Ning Lu, Dapeng Chen, Yibo Li, Zecheng Xie, Shenggao Zhu, Liangcai Gao, Wei Peng

Abstract

Table structure recognition aims to extract the logical and physical structure of unstructured table images into a machine-readable format. The latest end-to-end image-to-text approaches simultaneously predict the two structures by two decoders, where the prediction of the physical structure (the bounding boxes of the cells) is based on the representation of the logical structure. However, the previous methods struggle with imprecise bounding boxes as the logical representation lacks local visual information. To address this issue, we propose an end-to-end sequential modeling framework for table structure recognition called VAST. It contains a novel coordinate sequence decoder triggered by the representation of the non-empty cell from the logical structure decoder. In the coordinate sequence decoder, we model the bounding box coordinates as a language sequence, where the left, top, right and bottom coordinates are decoded sequentially to leverage the inter-coordinate dependency. Furthermore, we propose an auxiliary visual-alignment loss to enforce the logical representation of the non-empty cells to contain more local visual details, which helps produce better cell bounding boxes. Extensive experiments demonstrate that our proposed method can achieve state-of-the-art results in both logical and physical structure recognition. The ablation study also validates that the proposed coordinate sequence decoder and the visual-alignment loss are the keys to the success of our method.

Abstract (translated)

表格结构识别的目标是将无结构表格图像的逻辑和物理结构提取到一个可读格式中。最新的端到端图像到文本方法同时由两个解码器预测两个结构,其中物理结构的预测(细胞的边界框)是基于逻辑结构的表示。然而,以前的方法和方法 struggle with imprecise bounding boxes,因为逻辑表示缺乏局部视觉信息。为了解决这个问题,我们提出了一种名为VAST的端到端Sequential建模框架,它包含一个由逻辑结构解码器触发的非空细胞表示引起的新 coordinate sequence解码器。在 coordinate sequence解码器中,我们将边界框坐标建模为语言序列,其中左、上、右和下坐标依次解码以利用交互坐标依赖。此外,我们提出了一种辅助的视觉对齐损失,以强制非空细胞的逻辑表示包含更多的局部视觉细节,这有助于生成更好的细胞边界框。广泛的实验结果表明,我们提出的方法可以在逻辑和物理结构识别中实现最先进的结果。削除研究也证明了我们所提出的 coordinate sequence解码器和视觉对齐损失是我们方法成功的关键。

URL

https://arxiv.org/abs/2303.06949

PDF

https://arxiv.org/pdf/2303.06949.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot