Paper Reading AI Learner

GenKIE: Robust Generative Multimodal Document Key Information Extraction

2023-10-24 19:12:56
Panfeng Cao, Ye Wang, Qiang Zhang, Zaiqiao Meng

Abstract

Key information extraction (KIE) from scanned documents has gained increasing attention because of its applications in various domains. Although promising results have been achieved by some recent KIE approaches, they are usually built based on discriminative models, which lack the ability to handle optical character recognition (OCR) errors and require laborious token-level labelling. In this paper, we propose a novel generative end-to-end model, named GenKIE, to address the KIE task. GenKIE is a sequence-to-sequence multimodal generative model that utilizes multimodal encoders to embed visual, layout and textual features and a decoder to generate the desired output. Well-designed prompts are leveraged to incorporate the label semantics as the weakly supervised signals and entice the generation of the key information. One notable advantage of the generative model is that it enables automatic correction of OCR errors. Besides, token-level granular annotation is not required. Extensive experiments on multiple public real-world datasets show that GenKIE effectively generalizes over different types of documents and achieves state-of-the-art results. Our experiments also validate the model's robustness against OCR errors, making GenKIE highly applicable in real-world scenarios.

Abstract (translated)

从扫描文档中提取关键信息(KIE)因其在各种领域的应用而受到越来越多的关注。虽然一些最近KIE方法取得了很好的结果,但它们通常基于有监督的模型构建,缺乏处理光学字符识别(OCR)错误的能力,需要进行费力的标记级token级标注。在本文中,我们提出了一个名为GenKIE的新序列到序列多模态生成模型,以解决KIE任务。GenKIE是一种序列到序列多模态生成模型,利用多模态编码器嵌入视觉、布局和文本特征,并使用解码器生成所需输出。通过设计合适的提示,充分利用标签语义作为弱监督信号,激发关键信息的生成。 生成模型的一个显著优点是,它能够自动纠正OCR错误。此外,不需要进行标记级token级标注。在多个公开的实世界数据集上的广泛实验表明,GenKIE在各种类型的文档上有效扩展,并实现了最先进的性能。我们的实验还验证了模型的鲁棒性 against OCR错误,使GenKIE在现实场景中具有很高的适用性。

URL

https://arxiv.org/abs/2310.16131

PDF

https://arxiv.org/pdf/2310.16131.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot