Paper Reading AI Learner

D'etection d'Objets dans les documents num'eris'es par r'eseaux de neurones profonds

2023-01-27 14:45:45
Mélodie Boillet

Abstract

In this thesis, we study multiple tasks related to document layout analysis such as the detection of text lines, the splitting into acts or the detection of the writing support. Thus, we propose two deep neural models following two different approaches. We aim at proposing a model for object detection that considers the difficulties associated with document processing, including the limited amount of training data available. In this respect, we propose a pixel-level detection model and a second object-level detection model. We first propose a detection model with few parameters, fast in prediction, and which can obtain accurate prediction masks from a reduced number of training data. We implemented a strategy of collection and uniformization of many datasets, which are used to train a single line detection model that demonstrates high generalization capabilities to out-of-sample documents. We also propose a Transformer-based detection model. The design of such a model required redefining the task of object detection in document images and to study different approaches. Following this study, we propose an object detection strategy consisting in sequentially predicting the coordinates of the objects enclosing rectangles through a pixel classification. This strategy allows obtaining a fast model with only few parameters. Finally, in an industrial setting, new non-annotated data are often available. Thus, in the case of a model adaptation to this new data, it is expected to provide the system as few new annotated samples as possible. The selection of relevant samples for manual annotation is therefore crucial to enable successful adaptation. For this purpose, we propose confidence estimators from different approaches for object detection. We show that these estimators greatly reduce the amount of annotated data while optimizing the performances.

Abstract (translated)

本 thesis 研究的是与文档布局分析相关的多个任务,例如文本线条的检测、行为分割或写作支持的检测。因此,我们提出了两种不同的深度学习模型。我们的目标是提出一种考虑文档处理中的难题、包括可用的训练数据数量的模型,以提出一种对象检测模型。为此,我们提出了像素级别的检测模型和第二个对象级别的检测模型。我们首先提出了一个参数较少、预测速度较快的检测模型,可以从减少的训练数据中准确预测掩码。我们实施了收集和标准化许多数据集的策略,这些数据集用于训练一个单个线条检测模型,以展示对样本外文档的高泛化能力。我们还提出了基于Transformer的检测模型。该模型的设计需要重新定义文档图像中对象检测的任务,并研究不同的方法。在此之后,我们提出了一个对象检测策略,它包括通过像素分类预测包含矩形框的对象的坐标。这种策略仅需要几个参数即可快速获得模型。最后,在一个工业环境中,常常存在新的未标注数据。因此,对于该模型对新数据的学习适应,期望提供尽可能少的新标注样本。因此,选择相关的手动标注样本是实现成功适应的关键。为了达成这一目标,我们提出了不同方法中用于对象检测的信心估计器。我们证明了这些估计器在优化性能方面大大减小了标注数据的数量。

URL

https://arxiv.org/abs/2301.11753

PDF

https://arxiv.org/pdf/2301.11753.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot