Paper Reading AI Learner

Open-Vocabulary Scene Text Recognition via Pseudo-Image Labeling and Margin Loss

2024-03-12 10:54:38
Xuhua Ren, Hengcan Shi, Jin Li

Abstract

Scene text recognition is an important and challenging task in computer vision. However, most prior works focus on recognizing pre-defined words, while there are various out-of-vocabulary (OOV) words in real-world applications. In this paper, we propose a novel open-vocabulary text recognition framework, Pseudo-OCR, to recognize OOV words. The key challenge in this task is the lack of OOV training data. To solve this problem, we first propose a pseudo label generation module that leverages character detection and image inpainting to produce substantial pseudo OOV training data from real-world images. Unlike previous synthetic data, our pseudo OOV data contains real characters and backgrounds to simulate real-world applications. Secondly, to reduce noises in pseudo data, we present a semantic checking mechanism to filter semantically meaningful data. Thirdly, we introduce a quality-aware margin loss to boost the training with pseudo data. Our loss includes a margin-based part to enhance the classification ability, and a quality-aware part to penalize low-quality samples in both real and pseudo data. Extensive experiments demonstrate that our approach outperforms the state-of-the-art on eight datasets and achieves the first rank in the ICDAR2022 challenge.

Abstract (translated)

场景文本识别是计算机视觉中一个重要而具有挑战性的任务。然而,大多数先前的作品都专注于识别预定义的单词,而在现实应用中存在各种不在词汇表中的(OOV)单词。在本文中,我们提出了一个新颖的开放词汇文本识别框架,称为伪-OCR,以识别OOV单词。这个任务的关键挑战是缺乏OOV训练数据。为解决这个问题,我们首先提出了一个伪标签生成模块,利用字符检测和图像修复技术从现实世界的图像中产生大量伪OOV训练数据。与之前的合成数据不同,我们的伪OOV数据包含真实字符和背景,以模拟真实世界的应用。其次,为了减少伪数据中的噪声,我们提出了一个语义检查机制来过滤语义上有意义的数据。第三,我们引入了质量感知边距损失来提高带有伪数据的训练。我们的损失包括基于边距的质量和基于质量的损失。大量实验证明,我们的方法在八个数据集上的表现超过了现有技术的水平,在ICDAR2022挑战中获得了第一名的成绩。

URL

https://arxiv.org/abs/2403.07518

PDF

https://arxiv.org/pdf/2403.07518.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot