Despite the success of deep neural network (DNN) on sequential data (i.e., scene text and speech) recognition, it suffers from the over-confidence problem mainly due to overfitting in training with the cross-entropy loss, which may make the decision-making less reliable. Confidence calibration has been recently proposed as one effective solution to this problem. Nevertheless, the majority of existing confidence calibration methods aims at non-sequential data, which is limited if directly applied to sequential data since the intrinsic contextual dependency in sequences or the class-specific statistical prior is seldom exploited. To the end, we propose a Context-Aware Selective Label Smoothing (CASLS) method for calibrating sequential data. The proposed CASLS fully leverages the contextual dependency in sequences to construct confusion matrices of contextual prediction statistics over different classes. Class-specific error rates are then used to adjust the weights of smoothing strength in order to achieve adaptive calibration. Experimental results on sequence recognition tasks, including scene text recognition and speech recognition, demonstrate that our method can achieve the state-of-the-art performance.
尽管深度学习神经网络(DNN)在序列数据(如场景文本和语音识别)识别方面取得了成功,但它仍然面临着过度自信问题,这主要是因为在训练过程中过度拟合了交叉熵损失,这可能会导致决策变得不够可靠。最近,自信校准被提出了一种有效的解决方案。然而,大多数现有的自信校准方法的目标都是非序列数据,如果直接应用于序列数据,则受到限制,因为序列本身的内部上下文依赖或特定类别的统计先验很少被利用。最终,我们提出了一种Context-Aware selective Label Smoothing (CASLS)方法,用于校准序列数据。 proposed CASLS fully leverages the contextual dependency in sequences to construct confusion matrices of contextual prediction statistics over different classes。随后,特定类别的错误率被用于调整平滑强度的权重,以实现自适应校准。对序列识别任务(包括场景文本识别和语音识别)的实验结果表明,我们的方法和方法可以实现最先进的性能。
The recent large-scale Contrastive Language-Image Pretraining (CLIP) model has shown great potential in various downstream tasks via leveraging the pretrained vision and language knowledge. Scene text, which contains rich textual and visual information, has an inherent connection with a model like CLIP. Recently, pretraining approaches based on vision language models have made effective progresses in the field of text detection. In contrast to these works, this paper proposes a new method, termed TCM, focusing on Turning the CLIP Model directly for text detection without pretraining process. We demonstrate the advantages of the proposed TCM as follows: (1) The underlying principle of our framework can be applied to improve existing scene text detector. (2) It facilitates the few-shot training capability of existing methods, e.g., by using 10% of labeled data, we significantly improve the performance of the baseline method with an average of 22% in terms of the F-measure on 4 benchmarks. (3) By turning the CLIP model into existing scene text detection methods, we further achieve promising domain adaptation ability. The code will be publicly released.
最近的大规模Contrastive Language-Image Pretraining(CLIP)模型通过利用预训练的视觉和语言知识,在许多下游任务中表现出巨大的潜力。包含丰富文本和视觉信息的Scene文本与CLIP模型具有内在的联系。最近,基于视觉语言模型的预训练方法在文本检测领域取得了有效进展。与这些工作不同,本文提出了一种新方法,称为TCM,重点是将CLIP模型直接用于文本检测,而不需要预训练过程。我们展示TCM的优越性如下:(1) 我们框架的核心原则可以应用于改进现有的Scene文本检测方法。(2) 它促进了现有方法的少量样本训练能力,例如,通过使用标记数据中的10%,我们显著改进了基准方法的性能,平均提高22%。(3) 通过将Clip模型转化为现有的Scene文本检测方法,我们进一步实现了 promising 域适应能力。代码将公开发布。
While vision transformers have been highly successful in improving the performance in image-based tasks, not much work has been reported on applying transformers to multilingual scene text recognition due to the complexities in the visual appearance of multilingual texts. To fill the gap, this paper proposes an augmented transformer architecture with n-grams embedding and cross-language rectification (TANGER). TANGER consists of a primary transformer with single patch embeddings of visual images, and a supplementary transformer with adaptive n-grams embeddings that aims to flexibly explore the potential correlations between neighbouring visual patches, which is essential for feature extraction from multilingual scene texts. Cross-language rectification is achieved with a loss function that takes into account both language identification and contextual coherence scoring. Extensive comparative studies are conducted on four widely used benchmark datasets as well as a new multilingual scene text dataset containing Indonesian, English, and Chinese collected from tourism scenes in Indonesia. Our experimental results demonstrate that TANGER is considerably better compared to the state-of-the-art, especially in handling complex multilingual scene texts.
视觉transformer在图像任务中表现优秀,但在将transformer应用于多语言场景文本识别方面,工作相对较少,因为多语言文本的视觉 appearance 非常复杂。为了填补这一差距,本文提出了一种增强的transformer架构,采用n-gram嵌入和跨语言纠正(TANger)。TANger由一个主transformer和一个自适应n-gram嵌入的附加transformer组成,旨在灵活探索相邻视觉 patch 的潜在相关性,这对于从多语言场景文本中提取特征至关重要。跨语言纠正使用一个包括语言识别和上下文连贯评分的损失函数实现。我们对四个广泛使用的基准数据集以及从印度尼西亚旅游场景收集的印度尼西亚语、英语和中文多语言场景文本数据集进行了广泛的比较研究。我们的实验结果表明,TANger相比当前最先进的技术来说表现相当好,特别是在处理复杂的多语言场景文本方面。
Scene-text spotting is a task that predicts a text area on natural scene images and recognizes its text characters simultaneously. It has attracted much attention in recent years due to its wide applications. Existing research has mainly focused on improving text region detection, not text recognition. Thus, while detection accuracy is improved, the end-to-end accuracy is insufficient. Texts in natural scene images tend to not be a random string of characters but a meaningful string of characters, a word. Therefore, we propose adversarial learning of semantic representations for scene text spotting (A3S) to improve end-to-end accuracy, including text recognition. A3S simultaneously predicts semantic features in the detected text area instead of only performing text recognition based on existing visual features. Experimental results on publicly available datasets show that the proposed method achieves better accuracy than other methods.
Scene text image super-resolution (STISR) aims to simultaneously increase the resolution and legibility of the text images, and the resulting images will significantly affect the performance of downstream tasks. Although numerous progress has been made, existing approaches raise two crucial issues: (1) They neglect the global structure of the text, which bounds the semantic determinism of the scene text. (2) The priors, e.g., text prior or stroke prior, employed in existing works, are extracted from pre-trained text recognizers. That said, such priors suffer from the domain gap including low resolution and blurriness caused by poor imaging conditions, leading to incorrect guidance. Our work addresses these gaps and proposes a plug-and-play module dubbed Dual Prior Modulation Network (DPMN), which leverages dual image-level priors to bring performance gain over existing approaches. Specifically, two types of prior-guided refinement modules, each using the text mask or graphic recognition result of the low-quality SR image from the preceding layer, are designed to improve the structural clarity and semantic accuracy of the text, respectively. The following attention mechanism hence modulates two quality-enhanced images to attain a superior SR result. Extensive experiments validate that our method improves the image quality and boosts the performance of downstream tasks over five typical approaches on the benchmark. Substantial visualizations and ablation studies demonstrate the advantages of the proposed DPMN. Code is available at: this https URL.
场景文本图像超分辨率(STISR)旨在同时提高文本图像的分辨率和可读性,结果图像将显著影响后续任务的表现。尽管已经取得了许多进展,但现有方法提出了两个关键问题:(1)它们忽视了文本的全局结构,这限制了场景文本的语义决定性。(2)在现有作品中使用的 priors,如文本前或肌肉前,是从训练的文本分类器中提取的。然而,这些 priors 受到领域差距的影响,包括低分辨率和模糊性由不良图像条件引起,导致错误的指导。我们的工作解决了这些差距并提出了名为“双重前缀调制网络”(DPMN)的可插拔模块,它利用双重图像前缀来提高与现有方法的性能比。具体来说,有两种类型的 prior-引导的改进模块,每个使用从前面层低质量SR图像的文本掩码或图形识别结果来改善文本的结构清晰性和语义准确性。因此,后面的注意机制因此 modulate 了两个质量增强的图像以获得更好的SR结果。广泛的实验证实,我们的方法提高了图像质量和在基准基准上提高后续任务表现胜过了五种典型的方法。有大量的可视化和烧穿研究证明了所提出的DPMN的优势。代码已可用,在此 https URL 上。
Every Scene Text Recognition (STR) task consists of text localization \& text recognition as the prominent sub-tasks. However, in real-world applications with fixed camera positions such as equipment monitor reading, image-based data entry, and printed document data extraction, the underlying data tends to be regular scene text. Hence, in these tasks, the use of generic, bulky models comes up with significant disadvantages compared to customized, efficient models in terms of model deployability, data privacy \& model reliability. Therefore, this paper introduces the underlying concepts, theory, implementation, and experiment results to develop models, which are highly specialized for the task itself, to achieve not only the SOTA performance but also to have minimal model weights, shorter inference time, and high model reliability. We introduce a novel deep learning architecture (GeoTRNet), trained to identify digits in a regular scene image, only using the geometrical features present, mimicking human perception over text recognition. The code is publicly available at this https URL
Text-based image captioning is an important but under-explored task, aiming to generate descriptions containing visual objects and scene text. Recent studies have made encouraging progress, but they are still suffering from a lack of overall understanding of scenes and generating inaccurate captions. One possible reason is that current studies mainly focus on constructing the plane-level geometric relationship of scene text without depth information. This leads to insufficient scene text relational reasoning so that models may describe scene text inaccurately. The other possible reason is that existing methods fail to generate fine-grained descriptions of some visual objects. In addition, they may ignore essential visual objects, leading to the scene text belonging to these ignored objects not being utilized. To address the above issues, we propose a DEpth and VIsual ConcEpts Aware Transformer (DEVICE) for TextCaps. Concretely, to construct three-dimensional geometric relations, we introduce depth information and propose a depth-enhanced feature updating module to ameliorate OCR token features. To generate more precise and comprehensive captions, we introduce semantic features of detected visual object concepts as auxiliary information. Our DEVICE is capable of generalizing scenes more comprehensively and boosting the accuracy of described visual entities. Sufficient experiments demonstrate the effectiveness of our proposed DEVICE, which outperforms state-of-the-art models on the TextCaps test set. Our code will be publicly available.
Understanding the scene is often essential for reading text in real-world scenarios. However, current scene text recognizers operate on cropped text images, unaware of the bigger picture. In this work, we harness the representative power of recent vision-language models, such as CLIP, to provide the crop-based recognizer with scene, image-level information. Specifically, we obtain a rich representation of the entire image and fuse it with the recognizer word-level features via cross-attention. Moreover, a gated mechanism is introduced that gradually shifts to the context-enriched representation, enabling simply fine-tuning a pretrained recognizer. We implement our model-agnostic framework, named CLIPTER - CLIP Text Recognition, on several leading text recognizers and demonstrate consistent performance gains, achieving state-of-the-art results over multiple benchmarks. Furthermore, an in-depth analysis reveals improved robustness to out-of-vocabulary words and enhanced generalization in low-data regimes.
End-to-end scene text spotting has made significant progress due to its intrinsic synergy between text detection and recognition. Previous methods commonly regard manual annotations such as horizontal rectangles, rotated rectangles, quadrangles,and polygons as a prerequisite, which are much more expensive than using single-point. For the first time, we demonstrate that training scene text spotting models can be achieved with an extremely low-cost single-point annotation by the proposed framework, termed SPTS v2. SPTS v2 reserves the advantage of the auto-regressive Transformer with an Instance Assignment Decoder (IAD) through sequentially predicting the center points of all text instances inside the same predicting sequence, while with a Parallel Recognition Decoder (PRD) for text recognition in parallel. These two decoders share the same parameters and are interactively connected with a simple but effective information transmission process to pass the gradient and information. Comprehensive experiments on various existing benchmark datasets demonstrate the SPTS v2 can outperform previous state-of-the-art single-point text spotters with fewer parameters while achieving 14x faster inference speed. Most importantly, within the scope of our SPTS v2, extensive experiments further reveal an important phenomenon that single-point serves as the optimal setting for the scene text spotting compared to non-point, rectangular bounding box, and polygonal bounding box. Such an attempt provides a significant opportunity for scene text spotting applications beyond the realms of existing paradigms. Code is available at this https URL.
Most TextVQA approaches focus on the integration of objects, scene texts and question words by a simple transformer encoder. But this fails to capture the semantic relations between different modalities. The paper proposes a Scene Graph based co-Attention Network (SceneGATE) for TextVQA, which reveals the semantic relations among the objects, Optical Character Recognition (OCR) tokens and the question words. It is achieved by a TextVQA-based scene graph that discovers the underlying semantics of an image. We created a guided-attention module to capture the intra-modal interplay between the language and the vision as a guidance for inter-modal interactions. To make explicit teaching of the relations between the two modalities, we proposed and integrated two attention modules, namely a scene graph-based semantic relation-aware attention and a positional relation-aware attention. We conducted extensive experiments on two benchmark datasets, Text-VQA and ST-VQA. It is shown that our SceneGATE method outperformed existing ones because of the scene graph and its attention modules.
Recently, segmentation-based methods are quite popular in scene text detection, which mainly contain two steps: text kernel segmentation and expansion. However, the segmentation process only considers each pixel independently, and the expansion process is difficult to achieve a favorable accuracy-speed trade-off. In this paper, we propose a Context-aware and Boundary-guided Network (CBN) to tackle these problems. In CBN, a basic text detector is firstly used to predict initial segmentation results. Then, we propose a context-aware module to enhance text kernel feature representations, which considers both global and local contexts. Finally, we introduce a boundary-guided module to expand enhanced text kernels adaptively with only the pixels on the contours, which not only obtains accurate text boundaries but also keeps high speed, especially on high-resolution output maps. In particular, with a lightweight backbone, the basic detector equipped with our proposed CBN achieves state-of-the-art results on several popular benchmarks, and our proposed CBN can be plugged into several segmentation-based methods. Code will be available on this https URL.
Scene text editing (STE) aims to replace text with the desired one while preserving background and styles of the original text. However, due to the complicated background textures and various text styles, existing methods fall short in generating clear and legible edited text images. In this study, we attribute the poor editing performance to two problems: 1) Implicit decoupling structure. Previous methods of editing the whole image have to learn different translation rules of background and text regions simultaneously. 2) Domain gap. Due to the lack of edited real scene text images, the network can only be well trained on synthetic pairs and performs poorly on real-world images. To handle the above problems, we propose a novel network by MOdifying Scene Text image at strokE Level (MOSTEL). Firstly, we generate stroke guidance maps to explicitly indicate regions to be edited. Different from the implicit one by directly modifying all the pixels at image level, such explicit instructions filter out the distractions from background and guide the network to focus on editing rules of text regions. Secondly, we propose a Semi-supervised Hybrid Learning to train the network with both labeled synthetic images and unpaired real scene text images. Thus, the STE model is adapted to real-world datasets distributions. Moreover, two new datasets (Tamper-Syn2k and Tamper-Scene) are proposed to fill the blank of public evaluation datasets. Extensive experiments demonstrate that our MOSTEL outperforms previous methods both qualitatively and quantitatively. Datasets and code will be available at this https URL.
In contrast to fully supervised methods using pixel-wise mask labels, box-supervised instance segmentation takes advantage of simple box annotations, which has recently attracted increasing research attention. This paper presents a novel single-shot instance segmentation approach, namely Box2Mask, which integrates the classical level-set evolution model into deep neural network learning to achieve accurate mask prediction with only bounding box supervision. Specifically, both the input image and its deep features are employed to evolve the level-set curves implicitly, and a local consistency module based on a pixel affinity kernel is used to mine the local context and spatial relations. Two types of single-stage frameworks, i.e., CNN-based and transformer-based frameworks, are developed to empower the level-set evolution for box-supervised instance segmentation, and each framework consists of three essential components: instance-aware decoder, box-level matching assignment and level-set evolution. By minimizing the level-set energy function, the mask map of each instance can be iteratively optimized within its bounding box annotation. The experimental results on five challenging testbeds, covering general scenes, remote sensing, medical and scene text images, demonstrate the outstanding performance of our proposed Box2Mask approach for box-supervised instance segmentation. In particular, with the Swin-Transformer large backbone, our Box2Mask obtains 42.4% mask AP on COCO, which is on par with the recently developed fully mask-supervised methods. The code is available at: this https URL.
Most existing scene text detectors require large-scale training data which cannot scale well due to two major factors: 1) scene text images often have domain-specific distributions; 2) collecting large-scale annotated scene text images is laborious. We study domain adaptive scene text detection, a largely neglected yet very meaningful task that aims for optimal transfer of labelled scene text images while handling unlabelled images in various new domains. Specifically, we design SCAST, a subcategory-aware self-training technique that mitigates the network overfitting and noisy pseudo labels in domain adaptive scene text detection effectively. SCAST consists of two novel designs. For labelled source data, it introduces pseudo subcategories for both foreground texts and background stuff which helps train more generalizable source models with multi-class detection objectives. For unlabelled target data, it mitigates the network overfitting by co-regularizing the binary and subcategory classifiers trained in the source domain. Extensive experiments show that SCAST achieves superior detection performance consistently across multiple public benchmarks, and it also generalizes well to other domain adaptive detection tasks such as vehicle detection.
This paper explores the multi-scale aggregation strategy for scene text detection in natural images. We present the Aggregated Text TRansformer(ATTR), which is designed to represent texts in scene images with a multi-scale self-attention mechanism. Starting from the image pyramid with multiple resolutions, the features are first extracted at different scales with shared weight and then fed into an encoder-decoder architecture of Transformer. The multi-scale image representations are robust and contain rich information on text contents of various sizes. The text Transformer aggregates these features to learn the interaction across different scales and improve text representation. The proposed method detects scene texts by representing each text instance as an individual binary mask, which is tolerant of curve texts and regions with dense instances. Extensive experiments on public scene text detection datasets demonstrate the effectiveness of the proposed framework.
We present a novel problem of text-based visual question generation or TextVQG in short. Given the recent growing interest of the document image analysis community in combining text understanding with conversational artificial intelligence, e.g., text-based visual question answering, TextVQG becomes an important task. TextVQG aims to generate a natural language question for a given input image and an automatically extracted text also known as OCR token from it such that the OCR token is an answer to the generated question. TextVQG is an essential ability for a conversational agent. However, it is challenging as it requires an in-depth understanding of the scene and the ability to semantically bridge the visual content with the text present in the image. To address TextVQG, we present an OCR consistent visual question generation model that Looks into the visual content, Reads the scene text, and Asks a relevant and meaningful natural language question. We refer to our proposed model as OLRA. We perform an extensive evaluation of OLRA on two public benchmarks and compare them against baselines. Our model OLRA automatically generates questions similar to the public text-based visual question answering datasets that were curated manually. Moreover, we significantly outperform baseline approaches on the performance measures popularly used in text generation literature.
In this paper, we introduce a new dataset of room interior pictures with overlaying and scene text, totalling to 4836 annotated images in 25 product categories. We provide details on the collection and annotation process of our dataset, and analyze its statistics. Furthermore, we propose a baseline method for overlaying text detection, that leverages the character region-aware text detection framework to guide the classification model. We validate our approach and show its efficiency in terms of binary classification metrics, reaching the final performance of 0.95 F1 score, with false positive and false negative rates of 0.02 and 0.06 correspondingly.
End-to-end text spotting aims to integrate scene text detection and recognition into a unified framework. Dealing with the relationship between the two sub-tasks plays a pivotal role in designing effective spotters. Although transformer-based methods eliminate the heuristic post-processing, they still suffer from the synergy issue between the sub-tasks and low training efficiency. In this paper, we present DeepSolo, a simple detection transformer baseline that lets a single Decoder with Explicit Points Solo for text detection and recognition simultaneously. Technically, for each text instance, we represent the character sequence as ordered points and model them with learnable explicit point queries. After passing a single decoder, the point queries have encoded requisite text semantics and locations and thus can be further decoded to the center line, boundary, script, and confidence of text via very simple prediction heads in parallel, solving the sub-tasks in text spotting in a unified framework. Besides, we also introduce a text-matching criterion to deliver more accurate supervisory signals, thus enabling more efficient training. Quantitative experiments on public benchmarks demonstrate that DeepSolo outperforms previous state-of-the-art methods and achieves better training efficiency. In addition, DeepSolo is also compatible with line annotations, which require much less annotation cost than polygons. The code will be released.
Scene text spotting is of great importance to the computer vision community due to its wide variety of applications. Recent methods attempt to introduce linguistic knowledge for challenging recognition rather than pure visual classification. However, how to effectively model the linguistic rules in end-to-end deep networks remains a research challenge. In this paper, we argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input. Correspondingly, we propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting. Firstly, the autonomous suggests enforcing explicitly language modeling by decoupling the recognizer into vision model and language model and blocking gradient flow between both models. Secondly, a novel bidirectional cloze network (BCN) as the language model is proposed based on bidirectional feature representation. Thirdly, we propose an execution manner of iterative correction for the language model which can effectively alleviate the impact of noise input. Finally, to polish ABINet++ in long text recognition, we propose to aggregate horizontal features by embedding Transformer units inside a U-Net, and design a position and content attention module which integrates character order and content to attend to character features precisely. ABINet++ achieves state-of-the-art performance on both scene text recognition and scene text spotting benchmarks, which consistently demonstrates the superiority of our method in various environments especially on low-quality images. Besides, extensive experiments including in English and Chinese also prove that, a text spotter that incorporates our language modeling method can significantly improve its performance both in accuracy and speed compared with commonly used attention-based recognizers.
Image captioning aims to describe an image with a natural language sentence, allowing powerful language models to understand images. The framework of combining image captioning with language models has been successful on various vision-language tasks. However, an image contains much more information than a single sentence, leading to underspecification of which visual entities should be described in the caption sentence. For example, when performing visual questioning answering (VQA), generic image captions often miss visual details that are essential for the language model to answer correctly. To address this challenge, we propose PromptCap, a captioning model that takes a natural-language prompt to control the contents of the generated caption. The prompt contains a question that the caption should help to answer, and also supports taking auxiliary text inputs such as scene text within the image itself. To finetune a general image caption model for prompt-guided captioning, we propose a pipeline to synthesize and filter training examples with GPT-3 and existing VQA datasets. For evaluation, we start with an existing pipeline in which a language model is prompted with image captions to carry out VQA. With the same language model, a higher QA accuracy shows that our generated captions are more relevant to the question prompts. PromptCap outperforms generic captions by a large margin on a variety of VQA tasks and achieves the state-of-the-art accuracy of 58.8 % on OK-VQA and 58.0 % on A-OKVQA. Zero-shot experiments on WebQA show that PromptCap generalizes well to unseen domains.