The advancement of text shape representations towards compactness has enhanced text detection and spotting performance, but at a high annotation cost. Current models use single-point annotations to reduce costs, yet they lack sufficient localization information for downstream applications. To overcome this limitation, we introduce Point2Polygon, which can efficiently transform single-points into compact polygons. Our method uses a coarse-to-fine process, starting with creating and selecting anchor points based on recognition confidence, then vertically and horizontally refining the polygon using recognition information to optimize its shape. We demonstrate the accuracy of the generated polygons through extensive experiments: 1) By creating polygons from ground truth points, we achieved an accuracy of 82.0% on ICDAR 2015; 2) In training detectors with polygons generated by our method, we attained 86% of the accuracy relative to training with ground truth (GT); 3) Additionally, the proposed Point2Polygon can be seamlessly integrated to empower single-point spotters to generate polygons. This integration led to an impressive 82.5% accuracy for the generated polygons. It is worth mentioning that our method relies solely on synthetic recognition information, eliminating the need for any manual annotation beyond single points.
文本形状表示的进步使得文本检测和斑点检测性能得到了提高,但需要高昂的标注成本。当前的模型使用单点标注来降低成本,然而它们缺乏足够的关键位置信息,对于下游应用来说至关重要。为了克服这个限制,我们引入了点2面体,它可以通过有效地将单点转换为紧凑的多边形来提高文本检测和斑点检测的性能。我们的方法采用粗到细的过程,首先根据识别信心创建和选择锚点,然后使用识别信息垂直和水平优化多边形的形状。我们通过广泛的实验来证明生成的多边形的准确性:1)通过从真实点创建多边形,我们在2015年ICDAR上实现了82.0%的准确度;2)在用我们方法生成的检测器上进行训练时,我们实现了与用真实点进行训练的86%的准确度相对;3)此外,点2面体可以轻松地与其他单点检测器集成,使其生成多边形。这种集成导致生成的多边形具有令人印象深刻的82.5%的准确度。值得注意的是,我们的方法仅依赖于合成识别信息,消除了对任何手动标注的需求,从而实现单点检测器。
https://arxiv.org/abs/2312.13778
In this paper, we build a visual dialogue dataset, named InfoVisDial, which provides rich informative answers in each round even with external knowledge related to the visual content. Different from existing datasets where the answer is compact and short, InfoVisDial contains long free-form answers with rich information in each round of dialogue. For effective data collection, the key idea is to bridge the large-scale multimodal model (e.g., GIT) and the language models (e.g., GPT-3). GIT can describe the image content even with scene text, while GPT-3 can generate informative dialogue based on the image description and appropriate prompting techniques. With such automatic pipeline, we can readily generate informative visual dialogue data at scale. Then, we ask human annotators to rate the generated dialogues to filter the low-quality conversations.Human analyses show that InfoVisDial covers informative and diverse dialogue topics: $54.4\%$ of the dialogue rounds are related to image scene texts, and $36.7\%$ require external knowledge. Each round's answer is also long and open-ended: $87.3\%$ of answers are unique with an average length of $8.9$, compared with $27.37\%$ and $2.9$ in VisDial. Last, we propose a strong baseline by adapting the GIT model for the visual dialogue task and fine-tune the model on InfoVisDial. Hopefully, our work can motivate more effort on this direction.
在本文中,我们构建了一个名为InfoVisDial的视觉对话数据集,为每个回合提供了丰富的信息性回答,即使与视觉内容相关的外部知识很大。与现有数据集不同,InfoVisDial包含了每个回合的长的自由文本答案,每个回合的答案都充满了丰富的信息。为了有效地收集数据,关键思想是连接大规模多模态模型(例如GIT)和语言模型(例如GPT-3)。GIT可以描述场景文本中的图像内容,而GPT-3可以根据图像描述和适当的提示技术生成有用的对话。有了这样的自动工作流程,我们可以在规模上轻松地生成有用的视觉对话数据。然后,我们要求人类标注者对生成的对话进行评分,以过滤低质量的对话。人类分析表明,InfoVisDial涵盖了有益且多样化的对话主题:87.3%的对话回合与图像场景文本相关,而36.7%的对话回合需要外部知识。每个回合的答案也是长且开诚布公的:87.3%的答案是独特的,平均长度为8.9,与VisDial中的27.37%和2.9%相比。最后,我们通过将GIT模型适应视觉对话任务并对InfoVisDial进行微调,提出了一个强大的基线。我们希望,我们的工作可以激励在这个方向上投入更多的努力。
https://arxiv.org/abs/2312.13503
Recently, diffusion-based image generation methods are credited for their remarkable text-to-image generation capabilities, while still facing challenges in accurately generating multilingual scene text images. To tackle this problem, we propose Diff-Text, which is a training-free scene text generation framework for any language. Our model outputs a photo-realistic image given a text of any language along with a textual description of a scene. The model leverages rendered sketch images as priors, thus arousing the potential multilingual-generation ability of the pre-trained Stable Diffusion. Based on the observation from the influence of the cross-attention map on object placement in generated images, we propose a localized attention constraint into the cross-attention layer to address the unreasonable positioning problem of scene text. Additionally, we introduce contrastive image-level prompts to further refine the position of the textual region and achieve more accurate scene text generation. Experiments demonstrate that our method outperforms the existing method in both the accuracy of text recognition and the naturalness of foreground-background blending.
最近,基于扩散的图像生成方法以其在文本到图像生成方面的显著表现而受到赞誉,但在准确生成多语言场景文本图像方面仍然面临挑战。为解决这个问题,我们提出了Diff-Text,这是一个为任何语言的训练free场景文本生成框架。我们的模型根据任何语言的文本生成一张照片真实的图像,并提供了场景的文本描述。模型利用预训练的Stable Diffusion生成的渲染插图作为先验,从而激起先验的 multilingual-generation 能力。基于观察到交叉注意图对生成图像中物体对齐的影响,我们在交叉注意层引入局部注意力约束以解决场景文本不合理的对齐问题。此外,我们还引入了对比图像级的提示来进一步精细文本区域的定位,并实现更精确的场景文本生成。实验证明,我们的方法在文本识别的准确性和前景-背景融合的自然性方面超过了现有方法。
https://arxiv.org/abs/2312.12232
Nowadays, scene text recognition has attracted more and more attention due to its diverse applications. Most state-of-the-art methods adopt an encoder-decoder framework with the attention mechanism, autoregressively generating text from left to right. Despite the convincing performance, this sequential decoding strategy constrains inference speed. Conversely, non-autoregressive models provide faster, simultaneous predictions but often sacrifice accuracy. Although utilizing an explicit language model can improve performance, it burdens the computational load. Besides, separating linguistic knowledge from vision information may harm the final prediction. In this paper, we propose an alternative solution, using a parallel and iterative decoder that adopts an easy-first decoding strategy. Furthermore, we regard text recognition as an image-based conditional text generation task and utilize the discrete diffusion strategy, ensuring exhaustive exploration of bidirectional contextual information. Extensive experiments demonstrate that the proposed approach achieves superior results on the benchmark datasets, including both Chinese and English text images.
如今,由于其各种应用场景,场景文本识别已经吸引了越来越多的关注。最先进的方法采用编码器-解码器框架,并使用注意力机制自左至右生成文本。尽管具有令人信服的性能,但这种序列解码策略限制了推理速度。相反,无自回归模型提供更快的同时预测,但通常会牺牲准确性。尽管使用显式语言模型可以提高性能,但它增加了计算负担。此外,将语言知识与视觉信息分离可能损害最终预测。在本文中,我们提出了另一种解决方案,使用一种并行和迭代解码器,采用简单的先解码策略。此外,我们将文本识别视为基于图像的条件下文本生成任务,并使用离散扩散策略,确保探索双向上下文信息的全面性。大量实验证明,与基准数据集相比,所提出的方案在包括中文和英文文本图像在内的各个领域都取得了卓越的性能。
https://arxiv.org/abs/2312.11923
Natural scene text detection is a significant challenge in computer vision, with tremendous potential applications in multilingual, diverse, and complex text scenarios. We propose a multilingual text detection model to address the issues of low accuracy and high difficulty in detecting multilingual text in natural scenes. In response to the challenges posed by multilingual text images with multiple character sets and various font styles, we introduce the SFM Swin Transformer feature extraction network to enhance the model's robustness in detecting characters and fonts across different languages. Dealing with the considerable variation in text scales and complex arrangements in natural scene text images, we present the AS-HRFPN feature fusion network by incorporating an Adaptive Spatial Feature Fusion module and a Spatial Pyramid Pooling module. The feature fusion network improvements enhance the model's ability to detect text sizes and orientations. Addressing diverse backgrounds and font variations in multilingual scene text images is a challenge for existing methods. Limited local receptive fields hinder detection performance. To overcome this, we propose a Global Semantic Segmentation Branch, extracting and preserving global features for more effective text detection, aligning with the need for comprehensive information. In this study, we collected and built a real-world multilingual natural scene text image dataset and conducted comprehensive experiments and analyses. The experimental results demonstrate that the proposed algorithm achieves an F-measure of 85.02\%, which is 4.71\% higher than the baseline model. We also conducted extensive cross-dataset validation on MSRA-TD500, ICDAR2017MLT, and ICDAR2015 datasets to verify the generality of our approach. The code and dataset can be found at this https URL.
自然场景文本检测是计算机视觉领域的一个重要挑战,具有多语言、多样化和复杂文本场景的巨大应用潜力。我们提出了一种多语言文本检测模型,以解决在自然场景中检测多语言文本的准确性和难度较高的难题。针对多语言文本图像具有多个字符集和各种字体样式所带来的挑战,我们引入了SFM Swin Transformer特征提取网络,以增强模型在检测不同语言中的字符和字体的鲁棒性。处理自然场景文本图像中文本尺寸和复杂排列所带来的巨大变化,我们通过将自适应空间特征融合网络和空间金字塔池化模块相结合,提出了AS-HRFPN特征融合网络。特征融合网络改进了模型的检测文本大小和方向的能力。解决多语言场景文本图像中的多样背景和字体变化是一个挑战,现有的方法。有限的局部接收域阻碍了检测性能。为了克服这一挑战,我们提出了全局语义分割分支,提取和保留全局特征以实现更有效的文本检测,与需要全面信息的需求相吻合。在本研究中,我们收集并构建了一个真实世界多语言自然场景文本图像数据集,并进行了全面实验和分析。实验结果表明,与基线模型相比,所提出的算法实现了85.02%的F1分数,比基线模型高4.71%。我们还对MSRA-TD500、ICDAR2017MLT和ICDAR2015等数据集进行了广泛的跨数据集验证,以验证我们方法的普适性。代码和数据集可以在这个链接中找到。
https://arxiv.org/abs/2312.11153
In this paper, we investigate cross-lingual learning (CLL) for multilingual scene text recognition (STR). CLL transfers knowledge from one language to another. We aim to find the condition that exploits knowledge from high-resource languages for improving performance in low-resource languages. To do so, we first examine if two general insights about CLL discussed in previous works are applied to multilingual STR: (1) Joint learning with high- and low-resource languages may reduce performance on low-resource languages, and (2) CLL works best between typologically similar languages. Through extensive experiments, we show that two general insights may not be applied to multilingual STR. After that, we show that the crucial condition for CLL is the dataset size of high-resource languages regardless of the kind of high-resource languages. Our code, data, and models are available at this https URL.
在本文中,我们研究了跨语言学习(CLL)在多语言场景文本识别(STR)中的应用。CLL将知识从一个语言传递到另一个语言。我们的目标是找到一个条件,利用高资源语言的知识来提高低资源语言的性能。为此,我们首先检查之前工作讨论的关于CLL的两个一般性见解是否适用于多语言STR:(1)联合学习高和低资源语言可能会在低资源语言上降低性能,(2)CLL在类型相似的语言之间效果最好。通过广泛的实验,我们发现两个一般性见解不能应用于多语言STR。然后,我们证明了对于CLL,关键条件是高资源语言的数据集大小,无论高资源语言的类型如何。我们的代码、数据和模型可在此处访问:https://url。
https://arxiv.org/abs/2312.10806
Existing scene text detection methods typically rely on extensive real data for training. Due to the lack of annotated real images, recent works have attempted to exploit large-scale labeled synthetic data (LSD) for pre-training text detectors. However, a synth-to-real domain gap emerges, further limiting the performance of text detectors. Differently, in this work, we propose \textbf{FreeReal}, a real-domain-aligned pre-training paradigm that enables the complementary strengths of both LSD and unlabeled real data (URD). Specifically, to bridge real and synthetic worlds for pre-training, a novel glyph-based mixing mechanism (GlyphMix) is tailored for text images. GlyphMix delineates the character structures of synthetic images and embeds them as graffiti-like units onto real images. Without introducing real domain drift, GlyphMix freely yields real-world images with annotations derived from synthetic labels. Furthermore, when given free fine-grained synthetic labels, GlyphMix can effectively bridge the linguistic domain gap stemming from English-dominated LSD to URD in various languages. Without bells and whistles, FreeReal achieves average gains of 4.56\%, 3.85\%, 3.90\%, and 1.97\% in improving the performance of DBNet, PANet, PSENet, and FCENet methods, respectively, consistently outperforming previous pre-training methods by a substantial margin across four public datasets. Code will be released soon.
现有的场景文本检测方法通常依赖于大量真实数据进行训练。由于缺乏注释真实图像,最近的工作试图利用大规模标记的合成数据(LSD)进行预训练文本检测。然而,在合成到真实领域之间存在一个鸿沟,进一步限制了文本检测器的性能。不同之处在于,在本文中,我们提出了 FreeReal,一种真实领域对齐的预训练范式,可以将LSD和未标注的真实数据(URD)的优势相结合。具体来说,为了在预训练过程中弥合真实和合成世界之间的差距,我们针对文本图像定制了一个新颖的基于字符的混合机制(GlyphMix)。GlyphMix 勾勒出合成图像中的字符结构,并将它们嵌入到真实图像中,从而自由地生成具有合成标签的实世界图像。在没有引入真实领域漂移的情况下,GlyphMix 有效地在各种语言中跨越英语主导的LSD到URD的语义领域差距。没有花哨的功能,FreeReal 通过分别提高DBNet、PANet、PSENet和FCENet等方法的表现,实现了4.56%、3.85%、3.90%和1.97%的改善,并在四个公共数据集上显著超越了之前预训练方法。代码不久将发布。
https://arxiv.org/abs/2312.05286
Text-to-Image (T2I) generation methods based on diffusion model have garnered significant attention in the last few years. Although these image synthesis methods produce visually appealing results, they frequently exhibit spelling errors when rendering text within the generated images. Such errors manifest as missing, incorrect or extraneous characters, thereby severely constraining the performance of text image generation based on diffusion models. To address the aforementioned issue, this paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model (i.e., Stable Diffusion [27]). Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder and provides more robust text embeddings as conditional guidance. Then, we fine-tune the diffusion model using a large-scale dataset, incorporating local attention control under the supervision of character-level segmentation maps. Finally, by employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images. Both qualitative and quantitative results demonstrate the superiority of our method to the state of the art. Furthermore, we showcase several potential applications of the proposed UDiffText, including text-centric image synthesis, scene text editing, etc. Code and model will be available at this https URL .
基于扩散模型的文本图像生成方法近年来引起了广泛关注。尽管这些图像生成方法产生 visually 吸引人的结果,但在渲染生成图像中的文本时,它们经常出现拼写错误。这些错误表现为缺失、错误或多余的字符,从而严重限制了基于扩散模型的文本图像生成的性能。为解决上述问题,本文提出了一种新颖的文本图像生成方法,利用预训练的扩散模型(即稳定扩散 [27])。我们的方法涉及设计并训练一个轻量级的字符级别文本编码器,用它取代原始的CLIP编码器并提供了更强的文本嵌入条件指导。然后,我们在大规模数据集上对扩散模型进行微调,并在字符级别分割图的监督下实现局部注意力控制。最后,通过采用推理阶段优化过程,我们在任意给定图像中生成文本时实现了显著的序列准确度。质性和数量结果均证明了我们方法与现有技术的优越性。此外,我们还展示了基于提出的UDiffText的一些潜在应用,包括文本中心图像合成、场景文本编辑等。代码和模型将在此处https URL上提供。
https://arxiv.org/abs/2312.04884
Scene text image super-resolution (STISR) aims at simultaneously increasing the resolution and readability of low-resolution scene text images, thus boosting the performance of the downstream recognition task. Two factors in scene text images, semantic information and visual structure, affect the recognition performance significantly. To mitigate the effects from these factors, this paper proposes a Prior-Enhanced Attention Network (PEAN). Specifically, a diffusion-based module is developed to enhance the text prior, hence offering better guidance for the SR network to generate SR images with higher semantic accuracy. Meanwhile, the proposed PEAN leverages an attention-based modulation module to understand scene text images by neatly perceiving the local and global dependence of images, despite the shape of the text. A multi-task learning paradigm is employed to optimize the network, enabling the model to generate legible SR images. As a result, PEAN establishes new SOTA results on the TextZoom benchmark. Experiments are also conducted to analyze the importance of the enhanced text prior as a means of improving the performance of the SR network. Code will be made available at this https URL.
场景文本图像超分辨率(STISR)旨在同时提高低分辨率场景文本图像的分辨率和可读性,从而提高下游识别任务的性能。场景文本图像中的语义信息和视觉结构对识别性能具有重要影响。为了减轻这些因素的影响,本文提出了一种先验增强注意网络(PEAN)。具体来说,采用扩散模块增强文本先验,从而为SR网络提供更好的指导以生成具有更高语义准确性的SR图像。同时,PEAN利用基于注意的模块来理解场景文本图像,尽管其形状有所不同。采用多任务学习范式优化网络,使模型能够生成清晰可读的SR图像。因此,PEAN在TextZoom基准上取得了最先进的成果。此外,还进行了实验分析,以分析增强文本先验作为提高SR网络性能的重要手段。代码将在此处链接提供。
https://arxiv.org/abs/2311.17955
Robustness certification, which aims to formally certify the predictions of neural networks against adversarial inputs, has become an integral part of important tool for safety-critical applications. Despite considerable progress, existing certification methods are limited to elementary architectures, such as convolutional networks, recurrent networks and recently Transformers, on benchmark datasets such as MNIST. In this paper, we focus on the robustness certification of scene text recognition (STR), which is a complex and extensively deployed image-based sequence prediction problem. We tackle three types of STR model architectures, including the standard STR pipelines and the Vision Transformer. We propose STR-Cert, the first certification method for STR models, by significantly extending the DeepPoly polyhedral verification framework via deriving novel polyhedral bounds and algorithms for key STR model components. Finally, we certify and compare STR models on six datasets, demonstrating the efficiency and scalability of robustness certification, particularly for the Vision Transformer.
安全性认证(旨在正式认证神经网络对抗性输入的预测)已成为安全关键应用的重要工具。尽管取得了显著的进展,但现有的认证方法仅限于基本的架构,如卷积神经网络、循环神经网络和最近 Transformer,在基准数据集如 MNIST 上。在本文中,我们重点关注场景文本识别(STR)模型的安全性认证,这是一种复杂且广泛部署的图像序列预测问题。我们解决了三种 STR 模型架构,包括标准的 STR 管道和 Vision Transformer。我们通过通过扩展 DeepP 凸多面体验证框架来引入新的凸多面体界来提出 STR-Cert,这是第一个为 STR 模型设计的认证方法。最后,我们在六个数据集上进行了 STR 模型的认证和比较,证明了安全性认证的有效性和可扩展性,特别是对于 Vision Transformer。
https://arxiv.org/abs/2401.05338
Scene text detection techniques have garnered significant attention due to their wide-ranging applications. However, existing methods have a high demand for training data, and obtaining accurate human annotations is labor-intensive and time-consuming. As a solution, researchers have widely adopted synthetic text images as a complementary resource to real text images during pre-training. Yet there is still room for synthetic datasets to enhance the performance of scene text detectors. We contend that one main limitation of existing generation methods is the insufficient integration of foreground text with the background. To alleviate this problem, we present the Diffusion Model based Text Generator (DiffText), a pipeline that utilizes the diffusion model to seamlessly blend foreground text regions with the background's intrinsic features. Additionally, we propose two strategies to generate visually coherent text with fewer spelling errors. With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors. Extensive experiments on detecting horizontal, rotated, curved, and line-level texts demonstrate the effectiveness of DiffText in producing realistic text images.
场景文本检测技术因其广泛的应用而受到了广泛关注。然而,现有的方法需要大量的训练数据,而且获得准确的人类注释是费力且耗时的。为了解决这个问题,研究人员在预训练过程中广泛采用了合成文本图像作为真实文本图像的补充资源。然而,合成数据集仍有提高场景文本检测器性能的空间。我们认为现有生成方法的主要局限是前馈神经网络对前景文本与背景特征之间整合不足。为了减轻这个问题,我们提出了基于扩散模型的文本生成器(DiffText),一个利用扩散模型平滑地将前景文本区域与背景的固有特征结合起来的流程。此外,我们还提出了两种策略来生成视觉上连贯的文本,同时减少拼写错误。随着文本实例数量的减少,我们生产的文本图像在帮助文本检测器方面始终超越其他合成数据。对水平、旋转、弯曲和线级文本的检测实验表明,DiffText在生成真实文本图像方面非常有效。
https://arxiv.org/abs/2311.16555
Scene Text Image Super-Resolution (STISR) aims to enhance the resolution and legibility of text within low-resolution (LR) images, consequently elevating recognition accuracy in Scene Text Recognition (STR). Previous methods predominantly employ discriminative Convolutional Neural Networks (CNNs) augmented with diverse forms of text guidance to address this issue. Nevertheless, they remain deficient when confronted with severely blurred images, due to their insufficient generation capability when little structural or semantic information can be extracted from original images. Therefore, we introduce RGDiffSR, a Recognition-Guided Diffusion model for scene text image Super-Resolution, which exhibits great generative diversity and fidelity even in challenging scenarios. Moreover, we propose a Recognition-Guided Denoising Network, to guide the diffusion model generating LR-consistent results through succinct semantic guidance. Experiments on the TextZoom dataset demonstrate the superiority of RGDiffSR over prior state-of-the-art methods in both text recognition accuracy and image fidelity.
场景文本图像超分辨率(STISR)旨在提高低分辨率(LR)图像中文本的分辨率和可读性,从而提高场景文本识别(STR)的识别准确性。以前的方法主要采用添加不同形式的文本指导的卷积神经网络(CNN)来解决此问题。然而,当面临严重模糊的图像时,它们仍然缺乏生成能力,因为它们在原始图像中提取的结构或语义信息不足。因此,我们引入了RGDiffSR,一种用于场景文本图像超分辨率的学习引导扩散(LID)模型,即使在具有严重模糊的图像的挑战性场景中,也表现出广泛的生成多样性和准确性。此外,我们提出了一个基于语义指导的去噪网络,通过简洁的语义指导引导扩散模型生成LR一致的结果。在TextZoom数据集上的实验证明,RGDiffSR在文本识别准确性和图像质量方面优于最先进的methods。
https://arxiv.org/abs/2311.13317
Retrieving textual information from natural scene images is an active research area in the field of computer vision with numerous practical applications. Detecting text regions and extracting text from signboards is a challenging problem due to special characteristics like reflecting lights, uneven illumination, or shadows found in real-life natural scene images. With the advent of deep learning-based methods, different sophisticated techniques have been proposed for text detection and text recognition from the natural scene. Though a significant amount of effort has been devoted to extracting natural scene text for resourceful languages like English, little has been done for low-resource languages like Bangla. In this research work, we have proposed an end-to-end system with deep learning-based models for efficiently detecting, recognizing, correcting, and parsing address information from Bangla signboards. We have created manually annotated datasets and synthetic datasets to train signboard detection, address text detection, address text recognition, address text correction, and address text parser models. We have conducted a comparative study among different CTC-based and Encoder-Decoder model architectures for Bangla address text recognition. Moreover, we have designed a novel address text correction model using a sequence-to-sequence transformer-based network to improve the performance of Bangla address text recognition model by post-correction. Finally, we have developed a Bangla address text parser using the state-of-the-art transformer-based pre-trained language model.
从自然场景图像中检索文本信息是一个活跃的研究领域,在计算机视觉领域具有许多实际应用。检测文本区域和提取标志牌上的文本是一个具有挑战性的问题,因为现实生活中自然场景图像中存在诸如反射光、不均匀照明或阴影等特殊特点。随着深度学习方法的的出现,已经提出了许多用于从自然场景中检测和识别文本的复杂技术。尽管为英语等资源丰富语言创建自然场景文本已经投入了大量精力,但对于像孟加拉语这样的低资源语言,工作还很少。在这项研究中,我们提出了一个基于深度学习模型的端到端系统,用于有效地检测、识别、纠正和解析孟加拉语标志牌上的地址信息。我们创建了手动标注的数据集和合成数据集,用于训练标志牌检测、地址文本检测、地址文本识别、地址文本纠正和地址文本解析模型。我们还研究了使用不同CTC(卷积神经网络)和Encoder-Decoder模型架构的孟加拉语地址文本识别的比较情况。此外,我们利用序列到序列Transformer网络设计了一种新的地址文本修正模型,以提高孟加拉语地址文本识别模型的性能。最后,我们使用最先进的基于Transformer的预训练语言模型开发了孟加拉语地址文本解析器。
https://arxiv.org/abs/2311.13222
Scene text recognition (STR) in the wild frequently encounters challenges when coping with domain variations, font diversity, shape deformations, etc. A straightforward solution is performing model fine-tuning tailored to a specific scenario, but it is computationally intensive and requires multiple model copies for various scenarios. Recent studies indicate that large language models (LLMs) can learn from a few demonstration examples in a training-free manner, termed "In-Context Learning" (ICL). Nevertheless, applying LLMs as a text recognizer is unacceptably resource-consuming. Moreover, our pilot experiments on LLMs show that ICL fails in STR, mainly attributed to the insufficient incorporation of contextual information from diverse samples in the training stage. To this end, we introduce E$^2$STR, a STR model trained with context-rich scene text sequences, where the sequences are generated via our proposed in-context training strategy. E$^2$STR demonstrates that a regular-sized model is sufficient to achieve effective ICL capabilities in STR. Extensive experiments show that E$^2$STR exhibits remarkable training-free adaptation in various scenarios and outperforms even the fine-tuned state-of-the-art approaches on public benchmarks.
野外的场景文本识别(STR)常常在面对领域差异、字体多样性、形状变形等挑战时遇到困难。一个简单的解决方案是对特定场景进行模型微调,但这对计算资源密集型任务来说代价太高,且需要为各种场景创建多个模型副本。最近的研究表明,大型语言模型(LLMs)可以在无训练的情况下从几个演示例子中学习,这种方法被称为“在上下文学习中”(ICL)。然而,将LLMs应用于文本识别器会导致资源耗竭。此外,我们对LLMs的实验表明,ICL在STR方面的表现不佳,主要原因是训练阶段中缺乏不同样本的上下文信息。因此,我们引入了E$^2$STR,一种基于上下文信息的场景文本识别模型,通过我们提出的在上下文训练策略生成序列。E$^2$STR的实验表明,对于STR,一个常规大小的模型已经足够实现有效的ICL能力。大量的实验表明,E$^2$STR在各种场景中都表现出惊人的训练-免费适应性,甚至超过了甚至最先进的公共基准。
https://arxiv.org/abs/2311.13120
Scene Text Image Super-resolution (STISR) has recently achieved great success as a preprocessing method for scene text recognition. STISR aims to transform blurred and noisy low-resolution (LR) text images in real-world settings into clear high-resolution (HR) text images suitable for scene text recognition. In this study, we leverage text-conditional diffusion models (DMs), known for their impressive text-to-image synthesis capabilities, for STISR tasks. Our experimental results revealed that text-conditional DMs notably surpass existing STISR methods. Especially when texts from LR text images are given as input, the text-conditional DMs are able to produce superior quality super-resolution text images. Utilizing this capability, we propose a novel framework for synthesizing LR-HR paired text image datasets. This framework consists of three specialized text-conditional DMs, each dedicated to text image synthesis, super-resolution, and image degradation. These three modules are vital for synthesizing distinct LR and HR paired images, which are more suitable for training STISR methods. Our experiments confirmed that these synthesized image pairs significantly enhance the performance of STISR methods in the TextZoom evaluation.
近年来,场景文本图像超分辨率(STISR)作为一种预处理方法,在场景文本识别任务中取得了巨大的成功。STISR的目标是将现实场景中模糊和噪音低分辨率(LR)文本图像转换为适合场景文本识别的高分辨率(HR)文本图像。在本研究中,我们利用著名的文本条件扩散模型(DMs),这些模型以其令人印象深刻的文本到图像合成能力而闻名,用于STISR任务。我们的实验结果表明,文本条件DM明显超越了现有的STISR方法。特别是当LR文本图像作为输入时,文本条件DM能够产生卓越的超级分辨率文本图像。利用这种能力,我们提出了一个用于合成LR-HR配对文本图像数据的全新框架。这个框架包括三个专门用于文本图像合成、超分辨和图像退化的文本条件DM。这三个模块对于合成不同LR和HR配对的图像至关重要。我们的实验证实,这些合成的图像对STISR方法在TextZoom评估中的性能显著提高。
https://arxiv.org/abs/2311.09759
Inspired by the success of transfer learning in computer vision, roboticists have investigated visual pre-training as a means to improve the learning efficiency and generalization ability of policies learned from pixels. To that end, past work has favored large object interaction datasets, such as first-person videos of humans completing diverse tasks, in pursuit of manipulation-relevant features. Although this approach improves the efficiency of policy learning, it remains unclear how reliable these representations are in the presence of distribution shifts that arise commonly in robotic applications. Surprisingly, we find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture or the introduction of distractor objects. To understand what properties do lead to robust representations, we compare the performance of 15 pre-trained vision models under different visual appearances. We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models. The rank order induced by this metric is more predictive than metrics that have previously guided generalization research within computer vision and machine learning, such as downstream ImageNet accuracy, in-domain accuracy, or shape-bias as evaluated by cue-conflict performance. We test this finding extensively on a suite of distribution shifts in ten tasks across two simulated manipulation environments. On the ALOHA setup, segmentation score predicts real-world performance after offline training with 50 demonstrations.
受到计算机视觉中迁移学习成功的影响,机器人学家们研究了视觉预训练作为一种改进从像素中学习到的策略的学习效率和泛化能力的手段。为此,过去的努力倾向于使用大型物体交互数据集,如人类完成多样任务的第一人称视频,以追求与操作相关的特征。尽管这种方法提高了策略学习的效率,但在分布式变化通常出现在机器人应用中时,这些表示的不确定性仍然存在。令人惊讶的是,我们发现,为操纵和控制任务设计的视觉表示并不一定在照明和场景纹理的微妙变化或加入干扰物的情况下泛化。为了理解导致稳健表示的特性,我们在不同视觉外观下比较了15个预训练视觉模型的性能。我们发现,在ViT模型中,新兴的分割能力是一个强大的预测因素,可以解释出分布在外观中的离散化如何影响模型的泛化能力。由这个指标引起的排名顺序比之前在计算机视觉和机器学习领域指导泛化研究的方法更具有预测性,例如下游ImageNet的准确度、领域准确度或形状偏差评估时的准确度。我们在两个模拟操作环境中进行了这个发现的广泛测试。在ALOHA设置中,分割得分预测了通过离线训练演示50个任务的实时性能。
https://arxiv.org/abs/2312.12444
Diffusion models have gained attention for image editing yielding impressive results in text-to-image tasks. On the downside, one might notice that generated images of stable diffusion models suffer from deteriorated details. This pitfall impacts image editing tasks that require information preservation e.g., scene text editing. As a desired result, the model must show the capability to replace the text on the source image to the target text while preserving the details e.g., color, font size, and background. To leverage the potential of diffusion models, in this work, we introduce Diffusion-BasEd Scene Text manipulation Network so-called DBEST. Specifically, we design two adaptation strategies, namely one-shot style adaptation and text-recognition guidance. In experiments, we thoroughly assess and compare our proposed method against state-of-the-arts on various scene text datasets, then provide extensive ablation studies for each granularity to analyze our performance gain. Also, we demonstrate the effectiveness of our proposed method to synthesize scene text indicated by competitive Optical Character Recognition (OCR) accuracy. Our method achieves 94.15% and 98.12% on COCO-text and ICDAR2013 datasets for character-level evaluation.
扩散模型因在文本到图像任务中产生令人印象深刻的结果而受到了关注。然而,一个缺点是,生成的稳定扩散模型的图像在细节上会恶化。这个缺陷会影响需要保留信息的图像编辑任务,例如场景文本编辑。作为理想的结果,模型必须表现出在保留细节的同时将源图像上的文本替换为目标文本的能力,例如颜色、字体大小和背景。为了利用扩散模型的潜力,在本文中,我们引入了扩散-基于场景文本编辑网络DBEST。具体来说,我们设计了两项适应策略,即一次风格适应和文本识别指导。在实验中,我们深入评估和比较了我们的方法与现有技术的 various场景文本数据集上的表现,然后对每个粒度进行广泛的消融研究,以分析我们性能的提高。此外,我们还证明了我们方法的有效性,用于生成与竞争光学字符识别(OCR)准确性相同的场景文本。我们的方法在COCO-text和ICDAR2013数据集上的字符级别评估获得了94.15%和98.12%的准确率。
https://arxiv.org/abs/2311.00734
This paper presents a comprehensive evaluation of the Optical Character Recognition (OCR) capabilities of the recently released GPT-4V(ision), a Large Multimodal Model (LMM). We assess the model's performance across a range of OCR tasks, including scene text recognition, handwritten text recognition, handwritten mathematical expression recognition, table structure recognition, and information extraction from visually-rich document. The evaluation reveals that GPT-4V performs well in recognizing and understanding Latin contents, but struggles with multilingual scenarios and complex tasks. Based on these observations, we delve deeper into the necessity of specialized OCR models and deliberate on the strategies to fully harness the pretrained general LMMs like GPT-4V for OCR downstream tasks. The study offers a critical reference for future research in OCR with LMMs. Evaluation pipeline and results are available at this https URL.
本文对GPT-4V(Vision),一个 recently发布的Large Multimodal Model(LMM)进行了Optical Character Recognition(OCR)能力进行全面评估。我们在一系列OCR任务中评估了模型的性能,包括场景文本识别、手写文本识别、手写数学表达式识别、表结构识别和从视觉丰富的文档中提取信息。评估显示,GPT-4V在识别和理解拉丁文本方面表现良好,但在多语言场景和复杂任务上表现不佳。基于这些观察结果,我们深入研究了专用OCR模型的必要性,并考虑了如何充分利用预训练的一般LMMs(如GPT-4V)进行OCR下游任务的策略。该研究为未来的OCR研究提供了重要的参考。评估流程和结果可在此链接查看:https://url.cn/xyz6h4yx
https://arxiv.org/abs/2310.16809
Scene Text Editing (STE) aims to substitute text in an image with new desired text while preserving the background and styles of the original text. However, present techniques present a notable challenge in the generation of edited text images that exhibit a high degree of clarity and legibility. This challenge primarily stems from the inherent diversity found within various text types and the intricate textures of complex backgrounds. To address this challenge, this paper introduces a three-stage framework for transferring texts across text images. Initially, we introduce a text-swapping network that seamlessly substitutes the original text with the desired replacement. Subsequently, we incorporate a background inpainting network into our framework. This specialized network is designed to skillfully reconstruct background images, effectively addressing the voids left after the removal of the original text. This process meticulously preserves visual harmony and coherence in the background. Ultimately, the synthesis of outcomes from the text-swapping network and the background inpainting network is achieved through a fusion network, culminating in the creation of the meticulously edited final image. A demo video is included in the supplementary material.
场景文本编辑(STE)旨在将图像中的文本替换为新的期望文本,同时保留原始文本的背景和风格。然而,现有技术在生成编辑文本图像时面临着明显的挑战,这些图像具有高清晰度和可读性。这个挑战主要源于各种文本类型之间的固有差异以及复杂背景中的复杂纹理。为了应对这个挑战,本文引入了一个跨文本图像传输的三阶段框架。最初,我们引入了一个文本替换网络,它无缝地用期望的替换文本替换原始文本。随后,我们将一个背景修复网络纳入我们的框架中。这个专门网络旨在巧妙地重构背景图像,有效地解决了在移除原始文本后留下的缺口。这个过程精心保留了背景的视觉和谐和连贯性。最后,通过文本替换网络和背景修复网络的合成,达到了精确编辑最终图像的效果。在补充资料中包括一个演示视频。
https://arxiv.org/abs/2310.13366
Diffusion-based methods have achieved prominent success in generating 2D media. However, accomplishing similar proficiencies for scene-level mesh texturing in 3D spatial applications, e.g., XR/VR, remains constrained, primarily due to the intricate nature of 3D geometry and the necessity for immersive free-viewpoint rendering. In this paper, we propose a novel indoor scene texturing framework, which delivers text-driven texture generation with enchanting details and authentic spatial coherence. The key insight is to first imagine a stylized 360° panoramic texture from the central viewpoint of the scene, and then propagate it to the rest areas with inpainting and imitating techniques. To ensure meaningful and aligned textures to the scene, we develop a novel coarse-to-fine panoramic texture generation approach with dual texture alignment, which both considers the geometry and texture cues of the captured scenes. To survive from cluttered geometries during texture propagation, we design a separated strategy, which conducts texture inpainting in confidential regions and then learns an implicit imitating network to synthesize textures in occluded and tiny structural areas. Extensive experiments and the immersive VR application on real-world indoor scenes demonstrate the high quality of the generated textures and the engaging experience on VR headsets. Project webpage: this https URL
扩散基方法已经在生成2D媒体方面取得了显著的成功。然而,在3D空间应用中实现与场景级别网格纹理类似的效率,例如XR/VR,仍然受到限制,主要原因是3D几何的复杂性和需要沉浸式自由视点渲染的必要性。在本文中,我们提出了一个新颖的室内场景纹理框架,它通过文本驱动纹理生成具有迷人细节和真实空间连贯性的纹理。关键见解是首先想象场景中心视点的抽象360°全景纹理,然后通过修复和仿真的技术将其传播到其他区域。为了确保纹理与场景的一致性,我们开发了一种新的粗纹理到细纹理全景纹理生成方法,该方法采用双纹理对齐,同时考虑了捕获场景的几何和纹理提示。为了在纹理传播过程中从杂乱的的几何中存活下来,我们设计了一种分离的策略,它在对纹理传播的区域进行修复纹理后,并学习一个隐含的模仿网络来合成遮挡和微小结构区域的纹理。在现实世界室内场景的沉浸式VR应用中进行广泛的实验,证明了生成的纹理的高质量以及VR头戴设备上的引人入胜的体验。项目网页:https:// this URL
https://arxiv.org/abs/2310.13119