Real-world text can be damaged by corrosion issues caused by environmental or human factors, which hinder the preservation of the complete styles of texts, e.g., texture and structure. These corrosion issues, such as graffiti signs and incomplete signatures, bring difficulties in understanding the texts, thereby posing significant challenges to downstream applications, e.g., scene text recognition and signature identification. Notably, current inpainting techniques often fail to adequately address this problem and have difficulties restoring accurate text images along with reasonable and consistent styles. Formulating this as an open problem of text image inpainting, this paper aims to build a benchmark to facilitate its study. In doing so, we establish two specific text inpainting datasets which contain scene text images and handwritten text images, respectively. Each of them includes images revamped by real-life and synthetic datasets, featuring pairs of original images, corrupted images, and other assistant information. On top of the datasets, we further develop a novel neural framework, Global Structure-guided Diffusion Model (GSDM), as a potential solution. Leveraging the global structure of the text as a prior, the proposed GSDM develops an efficient diffusion model to recover clean texts. The efficacy of our approach is demonstrated by thorough empirical study, including a substantial boost in both recognition accuracy and image quality. These findings not only highlight the effectiveness of our method but also underscore its potential to enhance the broader field of text image understanding and processing. Code and datasets are available at: this https URL.
现实世界的文本可能会因为环境或人类因素引起的腐蚀问题而受到损害,这会阻碍文本的完整风格,如纹理和结构。这些腐蚀问题,如涂鸦标志和未完成签名,在理解文本方面带来困难,从而对下游应用,如场景文本识别和签名识别构成了重大挑战。值得注意的是,当前修复技术往往未能充分解决这个问题,并且在恢复准确文本图像和合理且一致的风格方面存在困难。将这个问题定性为文本图像修复的一个开放问题,本文旨在建立一个基准来促进其研究。通过建立包含场景文本图像和手写文本图像的两个具体的文本图像修复数据集,我们分别利用现实和合成数据集对图像进行修复,包括原始图像、污染图像和其他辅助信息。在数据集之上,我们进一步发展了一个新颖的神经框架——全局结构引导扩散模型(GSDM),作为潜在解决方案。利用文本全局结构的先验知识,GSDM 开发了一个有效的扩散模型来恢复清洁的文本。本文方法的有效性通过详细的实验研究得到了证实,包括识别准确度和图像质量的大幅提升。这些发现不仅突出了我们方法的有效性,而且强调了其在文本图像理解和处理领域的潜在提高。代码和数据集可在此链接处获取:https:// this URL。
https://arxiv.org/abs/2401.14832
Recently, scene text detection has received significant attention due to its wide application. However, accurate detection in complex scenes of multiple scales, orientations, and curvature remains a challenge. Numerous detection methods adopt the Vatti clipping (VC) algorithm for multiple-instance training to address the issue of arbitrary-shaped text. Yet we identify several bias results from these approaches called the "shrinked kernel". Specifically, it refers to a decrease in accuracy resulting from an output that overly favors the text kernel. In this paper, we propose a new approach named Expand Kernel Network (EK-Net) with expand kernel distance to compensate for the previous deficiency, which includes three-stages regression to complete instance detection. Moreover, EK-Net not only realize the precise positioning of arbitrary-shaped text, but also achieve a trade-off between performance and speed. Evaluation results demonstrate that EK-Net achieves state-of-the-art or competitive performance compared to other advanced methods, e.g., F-measure of 85.72% at 35.42 FPS on ICDAR 2015, F-measure of 85.75% at 40.13 FPS on CTW1500.
近年来,场景文本检测因其广泛应用而受到了广泛关注。然而,在复杂场景中准确检测多个规模、方向和曲率的文本仍然具有挑战性。为解决任意形状文本的问题,许多检测方法采用Vatti截剪(VC)算法进行多实例训练。然而,我们从中发现了几个称为“收缩核”的偏差结果。具体来说,它指的是输出过分倾向于文本核导致准确度下降。在本文中,我们提出了一种名为扩展核网络(EK-Net)的新方法,通过扩展核距离来弥补这一缺陷,包括三个阶段的回归以完成实例检测。此外,EK-Net不仅实现了任意形状文本的准确定位,还实现了性能与速度的平衡。评估结果显示,与其它先进方法相比,EK-Net在IICAR 2015上的F1分数达到了85.72%,在CTW1500上的F1分数达到了85.75%。
https://arxiv.org/abs/2401.11704
Scene Text Recognition (STR) is a challenging task that involves recognizing text within images of natural scenes. Although current state-of-the-art models for STR exhibit high performance, they typically suffer from low inference efficiency due to their reliance on hybrid architectures comprised of visual encoders and sequence decoders. In this work, we propose the VIsion Permutable extractor for fast and efficient scene Text Recognition (VIPTR), which achieves an impressive balance between high performance and rapid inference speeds in the domain of STR. Specifically, VIPTR leverages a visual-semantic extractor with a pyramid structure, characterized by multiple self-attention layers, while eschewing the traditional sequence decoder. This design choice results in a lightweight and efficient model capable of handling inputs of varying sizes. Extensive experimental results on various standard datasets for both Chinese and English scene text recognition validate the superiority of VIPTR. Notably, the VIPTR-T (Tiny) variant delivers highly competitive accuracy on par with other lightweight models and achieves SOTA inference speeds. Meanwhile, the VIPTR-L (Large) variant attains greater recognition accuracy, while maintaining a low parameter count and favorable inference speed. Our proposed method provides a compelling solution for the STR challenge, which blends high accuracy with efficiency and greatly benefits real-world applications requiring fast and reliable text recognition. The code is publicly available at this https URL.
场景文本识别(STR)是一个具有挑战性的任务,涉及在自然场景图像中识别文本。尽管当前最先进的STR模型具有很高的性能,但它们通常由于依赖视觉编码器和解码器的混合架构而具有低推理效率。在这项工作中,我们提出了一种名为视觉可持久提取器(VIPTR)的快速高效的场景文本识别(STR)方法,在STR领域实现了高性能和快速推理速度之间的令人印象深刻的平衡。具体来说,VIPTR利用具有金字塔结构的视觉语义提取器,其中包含多个自注意力层,而跳过了传统的序列解码器。这种设计选择导致了一个轻量级且高效的模型,能够处理各种输入大小的数据。对中文和英文场景文本识别的各种标准数据集的实验结果证实了VIPTR具有卓越的优越性。值得注意的是,VIPTR-T(小型)变体在与其他轻量级模型的竞争中具有高度的准确性,并实现了与SOTA推理速度相当的性能。同时,VIPTR-L(大型)变体具有更高的识别准确性,而参数数量较少,推理速度有利。我们提出的方法为STR挑战提供了一个引人注目的解决方案,将高准确性与效率相结合,大大有益于需要快速可靠文本识别的实时应用。代码公开在https://这个URL上。
https://arxiv.org/abs/2401.10110
Scene text recognition, as a cross-modal task involving vision and text, is an important research topic in computer vision. Most existing methods use language models to extract semantic information for optimizing visual recognition. However, the guidance of visual cues is ignored in the process of semantic mining, which limits the performance of the algorithm in recognizing irregular scene text. To tackle this issue, we propose a novel cross-modal fusion network (CMFN) for irregular scene text recognition, which incorporates visual cues into the semantic mining process. Specifically, CMFN consists of a position self-enhanced encoder, a visual recognition branch and an iterative semantic recognition branch. The position self-enhanced encoder provides character sequence position encoding for both the visual recognition branch and the iterative semantic recognition branch. The visual recognition branch carries out visual recognition based on the visual features extracted by CNN and the position encoding information provided by the position self-enhanced encoder. The iterative semantic recognition branch, which consists of a language recognition module and a cross-modal fusion gate, simulates the way that human recognizes scene text and integrates cross-modal visual cues for text recognition. The experiments demonstrate that the proposed CMFN algorithm achieves comparable performance to state-of-the-art algorithms, indicating its effectiveness.
场景文本识别是一个涉及视觉和文本的多模态任务,在计算机视觉领域是一个重要的研究课题。大多数现有方法使用语言模型提取语义信息来优化视觉识别。然而,在语义挖掘过程中,忽略了解视觉线索的指导,这限制了算法在识别不规则场景文本时的性能。为了应对这个问题,我们提出了一个新颖的跨模态融合网络(CMFN)用于不规则场景文本识别,其中将视觉线索融入语义挖掘过程。具体来说,CMFN由位置自增强编码器、视觉识别分支和迭代语义识别分支组成。位置自增强编码器为视觉识别分支和迭代语义识别分支提供字符序列位置编码。视觉识别分支根据CNN提取的视觉特征进行视觉识别,并提供位置自增强编码器提供的位置编码信息。迭代语义识别分支由语言识别模块和跨模态融合门组成,模拟了人类识别场景文本的方式,并整合了跨模态视觉线索进行文本识别。实验证明,与最先进的算法相比,所提出的CMFN算法具有可比较的性能,表明了其有效性。
https://arxiv.org/abs/2401.10041
Segmentation-based scene text detection algorithms can handle arbitrary shape scene texts and have strong robustness and adaptability, so it has attracted wide attention. Existing segmentation-based scene text detection algorithms usually only segment the pixels in the center region of the text, while ignoring other information of the text region, such as edge information, distance information, etc., thus limiting the detection accuracy of the algorithm for scene text. This paper proposes a plug-and-play module called the Region Multiple Information Perception Module (RMIPM) to enhance the detection performance of segmentation-based algorithms. Specifically, we design an improved module that can perceive various types of information about scene text regions, such as text foreground classification maps, distance maps, direction maps, etc. Experiments on MSRA-TD500 and TotalText datasets show that our method achieves comparable performance with current state-of-the-art algorithms.
基于分割的场景文本检测算法可以处理任意形状的场景文本,具有很强的鲁棒性和适应性,因此吸引了广泛关注。现有的基于分割的场景文本检测算法通常仅在文本的中心区域分割像素,而忽略了文本区域的其他信息,如边缘信息、距离信息等,从而限制了算法对场景文本的检测精度。本文提出了一种可插拔的模块,称为区域多重信息感知模块(RMIPM),以增强基于分割算法的检测性能。具体来说,我们设计了一个改进的模块,可以感知场景文本区域的各种信息,如文本前景分类图、距离图、方向图等。对MSRA-TD500和TotalText数据集的实验结果表明,我们的方法与最先进的算法具有相当的表现。
https://arxiv.org/abs/2401.10017
Arbitrary shape scene text detection is of great importance in scene understanding tasks. Due to the complexity and diversity of text in natural scenes, existing scene text algorithms have limited accuracy for detecting arbitrary shape text. In this paper, we propose a novel arbitrary shape scene text detector through boundary points dynamic optimization(BPDO). The proposed model is designed with a text aware module (TAM) and a boundary point dynamic optimization module (DOM). Specifically, the model designs a text aware module based on segmentation to obtain boundary points describing the central region of the text by extracting a priori information about the text region. Then, based on the idea of deformable attention, it proposes a dynamic optimization model for boundary points, which gradually optimizes the exact position of the boundary points based on the information of the adjacent region of each boundary point. Experiments on CTW-1500, Total-Text, and MSRA-TD500 datasets show that the model proposed in this paper achieves a performance that is better than or comparable to the state-of-the-art algorithm, proving the effectiveness of the model.
任意形状场景文本检测在场景理解任务中具有重要的意义。由于自然场景中文本的复杂性和多样性,现有的场景文本算法对检测任意形状文本的准确性有限。在本文中,我们通过边界点动态优化(BPDO)提出了一种新颖的任意形状场景文本检测器。与文本感知模块(TAM)和边界点动态优化模块(DOM)相结合,该模型设计了一个基于分段的文本感知模块,以提取文本区域的中间信息。然后,根据变形注意力的思想,它提出了一个动态优化模型来优化边界点的精确位置,该模型根据相邻区域的信息逐步优化边界点的位置。在CTW-1500、Total-Text和MSRA-TD500等数据集上的实验表明,本文提出的模型在实现最佳性能或与最先进的算法相当方面具有优势,证明了该模型的有效性。
https://arxiv.org/abs/2401.09997
End-to-end scene text spotting, which aims to read the text in natural images, has garnered significant attention in recent years. However, recent state-of-the-art methods usually incorporate detection and recognition simply by sharing the backbone, which does not directly take advantage of the feature interaction between the two tasks. In this paper, we propose a new end-to-end scene text spotting framework termed SwinTextSpotter v2, which seeks to find a better synergy between text detection and recognition. Specifically, we enhance the relationship between two tasks using novel Recognition Conversion and Recognition Alignment modules. Recognition Conversion explicitly guides text localization through recognition loss, while Recognition Alignment dynamically extracts text features for recognition through the detection predictions. This simple yet effective design results in a concise framework that requires neither an additional rectification module nor character-level annotations for the arbitrarily-shaped text. Furthermore, the parameters of the detector are greatly reduced without performance degradation by introducing a Box Selection Schedule. Qualitative and quantitative experiments demonstrate that SwinTextSpotter v2 achieved state-of-the-art performance on various multilingual (English, Chinese, and Vietnamese) benchmarks. The code will be available at \href{this https URL}{SwinTextSpotter v2}.
近年来,端到端图像中场景文本检测已经引起了广泛关注。然而,最近的最先进方法通常通过共享骨干来检测和识别,并没有直接利用两个任务之间的特征交互。在本文中,我们提出了一个新的端到端场景文本检测框架SwinTextSpotter v2,旨在找到文本检测和识别之间的更好协同作用。具体来说,我们通过引入新的识别转换和识别对齐模块来增强两个任务之间的关系。识别转换明确指导文本定位通过识别损失,而识别对齐通过检测预测动态提取文本特征。这种简单而有效的设计使得不需要额外的矩形选择模块或字符级别注释来处理任意形状的文本。此外,通过引入Box选择计划,检测器的参数在性能不降低的情况下大大减少。通过质量和数量实验证明,SwinTextSpotter v2在各种多语言(英语、汉语和越南)基准测试中都实现了最先进的成绩。代码将在此处公开,\href{this <https://this URL>}{SwinTextSpotter v2}.
https://arxiv.org/abs/2401.07641
Scene text spotting is a challenging task, especially for inverse-like scene text, which has complex layouts, e.g., mirrored, symmetrical, or retro-flexed. In this paper, we propose a unified end-to-end trainable inverse-like antagonistic text spotting framework dubbed IATS, which can effectively spot inverse-like scene texts without sacrificing general ones. Specifically, we propose an innovative reading-order estimation module (REM) that extracts reading-order information from the initial text boundary generated by an initial boundary module (IBM). To optimize and train REM, we propose a joint reading-order estimation loss consisting of a classification loss, an orthogonality loss, and a distribution loss. With the help of IBM, we can divide the initial text boundary into two symmetric control points and iteratively refine the new text boundary using a lightweight boundary refinement module (BRM) for adapting to various shapes and scales. To alleviate the incompatibility between text detection and recognition, we propose a dynamic sampling module (DSM) with a thin-plate spline that can dynamically sample appropriate features for recognition in the detected text region. Without extra supervision, the DSM can proactively learn to sample appropriate features for text recognition through the gradient returned by the recognition module. Extensive experiments on both challenging scene text and inverse-like scene text datasets demonstrate that our method achieves superior performance both on irregular and inverse-like text spotting.
场景文本检测是一项具有挑战性的任务,尤其是在反向型场景文本中,这些文本具有复杂的布局,例如镜像、对称或反曲。在本文中,我们提出了一种名为IATS的统一端到端训练的逆向型对抗性文本检测框架,可以有效地检测反向型场景文本,同时不牺牲通用文本。具体来说,我们提出了一种创新性的阅读顺序估计模块(REM),它从初始文本边界生成的初始边界模块(IBM)中提取阅读顺序信息。为了优化和训练REM,我们提出了一种联合阅读顺序估计损失,包括分类损失、正交性损失和分布损失。借助IBM,我们可以将初始文本边界划分为两个对称的控制点,并使用轻量级的边界修复模块(BRM)进行迭代,以适应各种形状和比例。为了减轻文本检测和识别之间的不兼容性,我们提出了一种动态采样模块(DSM),它采用双曲线进行动态采样,以在检测到的文本区域内动态地采样适当的特征进行识别。没有额外的监督,DSM可以通过识别模块返回的梯度主动学习适当的特征进行文本识别。在具有挑战性的场景文本和反向型场景文本数据集上进行的大量实验证明,我们的方法在非规则和反向型文本检测方面都取得了卓越的性能。
https://arxiv.org/abs/2401.03637
The laws of model size, data volume, computation and model performance have been extensively studied in the field of Natural Language Processing (NLP). However, the scaling laws in Optical Character Recognition (OCR) have not yet been investigated. To address this, we conducted comprehensive studies that involved examining the correlation between performance and the scale of models, data volume and computation in the field of text recognition.Conclusively, the study demonstrates smooth power laws between performance and model size, as well as training data volume, when other influencing factors are held constant. Additionally, we have constructed a large-scale dataset called REBU-Syn, which comprises 6 million real samples and 18 million synthetic samples. Based on our scaling law and new dataset, we have successfully trained a scene text recognition model, achieving a new state-ofthe-art on 6 common test benchmarks with a top-1 average accuracy of 97.42%.
自然语言处理(NLP)领域已经对模型的规模、数据量和计算性能的定律进行了广泛研究。然而,光学字符识别(OCR)中的缩放定律尚未被研究。为解决这个问题,我们进行了全面的研究,涉及了文本识别领域中模型性能、数据量和计算与性能之间的相关性。 总之,我们的研究证明了性能与模型大小之间的平滑功率定律,以及训练数据量和计算之间的平滑功率定律。此外,我们还构建了一个名为REBU-Syn的大规模数据集,包括600万真实样本和1800万合成样本。基于我们的缩放定律和新数据集,我们成功训练了一个场景文本识别模型,在6个常见测试基准上的 top-1 平均准确率达到了97.42%。
https://arxiv.org/abs/2401.00028
Scene text spotting is essential in various computer vision applications, enabling extracting and interpreting textual information from images. However, existing methods often neglect the spatial semantics of word images, leading to suboptimal detection recall rates for long and short words within long-tailed word length distributions that exist prominently in dense scenes. In this paper, we present WordLenSpotter, a novel word length-aware spotter for scene text image detection and recognition, improving the spotting capabilities for long and short words, particularly in the tail data of dense text images. We first design an image encoder equipped with a dilated convolutional fusion module to integrate multiscale text image features effectively. Then, leveraging the Transformer framework, we synergistically optimize text detection and recognition accuracy after iteratively refining text region image features using the word length prior. Specially, we design a Spatial Length Predictor module (SLP) using character count prior tailored to different word lengths to constrain the regions of interest effectively. Furthermore, we introduce a specialized word Length-aware Segmentation (LenSeg) proposal head, enhancing the network's capacity to capture the distinctive features of long and short terms within categories characterized by long-tailed distributions. Comprehensive experiments on public datasets and our dense text spotting dataset DSTD1500 demonstrate the superiority of our proposed methods, particularly in dense text image detection and recognition tasks involving long-tailed word length distributions encompassing a range of long and short words.
场景文本检测是在各种计算机视觉应用中必不可少的,它能够从图像中提取和解释文本信息。然而,现有的方法通常忽视了单词图像的空间语义,导致对于存在突出长尾分布的长短单词,其检测召回率往往较低。在本文中,我们提出了WordLenSpotter,一种新颖的基于单词长度的场景文本检测器,特别关注长短单词在密集场景中的尾数据。我们首先设计了一个带有 dilated 卷积融合模块的图像编码器,以有效地整合多尺度文本图像特征。接着,利用Transformer框架,我们通过迭代优化逐个细化文本区域图像特征,从而协同提高文本检测和识别的准确性。特别是,我们针对不同单词长度设计了一个Spatial Length Predictor(SLP)模块,可以有效地约束感兴趣区域的范围。此外,我们还引入了一个专门的长度感知分割(LenSeg)提议头,可以增强网络在具有长尾分布的类别的特征捕捉能力。 在公开数据集和我们的密集文本检测数据集DSTD1500上进行的全面实验证明了我们提出方法的优越性,尤其是在涉及长短单词长度的复杂场景文本检测和识别任务中。
https://arxiv.org/abs/2312.15690
The advancement of text shape representations towards compactness has enhanced text detection and spotting performance, but at a high annotation cost. Current models use single-point annotations to reduce costs, yet they lack sufficient localization information for downstream applications. To overcome this limitation, we introduce Point2Polygon, which can efficiently transform single-points into compact polygons. Our method uses a coarse-to-fine process, starting with creating and selecting anchor points based on recognition confidence, then vertically and horizontally refining the polygon using recognition information to optimize its shape. We demonstrate the accuracy of the generated polygons through extensive experiments: 1) By creating polygons from ground truth points, we achieved an accuracy of 82.0% on ICDAR 2015; 2) In training detectors with polygons generated by our method, we attained 86% of the accuracy relative to training with ground truth (GT); 3) Additionally, the proposed Point2Polygon can be seamlessly integrated to empower single-point spotters to generate polygons. This integration led to an impressive 82.5% accuracy for the generated polygons. It is worth mentioning that our method relies solely on synthetic recognition information, eliminating the need for any manual annotation beyond single points.
文本形状表示的进步使得文本检测和斑点检测性能得到了提高,但需要高昂的标注成本。当前的模型使用单点标注来降低成本,然而它们缺乏足够的关键位置信息,对于下游应用来说至关重要。为了克服这个限制,我们引入了点2面体,它可以通过有效地将单点转换为紧凑的多边形来提高文本检测和斑点检测的性能。我们的方法采用粗到细的过程,首先根据识别信心创建和选择锚点,然后使用识别信息垂直和水平优化多边形的形状。我们通过广泛的实验来证明生成的多边形的准确性:1)通过从真实点创建多边形,我们在2015年ICDAR上实现了82.0%的准确度;2)在用我们方法生成的检测器上进行训练时,我们实现了与用真实点进行训练的86%的准确度相对;3)此外,点2面体可以轻松地与其他单点检测器集成,使其生成多边形。这种集成导致生成的多边形具有令人印象深刻的82.5%的准确度。值得注意的是,我们的方法仅依赖于合成识别信息,消除了对任何手动标注的需求,从而实现单点检测器。
https://arxiv.org/abs/2312.13778
In this paper, we build a visual dialogue dataset, named InfoVisDial, which provides rich informative answers in each round even with external knowledge related to the visual content. Different from existing datasets where the answer is compact and short, InfoVisDial contains long free-form answers with rich information in each round of dialogue. For effective data collection, the key idea is to bridge the large-scale multimodal model (e.g., GIT) and the language models (e.g., GPT-3). GIT can describe the image content even with scene text, while GPT-3 can generate informative dialogue based on the image description and appropriate prompting techniques. With such automatic pipeline, we can readily generate informative visual dialogue data at scale. Then, we ask human annotators to rate the generated dialogues to filter the low-quality conversations.Human analyses show that InfoVisDial covers informative and diverse dialogue topics: $54.4\%$ of the dialogue rounds are related to image scene texts, and $36.7\%$ require external knowledge. Each round's answer is also long and open-ended: $87.3\%$ of answers are unique with an average length of $8.9$, compared with $27.37\%$ and $2.9$ in VisDial. Last, we propose a strong baseline by adapting the GIT model for the visual dialogue task and fine-tune the model on InfoVisDial. Hopefully, our work can motivate more effort on this direction.
在本文中,我们构建了一个名为InfoVisDial的视觉对话数据集,为每个回合提供了丰富的信息性回答,即使与视觉内容相关的外部知识很大。与现有数据集不同,InfoVisDial包含了每个回合的长的自由文本答案,每个回合的答案都充满了丰富的信息。为了有效地收集数据,关键思想是连接大规模多模态模型(例如GIT)和语言模型(例如GPT-3)。GIT可以描述场景文本中的图像内容,而GPT-3可以根据图像描述和适当的提示技术生成有用的对话。有了这样的自动工作流程,我们可以在规模上轻松地生成有用的视觉对话数据。然后,我们要求人类标注者对生成的对话进行评分,以过滤低质量的对话。人类分析表明,InfoVisDial涵盖了有益且多样化的对话主题:87.3%的对话回合与图像场景文本相关,而36.7%的对话回合需要外部知识。每个回合的答案也是长且开诚布公的:87.3%的答案是独特的,平均长度为8.9,与VisDial中的27.37%和2.9%相比。最后,我们通过将GIT模型适应视觉对话任务并对InfoVisDial进行微调,提出了一个强大的基线。我们希望,我们的工作可以激励在这个方向上投入更多的努力。
https://arxiv.org/abs/2312.13503
Recently, diffusion-based image generation methods are credited for their remarkable text-to-image generation capabilities, while still facing challenges in accurately generating multilingual scene text images. To tackle this problem, we propose Diff-Text, which is a training-free scene text generation framework for any language. Our model outputs a photo-realistic image given a text of any language along with a textual description of a scene. The model leverages rendered sketch images as priors, thus arousing the potential multilingual-generation ability of the pre-trained Stable Diffusion. Based on the observation from the influence of the cross-attention map on object placement in generated images, we propose a localized attention constraint into the cross-attention layer to address the unreasonable positioning problem of scene text. Additionally, we introduce contrastive image-level prompts to further refine the position of the textual region and achieve more accurate scene text generation. Experiments demonstrate that our method outperforms the existing method in both the accuracy of text recognition and the naturalness of foreground-background blending.
最近,基于扩散的图像生成方法以其在文本到图像生成方面的显著表现而受到赞誉,但在准确生成多语言场景文本图像方面仍然面临挑战。为解决这个问题,我们提出了Diff-Text,这是一个为任何语言的训练free场景文本生成框架。我们的模型根据任何语言的文本生成一张照片真实的图像,并提供了场景的文本描述。模型利用预训练的Stable Diffusion生成的渲染插图作为先验,从而激起先验的 multilingual-generation 能力。基于观察到交叉注意图对生成图像中物体对齐的影响,我们在交叉注意层引入局部注意力约束以解决场景文本不合理的对齐问题。此外,我们还引入了对比图像级的提示来进一步精细文本区域的定位,并实现更精确的场景文本生成。实验证明,我们的方法在文本识别的准确性和前景-背景融合的自然性方面超过了现有方法。
https://arxiv.org/abs/2312.12232
Nowadays, scene text recognition has attracted more and more attention due to its diverse applications. Most state-of-the-art methods adopt an encoder-decoder framework with the attention mechanism, autoregressively generating text from left to right. Despite the convincing performance, this sequential decoding strategy constrains inference speed. Conversely, non-autoregressive models provide faster, simultaneous predictions but often sacrifice accuracy. Although utilizing an explicit language model can improve performance, it burdens the computational load. Besides, separating linguistic knowledge from vision information may harm the final prediction. In this paper, we propose an alternative solution, using a parallel and iterative decoder that adopts an easy-first decoding strategy. Furthermore, we regard text recognition as an image-based conditional text generation task and utilize the discrete diffusion strategy, ensuring exhaustive exploration of bidirectional contextual information. Extensive experiments demonstrate that the proposed approach achieves superior results on the benchmark datasets, including both Chinese and English text images.
如今,由于其各种应用场景,场景文本识别已经吸引了越来越多的关注。最先进的方法采用编码器-解码器框架,并使用注意力机制自左至右生成文本。尽管具有令人信服的性能,但这种序列解码策略限制了推理速度。相反,无自回归模型提供更快的同时预测,但通常会牺牲准确性。尽管使用显式语言模型可以提高性能,但它增加了计算负担。此外,将语言知识与视觉信息分离可能损害最终预测。在本文中,我们提出了另一种解决方案,使用一种并行和迭代解码器,采用简单的先解码策略。此外,我们将文本识别视为基于图像的条件下文本生成任务,并使用离散扩散策略,确保探索双向上下文信息的全面性。大量实验证明,与基准数据集相比,所提出的方案在包括中文和英文文本图像在内的各个领域都取得了卓越的性能。
https://arxiv.org/abs/2312.11923
Natural scene text detection is a significant challenge in computer vision, with tremendous potential applications in multilingual, diverse, and complex text scenarios. We propose a multilingual text detection model to address the issues of low accuracy and high difficulty in detecting multilingual text in natural scenes. In response to the challenges posed by multilingual text images with multiple character sets and various font styles, we introduce the SFM Swin Transformer feature extraction network to enhance the model's robustness in detecting characters and fonts across different languages. Dealing with the considerable variation in text scales and complex arrangements in natural scene text images, we present the AS-HRFPN feature fusion network by incorporating an Adaptive Spatial Feature Fusion module and a Spatial Pyramid Pooling module. The feature fusion network improvements enhance the model's ability to detect text sizes and orientations. Addressing diverse backgrounds and font variations in multilingual scene text images is a challenge for existing methods. Limited local receptive fields hinder detection performance. To overcome this, we propose a Global Semantic Segmentation Branch, extracting and preserving global features for more effective text detection, aligning with the need for comprehensive information. In this study, we collected and built a real-world multilingual natural scene text image dataset and conducted comprehensive experiments and analyses. The experimental results demonstrate that the proposed algorithm achieves an F-measure of 85.02\%, which is 4.71\% higher than the baseline model. We also conducted extensive cross-dataset validation on MSRA-TD500, ICDAR2017MLT, and ICDAR2015 datasets to verify the generality of our approach. The code and dataset can be found at this https URL.
自然场景文本检测是计算机视觉领域的一个重要挑战,具有多语言、多样化和复杂文本场景的巨大应用潜力。我们提出了一种多语言文本检测模型,以解决在自然场景中检测多语言文本的准确性和难度较高的难题。针对多语言文本图像具有多个字符集和各种字体样式所带来的挑战,我们引入了SFM Swin Transformer特征提取网络,以增强模型在检测不同语言中的字符和字体的鲁棒性。处理自然场景文本图像中文本尺寸和复杂排列所带来的巨大变化,我们通过将自适应空间特征融合网络和空间金字塔池化模块相结合,提出了AS-HRFPN特征融合网络。特征融合网络改进了模型的检测文本大小和方向的能力。解决多语言场景文本图像中的多样背景和字体变化是一个挑战,现有的方法。有限的局部接收域阻碍了检测性能。为了克服这一挑战,我们提出了全局语义分割分支,提取和保留全局特征以实现更有效的文本检测,与需要全面信息的需求相吻合。在本研究中,我们收集并构建了一个真实世界多语言自然场景文本图像数据集,并进行了全面实验和分析。实验结果表明,与基线模型相比,所提出的算法实现了85.02%的F1分数,比基线模型高4.71%。我们还对MSRA-TD500、ICDAR2017MLT和ICDAR2015等数据集进行了广泛的跨数据集验证,以验证我们方法的普适性。代码和数据集可以在这个链接中找到。
https://arxiv.org/abs/2312.11153
In this paper, we investigate cross-lingual learning (CLL) for multilingual scene text recognition (STR). CLL transfers knowledge from one language to another. We aim to find the condition that exploits knowledge from high-resource languages for improving performance in low-resource languages. To do so, we first examine if two general insights about CLL discussed in previous works are applied to multilingual STR: (1) Joint learning with high- and low-resource languages may reduce performance on low-resource languages, and (2) CLL works best between typologically similar languages. Through extensive experiments, we show that two general insights may not be applied to multilingual STR. After that, we show that the crucial condition for CLL is the dataset size of high-resource languages regardless of the kind of high-resource languages. Our code, data, and models are available at this https URL.
在本文中,我们研究了跨语言学习(CLL)在多语言场景文本识别(STR)中的应用。CLL将知识从一个语言传递到另一个语言。我们的目标是找到一个条件,利用高资源语言的知识来提高低资源语言的性能。为此,我们首先检查之前工作讨论的关于CLL的两个一般性见解是否适用于多语言STR:(1)联合学习高和低资源语言可能会在低资源语言上降低性能,(2)CLL在类型相似的语言之间效果最好。通过广泛的实验,我们发现两个一般性见解不能应用于多语言STR。然后,我们证明了对于CLL,关键条件是高资源语言的数据集大小,无论高资源语言的类型如何。我们的代码、数据和模型可在此处访问:https://url。
https://arxiv.org/abs/2312.10806
Existing scene text detection methods typically rely on extensive real data for training. Due to the lack of annotated real images, recent works have attempted to exploit large-scale labeled synthetic data (LSD) for pre-training text detectors. However, a synth-to-real domain gap emerges, further limiting the performance of text detectors. Differently, in this work, we propose \textbf{FreeReal}, a real-domain-aligned pre-training paradigm that enables the complementary strengths of both LSD and unlabeled real data (URD). Specifically, to bridge real and synthetic worlds for pre-training, a novel glyph-based mixing mechanism (GlyphMix) is tailored for text images. GlyphMix delineates the character structures of synthetic images and embeds them as graffiti-like units onto real images. Without introducing real domain drift, GlyphMix freely yields real-world images with annotations derived from synthetic labels. Furthermore, when given free fine-grained synthetic labels, GlyphMix can effectively bridge the linguistic domain gap stemming from English-dominated LSD to URD in various languages. Without bells and whistles, FreeReal achieves average gains of 4.56\%, 3.85\%, 3.90\%, and 1.97\% in improving the performance of DBNet, PANet, PSENet, and FCENet methods, respectively, consistently outperforming previous pre-training methods by a substantial margin across four public datasets. Code will be released soon.
现有的场景文本检测方法通常依赖于大量真实数据进行训练。由于缺乏注释真实图像,最近的工作试图利用大规模标记的合成数据(LSD)进行预训练文本检测。然而,在合成到真实领域之间存在一个鸿沟,进一步限制了文本检测器的性能。不同之处在于,在本文中,我们提出了 FreeReal,一种真实领域对齐的预训练范式,可以将LSD和未标注的真实数据(URD)的优势相结合。具体来说,为了在预训练过程中弥合真实和合成世界之间的差距,我们针对文本图像定制了一个新颖的基于字符的混合机制(GlyphMix)。GlyphMix 勾勒出合成图像中的字符结构,并将它们嵌入到真实图像中,从而自由地生成具有合成标签的实世界图像。在没有引入真实领域漂移的情况下,GlyphMix 有效地在各种语言中跨越英语主导的LSD到URD的语义领域差距。没有花哨的功能,FreeReal 通过分别提高DBNet、PANet、PSENet和FCENet等方法的表现,实现了4.56%、3.85%、3.90%和1.97%的改善,并在四个公共数据集上显著超越了之前预训练方法。代码不久将发布。
https://arxiv.org/abs/2312.05286
Text-to-Image (T2I) generation methods based on diffusion model have garnered significant attention in the last few years. Although these image synthesis methods produce visually appealing results, they frequently exhibit spelling errors when rendering text within the generated images. Such errors manifest as missing, incorrect or extraneous characters, thereby severely constraining the performance of text image generation based on diffusion models. To address the aforementioned issue, this paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model (i.e., Stable Diffusion [27]). Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder and provides more robust text embeddings as conditional guidance. Then, we fine-tune the diffusion model using a large-scale dataset, incorporating local attention control under the supervision of character-level segmentation maps. Finally, by employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images. Both qualitative and quantitative results demonstrate the superiority of our method to the state of the art. Furthermore, we showcase several potential applications of the proposed UDiffText, including text-centric image synthesis, scene text editing, etc. Code and model will be available at this https URL .
基于扩散模型的文本图像生成方法近年来引起了广泛关注。尽管这些图像生成方法产生 visually 吸引人的结果,但在渲染生成图像中的文本时,它们经常出现拼写错误。这些错误表现为缺失、错误或多余的字符,从而严重限制了基于扩散模型的文本图像生成的性能。为解决上述问题,本文提出了一种新颖的文本图像生成方法,利用预训练的扩散模型(即稳定扩散 [27])。我们的方法涉及设计并训练一个轻量级的字符级别文本编码器,用它取代原始的CLIP编码器并提供了更强的文本嵌入条件指导。然后,我们在大规模数据集上对扩散模型进行微调,并在字符级别分割图的监督下实现局部注意力控制。最后,通过采用推理阶段优化过程,我们在任意给定图像中生成文本时实现了显著的序列准确度。质性和数量结果均证明了我们方法与现有技术的优越性。此外,我们还展示了基于提出的UDiffText的一些潜在应用,包括文本中心图像合成、场景文本编辑等。代码和模型将在此处https URL上提供。
https://arxiv.org/abs/2312.04884
Scene text image super-resolution (STISR) aims at simultaneously increasing the resolution and readability of low-resolution scene text images, thus boosting the performance of the downstream recognition task. Two factors in scene text images, semantic information and visual structure, affect the recognition performance significantly. To mitigate the effects from these factors, this paper proposes a Prior-Enhanced Attention Network (PEAN). Specifically, a diffusion-based module is developed to enhance the text prior, hence offering better guidance for the SR network to generate SR images with higher semantic accuracy. Meanwhile, the proposed PEAN leverages an attention-based modulation module to understand scene text images by neatly perceiving the local and global dependence of images, despite the shape of the text. A multi-task learning paradigm is employed to optimize the network, enabling the model to generate legible SR images. As a result, PEAN establishes new SOTA results on the TextZoom benchmark. Experiments are also conducted to analyze the importance of the enhanced text prior as a means of improving the performance of the SR network. Code will be made available at this https URL.
场景文本图像超分辨率(STISR)旨在同时提高低分辨率场景文本图像的分辨率和可读性,从而提高下游识别任务的性能。场景文本图像中的语义信息和视觉结构对识别性能具有重要影响。为了减轻这些因素的影响,本文提出了一种先验增强注意网络(PEAN)。具体来说,采用扩散模块增强文本先验,从而为SR网络提供更好的指导以生成具有更高语义准确性的SR图像。同时,PEAN利用基于注意的模块来理解场景文本图像,尽管其形状有所不同。采用多任务学习范式优化网络,使模型能够生成清晰可读的SR图像。因此,PEAN在TextZoom基准上取得了最先进的成果。此外,还进行了实验分析,以分析增强文本先验作为提高SR网络性能的重要手段。代码将在此处链接提供。
https://arxiv.org/abs/2311.17955
Robustness certification, which aims to formally certify the predictions of neural networks against adversarial inputs, has become an integral part of important tool for safety-critical applications. Despite considerable progress, existing certification methods are limited to elementary architectures, such as convolutional networks, recurrent networks and recently Transformers, on benchmark datasets such as MNIST. In this paper, we focus on the robustness certification of scene text recognition (STR), which is a complex and extensively deployed image-based sequence prediction problem. We tackle three types of STR model architectures, including the standard STR pipelines and the Vision Transformer. We propose STR-Cert, the first certification method for STR models, by significantly extending the DeepPoly polyhedral verification framework via deriving novel polyhedral bounds and algorithms for key STR model components. Finally, we certify and compare STR models on six datasets, demonstrating the efficiency and scalability of robustness certification, particularly for the Vision Transformer.
安全性认证(旨在正式认证神经网络对抗性输入的预测)已成为安全关键应用的重要工具。尽管取得了显著的进展,但现有的认证方法仅限于基本的架构,如卷积神经网络、循环神经网络和最近 Transformer,在基准数据集如 MNIST 上。在本文中,我们重点关注场景文本识别(STR)模型的安全性认证,这是一种复杂且广泛部署的图像序列预测问题。我们解决了三种 STR 模型架构,包括标准的 STR 管道和 Vision Transformer。我们通过通过扩展 DeepP 凸多面体验证框架来引入新的凸多面体界来提出 STR-Cert,这是第一个为 STR 模型设计的认证方法。最后,我们在六个数据集上进行了 STR 模型的认证和比较,证明了安全性认证的有效性和可扩展性,特别是对于 Vision Transformer。
https://arxiv.org/abs/2401.05338