Scene text recognition is a rapidly developing field that faces numerous challenges due to the complexity and diversity of scene text, including complex backgrounds, diverse fonts, flexible arrangements, and accidental occlusions. In this paper, we propose a novel approach called Class-Aware Mask-guided feature refinement (CAM) to address these challenges. Our approach introduces canonical class-aware glyph masks generated from a standard font to effectively suppress background and text style noise, thereby enhancing feature discrimination. Additionally, we design a feature alignment and fusion module to incorporate the canonical mask guidance for further feature refinement for text recognition. By enhancing the alignment between the canonical mask feature and the text feature, the module ensures more effective fusion, ultimately leading to improved recognition performance. We first evaluate CAM on six standard text recognition benchmarks to demonstrate its effectiveness. Furthermore, CAM exhibits superiority over the state-of-the-art method by an average performance gain of 4.1% across six more challenging datasets, despite utilizing a smaller model size. Our study highlights the importance of incorporating canonical mask guidance and aligned feature refinement techniques for robust scene text recognition. The code is available at this https URL.
场景文本识别是一个迅速发展的领域,由于场景文本的复杂性和多样性,包括复杂的背景、多样化的字体和灵活的排列以及意外的遮挡,面临着许多挑战。在本文中,我们提出了一个名为类感知引导特征细化(CAM)的新方法来应对这些挑战。我们的方法引入了一个标准字体生成的规范类感知 glyph 口罩,有效地抑制了背景和文本风格噪声,从而提高了特征识别效果。此外,我们还设计了一个特征对齐和融合模块,以进一步对文本识别进行特征细化。通过增强规范口罩特征与文本特征之间的对齐,该模块确保了更有效的融合,最终提高了识别性能。我们首先在六个标准文本识别基准上评估了CAM的有效性,以证明其有效性。此外,CAM在六个更具挑战性的数据集上的平均性能比最先进的方法提高了4.1%,尽管采用了更小的模型大小。我们的研究突出了将规范口罩指导和对齐特征细化技术纳入文本识别的重要性。代码可在此处访问:https://url.cn/
https://arxiv.org/abs/2402.13643
Recent advancements in personalizing text-to-image (T2I) diffusion models have shown the capability to generate images based on personalized visual concepts using a limited number of user-provided examples. However, these models often struggle with maintaining high visual fidelity, particularly in manipulating scenes as defined by textual inputs. Addressing this, we introduce ComFusion, a novel approach that leverages pretrained models generating composition of a few user-provided subject images and predefined-text scenes, effectively fusing visual-subject instances with textual-specific scenes, resulting in the generation of high-fidelity instances within diverse scenes. ComFusion integrates a class-scene prior preservation regularization, which leverages composites the subject class and scene-specific knowledge from pretrained models to enhance generation fidelity. Additionally, ComFusion uses coarse generated images, ensuring they align effectively with both the instance image and scene texts. Consequently, ComFusion maintains a delicate balance between capturing the essence of the subject and maintaining scene fidelity.Extensive evaluations of ComFusion against various baselines in T2I personalization have demonstrated its qualitative and quantitative superiority.
近年来在个性化文本-图像(T2I)扩散模型方面的进步表明,这些模型能够使用有限的用户提供的示例生成基于个性化视觉概念的图像。然而,这些模型通常在操作由文本输入定义的场景时陷入困境,尤其是在处理场景时。为了解决这个问题,我们引入了ComFusion,一种新方法,它利用预训练模型生成几张用户提供的主题图像和预定义文本场景的组合,有效地将视觉主题实例与文本特定的场景融合在一起,从而在多样场景中生成高保真度的实例。ComFusion实现了一个类场景先验保留的规范化,它利用预训练模型的组合来增强生成保真度。此外,ComFusion使用粗生成图像,确保它们与实例图像和场景文本 alignment 有效。因此,ComFusion在捕捉主题本质的同时保持了场景保真度。 通过对ComFusion与各种T2I个性化基准的广泛评估,证明了ComFusion在质量和数量上具有优越性。
https://arxiv.org/abs/2402.11849
Existing methods for scene text detection can be divided into two paradigms: segmentation-based and anchor-based. While Segmentation-based methods are well-suited for irregular shapes, they struggle with compact or overlapping layouts. Conversely, anchor-based approaches excel for complex layouts but suffer from irregular shapes. To strengthen their merits and overcome their respective demerits, we propose a Complementary Proposal Network (CPN) that seamlessly and parallelly integrates semantic and geometric information for superior performance. The CPN comprises two efficient networks for proposal generation: the Deformable Morphology Semantic Network, which generates semantic proposals employing an innovative deformable morphological operator, and the Balanced Region Proposal Network, which produces geometric proposals with pre-defined anchors. To further enhance the complementarity, we introduce an Interleaved Feature Attention module that enables semantic and geometric features to interact deeply before proposal generation. By leveraging both complementary proposals and features, CPN outperforms state-of-the-art approaches with significant margins under comparable computation cost. Specifically, our approach achieves improvements of 3.6%, 1.3% and 1.0% on challenging benchmarks ICDAR19-ArT, IC15, and MSRA-TD500, respectively. Code for our method will be released.
现有的场景文本检测方法可以分为两种范式:基于分割和基于锚定。虽然基于分割的方法对于不规则形状的应用效果很好,但它们在紧凑或重叠布局下表现不佳。相反,基于锚定的方法在复杂布局下表现出色,但存在不规则形状的问题。为了增强其优势并克服各自的缺陷,我们提出了一个互补建议网络(CPN),它平滑地并行地整合语义和几何信息以实现卓越的性能。CPN包括两个用于提议生成的有效网络:具有创新变形形态操作的语义变形形态网络和具有预定义锚定的平衡区域提议网络。为了进一步增强互补性,我们还引入了一个跨特征关注模块,使得语义和几何特征在提议生成前进行深度交互。通过利用互补提议和特征,CPN在类似计算成本下显著优于最先进的 approaches。具体来说,我们的方法在具有挑战性的基准测试ICDAR19-ArT、IC15和MSRA-TD500上分别实现了3.6%、1.3%和1.0%的改进。我们的方法将发布代码。
https://arxiv.org/abs/2402.11540
We introduce Lumos, the first end-to-end multimodal question-answering system with text understanding capabilities. At the core of Lumos is a Scene Text Recognition (STR) component that extracts text from first person point-of-view images, the output of which is used to augment input to a Multimodal Large Language Model (MM-LLM). While building Lumos, we encountered numerous challenges related to STR quality, overall latency, and model inference. In this paper, we delve into those challenges, and discuss the system architecture, design choices, and modeling techniques employed to overcome these obstacles. We also provide a comprehensive evaluation for each component, showcasing high quality and efficiency.
我们介绍Lumos,第一个端到端的多模态问题回答系统,具有文本理解功能。Lumos的核心是一个场景文本识别(STR)组件,它从第一人称视角图像中提取文本,用于增强输入到多模态大型语言模型(MM-LLM)。在构建Lumos时,我们遇到了许多与STR质量、整体延迟和模型推理有关的挑战。在本文中,我们深入研究了这些问题,讨论了用于克服这些障碍的系统架构、设计选择和建模技术。我们还对每个组件进行了全面的评估,展示了高质量和效率。
https://arxiv.org/abs/2402.08017
Multi-modal models have shown appealing performance in visual tasks recently, as instruction-guided training has evoked the ability to understand fine-grained visual content. However, current methods cannot be trivially applied to scene text recognition (STR) due to the gap between natural and text images. In this paper, we introduce a novel paradigm that formulates STR as an instruction learning problem, and propose instruction-guided scene text recognition (IGTR) to achieve effective cross-modal learning. IGTR first generates rich and diverse instruction triplets of <condition,question,answer>, serving as guidance for nuanced text image understanding. Then, we devise an architecture with dedicated cross-modal feature fusion module, and multi-task answer head to effectively fuse the required instruction and image features for answering questions. Built upon these designs, IGTR facilitates accurate text recognition by comprehending character attributes. Experiments on English and Chinese benchmarks show that IGTR outperforms existing models by significant margins. Furthermore, by adjusting the instructions, IGTR enables various recognition schemes. These include zero-shot prediction, where the model is trained based on instructions not explicitly targeting character recognition, and the recognition of rarely appearing and morphologically similar characters, which were previous challenges for existing models.
多模态模型在视觉任务中最近表现出了吸引人的性能,因为指令指导训练激发了理解细粒度视觉内容的能力。然而,由于自然图像和文本图像之间的差距,当前方法无法直接应用于场景文本识别(STR)。在本文中,我们提出了一个新颖的范例,将STR表示为指令学习问题,并提出了指令指导场景文本识别(IGTR)以实现有效的跨模态学习。IGTR首先生成丰富多样的指令三元组<条件,问题,答案>,作为复杂文本图像理解的指导。然后,我们设计了一个具有专用跨模态特征融合模块的架构,以及多任务答案头,以有效地融合回答问题所需的指令和图像特征。在这些设计的基础上,IGTR通过理解字符属性来准确识别文本。在英语和中文基准上的实验表明,IGTR超越了现有模型的显著优势。此外,通过调整指令,IGTR可以实现各种识别方案。这些包括基于指令没有明确针对字符识别的模型训练,以及识别罕见的和形态相似的字符,这是现有模型的前挑战。
https://arxiv.org/abs/2401.17851
Real-world text can be damaged by corrosion issues caused by environmental or human factors, which hinder the preservation of the complete styles of texts, e.g., texture and structure. These corrosion issues, such as graffiti signs and incomplete signatures, bring difficulties in understanding the texts, thereby posing significant challenges to downstream applications, e.g., scene text recognition and signature identification. Notably, current inpainting techniques often fail to adequately address this problem and have difficulties restoring accurate text images along with reasonable and consistent styles. Formulating this as an open problem of text image inpainting, this paper aims to build a benchmark to facilitate its study. In doing so, we establish two specific text inpainting datasets which contain scene text images and handwritten text images, respectively. Each of them includes images revamped by real-life and synthetic datasets, featuring pairs of original images, corrupted images, and other assistant information. On top of the datasets, we further develop a novel neural framework, Global Structure-guided Diffusion Model (GSDM), as a potential solution. Leveraging the global structure of the text as a prior, the proposed GSDM develops an efficient diffusion model to recover clean texts. The efficacy of our approach is demonstrated by thorough empirical study, including a substantial boost in both recognition accuracy and image quality. These findings not only highlight the effectiveness of our method but also underscore its potential to enhance the broader field of text image understanding and processing. Code and datasets are available at: this https URL.
现实世界的文本可能会因为环境或人类因素引起的腐蚀问题而受到损害,这会阻碍文本的完整风格,如纹理和结构。这些腐蚀问题,如涂鸦标志和未完成签名,在理解文本方面带来困难,从而对下游应用,如场景文本识别和签名识别构成了重大挑战。值得注意的是,当前修复技术往往未能充分解决这个问题,并且在恢复准确文本图像和合理且一致的风格方面存在困难。将这个问题定性为文本图像修复的一个开放问题,本文旨在建立一个基准来促进其研究。通过建立包含场景文本图像和手写文本图像的两个具体的文本图像修复数据集,我们分别利用现实和合成数据集对图像进行修复,包括原始图像、污染图像和其他辅助信息。在数据集之上,我们进一步发展了一个新颖的神经框架——全局结构引导扩散模型(GSDM),作为潜在解决方案。利用文本全局结构的先验知识,GSDM 开发了一个有效的扩散模型来恢复清洁的文本。本文方法的有效性通过详细的实验研究得到了证实,包括识别准确度和图像质量的大幅提升。这些发现不仅突出了我们方法的有效性,而且强调了其在文本图像理解和处理领域的潜在提高。代码和数据集可在此链接处获取:https:// this URL。
https://arxiv.org/abs/2401.14832
Recently, scene text detection has received significant attention due to its wide application. However, accurate detection in complex scenes of multiple scales, orientations, and curvature remains a challenge. Numerous detection methods adopt the Vatti clipping (VC) algorithm for multiple-instance training to address the issue of arbitrary-shaped text. Yet we identify several bias results from these approaches called the "shrinked kernel". Specifically, it refers to a decrease in accuracy resulting from an output that overly favors the text kernel. In this paper, we propose a new approach named Expand Kernel Network (EK-Net) with expand kernel distance to compensate for the previous deficiency, which includes three-stages regression to complete instance detection. Moreover, EK-Net not only realize the precise positioning of arbitrary-shaped text, but also achieve a trade-off between performance and speed. Evaluation results demonstrate that EK-Net achieves state-of-the-art or competitive performance compared to other advanced methods, e.g., F-measure of 85.72% at 35.42 FPS on ICDAR 2015, F-measure of 85.75% at 40.13 FPS on CTW1500.
近年来,场景文本检测因其广泛应用而受到了广泛关注。然而,在复杂场景中准确检测多个规模、方向和曲率的文本仍然具有挑战性。为解决任意形状文本的问题,许多检测方法采用Vatti截剪(VC)算法进行多实例训练。然而,我们从中发现了几个称为“收缩核”的偏差结果。具体来说,它指的是输出过分倾向于文本核导致准确度下降。在本文中,我们提出了一种名为扩展核网络(EK-Net)的新方法,通过扩展核距离来弥补这一缺陷,包括三个阶段的回归以完成实例检测。此外,EK-Net不仅实现了任意形状文本的准确定位,还实现了性能与速度的平衡。评估结果显示,与其它先进方法相比,EK-Net在IICAR 2015上的F1分数达到了85.72%,在CTW1500上的F1分数达到了85.75%。
https://arxiv.org/abs/2401.11704
Scene Text Recognition (STR) is a challenging task that involves recognizing text within images of natural scenes. Although current state-of-the-art models for STR exhibit high performance, they typically suffer from low inference efficiency due to their reliance on hybrid architectures comprised of visual encoders and sequence decoders. In this work, we propose the VIsion Permutable extractor for fast and efficient scene Text Recognition (VIPTR), which achieves an impressive balance between high performance and rapid inference speeds in the domain of STR. Specifically, VIPTR leverages a visual-semantic extractor with a pyramid structure, characterized by multiple self-attention layers, while eschewing the traditional sequence decoder. This design choice results in a lightweight and efficient model capable of handling inputs of varying sizes. Extensive experimental results on various standard datasets for both Chinese and English scene text recognition validate the superiority of VIPTR. Notably, the VIPTR-T (Tiny) variant delivers highly competitive accuracy on par with other lightweight models and achieves SOTA inference speeds. Meanwhile, the VIPTR-L (Large) variant attains greater recognition accuracy, while maintaining a low parameter count and favorable inference speed. Our proposed method provides a compelling solution for the STR challenge, which blends high accuracy with efficiency and greatly benefits real-world applications requiring fast and reliable text recognition. The code is publicly available at this https URL.
场景文本识别(STR)是一个具有挑战性的任务,涉及在自然场景图像中识别文本。尽管当前最先进的STR模型具有很高的性能,但它们通常由于依赖视觉编码器和解码器的混合架构而具有低推理效率。在这项工作中,我们提出了一种名为视觉可持久提取器(VIPTR)的快速高效的场景文本识别(STR)方法,在STR领域实现了高性能和快速推理速度之间的令人印象深刻的平衡。具体来说,VIPTR利用具有金字塔结构的视觉语义提取器,其中包含多个自注意力层,而跳过了传统的序列解码器。这种设计选择导致了一个轻量级且高效的模型,能够处理各种输入大小的数据。对中文和英文场景文本识别的各种标准数据集的实验结果证实了VIPTR具有卓越的优越性。值得注意的是,VIPTR-T(小型)变体在与其他轻量级模型的竞争中具有高度的准确性,并实现了与SOTA推理速度相当的性能。同时,VIPTR-L(大型)变体具有更高的识别准确性,而参数数量较少,推理速度有利。我们提出的方法为STR挑战提供了一个引人注目的解决方案,将高准确性与效率相结合,大大有益于需要快速可靠文本识别的实时应用。代码公开在https://这个URL上。
https://arxiv.org/abs/2401.10110
Scene text recognition, as a cross-modal task involving vision and text, is an important research topic in computer vision. Most existing methods use language models to extract semantic information for optimizing visual recognition. However, the guidance of visual cues is ignored in the process of semantic mining, which limits the performance of the algorithm in recognizing irregular scene text. To tackle this issue, we propose a novel cross-modal fusion network (CMFN) for irregular scene text recognition, which incorporates visual cues into the semantic mining process. Specifically, CMFN consists of a position self-enhanced encoder, a visual recognition branch and an iterative semantic recognition branch. The position self-enhanced encoder provides character sequence position encoding for both the visual recognition branch and the iterative semantic recognition branch. The visual recognition branch carries out visual recognition based on the visual features extracted by CNN and the position encoding information provided by the position self-enhanced encoder. The iterative semantic recognition branch, which consists of a language recognition module and a cross-modal fusion gate, simulates the way that human recognizes scene text and integrates cross-modal visual cues for text recognition. The experiments demonstrate that the proposed CMFN algorithm achieves comparable performance to state-of-the-art algorithms, indicating its effectiveness.
场景文本识别是一个涉及视觉和文本的多模态任务,在计算机视觉领域是一个重要的研究课题。大多数现有方法使用语言模型提取语义信息来优化视觉识别。然而,在语义挖掘过程中,忽略了解视觉线索的指导,这限制了算法在识别不规则场景文本时的性能。为了应对这个问题,我们提出了一个新颖的跨模态融合网络(CMFN)用于不规则场景文本识别,其中将视觉线索融入语义挖掘过程。具体来说,CMFN由位置自增强编码器、视觉识别分支和迭代语义识别分支组成。位置自增强编码器为视觉识别分支和迭代语义识别分支提供字符序列位置编码。视觉识别分支根据CNN提取的视觉特征进行视觉识别,并提供位置自增强编码器提供的位置编码信息。迭代语义识别分支由语言识别模块和跨模态融合门组成,模拟了人类识别场景文本的方式,并整合了跨模态视觉线索进行文本识别。实验证明,与最先进的算法相比,所提出的CMFN算法具有可比较的性能,表明了其有效性。
https://arxiv.org/abs/2401.10041
Segmentation-based scene text detection algorithms can handle arbitrary shape scene texts and have strong robustness and adaptability, so it has attracted wide attention. Existing segmentation-based scene text detection algorithms usually only segment the pixels in the center region of the text, while ignoring other information of the text region, such as edge information, distance information, etc., thus limiting the detection accuracy of the algorithm for scene text. This paper proposes a plug-and-play module called the Region Multiple Information Perception Module (RMIPM) to enhance the detection performance of segmentation-based algorithms. Specifically, we design an improved module that can perceive various types of information about scene text regions, such as text foreground classification maps, distance maps, direction maps, etc. Experiments on MSRA-TD500 and TotalText datasets show that our method achieves comparable performance with current state-of-the-art algorithms.
基于分割的场景文本检测算法可以处理任意形状的场景文本,具有很强的鲁棒性和适应性,因此吸引了广泛关注。现有的基于分割的场景文本检测算法通常仅在文本的中心区域分割像素,而忽略了文本区域的其他信息,如边缘信息、距离信息等,从而限制了算法对场景文本的检测精度。本文提出了一种可插拔的模块,称为区域多重信息感知模块(RMIPM),以增强基于分割算法的检测性能。具体来说,我们设计了一个改进的模块,可以感知场景文本区域的各种信息,如文本前景分类图、距离图、方向图等。对MSRA-TD500和TotalText数据集的实验结果表明,我们的方法与最先进的算法具有相当的表现。
https://arxiv.org/abs/2401.10017
Arbitrary shape scene text detection is of great importance in scene understanding tasks. Due to the complexity and diversity of text in natural scenes, existing scene text algorithms have limited accuracy for detecting arbitrary shape text. In this paper, we propose a novel arbitrary shape scene text detector through boundary points dynamic optimization(BPDO). The proposed model is designed with a text aware module (TAM) and a boundary point dynamic optimization module (DOM). Specifically, the model designs a text aware module based on segmentation to obtain boundary points describing the central region of the text by extracting a priori information about the text region. Then, based on the idea of deformable attention, it proposes a dynamic optimization model for boundary points, which gradually optimizes the exact position of the boundary points based on the information of the adjacent region of each boundary point. Experiments on CTW-1500, Total-Text, and MSRA-TD500 datasets show that the model proposed in this paper achieves a performance that is better than or comparable to the state-of-the-art algorithm, proving the effectiveness of the model.
任意形状场景文本检测在场景理解任务中具有重要的意义。由于自然场景中文本的复杂性和多样性,现有的场景文本算法对检测任意形状文本的准确性有限。在本文中,我们通过边界点动态优化(BPDO)提出了一种新颖的任意形状场景文本检测器。与文本感知模块(TAM)和边界点动态优化模块(DOM)相结合,该模型设计了一个基于分段的文本感知模块,以提取文本区域的中间信息。然后,根据变形注意力的思想,它提出了一个动态优化模型来优化边界点的精确位置,该模型根据相邻区域的信息逐步优化边界点的位置。在CTW-1500、Total-Text和MSRA-TD500等数据集上的实验表明,本文提出的模型在实现最佳性能或与最先进的算法相当方面具有优势,证明了该模型的有效性。
https://arxiv.org/abs/2401.09997
End-to-end scene text spotting, which aims to read the text in natural images, has garnered significant attention in recent years. However, recent state-of-the-art methods usually incorporate detection and recognition simply by sharing the backbone, which does not directly take advantage of the feature interaction between the two tasks. In this paper, we propose a new end-to-end scene text spotting framework termed SwinTextSpotter v2, which seeks to find a better synergy between text detection and recognition. Specifically, we enhance the relationship between two tasks using novel Recognition Conversion and Recognition Alignment modules. Recognition Conversion explicitly guides text localization through recognition loss, while Recognition Alignment dynamically extracts text features for recognition through the detection predictions. This simple yet effective design results in a concise framework that requires neither an additional rectification module nor character-level annotations for the arbitrarily-shaped text. Furthermore, the parameters of the detector are greatly reduced without performance degradation by introducing a Box Selection Schedule. Qualitative and quantitative experiments demonstrate that SwinTextSpotter v2 achieved state-of-the-art performance on various multilingual (English, Chinese, and Vietnamese) benchmarks. The code will be available at \href{this https URL}{SwinTextSpotter v2}.
近年来,端到端图像中场景文本检测已经引起了广泛关注。然而,最近的最先进方法通常通过共享骨干来检测和识别,并没有直接利用两个任务之间的特征交互。在本文中,我们提出了一个新的端到端场景文本检测框架SwinTextSpotter v2,旨在找到文本检测和识别之间的更好协同作用。具体来说,我们通过引入新的识别转换和识别对齐模块来增强两个任务之间的关系。识别转换明确指导文本定位通过识别损失,而识别对齐通过检测预测动态提取文本特征。这种简单而有效的设计使得不需要额外的矩形选择模块或字符级别注释来处理任意形状的文本。此外,通过引入Box选择计划,检测器的参数在性能不降低的情况下大大减少。通过质量和数量实验证明,SwinTextSpotter v2在各种多语言(英语、汉语和越南)基准测试中都实现了最先进的成绩。代码将在此处公开,\href{this <https://this URL>}{SwinTextSpotter v2}.
https://arxiv.org/abs/2401.07641
Scene text spotting is a challenging task, especially for inverse-like scene text, which has complex layouts, e.g., mirrored, symmetrical, or retro-flexed. In this paper, we propose a unified end-to-end trainable inverse-like antagonistic text spotting framework dubbed IATS, which can effectively spot inverse-like scene texts without sacrificing general ones. Specifically, we propose an innovative reading-order estimation module (REM) that extracts reading-order information from the initial text boundary generated by an initial boundary module (IBM). To optimize and train REM, we propose a joint reading-order estimation loss consisting of a classification loss, an orthogonality loss, and a distribution loss. With the help of IBM, we can divide the initial text boundary into two symmetric control points and iteratively refine the new text boundary using a lightweight boundary refinement module (BRM) for adapting to various shapes and scales. To alleviate the incompatibility between text detection and recognition, we propose a dynamic sampling module (DSM) with a thin-plate spline that can dynamically sample appropriate features for recognition in the detected text region. Without extra supervision, the DSM can proactively learn to sample appropriate features for text recognition through the gradient returned by the recognition module. Extensive experiments on both challenging scene text and inverse-like scene text datasets demonstrate that our method achieves superior performance both on irregular and inverse-like text spotting.
场景文本检测是一项具有挑战性的任务,尤其是在反向型场景文本中,这些文本具有复杂的布局,例如镜像、对称或反曲。在本文中,我们提出了一种名为IATS的统一端到端训练的逆向型对抗性文本检测框架,可以有效地检测反向型场景文本,同时不牺牲通用文本。具体来说,我们提出了一种创新性的阅读顺序估计模块(REM),它从初始文本边界生成的初始边界模块(IBM)中提取阅读顺序信息。为了优化和训练REM,我们提出了一种联合阅读顺序估计损失,包括分类损失、正交性损失和分布损失。借助IBM,我们可以将初始文本边界划分为两个对称的控制点,并使用轻量级的边界修复模块(BRM)进行迭代,以适应各种形状和比例。为了减轻文本检测和识别之间的不兼容性,我们提出了一种动态采样模块(DSM),它采用双曲线进行动态采样,以在检测到的文本区域内动态地采样适当的特征进行识别。没有额外的监督,DSM可以通过识别模块返回的梯度主动学习适当的特征进行文本识别。在具有挑战性的场景文本和反向型场景文本数据集上进行的大量实验证明,我们的方法在非规则和反向型文本检测方面都取得了卓越的性能。
https://arxiv.org/abs/2401.03637
The laws of model size, data volume, computation and model performance have been extensively studied in the field of Natural Language Processing (NLP). However, the scaling laws in Optical Character Recognition (OCR) have not yet been investigated. To address this, we conducted comprehensive studies that involved examining the correlation between performance and the scale of models, data volume and computation in the field of text recognition.Conclusively, the study demonstrates smooth power laws between performance and model size, as well as training data volume, when other influencing factors are held constant. Additionally, we have constructed a large-scale dataset called REBU-Syn, which comprises 6 million real samples and 18 million synthetic samples. Based on our scaling law and new dataset, we have successfully trained a scene text recognition model, achieving a new state-ofthe-art on 6 common test benchmarks with a top-1 average accuracy of 97.42%.
自然语言处理(NLP)领域已经对模型的规模、数据量和计算性能的定律进行了广泛研究。然而,光学字符识别(OCR)中的缩放定律尚未被研究。为解决这个问题,我们进行了全面的研究,涉及了文本识别领域中模型性能、数据量和计算与性能之间的相关性。 总之,我们的研究证明了性能与模型大小之间的平滑功率定律,以及训练数据量和计算之间的平滑功率定律。此外,我们还构建了一个名为REBU-Syn的大规模数据集,包括600万真实样本和1800万合成样本。基于我们的缩放定律和新数据集,我们成功训练了一个场景文本识别模型,在6个常见测试基准上的 top-1 平均准确率达到了97.42%。
https://arxiv.org/abs/2401.00028
Scene text spotting is essential in various computer vision applications, enabling extracting and interpreting textual information from images. However, existing methods often neglect the spatial semantics of word images, leading to suboptimal detection recall rates for long and short words within long-tailed word length distributions that exist prominently in dense scenes. In this paper, we present WordLenSpotter, a novel word length-aware spotter for scene text image detection and recognition, improving the spotting capabilities for long and short words, particularly in the tail data of dense text images. We first design an image encoder equipped with a dilated convolutional fusion module to integrate multiscale text image features effectively. Then, leveraging the Transformer framework, we synergistically optimize text detection and recognition accuracy after iteratively refining text region image features using the word length prior. Specially, we design a Spatial Length Predictor module (SLP) using character count prior tailored to different word lengths to constrain the regions of interest effectively. Furthermore, we introduce a specialized word Length-aware Segmentation (LenSeg) proposal head, enhancing the network's capacity to capture the distinctive features of long and short terms within categories characterized by long-tailed distributions. Comprehensive experiments on public datasets and our dense text spotting dataset DSTD1500 demonstrate the superiority of our proposed methods, particularly in dense text image detection and recognition tasks involving long-tailed word length distributions encompassing a range of long and short words.
场景文本检测是在各种计算机视觉应用中必不可少的,它能够从图像中提取和解释文本信息。然而,现有的方法通常忽视了单词图像的空间语义,导致对于存在突出长尾分布的长短单词,其检测召回率往往较低。在本文中,我们提出了WordLenSpotter,一种新颖的基于单词长度的场景文本检测器,特别关注长短单词在密集场景中的尾数据。我们首先设计了一个带有 dilated 卷积融合模块的图像编码器,以有效地整合多尺度文本图像特征。接着,利用Transformer框架,我们通过迭代优化逐个细化文本区域图像特征,从而协同提高文本检测和识别的准确性。特别是,我们针对不同单词长度设计了一个Spatial Length Predictor(SLP)模块,可以有效地约束感兴趣区域的范围。此外,我们还引入了一个专门的长度感知分割(LenSeg)提议头,可以增强网络在具有长尾分布的类别的特征捕捉能力。 在公开数据集和我们的密集文本检测数据集DSTD1500上进行的全面实验证明了我们提出方法的优越性,尤其是在涉及长短单词长度的复杂场景文本检测和识别任务中。
https://arxiv.org/abs/2312.15690
The advancement of text shape representations towards compactness has enhanced text detection and spotting performance, but at a high annotation cost. Current models use single-point annotations to reduce costs, yet they lack sufficient localization information for downstream applications. To overcome this limitation, we introduce Point2Polygon, which can efficiently transform single-points into compact polygons. Our method uses a coarse-to-fine process, starting with creating and selecting anchor points based on recognition confidence, then vertically and horizontally refining the polygon using recognition information to optimize its shape. We demonstrate the accuracy of the generated polygons through extensive experiments: 1) By creating polygons from ground truth points, we achieved an accuracy of 82.0% on ICDAR 2015; 2) In training detectors with polygons generated by our method, we attained 86% of the accuracy relative to training with ground truth (GT); 3) Additionally, the proposed Point2Polygon can be seamlessly integrated to empower single-point spotters to generate polygons. This integration led to an impressive 82.5% accuracy for the generated polygons. It is worth mentioning that our method relies solely on synthetic recognition information, eliminating the need for any manual annotation beyond single points.
文本形状表示的进步使得文本检测和斑点检测性能得到了提高,但需要高昂的标注成本。当前的模型使用单点标注来降低成本,然而它们缺乏足够的关键位置信息,对于下游应用来说至关重要。为了克服这个限制,我们引入了点2面体,它可以通过有效地将单点转换为紧凑的多边形来提高文本检测和斑点检测的性能。我们的方法采用粗到细的过程,首先根据识别信心创建和选择锚点,然后使用识别信息垂直和水平优化多边形的形状。我们通过广泛的实验来证明生成的多边形的准确性:1)通过从真实点创建多边形,我们在2015年ICDAR上实现了82.0%的准确度;2)在用我们方法生成的检测器上进行训练时,我们实现了与用真实点进行训练的86%的准确度相对;3)此外,点2面体可以轻松地与其他单点检测器集成,使其生成多边形。这种集成导致生成的多边形具有令人印象深刻的82.5%的准确度。值得注意的是,我们的方法仅依赖于合成识别信息,消除了对任何手动标注的需求,从而实现单点检测器。
https://arxiv.org/abs/2312.13778
In this paper, we build a visual dialogue dataset, named InfoVisDial, which provides rich informative answers in each round even with external knowledge related to the visual content. Different from existing datasets where the answer is compact and short, InfoVisDial contains long free-form answers with rich information in each round of dialogue. For effective data collection, the key idea is to bridge the large-scale multimodal model (e.g., GIT) and the language models (e.g., GPT-3). GIT can describe the image content even with scene text, while GPT-3 can generate informative dialogue based on the image description and appropriate prompting techniques. With such automatic pipeline, we can readily generate informative visual dialogue data at scale. Then, we ask human annotators to rate the generated dialogues to filter the low-quality conversations.Human analyses show that InfoVisDial covers informative and diverse dialogue topics: $54.4\%$ of the dialogue rounds are related to image scene texts, and $36.7\%$ require external knowledge. Each round's answer is also long and open-ended: $87.3\%$ of answers are unique with an average length of $8.9$, compared with $27.37\%$ and $2.9$ in VisDial. Last, we propose a strong baseline by adapting the GIT model for the visual dialogue task and fine-tune the model on InfoVisDial. Hopefully, our work can motivate more effort on this direction.
在本文中,我们构建了一个名为InfoVisDial的视觉对话数据集,为每个回合提供了丰富的信息性回答,即使与视觉内容相关的外部知识很大。与现有数据集不同,InfoVisDial包含了每个回合的长的自由文本答案,每个回合的答案都充满了丰富的信息。为了有效地收集数据,关键思想是连接大规模多模态模型(例如GIT)和语言模型(例如GPT-3)。GIT可以描述场景文本中的图像内容,而GPT-3可以根据图像描述和适当的提示技术生成有用的对话。有了这样的自动工作流程,我们可以在规模上轻松地生成有用的视觉对话数据。然后,我们要求人类标注者对生成的对话进行评分,以过滤低质量的对话。人类分析表明,InfoVisDial涵盖了有益且多样化的对话主题:87.3%的对话回合与图像场景文本相关,而36.7%的对话回合需要外部知识。每个回合的答案也是长且开诚布公的:87.3%的答案是独特的,平均长度为8.9,与VisDial中的27.37%和2.9%相比。最后,我们通过将GIT模型适应视觉对话任务并对InfoVisDial进行微调,提出了一个强大的基线。我们希望,我们的工作可以激励在这个方向上投入更多的努力。
https://arxiv.org/abs/2312.13503
Recently, diffusion-based image generation methods are credited for their remarkable text-to-image generation capabilities, while still facing challenges in accurately generating multilingual scene text images. To tackle this problem, we propose Diff-Text, which is a training-free scene text generation framework for any language. Our model outputs a photo-realistic image given a text of any language along with a textual description of a scene. The model leverages rendered sketch images as priors, thus arousing the potential multilingual-generation ability of the pre-trained Stable Diffusion. Based on the observation from the influence of the cross-attention map on object placement in generated images, we propose a localized attention constraint into the cross-attention layer to address the unreasonable positioning problem of scene text. Additionally, we introduce contrastive image-level prompts to further refine the position of the textual region and achieve more accurate scene text generation. Experiments demonstrate that our method outperforms the existing method in both the accuracy of text recognition and the naturalness of foreground-background blending.
最近,基于扩散的图像生成方法以其在文本到图像生成方面的显著表现而受到赞誉,但在准确生成多语言场景文本图像方面仍然面临挑战。为解决这个问题,我们提出了Diff-Text,这是一个为任何语言的训练free场景文本生成框架。我们的模型根据任何语言的文本生成一张照片真实的图像,并提供了场景的文本描述。模型利用预训练的Stable Diffusion生成的渲染插图作为先验,从而激起先验的 multilingual-generation 能力。基于观察到交叉注意图对生成图像中物体对齐的影响,我们在交叉注意层引入局部注意力约束以解决场景文本不合理的对齐问题。此外,我们还引入了对比图像级的提示来进一步精细文本区域的定位,并实现更精确的场景文本生成。实验证明,我们的方法在文本识别的准确性和前景-背景融合的自然性方面超过了现有方法。
https://arxiv.org/abs/2312.12232
Nowadays, scene text recognition has attracted more and more attention due to its diverse applications. Most state-of-the-art methods adopt an encoder-decoder framework with the attention mechanism, autoregressively generating text from left to right. Despite the convincing performance, this sequential decoding strategy constrains inference speed. Conversely, non-autoregressive models provide faster, simultaneous predictions but often sacrifice accuracy. Although utilizing an explicit language model can improve performance, it burdens the computational load. Besides, separating linguistic knowledge from vision information may harm the final prediction. In this paper, we propose an alternative solution, using a parallel and iterative decoder that adopts an easy-first decoding strategy. Furthermore, we regard text recognition as an image-based conditional text generation task and utilize the discrete diffusion strategy, ensuring exhaustive exploration of bidirectional contextual information. Extensive experiments demonstrate that the proposed approach achieves superior results on the benchmark datasets, including both Chinese and English text images.
如今,由于其各种应用场景,场景文本识别已经吸引了越来越多的关注。最先进的方法采用编码器-解码器框架,并使用注意力机制自左至右生成文本。尽管具有令人信服的性能,但这种序列解码策略限制了推理速度。相反,无自回归模型提供更快的同时预测,但通常会牺牲准确性。尽管使用显式语言模型可以提高性能,但它增加了计算负担。此外,将语言知识与视觉信息分离可能损害最终预测。在本文中,我们提出了另一种解决方案,使用一种并行和迭代解码器,采用简单的先解码策略。此外,我们将文本识别视为基于图像的条件下文本生成任务,并使用离散扩散策略,确保探索双向上下文信息的全面性。大量实验证明,与基准数据集相比,所提出的方案在包括中文和英文文本图像在内的各个领域都取得了卓越的性能。
https://arxiv.org/abs/2312.11923
Natural scene text detection is a significant challenge in computer vision, with tremendous potential applications in multilingual, diverse, and complex text scenarios. We propose a multilingual text detection model to address the issues of low accuracy and high difficulty in detecting multilingual text in natural scenes. In response to the challenges posed by multilingual text images with multiple character sets and various font styles, we introduce the SFM Swin Transformer feature extraction network to enhance the model's robustness in detecting characters and fonts across different languages. Dealing with the considerable variation in text scales and complex arrangements in natural scene text images, we present the AS-HRFPN feature fusion network by incorporating an Adaptive Spatial Feature Fusion module and a Spatial Pyramid Pooling module. The feature fusion network improvements enhance the model's ability to detect text sizes and orientations. Addressing diverse backgrounds and font variations in multilingual scene text images is a challenge for existing methods. Limited local receptive fields hinder detection performance. To overcome this, we propose a Global Semantic Segmentation Branch, extracting and preserving global features for more effective text detection, aligning with the need for comprehensive information. In this study, we collected and built a real-world multilingual natural scene text image dataset and conducted comprehensive experiments and analyses. The experimental results demonstrate that the proposed algorithm achieves an F-measure of 85.02\%, which is 4.71\% higher than the baseline model. We also conducted extensive cross-dataset validation on MSRA-TD500, ICDAR2017MLT, and ICDAR2015 datasets to verify the generality of our approach. The code and dataset can be found at this https URL.
自然场景文本检测是计算机视觉领域的一个重要挑战,具有多语言、多样化和复杂文本场景的巨大应用潜力。我们提出了一种多语言文本检测模型,以解决在自然场景中检测多语言文本的准确性和难度较高的难题。针对多语言文本图像具有多个字符集和各种字体样式所带来的挑战,我们引入了SFM Swin Transformer特征提取网络,以增强模型在检测不同语言中的字符和字体的鲁棒性。处理自然场景文本图像中文本尺寸和复杂排列所带来的巨大变化,我们通过将自适应空间特征融合网络和空间金字塔池化模块相结合,提出了AS-HRFPN特征融合网络。特征融合网络改进了模型的检测文本大小和方向的能力。解决多语言场景文本图像中的多样背景和字体变化是一个挑战,现有的方法。有限的局部接收域阻碍了检测性能。为了克服这一挑战,我们提出了全局语义分割分支,提取和保留全局特征以实现更有效的文本检测,与需要全面信息的需求相吻合。在本研究中,我们收集并构建了一个真实世界多语言自然场景文本图像数据集,并进行了全面实验和分析。实验结果表明,与基线模型相比,所提出的算法实现了85.02%的F1分数,比基线模型高4.71%。我们还对MSRA-TD500、ICDAR2017MLT和ICDAR2015等数据集进行了广泛的跨数据集验证,以验证我们方法的普适性。代码和数据集可以在这个链接中找到。
https://arxiv.org/abs/2312.11153