Existing scene text spotters are designed to locate and transcribe texts from images. However, it is challenging for a spotter to achieve precise detection and recognition of scene texts simultaneously. Inspired by the glimpse-focus spotting pipeline of human beings and impressive performances of Pre-trained Language Models (PLMs) on visual tasks, we ask: 1) "Can machines spot texts without precise detection just like human beings?", and if yes, 2) "Is text block another alternative for scene text spotting other than word or character?" To this end, our proposed scene text spotter leverages advanced PLMs to enhance performance without fine-grained detection. Specifically, we first use a simple detector for block-level text detection to obtain rough positional information. Then, we finetune a PLM using a large-scale OCR dataset to achieve accurate recognition. Benefiting from the comprehensive language knowledge gained during the pre-training phase, the PLM-based recognition module effectively handles complex scenarios, including multi-line, reversed, occluded, and incomplete-detection texts. Taking advantage of the fine-tuned language model on scene recognition benchmarks and the paradigm of text block detection, extensive experiments demonstrate the superior performance of our scene text spotter across multiple public benchmarks. Additionally, we attempt to spot texts directly from an entire scene image to demonstrate the potential of PLMs, even Large Language Models (LLMs).
现有的场景文本检测器旨在从图像中定位和转录文本。然而,对于检测器来说,同时实现精确的检测和识别场景文本是非常具有挑战性的。受到人类瞥视焦点检测流程和预训练语言模型(PLMs)在视觉任务上的出色表现启发,我们提出以下问题:1)“机器能否像人类一样准确地检测到文本,而无需进行精确的检测?”如果答案是肯定的,2)“文本块是否是场景文本检测的另一种选择,除了单词或字符?”为此,我们提出的场景文本检测器利用预训练语言模型(PLMs)增强性能,而无需进行微细检测。具体来说,我们首先使用简单的基于块级的文本检测器获得粗略的位置信息。然后,我们使用一个大型的OCR数据集对PLM进行微调,以实现精确的识别。得益于在预训练阶段获得的全面语言知识,基于PLM的识别模块有效地处理复杂的场景,包括多行、反向、遮挡和不完整的检测文本。利用场景识别基准测试中的微调后的语言模型以及文本块检测的范式,广泛的实验证明了我们在多个公共基准测试中的优越性能。此外,我们还尝试从整个场景图像中直接检测文本,以展示PLMs的潜力,即使是大语言模型(LLMs)。
https://arxiv.org/abs/2403.10047
Visual text rendering poses a fundamental challenge for contemporary text-to-image generation models, with the core problem lying in text encoder deficiencies. To achieve accurate text rendering, we identify two crucial requirements for text encoders: character awareness and alignment with glyphs. Our solution involves crafting a series of customized text encoder, Glyph-ByT5, by fine-tuning the character-aware ByT5 encoder using a meticulously curated paired glyph-text dataset. We present an effective method for integrating Glyph-ByT5 with SDXL, resulting in the creation of the Glyph-SDXL model for design image generation. This significantly enhances text rendering accuracy, improving it from less than $20\%$ to nearly $90\%$ on our design image benchmark. Noteworthy is Glyph-SDXL's newfound ability for text paragraph rendering, achieving high spelling accuracy for tens to hundreds of characters with automated multi-line layouts. Finally, through fine-tuning Glyph-SDXL with a small set of high-quality, photorealistic images featuring visual text, we showcase a substantial improvement in scene text rendering capabilities in open-domain real images. These compelling outcomes aim to encourage further exploration in designing customized text encoders for diverse and challenging tasks.
视觉文本渲染对当代文本-图像生成模型来说是一个基本挑战,其核心问题在于文本编码器的不足。为了实现准确的文本渲染,我们提出了两个关键需求:字符意识和与字符相关的对齐。我们的解决方案是通过微调带有字符意识的ByT5编码器,使用精心挑选的配对字符-文本数据集进行微调,来创建一个自定义的文本编码器Glyph-ByT5。我们展示了将Glyph-ByT5与SDXL集成有效的方法,从而创建了Glyph-SDXL模型,用于设计图像生成。这使得文本渲染准确性大大提高,从不到20%提高到了几乎90%。值得注意的是,Glyph-SDXL在文本段落渲染方面表现出的新能力,实现了对数十到数百个字符的高拼写准确度,并采用自动多行布局。最后,通过用包含视觉文本的高质量、实拍图像微调Glyph-SDXL,我们在开放域真实图像中展示了场景文本渲染能力的重大改进。这些引人注目的结果鼓励我们在各种具有挑战性的任务中进一步探索自定义文本编码器的设计。
https://arxiv.org/abs/2403.09622
Scene-Text Visual Question Answering (ST-VQA) aims to understand scene text in images and answer questions related to the text content. Most existing methods heavily rely on the accuracy of Optical Character Recognition (OCR) systems, and aggressive fine-tuning based on limited spatial location information and erroneous OCR text information often leads to inevitable overfitting. In this paper, we propose a multimodal adversarial training architecture with spatial awareness capabilities. Specifically, we introduce an Adversarial OCR Enhancement (AOE) module, which leverages adversarial training in the embedding space of OCR modality to enhance fault-tolerant representation of OCR texts, thereby reducing noise caused by OCR errors. Simultaneously, We add a Spatial-Aware Self-Attention (SASA) mechanism to help the model better capture the spatial relationships among OCR tokens. Various experiments demonstrate that our method achieves significant performance improvements on both the ST-VQA and TextVQA datasets and provides a novel paradigm for multimodal adversarial training.
场景文本视觉问答(ST-VQA)旨在理解图像中的场景文本,并回答与文本内容相关的问题。现有方法很大程度上依赖于光学字符识别(OCR)系统的准确性,而且基于有限的空间位置信息和错误的OCR文本信息进行激进的微调往往会导致不可预测的过拟合。在本文中,我们提出了一个具有空间感知能力的多模态对抗训练架构。具体来说,我们引入了一个对抗性OCR增强(AOE)模块,它利用OCR模态的嵌入空间中的对抗性训练来增强OCR文本的容错表示,从而减少由OCR错误引起的噪声。同时,我们还添加了一个空间感知自注意力(SASA)机制,帮助模型更好地捕捉OCR词汇之间的空间关系。各种实验结果表明,我们的方法在ST-VQA和TextVQA数据集上都取得了显著的性能提升,并为多模态对抗训练树立了新的范例。
https://arxiv.org/abs/2403.09288
The importance of Scene Text Recognition (STR) in today's increasingly digital world cannot be overstated. Given the significance of STR, data intensive deep learning approaches that auto-learn feature mappings have primarily driven the development of STR solutions. Several benchmark datasets and substantial work on deep learning models are available for Latin languages to meet this need. On more complex, syntactically and semantically, Indian languages spoken and read by 1.3 billion people, there is less work and datasets available. This paper aims to address the Indian space's lack of a comprehensive dataset by proposing the largest and most comprehensive real dataset - IndicSTR12 - and benchmarking STR performance on 12 major Indian languages. A few works have addressed the same issue, but to the best of our knowledge, they focused on a small number of Indian languages. The size and complexity of the proposed dataset are comparable to those of existing Latin contemporaries, while its multilingualism will catalyse the development of robust text detection and recognition models. It was created specifically for a group of related languages with different scripts. The dataset contains over 27000 word-images gathered from various natural scenes, with over 1000 word-images for each language. Unlike previous datasets, the images cover a broader range of realistic conditions, including blur, illumination changes, occlusion, non-iconic texts, low resolution, perspective text etc. Along with the new dataset, we provide a high-performing baseline on three models - PARSeq, CRNN, and STARNet.
在当今日益数字化的世界中,场景文本识别(STR)的重要性不容忽视。考虑到STR的重要性,数据密集型深度学习方法自动学习特征映射已经在STR解决方案的发展中发挥了主要作用。目前有多个用于拉丁语的数据集和大量关于深度学习模型的研究,以满足这一需求。对于说和读13亿人口的印度语言来说,在更复杂、语义和语法方面,可用的工作和数据集较少。本文旨在通过提出最大的、最全面的真实数据集-IndicSTR12,来解决印度空间缺乏全面数据集的问题,并评估STR在12种主要印度语言上的性能。虽然已经有一些研究解决了同样的问题,但据我们所知,它们主要针对少数印度语言。拟议的数据集的大小和复杂性与现有拉丁语作品相当,而其多语言性将催生强大的文本检测和识别模型的开发。它特意为几种相关语言的一个群体而创建。数据集包含来自各种自然场景的超过27000个单词图像,每个语言都有超过1000个单词图像。与以前的数据集不同,图像涵盖了更广泛的现实情况,包括模糊、光照变化、遮挡、非典型文本、低分辨率、透视文本等。与新数据集一起,我们为PARSeq、CRNN和STARNet提供了高性能的基准。
https://arxiv.org/abs/2403.08007
Scene text recognition is an important and challenging task in computer vision. However, most prior works focus on recognizing pre-defined words, while there are various out-of-vocabulary (OOV) words in real-world applications. In this paper, we propose a novel open-vocabulary text recognition framework, Pseudo-OCR, to recognize OOV words. The key challenge in this task is the lack of OOV training data. To solve this problem, we first propose a pseudo label generation module that leverages character detection and image inpainting to produce substantial pseudo OOV training data from real-world images. Unlike previous synthetic data, our pseudo OOV data contains real characters and backgrounds to simulate real-world applications. Secondly, to reduce noises in pseudo data, we present a semantic checking mechanism to filter semantically meaningful data. Thirdly, we introduce a quality-aware margin loss to boost the training with pseudo data. Our loss includes a margin-based part to enhance the classification ability, and a quality-aware part to penalize low-quality samples in both real and pseudo data. Extensive experiments demonstrate that our approach outperforms the state-of-the-art on eight datasets and achieves the first rank in the ICDAR2022 challenge.
场景文本识别是计算机视觉中一个重要而具有挑战性的任务。然而,大多数先前的作品都专注于识别预定义的单词,而在现实应用中存在各种不在词汇表中的(OOV)单词。在本文中,我们提出了一个新颖的开放词汇文本识别框架,称为伪-OCR,以识别OOV单词。这个任务的关键挑战是缺乏OOV训练数据。为解决这个问题,我们首先提出了一个伪标签生成模块,利用字符检测和图像修复技术从现实世界的图像中产生大量伪OOV训练数据。与之前的合成数据不同,我们的伪OOV数据包含真实字符和背景,以模拟真实世界的应用。其次,为了减少伪数据中的噪声,我们提出了一个语义检查机制来过滤语义上有意义的数据。第三,我们引入了质量感知边距损失来提高带有伪数据的训练。我们的损失包括基于边距的质量和基于质量的损失。大量实验证明,我们的方法在八个数据集上的表现超过了现有技术的水平,在ICDAR2022挑战中获得了第一名的成绩。
https://arxiv.org/abs/2403.07518
Text detection is frequently used in vision-based mobile robots when they need to interpret texts in their surroundings to perform a given task. For instance, delivery robots in multilingual cities need to be capable of doing multilingual text detection so that the robots can read traffic signs and road markings. Moreover, the target languages change from region to region, implying the need of efficiently re-training the models to recognize the novel/new languages. However, collecting and labeling training data for novel languages are cumbersome, and the efforts to re-train an existing/trained text detector are considerable. Even worse, such a routine would repeat whenever a novel language appears. This motivates us to propose a new problem setting for tackling the aforementioned challenges in a more efficient way: "We ask for a generalizable multilingual text detection framework to detect and identify both seen and unseen language regions inside scene images without the requirement of collecting supervised training data for unseen languages as well as model re-training". To this end, we propose "MENTOR", the first work to realize a learning strategy between zero-shot learning and few-shot learning for multilingual scene text detection.
文本检测在基于视觉的移动机器人中非常常见,当它们需要在其周围环境中解释文本以执行特定任务时。例如,多语言城市中的送货机器人需要具备多语言文本检测能力,以便机器人能阅读交通标志和道路标线。此外,目标语言会根据地区发生变化,这意味着需要有效地重新训练模型以识别新/未知的语言。然而,收集和标注新语言的训练数据非常耗时,重新训练现有/训练好的文本检测模型的努力也很大。更糟糕的是,这种模式会重复出现,每次出现新语言都会如此。因此,我们激励自己提出一个更有效的解决方案来解决上述挑战: "我们要求一个通用的多语言场景文本检测框架,能够检测和识别场景图像中的已见和未见语言区域,无需收集未见语言的监督训练数据,也不需要重新训练模型。" 为实现这一目标,我们提出了"MENTOR",是第一个在零散学习和高散射学习之间实现学习策略的关于多语言场景文本检测的工作。
https://arxiv.org/abs/2403.07286
We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks, including document question answering (DocVQA) and scene text analysis. Our approach introduces enhancement across several dimensions: by adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions and stabilize early training; We hypothesize that images may contain redundant tokens, and by using similarity to filter out significant tokens, we can not only streamline the token length but also enhance the model's performance. Moreover, by expanding our model's capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability and minimize hallucinations. Additionally, TextMonkey can be finetuned to gain the ability to comprehend commands for clicking screenshots. Overall, our method notably boosts performance across various benchmark datasets, achieving increases of 5.2%, 6.9%, and 2.8% in Scene Text-Centric VQA, Document Oriented VQA, and KIE, respectively, especially with a score of 561 on OCRBench, surpassing prior open-sourced large multimodal models for document understanding. Code will be released at this https URL.
我们提出了TextMonkey,一个专为文本中心任务(包括文档问题回答和场景文本分析)而设计的大型多模态模型(LMM)。我们的方法在几个维度上进行了增强:通过采用Shifted Window Attention且初始化为零,我们在较高输入分辨率上实现了跨窗口连接并稳定了早期的训练;我们假设图像中可能包含冗余词,通过使用相似性过滤掉显著的词,我们不仅可以简化词长,而且还可以提高模型的性能。此外,通过扩展我们的模型的能力包括文本定位和 grounded,并将位置信息融入回答,我们提高了可解释性和减少了幻觉。此外,TextMonkey还可以微调以具有理解截图按键命令的能力。总的来说,我们的方法在各种基准数据集上显著提高了性能,在Scene Text-Centric VQA、Document Oriented VQA和KIE等任务中分别实现了5.2%、6.9%和2.8%的提高,尤其是OCRBench上的得分561,超过了之前开源的大型多模态模型对于文档理解的性能。代码将在此处发布:https:// this URL.
https://arxiv.org/abs/2403.04473
In recent years, text-image joint pre-training techniques have shown promising results in various tasks. However, in Optical Character Recognition (OCR) tasks, aligning text instances with their corresponding text regions in images poses a challenge, as it requires effective alignment between text and OCR-Text (referring to the text in images as OCR-Text to distinguish from the text in natural language) rather than a holistic understanding of the overall image content. In this paper, we propose a new pre-training method called OCR-Text Destylization Modeling (ODM) that transfers diverse styles of text found in images to a uniform style based on the text prompt. With ODM, we achieve better alignment between text and OCR-Text and enable pre-trained models to adapt to the complex and diverse styles of scene text detection and spotting tasks. Additionally, we have designed a new labeling generation method specifically for ODM and combined it with our proposed Text-Controller module to address the challenge of annotation costs in OCR tasks, allowing a larger amount of unlabeled data to participate in pre-training. Extensive experiments on multiple public datasets demonstrate that our method significantly improves performance and outperforms current pre-training methods in scene text detection and spotting tasks. Code is available at {this https URL}.
近年来,在各种任务中,文本图像联合预训练技术取得了良好的效果。然而,在光学字符识别(OCR)任务中,将文本实例与图像中的相应文本区域对齐是一个挑战,因为它需要将文本和OCR-文本(将图像中的文本称为OCR-文本,以与自然语言中的文本区分开来)之间的有效对齐,而不是对整个图像内容的全面理解。在本文中,我们提出了一个名为OCR-Text Destylization Modeling(ODM)的新预训练方法,它基于文本提示将图像中发现的多样文本风格转移为统一风格。通过ODM,我们实现了文本和OCR-文本之间的更好对齐,并使预训练模型能够适应场景文本检测和标注任务的复杂和多样风格。此外,我们还针对ODM设计了一个新的标签生成方法,并将其与我们的Text-Controller模块相结合,以解决OCR任务中标注成本的挑战,允许更多的未标注数据参与预训练。在多个公开数据集上的广泛实验证明,我们的方法显著提高了性能,在场景文本检测和标注任务中优于当前预训练方法。代码可在此处下载:{this <https://URL>}。
https://arxiv.org/abs/2403.00303
Zero-Shot Object Navigation (ZSON) requires agents to autonomously locate and approach unseen objects in unfamiliar environments and has emerged as a particularly challenging task within the domain of Embodied AI. Existing datasets for developing ZSON algorithms lack consideration of dynamic obstacles, object attribute diversity, and scene texts, thus exhibiting noticeable discrepancy from real-world situations. To address these issues, we propose a Dataset for Open-Vocabulary Zero-Shot Object Navigation in Dynamic Environments (DOZE) that comprises ten high-fidelity 3D scenes with over 18k tasks, aiming to mimic complex, dynamic real-world scenarios. Specifically, DOZE scenes feature multiple moving humanoid obstacles, a wide array of open-vocabulary objects, diverse distinct-attribute objects, and valuable textual hints. Besides, different from existing datasets that only provide collision checking between the agent and static obstacles, we enhance DOZE by integrating capabilities for detecting collisions between the agent and moving obstacles. This novel functionality enables evaluation of the agents' collision avoidance abilities in dynamic environments. We test four representative ZSON methods on DOZE, revealing substantial room for improvement in existing approaches concerning navigation efficiency, safety, and object recognition accuracy. Our dataset could be found at this https URL.
零距离物体导航(ZSON)要求智能体在未知环境中自主定位和靠近未见到的物体,这一任务在 embodied AI 领域已成为一个特别具有挑战性的任务。现有的用于开发 ZSON 算法的数据集中没有考虑到动态障碍、物体属性的多样性和场景文本,因此与现实世界情况存在明显的差异。为了解决这些问题,我们提出了一个用于开放词汇零距离物体导航在动态环境中的数据集(DOZE),它包括十个高保真的 3D 场景,超过 18k 个任务,旨在模拟复杂、动态的现实生活中场景。 具体来说,DOZE 场景特征有多名移动的人形障碍物、各种开放词汇物体、多样化的独特属性物体和有价值的文本提示。此外,与现有的仅提供代理与静态障碍物之间碰撞检查的数据集不同,我们通过整合检测代理与移动障碍物之间碰撞的功能来增强 DOZE。这种新功能使得可以在动态环境中评估代理的避障能力。我们在 DOZE 上测试了四种代表性的 ZSON 方法,揭示了现有方法在导航效率、安全性和物体识别准确性方面存在巨大的改进空间。我们的数据集可以在这个链接中找到:https://www.example.com/doze
https://arxiv.org/abs/2402.19007
Incorporating linguistic knowledge can improve scene text recognition, but it is questionable whether the same holds for scene text spotting, which typically involves text detection and recognition. This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models. This allows the model to capture the relationship between characters in the same word. Additionally, we introduce a technique to generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning. As a result, the newly created text distributions are more informative than pure one-hot encoding, leading to improved spotting and recognition performance. Our method is simple and efficient, and it can easily be integrated into existing auto-regressive-based approaches. Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words. It significantly improves both state-of-the-art scene text spotting and recognition pipelines, achieving state-of-the-art results on several benchmarks.
融入语言知识可以提高场景文本识别,但就场景文本检测而言,这种做法是否有效仍存在争议。本文提出了一种利用大型文本语料库中的语言知识来取代传统的一维编码以实现自回归场景文本检测和识别模型的方法。这使得模型能够捕捉同一单词中字符之间的关系。此外,我们还引入了一种生成与场景文本数据集分布相似的文本分布的技术,消除了在领域内微调的需求。因此,新创建的文本分布比纯一维编码更有信息,从而提高了检测和识别性能。我们的方法简单而高效,可以轻松地集成到现有的自回归方法中。实验结果表明,我们的方法不仅提高了识别准确度,还实现了更精确的词的局部定位。它显著提高了最先进的场景文本检测和识别流程,在多个基准测试中都实现了最先进的结果。
https://arxiv.org/abs/2402.17134
Scene text recognition (STR) is a challenging task that requires large-scale annotated data for training. However, collecting and labeling real text images is expensive and time-consuming, which limits the availability of real data. Therefore, most existing STR methods resort to synthetic data, which may introduce domain discrepancy and degrade the performance of STR models. To alleviate this problem, recent semi-supervised STR methods exploit unlabeled real data by enforcing character-level consistency regularization between weakly and strongly augmented views of the same image. However, these methods neglect word-level consistency, which is crucial for sequence recognition tasks. This paper proposes a novel semi-supervised learning method for STR that incorporates word-level consistency regularization from both visual and semantic aspects. Specifically, we devise a shortest path alignment module to align the sequential visual features of different views and minimize their distance. Moreover, we adopt a reinforcement learning framework to optimize the semantic similarity of the predicted strings in the embedding space. We conduct extensive experiments on several standard and challenging STR benchmarks and demonstrate the superiority of our proposed method over existing semi-supervised STR methods.
场景文本识别(STR)是一个具有挑战性的任务,需要大量标注数据进行训练。然而,收集和标注真实文本图像成本昂贵且耗时,这限制了真实数据的可用性。因此,大多数现有的STR方法求助于合成数据,这可能导致领域不一致并降低STR模型的性能。为了减轻这个问题,最近的一些半监督STR方法利用无标签真实数据,通过在同一图像的弱化和增强视图之间强制字符级别一致性正则化来利用它们。然而,这些方法忽视了词级别一致性,这对于序列识别任务至关重要。本文提出了一种新颖的半监督STR方法,该方法从视觉和语义方面结合词级别一致性正则化。具体来说,我们设计了一个最短路径对齐模块,对不同视图的序列视觉特征进行对齐,并最小化它们的距离。此外,我们采用强化学习框架来优化在嵌入空间中预测字符的语义相似度。我们在多个标准和具有挑战性的STR基准上进行了广泛的实验,并证明了与现有半监督STR方法相比,我们提出的方法具有优越性。
https://arxiv.org/abs/2402.15806
Scene text recognition is a rapidly developing field that faces numerous challenges due to the complexity and diversity of scene text, including complex backgrounds, diverse fonts, flexible arrangements, and accidental occlusions. In this paper, we propose a novel approach called Class-Aware Mask-guided feature refinement (CAM) to address these challenges. Our approach introduces canonical class-aware glyph masks generated from a standard font to effectively suppress background and text style noise, thereby enhancing feature discrimination. Additionally, we design a feature alignment and fusion module to incorporate the canonical mask guidance for further feature refinement for text recognition. By enhancing the alignment between the canonical mask feature and the text feature, the module ensures more effective fusion, ultimately leading to improved recognition performance. We first evaluate CAM on six standard text recognition benchmarks to demonstrate its effectiveness. Furthermore, CAM exhibits superiority over the state-of-the-art method by an average performance gain of 4.1% across six more challenging datasets, despite utilizing a smaller model size. Our study highlights the importance of incorporating canonical mask guidance and aligned feature refinement techniques for robust scene text recognition. The code is available at this https URL.
场景文本识别是一个迅速发展的领域,由于场景文本的复杂性和多样性,包括复杂的背景、多样化的字体和灵活的排列以及意外的遮挡,面临着许多挑战。在本文中,我们提出了一个名为类感知引导特征细化(CAM)的新方法来应对这些挑战。我们的方法引入了一个标准字体生成的规范类感知 glyph 口罩,有效地抑制了背景和文本风格噪声,从而提高了特征识别效果。此外,我们还设计了一个特征对齐和融合模块,以进一步对文本识别进行特征细化。通过增强规范口罩特征与文本特征之间的对齐,该模块确保了更有效的融合,最终提高了识别性能。我们首先在六个标准文本识别基准上评估了CAM的有效性,以证明其有效性。此外,CAM在六个更具挑战性的数据集上的平均性能比最先进的方法提高了4.1%,尽管采用了更小的模型大小。我们的研究突出了将规范口罩指导和对齐特征细化技术纳入文本识别的重要性。代码可在此处访问:https://url.cn/
https://arxiv.org/abs/2402.13643
Recent advancements in personalizing text-to-image (T2I) diffusion models have shown the capability to generate images based on personalized visual concepts using a limited number of user-provided examples. However, these models often struggle with maintaining high visual fidelity, particularly in manipulating scenes as defined by textual inputs. Addressing this, we introduce ComFusion, a novel approach that leverages pretrained models generating composition of a few user-provided subject images and predefined-text scenes, effectively fusing visual-subject instances with textual-specific scenes, resulting in the generation of high-fidelity instances within diverse scenes. ComFusion integrates a class-scene prior preservation regularization, which leverages composites the subject class and scene-specific knowledge from pretrained models to enhance generation fidelity. Additionally, ComFusion uses coarse generated images, ensuring they align effectively with both the instance image and scene texts. Consequently, ComFusion maintains a delicate balance between capturing the essence of the subject and maintaining scene fidelity.Extensive evaluations of ComFusion against various baselines in T2I personalization have demonstrated its qualitative and quantitative superiority.
近年来在个性化文本-图像(T2I)扩散模型方面的进步表明,这些模型能够使用有限的用户提供的示例生成基于个性化视觉概念的图像。然而,这些模型通常在操作由文本输入定义的场景时陷入困境,尤其是在处理场景时。为了解决这个问题,我们引入了ComFusion,一种新方法,它利用预训练模型生成几张用户提供的主题图像和预定义文本场景的组合,有效地将视觉主题实例与文本特定的场景融合在一起,从而在多样场景中生成高保真度的实例。ComFusion实现了一个类场景先验保留的规范化,它利用预训练模型的组合来增强生成保真度。此外,ComFusion使用粗生成图像,确保它们与实例图像和场景文本 alignment 有效。因此,ComFusion在捕捉主题本质的同时保持了场景保真度。 通过对ComFusion与各种T2I个性化基准的广泛评估,证明了ComFusion在质量和数量上具有优越性。
https://arxiv.org/abs/2402.11849
Existing methods for scene text detection can be divided into two paradigms: segmentation-based and anchor-based. While Segmentation-based methods are well-suited for irregular shapes, they struggle with compact or overlapping layouts. Conversely, anchor-based approaches excel for complex layouts but suffer from irregular shapes. To strengthen their merits and overcome their respective demerits, we propose a Complementary Proposal Network (CPN) that seamlessly and parallelly integrates semantic and geometric information for superior performance. The CPN comprises two efficient networks for proposal generation: the Deformable Morphology Semantic Network, which generates semantic proposals employing an innovative deformable morphological operator, and the Balanced Region Proposal Network, which produces geometric proposals with pre-defined anchors. To further enhance the complementarity, we introduce an Interleaved Feature Attention module that enables semantic and geometric features to interact deeply before proposal generation. By leveraging both complementary proposals and features, CPN outperforms state-of-the-art approaches with significant margins under comparable computation cost. Specifically, our approach achieves improvements of 3.6%, 1.3% and 1.0% on challenging benchmarks ICDAR19-ArT, IC15, and MSRA-TD500, respectively. Code for our method will be released.
现有的场景文本检测方法可以分为两种范式:基于分割和基于锚定。虽然基于分割的方法对于不规则形状的应用效果很好,但它们在紧凑或重叠布局下表现不佳。相反,基于锚定的方法在复杂布局下表现出色,但存在不规则形状的问题。为了增强其优势并克服各自的缺陷,我们提出了一个互补建议网络(CPN),它平滑地并行地整合语义和几何信息以实现卓越的性能。CPN包括两个用于提议生成的有效网络:具有创新变形形态操作的语义变形形态网络和具有预定义锚定的平衡区域提议网络。为了进一步增强互补性,我们还引入了一个跨特征关注模块,使得语义和几何特征在提议生成前进行深度交互。通过利用互补提议和特征,CPN在类似计算成本下显著优于最先进的 approaches。具体来说,我们的方法在具有挑战性的基准测试ICDAR19-ArT、IC15和MSRA-TD500上分别实现了3.6%、1.3%和1.0%的改进。我们的方法将发布代码。
https://arxiv.org/abs/2402.11540
We introduce Lumos, the first end-to-end multimodal question-answering system with text understanding capabilities. At the core of Lumos is a Scene Text Recognition (STR) component that extracts text from first person point-of-view images, the output of which is used to augment input to a Multimodal Large Language Model (MM-LLM). While building Lumos, we encountered numerous challenges related to STR quality, overall latency, and model inference. In this paper, we delve into those challenges, and discuss the system architecture, design choices, and modeling techniques employed to overcome these obstacles. We also provide a comprehensive evaluation for each component, showcasing high quality and efficiency.
我们介绍Lumos,第一个端到端的多模态问题回答系统,具有文本理解功能。Lumos的核心是一个场景文本识别(STR)组件,它从第一人称视角图像中提取文本,用于增强输入到多模态大型语言模型(MM-LLM)。在构建Lumos时,我们遇到了许多与STR质量、整体延迟和模型推理有关的挑战。在本文中,我们深入研究了这些问题,讨论了用于克服这些障碍的系统架构、设计选择和建模技术。我们还对每个组件进行了全面的评估,展示了高质量和效率。
https://arxiv.org/abs/2402.08017
Multi-modal models have shown appealing performance in visual tasks recently, as instruction-guided training has evoked the ability to understand fine-grained visual content. However, current methods cannot be trivially applied to scene text recognition (STR) due to the gap between natural and text images. In this paper, we introduce a novel paradigm that formulates STR as an instruction learning problem, and propose instruction-guided scene text recognition (IGTR) to achieve effective cross-modal learning. IGTR first generates rich and diverse instruction triplets of <condition,question,answer>, serving as guidance for nuanced text image understanding. Then, we devise an architecture with dedicated cross-modal feature fusion module, and multi-task answer head to effectively fuse the required instruction and image features for answering questions. Built upon these designs, IGTR facilitates accurate text recognition by comprehending character attributes. Experiments on English and Chinese benchmarks show that IGTR outperforms existing models by significant margins. Furthermore, by adjusting the instructions, IGTR enables various recognition schemes. These include zero-shot prediction, where the model is trained based on instructions not explicitly targeting character recognition, and the recognition of rarely appearing and morphologically similar characters, which were previous challenges for existing models.
多模态模型在视觉任务中最近表现出了吸引人的性能,因为指令指导训练激发了理解细粒度视觉内容的能力。然而,由于自然图像和文本图像之间的差距,当前方法无法直接应用于场景文本识别(STR)。在本文中,我们提出了一个新颖的范例,将STR表示为指令学习问题,并提出了指令指导场景文本识别(IGTR)以实现有效的跨模态学习。IGTR首先生成丰富多样的指令三元组<条件,问题,答案>,作为复杂文本图像理解的指导。然后,我们设计了一个具有专用跨模态特征融合模块的架构,以及多任务答案头,以有效地融合回答问题所需的指令和图像特征。在这些设计的基础上,IGTR通过理解字符属性来准确识别文本。在英语和中文基准上的实验表明,IGTR超越了现有模型的显著优势。此外,通过调整指令,IGTR可以实现各种识别方案。这些包括基于指令没有明确针对字符识别的模型训练,以及识别罕见的和形态相似的字符,这是现有模型的前挑战。
https://arxiv.org/abs/2401.17851
Real-world text can be damaged by corrosion issues caused by environmental or human factors, which hinder the preservation of the complete styles of texts, e.g., texture and structure. These corrosion issues, such as graffiti signs and incomplete signatures, bring difficulties in understanding the texts, thereby posing significant challenges to downstream applications, e.g., scene text recognition and signature identification. Notably, current inpainting techniques often fail to adequately address this problem and have difficulties restoring accurate text images along with reasonable and consistent styles. Formulating this as an open problem of text image inpainting, this paper aims to build a benchmark to facilitate its study. In doing so, we establish two specific text inpainting datasets which contain scene text images and handwritten text images, respectively. Each of them includes images revamped by real-life and synthetic datasets, featuring pairs of original images, corrupted images, and other assistant information. On top of the datasets, we further develop a novel neural framework, Global Structure-guided Diffusion Model (GSDM), as a potential solution. Leveraging the global structure of the text as a prior, the proposed GSDM develops an efficient diffusion model to recover clean texts. The efficacy of our approach is demonstrated by thorough empirical study, including a substantial boost in both recognition accuracy and image quality. These findings not only highlight the effectiveness of our method but also underscore its potential to enhance the broader field of text image understanding and processing. Code and datasets are available at: this https URL.
现实世界的文本可能会因为环境或人类因素引起的腐蚀问题而受到损害,这会阻碍文本的完整风格,如纹理和结构。这些腐蚀问题,如涂鸦标志和未完成签名,在理解文本方面带来困难,从而对下游应用,如场景文本识别和签名识别构成了重大挑战。值得注意的是,当前修复技术往往未能充分解决这个问题,并且在恢复准确文本图像和合理且一致的风格方面存在困难。将这个问题定性为文本图像修复的一个开放问题,本文旨在建立一个基准来促进其研究。通过建立包含场景文本图像和手写文本图像的两个具体的文本图像修复数据集,我们分别利用现实和合成数据集对图像进行修复,包括原始图像、污染图像和其他辅助信息。在数据集之上,我们进一步发展了一个新颖的神经框架——全局结构引导扩散模型(GSDM),作为潜在解决方案。利用文本全局结构的先验知识,GSDM 开发了一个有效的扩散模型来恢复清洁的文本。本文方法的有效性通过详细的实验研究得到了证实,包括识别准确度和图像质量的大幅提升。这些发现不仅突出了我们方法的有效性,而且强调了其在文本图像理解和处理领域的潜在提高。代码和数据集可在此链接处获取:https:// this URL。
https://arxiv.org/abs/2401.14832
Recently, scene text detection has received significant attention due to its wide application. However, accurate detection in complex scenes of multiple scales, orientations, and curvature remains a challenge. Numerous detection methods adopt the Vatti clipping (VC) algorithm for multiple-instance training to address the issue of arbitrary-shaped text. Yet we identify several bias results from these approaches called the "shrinked kernel". Specifically, it refers to a decrease in accuracy resulting from an output that overly favors the text kernel. In this paper, we propose a new approach named Expand Kernel Network (EK-Net) with expand kernel distance to compensate for the previous deficiency, which includes three-stages regression to complete instance detection. Moreover, EK-Net not only realize the precise positioning of arbitrary-shaped text, but also achieve a trade-off between performance and speed. Evaluation results demonstrate that EK-Net achieves state-of-the-art or competitive performance compared to other advanced methods, e.g., F-measure of 85.72% at 35.42 FPS on ICDAR 2015, F-measure of 85.75% at 40.13 FPS on CTW1500.
近年来,场景文本检测因其广泛应用而受到了广泛关注。然而,在复杂场景中准确检测多个规模、方向和曲率的文本仍然具有挑战性。为解决任意形状文本的问题,许多检测方法采用Vatti截剪(VC)算法进行多实例训练。然而,我们从中发现了几个称为“收缩核”的偏差结果。具体来说,它指的是输出过分倾向于文本核导致准确度下降。在本文中,我们提出了一种名为扩展核网络(EK-Net)的新方法,通过扩展核距离来弥补这一缺陷,包括三个阶段的回归以完成实例检测。此外,EK-Net不仅实现了任意形状文本的准确定位,还实现了性能与速度的平衡。评估结果显示,与其它先进方法相比,EK-Net在IICAR 2015上的F1分数达到了85.72%,在CTW1500上的F1分数达到了85.75%。
https://arxiv.org/abs/2401.11704
Scene Text Recognition (STR) is a challenging task that involves recognizing text within images of natural scenes. Although current state-of-the-art models for STR exhibit high performance, they typically suffer from low inference efficiency due to their reliance on hybrid architectures comprised of visual encoders and sequence decoders. In this work, we propose the VIsion Permutable extractor for fast and efficient scene Text Recognition (VIPTR), which achieves an impressive balance between high performance and rapid inference speeds in the domain of STR. Specifically, VIPTR leverages a visual-semantic extractor with a pyramid structure, characterized by multiple self-attention layers, while eschewing the traditional sequence decoder. This design choice results in a lightweight and efficient model capable of handling inputs of varying sizes. Extensive experimental results on various standard datasets for both Chinese and English scene text recognition validate the superiority of VIPTR. Notably, the VIPTR-T (Tiny) variant delivers highly competitive accuracy on par with other lightweight models and achieves SOTA inference speeds. Meanwhile, the VIPTR-L (Large) variant attains greater recognition accuracy, while maintaining a low parameter count and favorable inference speed. Our proposed method provides a compelling solution for the STR challenge, which blends high accuracy with efficiency and greatly benefits real-world applications requiring fast and reliable text recognition. The code is publicly available at this https URL.
场景文本识别(STR)是一个具有挑战性的任务,涉及在自然场景图像中识别文本。尽管当前最先进的STR模型具有很高的性能,但它们通常由于依赖视觉编码器和解码器的混合架构而具有低推理效率。在这项工作中,我们提出了一种名为视觉可持久提取器(VIPTR)的快速高效的场景文本识别(STR)方法,在STR领域实现了高性能和快速推理速度之间的令人印象深刻的平衡。具体来说,VIPTR利用具有金字塔结构的视觉语义提取器,其中包含多个自注意力层,而跳过了传统的序列解码器。这种设计选择导致了一个轻量级且高效的模型,能够处理各种输入大小的数据。对中文和英文场景文本识别的各种标准数据集的实验结果证实了VIPTR具有卓越的优越性。值得注意的是,VIPTR-T(小型)变体在与其他轻量级模型的竞争中具有高度的准确性,并实现了与SOTA推理速度相当的性能。同时,VIPTR-L(大型)变体具有更高的识别准确性,而参数数量较少,推理速度有利。我们提出的方法为STR挑战提供了一个引人注目的解决方案,将高准确性与效率相结合,大大有益于需要快速可靠文本识别的实时应用。代码公开在https://这个URL上。
https://arxiv.org/abs/2401.10110
Scene text recognition, as a cross-modal task involving vision and text, is an important research topic in computer vision. Most existing methods use language models to extract semantic information for optimizing visual recognition. However, the guidance of visual cues is ignored in the process of semantic mining, which limits the performance of the algorithm in recognizing irregular scene text. To tackle this issue, we propose a novel cross-modal fusion network (CMFN) for irregular scene text recognition, which incorporates visual cues into the semantic mining process. Specifically, CMFN consists of a position self-enhanced encoder, a visual recognition branch and an iterative semantic recognition branch. The position self-enhanced encoder provides character sequence position encoding for both the visual recognition branch and the iterative semantic recognition branch. The visual recognition branch carries out visual recognition based on the visual features extracted by CNN and the position encoding information provided by the position self-enhanced encoder. The iterative semantic recognition branch, which consists of a language recognition module and a cross-modal fusion gate, simulates the way that human recognizes scene text and integrates cross-modal visual cues for text recognition. The experiments demonstrate that the proposed CMFN algorithm achieves comparable performance to state-of-the-art algorithms, indicating its effectiveness.
场景文本识别是一个涉及视觉和文本的多模态任务,在计算机视觉领域是一个重要的研究课题。大多数现有方法使用语言模型提取语义信息来优化视觉识别。然而,在语义挖掘过程中,忽略了解视觉线索的指导,这限制了算法在识别不规则场景文本时的性能。为了应对这个问题,我们提出了一个新颖的跨模态融合网络(CMFN)用于不规则场景文本识别,其中将视觉线索融入语义挖掘过程。具体来说,CMFN由位置自增强编码器、视觉识别分支和迭代语义识别分支组成。位置自增强编码器为视觉识别分支和迭代语义识别分支提供字符序列位置编码。视觉识别分支根据CNN提取的视觉特征进行视觉识别,并提供位置自增强编码器提供的位置编码信息。迭代语义识别分支由语言识别模块和跨模态融合门组成,模拟了人类识别场景文本的方式,并整合了跨模态视觉线索进行文本识别。实验证明,与最先进的算法相比,所提出的CMFN算法具有可比较的性能,表明了其有效性。
https://arxiv.org/abs/2401.10041