Text recognition is an inherent integration of vision and language, encompassing the visual texture in stroke patterns and the semantic context among the character sequences. Towards advanced text recognition, there are three key challenges: (1) an encoder capable of representing the visual and semantic distributions; (2) a decoder that ensures the alignment between vision and semantics; and (3) consistency in the framework during pre-training, if it exists, and fine-tuning. Inspired by masked autoencoding, a successful pre-training strategy in both vision and language, we propose an innovative scene text recognition approach, named VL-Reader. The novelty of the VL-Reader lies in the pervasive interplay between vision and language throughout the entire process. Concretely, we first introduce a Masked Visual-Linguistic Reconstruction (MVLR) objective, which aims at simultaneously modeling visual and linguistic information. Then, we design a Masked Visual-Linguistic Decoder (MVLD) to further leverage masked vision-language context and achieve bi-modal feature interaction. The architecture of VL-Reader maintains consistency from pre-training to fine-tuning. In the pre-training stage, VL-Reader reconstructs both masked visual and text tokens, while in the fine-tuning stage, the network degrades to reconstruct all characters from an image without any masked regions. VL-reader achieves an average accuracy of 97.1% on six typical datasets, surpassing the SOTA by 1.1%. The improvement was even more significant on challenging datasets. The results demonstrate that vision and language reconstructor can serve as an effective scene text recognizer.
文本识别是将视觉和语言集成在一起的固有特性,包括 stroke 图案中的视觉纹理和字符序列中的语义上下文。对于高级文本识别,有三个关键挑战:(1)能够表示视觉和语义分布的编码器;(2)确保视觉和语义之间的一致性;(3)在预训练期间(如果存在)和微调期间保持框架的稳定性。受到遮蔽自动编码(Masked Autoencoding,MAE)在视觉和语言预训练中的成功策略的启发,我们提出了创新的情景文本识别方法,名为 VL-Reader。VL-Reader 的新颖之处在于在整个过程中视觉和语言之间的普遍相互作用。具体来说,我们首先引入了一个遮蔽视觉-语言重构(MVLR)目标,旨在同时建模视觉和语言信息。然后,我们设计了一个遮蔽视觉-语言解码器(MVLD)来进一步利用遮蔽的视觉-语言上下文并实现双模态特征交互。VL-Reader 的架构在预训练和微调期间保持一致。在预训练阶段,VL-Reader 重构遮蔽的视觉和文本标记,而在微调阶段,网络降解以重构图像中的所有字符,而没有任何遮蔽区域。VL-reader 在六个典型数据集上的平均准确度为 97.1%,比现有 SOTA 提高了 1.1%。在具有挑战性的数据集上,改进更加显著。结果表明,视觉和语言重构器可以作为有效的场景文本识别器。
https://arxiv.org/abs/2409.11656
The proliferation of scene text in both structured and unstructured environments presents significant challenges in optical character recognition (OCR), necessitating more efficient and robust text spotting solutions. This paper presents FastTextSpotter, a framework that integrates a Swin Transformer visual backbone with a Transformer Encoder-Decoder architecture, enhanced by a novel, faster self-attention unit, SAC2, to improve processing speeds while maintaining accuracy. FastTextSpotter has been validated across multiple datasets, including ICDAR2015 for regular texts and CTW1500 and TotalText for arbitrary-shaped texts, benchmarking against current state-of-the-art models. Our results indicate that FastTextSpotter not only achieves superior accuracy in detecting and recognizing multilingual scene text (English and Vietnamese) but also improves model efficiency, thereby setting new benchmarks in the field. This study underscores the potential of advanced transformer architectures in improving the adaptability and speed of text spotting applications in diverse real-world settings. The dataset, code, and pre-trained models have been released in our Github.
场景文本在结构和非结构化环境中的普遍存在给光学字符识别(OCR)带来了巨大的挑战,需要更高效和稳健的文本检测解决方案。本文介绍了一个名为FastTextSpotter的框架,它将Swin Transformer视觉骨干与Transformer Encoder-Decoder架构集成,并新增了一个新的自注意力单元SAC2,以提高处理速度的同时保持准确性。FastTextSpotter已经在多个数据集上进行了验证,包括ICDAR2015(普通文本)和CTW1500(任意形状文本),并与当前最先进的模型进行了比较。我们的结果表明,FastTextSpotter不仅实现了对多语言场景文本(英语和越南语)的卓越检测和识别,还提高了模型的效率,从而在领域内设置了新的基准。本研究突出了高级Transformer架构在提高文本检测应用程序的适应性和速度方面的潜力。数据集、代码和预训练模型已发布在我们的Github上。
https://arxiv.org/abs/2408.14998
Reading text from images (either natural scenes or documents) has been a long-standing research topic for decades, due to the high technical challenge and wide application range. Previously, individual specialist models are developed to tackle the sub-tasks of text reading (e.g., scene text recognition, handwritten text recognition and mathematical expression recognition). However, such specialist models usually cannot effectively generalize across different sub-tasks. Recently, generalist models (such as GPT-4V), trained on tremendous data in a unified way, have shown enormous potential in reading text in various scenarios, but with the drawbacks of limited accuracy and low efficiency. In this work, we propose Platypus, a generalized specialist model for text reading. Specifically, Platypus combines the best of both worlds: being able to recognize text of various forms with a single unified architecture, while achieving excellent accuracy and high efficiency. To better exploit the advantage of Platypus, we also construct a text reading dataset (called Worms), the images of which are curated from previous datasets and partially re-labeled. Experiments on standard benchmarks demonstrate the effectiveness and superiority of the proposed Platypus model. Model and data will be made publicly available at this https URL.
阅读图像中的文本(无论是自然场景还是文档)是一个长期的研究课题,由于具有高技术挑战和广泛的适用范围而成为如此。之前,为了应对文本阅读子任务(如场景文本识别、手写文本识别和数学表达式识别),单独的专家模型被开发出来。然而,这样的专家模型通常无法在不同的子任务上有效地泛化。最近,以统一的方式在巨大数据集上训练的全局模型(如GPT-4V)显示出在各种场景中阅读文本的巨大潜力。然而,由于准确性有限和效率低下,这些模型仍然存在一定的局限性。在本文中,我们提出了Platypus,一种通用的文本阅读专家模型。具体来说,Platypus结合了两种形式的最好:具有单一流程架构,能够识别各种形式的文本,同时实现卓越的准确性和高效率。为了更好地利用Platypus的优势,我们还构建了一个名为Worms的文本阅读数据集(该数据集以前的数据集为依据,部分重标号),该数据集包含了以前的数据集和部分重新标记的图像。在标准基准测试上进行的实验证明了所提出的Platypus模型的有效性和优越性。模型和数据将公开发布在https://这个URL上。
https://arxiv.org/abs/2408.14805
Click-Through Rate (CTR) prediction is crucial for Recommendation System(RS), aiming to provide personalized recommendation services for users in many aspects such as food delivery, e-commerce and so on. However, traditional RS relies on collaborative signals, which lacks semantic understanding to real-time scenes. We also noticed that a major challenge in utilizing Large Language Models (LLMs) for practical recommendation purposes is their efficiency in dealing with long text input. To break through the problems above, we propose Large Language Model Aided Real-time Scene Recommendation(LARR), adopt LLMs for semantic understanding, utilizing real-time scene information in RS without requiring LLM to process the entire real-time scene text directly, thereby enhancing the efficiency of LLM-based CTR modeling. Specifically, recommendation domain-specific knowledge is injected into LLM and then RS employs an aggregation encoder to build real-time scene information from separate LLM's outputs. Firstly, a LLM is continual pretrained on corpus built from recommendation data with the aid of special tokens. Subsequently, the LLM is fine-tuned via contrastive learning on three kinds of sample construction strategies. Through this step, LLM is transformed into a text embedding model. Finally, LLM's separate outputs for different scene features are aggregated by an encoder, aligning to collaborative signals in RS, enhancing the performance of recommendation model.
点击率(CTR)预测对于推荐系统(RS)至关重要,旨在为许多方面(如外卖、电子商务等)的用户提供个性化推荐服务。然而,传统的RS依赖于合作信号,缺乏对实时场景的语义理解。我们还注意到,在实际推荐任务中使用大型语言模型(LLMs)的一个主要挑战是它们处理长文本输入的效率。为了突破上述问题,我们提出了大型语言模型辅助实时场景推荐(LARR),并采用LLMs进行语义理解,在RS中利用实时场景信息,而不需要LLM处理整个实时场景文本,从而提高了LLM基于CTR建模的效率。具体来说,将推荐领域特定的知识注入LLM,然后RS采用聚合编码器从LLM的输出中构建实时场景信息。首先,LLM通过特殊标记的构建数据集持续预训练。随后,通过对比学习对三种样本构建策略对LLM进行微调。通过这一步骤,LLM被转换成文本嵌入模型。最后,LLM的单独输出对于不同场景特征进行聚合,与RS中的合作信号对齐,提高了推荐模型的性能。
https://arxiv.org/abs/2408.11523
Scene text editing aims to modify texts on images while maintaining the style of newly generated text similar to the original. Given an image, a target area, and target text, the task produces an output image with the target text in the selected area, replacing the original. This task has been studied extensively, with initial success using Generative Adversarial Networks (GANs) to balance text fidelity and style similarity. However, GAN-based methods struggled with complex backgrounds or text styles. Recent works leverage diffusion models, showing improved results, yet still face challenges, especially with non-Latin languages like CJK characters (Chinese, Japanese, Korean) that have complex glyphs, often producing inaccurate or unrecognizable characters. To address these issues, we present \emph{TextMastero} - a carefully designed multilingual scene text editing architecture based on latent diffusion models (LDMs). TextMastero introduces two key modules: a glyph conditioning module for fine-grained content control in generating accurate texts, and a latent guidance module for providing comprehensive style information to ensure similarity before and after editing. Both qualitative and quantitative experiments demonstrate that our method surpasses all known existing works in text fidelity and style similarity.
场景文本编辑旨在在保留新文本风格的同时修改图像中的文本,使其类似于原始文本。给定一个图像、目标区域和目标文本,任务生成一个在选定区域包含目标文本的输出图像,替换原始文本。这个任务已经进行了广泛研究,最初使用生成对抗网络(GANs)取得了成功,以平衡文本的准确性和风格相似性。然而,基于GAN的方法在复杂背景下或具有复杂字符风格的文本上遇到了困难。最近的工作利用扩散模型,表现出更好的效果,但仍然面临挑战,尤其是在非拉丁文语言(如CJK字符,即中文、日文、韩文)上,由于字符具有复杂的字形,往往产生不准确或不可识别的字符。为了解决这些问题,我们提出了\emph{TextMastero} - 一个基于潜在扩散模型的多语言场景文本编辑架构。TextMastero引入了两个关键模块:一个用于生成准确内容的字符条件模块,用于在编辑过程中确保生成文本的准确性;另一个是潜在指导模块,用于在编辑之前和之后提供完整的风格信息,以确保编辑前的文本和编辑后的文本具有相似性。两个质量和数量实验证明,我们的方法在文本准确性和风格相似性方面超越了所有已知现有作品。
https://arxiv.org/abs/2408.10623
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets. However, the domain gap between synthetic and real images poses a challenge in acquiring feature representations that align well with images on real scenes, thereby limiting the performance of these methods. We note that vision-language models like CLIP, pre-trained on extensive real image-text pairs, effectively align images and text in a unified embedding space, suggesting the potential to derive the representations of real images from text alone. Building upon this premise, we introduce a novel method named Decoder Pre-training with only text for STR (DPTR). DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder. An Offline Randomized Perturbation (ORP) strategy is introduced. It enriches the diversity of text embeddings by incorporating natural image embeddings extracted from the CLIP image encoder, effectively directing the decoder to acquire the potential representations of real images. In addition, we introduce a Feature Merge Unit (FMU) that guides the extracted visual embeddings focusing on the character foreground within the text image, thereby enabling the pre-trained decoder to work more efficiently and accurately. Extensive experiments across various STR decoders and language recognition tasks underscore the broad applicability and remarkable performance of DPTR, providing a novel insight for STR pre-training. Code is available at this https URL
场景文本识别(STR)预训练方法已经取得了显著的进步,主要依赖于合成数据集。然而,模拟数据和真实图像之间的领域差异使得获得与真实场景图像一致的特征表示成为挑战,从而限制了这些方法的表现。我们注意到,像CLIP这样的视觉语言模型,预训练在广泛的实图像-文本对上,有效地将图像和文本在统一的嵌入空间中对齐,这表明从文本中单独提取图像表示的潜力。以此为基础,我们引入了一种名为Decoder Pre-training with only text for STR (DPTR)的新方法。DPTR将CLIP文本编码器产生的文本嵌入视为伪视觉嵌入,并使用它们来预训练解码器。引入了一种离线随机扰动(ORP)策略。它通过从CLIP图像编码器中提取自然图像嵌入,丰富文本嵌入的多样性,从而将解码器引导到获取真实图像的潜在表示。此外,我们还引入了一个特征合并单元(FMU),它指导提取文本图像中字符前台的视觉嵌入,从而使预训练解码器能够更高效、准确地工作。在各种STR解码器和语言识别任务中进行的大量实验证实了DPTR的广泛应用和非凡的表现,为STR预训练提供了新的见解。代码可在此处访问:https://www.aclweb.org/anthology/2022.8810619
https://arxiv.org/abs/2408.05706
In recent years, significant progress has been made in scene text recognition by data-driven methods. However, due to the scarcity of annotated real-world data, the training of these methods predominantly relies on synthetic data. The distribution gap between synthetic and real data constrains the further performance improvement of these methods in real-world applications. To tackle this problem, a highly promising approach is to utilize massive amounts of unlabeled real data for self-supervised training, which has been widely proven effective in many NLP and CV tasks. Nevertheless, generic self-supervised methods are unsuitable for scene text images due to their sequential nature. To address this issue, we propose a Local Explicit and Global Order-aware self-supervised representation learning method (LEGO) that accounts for the characteristics of scene text images. Inspired by the human cognitive process of learning words, which involves spelling, reading, and writing, we propose three novel pre-text tasks for LEGO to model sequential, semantic, and structural features, respectively. The entire pre-training process is optimized by using a consistent Text Knowledge Codebook. Extensive experiments validate that LEGO outperforms previous scene text self-supervised methods. The recognizer incorporated with our pre-trained model achieves superior or comparable performance compared to state-of-the-art scene text recognition methods on six benchmarks. Furthermore, we demonstrate that LEGO can achieve superior performance in other text-related tasks.
近年来,数据驱动方法在场景文本识别方面取得了显著的进展。然而,由于标注的现实世界数据的稀少,这些方法的训练主要依赖合成数据。合成数据和现实数据之间的分布差距限制了这些方法在现实应用中的进一步性能提升。为解决这个问题,一种具有前景的方法是利用大量未标注的现实世界数据进行自监督训练,这一方法已经在许多自然语言处理(NLP)和计算机视觉(CV)任务中得到了广泛验证。然而,由于合成数据的序列性,通用自监督方法不适用于场景文本图像。为了应对这个问题,我们提出了一个局部显式和全局顺序感知自监督表示学习方法(LEGO),该方法考虑了场景文本图像的特点。我们启发于人类学习单词的过程,包括拼写、阅读和写作,并为LEGO提出了三个新的预处理任务,分别建模序列、语义和结构特征。整个预训练过程通过使用一致的文本知识代码本进行优化。大量实验验证,LEGO在性能上优于之前的所有场景文本自监督方法。与最先进的场景文本识别方法相比,嵌入在我们的预训练模型的识别器在六个基准测试中都实现了卓越或相当的表现。此外,我们还证明了LEGO在其他与文本相关的任务上可以实现卓越的表现。
https://arxiv.org/abs/2408.02036
Scene text retrieval aims to find all images containing the query text from an image gallery. Current efforts tend to adopt an Optical Character Recognition (OCR) pipeline, which requires complicated text detection and/or recognition processes, resulting in inefficient and inflexible retrieval. Different from them, in this work we propose to explore the intrinsic potential of Contrastive Language-Image Pre-training (CLIP) for OCR-free scene text retrieval. Through empirical analysis, we observe that the main challenges of CLIP as a text retriever are: 1) limited text perceptual scale, and 2) entangled visual-semantic concepts. To this end, a novel model termed FDP (Focus, Distinguish, and Prompt) is developed. FDP first focuses on scene text via shifting the attention to the text area and probing the hidden text knowledge, and then divides the query text into content word and function word for processing, in which a semantic-aware prompting scheme and a distracted queries assistance module are utilized. Extensive experiments show that FDP significantly enhances the inference speed while achieving better or competitive retrieval accuracy compared to existing methods. Notably, on the IIIT-STR benchmark, FDP surpasses the state-of-the-art model by 4.37% with a 4 times faster speed. Furthermore, additional experiments under phrase-level and attribute-aware scene text retrieval settings validate FDP's particular advantages in handling diverse forms of query text. The source code will be publicly available at this https URL.
场景文本检索旨在从图像画廊中找到包含查询文本的所有图像。当前的努力倾向于采用光学字符识别(OCR)流程,这需要复杂的文本检测和/或识别过程,导致效率低下和灵活性差。与它们不同,在这项工作中,我们提出了对比语言图像预训练(CLIP)无OCR场景文本检索的内在潜力的探索。通过实证分析,我们观察到CLIP作为文本检索器的主要挑战是:1)文本感知范围有限,2)纠缠的视觉语义概念。为此,我们开发了一个名为FDP(关注、区分和提示)的新模型。FDP首先通过将注意力和文本区域对齐来聚焦场景文本,并探索隐藏的文本知识,然后将查询文本分为内容词和功能词进行处理,其中采用语义提示方案和分心的查询帮助模块。大量的实验证明,FDP在提高推理速度的同时,与现有方法的检索准确性相比较,实现了显著的提高。值得注意的是,在IIIT-STR基准上,FDP超过了最先进的模型,速度快了4倍。此外,在短语级别和属性意识的场景文本检索设置中,进一步的实验证实了FDP在处理各种查询文本形式方面的独特优势。源代码将公开发布在https://这个URL上。
https://arxiv.org/abs/2408.00441
More and more end-to-end text spotting methods based on Transformer architecture have demonstrated superior performance. These methods utilize a bipartite graph matching algorithm to perform one-to-one optimal matching between predicted objects and actual objects. However, the instability of bipartite graph matching can lead to inconsistent optimization targets, thereby affecting the training performance of the model. Existing literature applies denoising training to solve the problem of bipartite graph matching instability in object detection tasks. Unfortunately, this denoising training method cannot be directly applied to text spotting tasks, as these tasks need to perform irregular shape detection tasks and more complex text recognition tasks than classification. To address this issue, we propose a novel denoising training method (DNTextSpotter) for arbitrary-shaped text spotting. Specifically, we decompose the queries of the denoising part into noised positional queries and noised content queries. We use the four Bezier control points of the Bezier center curve to generate the noised positional queries. For the noised content queries, considering that the output of the text in a fixed positional order is not conducive to aligning position with content, we employ a masked character sliding method to initialize noised content queries, thereby assisting in the alignment of text content and position. To improve the model's perception of the background, we further utilize an additional loss function for background characters classification in the denoising training part.Although DNTextSpotter is conceptually simple, it outperforms the state-of-the-art methods on four benchmarks (Total-Text, SCUT-CTW1500, ICDAR15, and Inverse-Text), especially yielding an improvement of 11.3% against the best approach in Inverse-Text dataset.
越来越多的基于Transformer架构的端到端文本检测方法表现出卓越的性能。这些方法利用二分图匹配算法执行预测物体和实际物体的一对一最优匹配。然而,二分图匹配的不稳定性可能导致不一致的优化目标,从而影响模型的训练性能。现有文献在物体检测任务上应用去噪训练来解决二分图匹配不稳定性的问题。然而,这种去噪训练方法无法直接应用于文本检测任务,因为这些任务需要执行不规则形状检测任务和比分类更加复杂的多文本识别任务。为了解决这个问题,我们提出了一个名为DNTextSpotter的新型去噪训练方法(DNTextSpotter)。具体来说,我们将去噪部分的查询分解为噪音位置查询和噪音内容查询。我们使用Bezier曲线中四个贝塞尔控制点生成噪音位置查询。考虑到固定位置的文本输出不利于将位置与内容对齐,我们采用遮罩字符滑动方法初始化噪音内容查询,从而帮助文本内容与位置对齐。为了提高模型对背景的感知,我们还进一步在去噪训练部分应用了背景字符分类的额外损失函数。尽管DNTextSpotter在概念上很简单,但在四个基准测试(Total-Text,SCUT-CTW1500,ICDAR15和Inverse-Text)上的表现超过了最先进的方法,尤其是相对于Inverse-Text数据集的最佳方法提高了11.3%。
https://arxiv.org/abs/2408.00355
Accurate text segmentation results are crucial for text-related generative tasks, such as text image generation, text editing, text removal, and text style transfer. Recently, some scene text segmentation methods have made significant progress in segmenting regular text. However, these methods perform poorly in scenarios containing artistic text. Therefore, this paper focuses on the more challenging task of artistic text segmentation and constructs a real artistic text segmentation dataset. One challenge of the task is that the local stroke shapes of artistic text are changeable with diversity and complexity. We propose a decoder with the layer-wise momentum query to prevent the model from ignoring stroke regions of special shapes. Another challenge is the complexity of the global topological structure. We further design a skeleton-assisted head to guide the model to focus on the global structure. Additionally, to enhance the generalization performance of the text segmentation model, we propose a strategy for training data synthesis, based on the large multi-modal model and the diffusion model. Experimental results show that our proposed method and synthetic dataset can significantly enhance the performance of artistic text segmentation and achieve state-of-the-art results on other public datasets.
准确的文本分割结果对于诸如文本图像生成、文本编辑、文本去噪和文本风格迁移等与文本相关的生成任务至关重要。最近,一些场景文本分割方法在分割普通文本方面取得了显著的进展。然而,在包含艺术文本的场景中,这些方法表现不佳。因此,本文专注于更具挑战性的艺术文本分割任务,并构建了一个真实的艺术文本分割数据集。任务的挑战之一是,艺术文本的局部画笔形状随多样性和复杂性而变化。我们提出了一种层间动量查询的解码器,以防止模型忽视特殊形状的画笔区域。另一个挑战是全局拓扑结构的复杂性。我们进一步设计了一个骨架辅助的头部,以引导模型关注全局结构。此外,为了增强文本分割模型的泛化性能,我们基于大型多模态模型和扩散模型提出了训练数据合成策略。实验结果表明,我们提出的方法和合成数据集可以显著增强艺术文本分割的表现,并在其他公共数据集上达到最先进水平。
https://arxiv.org/abs/2408.00106
The rapid advancements of generative AI have fueled the potential of generative text image editing while simultaneously escalating the threat of misinformation spreading. However, existing forensics methods struggle to detect unseen forgery types that they have not been trained on, leaving the development of a model capable of generalized detection of tampered scene text as an unresolved issue. To tackle this, we propose a novel task: open-set tampered scene text detection, which evaluates forensics models on their ability to identify both seen and previously unseen forgery types. We have curated a comprehensive, high-quality dataset, featuring the texts tampered by eight text editing models, to thoroughly assess the open-set generalization capabilities. Further, we introduce a novel and effective pre-training paradigm that subtly alters the texture of selected texts within an image and trains the model to identify these regions. This approach not only mitigates the scarcity of high-quality training data but also enhances models' fine-grained perception and open-set generalization abilities. Additionally, we present DAF, a novel framework that improves open-set generalization by distinguishing between the features of authentic and tampered text, rather than focusing solely on the tampered text's features. Our extensive experiments validate the remarkable efficacy of our methods. For example, our zero-shot performance can even beat the previous state-of-the-art full-shot model by a large margin. Our dataset and code will be open-source.
生成式 AI 的快速发展为生成式文本图像编辑的潜力提供了动力,同时加剧了错误信息传播的风险。然而,现有的法医学方法很难检测到他们没有接受过训练的未见过的伪造类型,这使得开发一个能够泛化检测 tampered 场景文本的模型成为一个未解决的问题。为解决这个问题,我们提出了一个新颖的任务:开箱即用的 tampered 场景文本检测,该任务评估法医学模型在识别已见过的和以前未见过的伪造类型的能力。我们策划了一个包含由八个文本编辑模型 tampered 的文本的全面、高质量数据集,以全面评估开放箱的泛化能力。此外,我们引入了一种新颖且有效的预训练范式,通过微妙地改变图像中选择文本的纹理,并训练模型识别这些区域。这种方法不仅减轻了高质量训练数据的稀缺性,还提高了模型的细粒度感知和开放箱的泛化能力。此外,我们还推出了 DAF,一种通过区分真实和 tampered 文本的特征来提高开放箱泛化的新颖框架。我们的大量实验证实了我们的方法取得了显著的成效。例如,我们的零开发生效甚至比以前的最先进的完整射击模型还要大得多。我们的数据集和代码将开源。
https://arxiv.org/abs/2407.21422
Transcription-only Supervised Text Spotting aims to learn text spotters relying only on transcriptions but no text boundaries for supervision, thus eliminating expensive boundary annotation. The crux of this task lies in locating each transcription in scene text images without location annotations. In this work, we formulate this challenging problem as a Weakly Supervised Cross-modality Contrastive Learning problem, and design a simple yet effective model dubbed WeCromCL that is able to detect each transcription in a scene image in a weakly supervised manner. Unlike typical methods for cross-modality contrastive learning that focus on modeling the holistic semantic correlation between an entire image and a text description, our WeCromCL conducts atomistic contrastive learning to model the character-wise appearance consistency between a text transcription and its correlated region in a scene image to detect an anchor point for the transcription in a weakly supervised manner. The detected anchor points by WeCromCL are further used as pseudo location labels to guide the learning of text spotting. Extensive experiments on four challenging benchmarks demonstrate the superior performance of our model over other methods. Code will be released.
The task Transcription-only Supervised Text Spotting aims to learn text spotters relying only on transcriptions, without any additional information about where these transcriptions are located in the scene images they appear in. The key challenge lies in locating each transcription within a scene text image with no location annotations available for reference. In this study, we propose a novel Weakly Supervised Cross-modality Contrastive Learning (WeCromCL) model that successfully tackles this difficult problem. The WeCromCL model addresses the lack of supervision by focusing on learning atomic contrastive relationships between text transcriptions and their corresponding regions in the scene images instead of holistic semantic correlations. In essence, it conducts atomistic cross-modality contrastive learning to identify anchor points for each transcription during weakly supervised learning. The detected anchor points serve as pseudo location labels that guide the training of text spotting processes. Extensive experiments on four benchmark datasets show our model's superior performance compared to other methods. Code will be released with this work.
https://arxiv.org/abs/2407.19507
Open-set text recognition, which aims to address both novel characters and previously seen ones, is one of the rising subtopics in the text recognition field. However, the current open-set text recognition solutions only focuses on horizontal text, which fail to model the real-life challenges posed by the variety of writing directions in real-world scene text. Multi-orientation text recognition, in general, faces challenges from the diverse image aspect ratios, significant imbalance in data amount, and domain gaps between orientations. In this work, we first propose a Multi-Oriented Open-Set Text Recognition task (MOOSTR) to model the challenges of both novel characters and writing direction variety. We then propose a Multi-Orientation Sharing Experts (MOoSE) framework as a strong baseline solution. MOoSE uses a mixture-of-experts scheme to alleviate the domain gaps between orientations, while exploiting common structural knowledge among experts to alleviate the data scarcity that some experts face. The proposed MOoSE framework is validated by ablative experiments, and also tested for feasibility on the existing open-set benchmark. Code, models, and documents are available at: this https URL
https://arxiv.org/abs/2407.18616
Scene text segmentation aims at cropping texts from scene images, which is usually used to help generative models edit or remove texts. The existing text segmentation methods tend to involve various text-related supervisions for better performance. However, most of them ignore the importance of text edges, which are significant for downstream applications. In this paper, we propose Edge-Aware Transformers, termed EAFormer, to segment texts more accurately, especially at the edge of texts. Specifically, we first design a text edge extractor to detect edges and filter out edges of non-text areas. Then, we propose an edge-guided encoder to make the model focus more on text edges. Finally, an MLP-based decoder is employed to predict text masks. We have conducted extensive experiments on commonly-used benchmarks to verify the effectiveness of EAFormer. The experimental results demonstrate that the proposed method can perform better than previous methods, especially on the segmentation of text edges. Considering that the annotations of several benchmarks (e.g., COCO_TS and MLT_S) are not accurate enough to fairly evaluate our methods, we have relabeled these datasets. Through experiments, we observe that our method can achieve a higher performance improvement when more accurate annotations are used for training.
https://arxiv.org/abs/2407.17020
Image inpainting aims to fill missing pixels in damaged images and has achieved significant progress with cut-edging learning techniques. Nevertheless, state-of-the-art inpainting methods are mainly designed for nature images and cannot correctly recover text within scene text images, and training existing models on the scene text images cannot fix the issues. In this work, we identify the visual-text inpainting task to achieve high-quality scene text image restoration and text completion: Given a scene text image with unknown missing regions and the corresponding text with unknown missing characters, we aim to complete the missing information in both images and text by leveraging their complementary information. Intuitively, the input text, even if damaged, contains language priors of the contents within the images and can guide the image inpainting. Meanwhile, the scene text image includes the appearance cues of the characters that could benefit text recovery. To this end, we design the cross-modal predictive interaction (CLII) model containing two branches, i.e., ImgBranch and TxtBranch, for scene text inpainting and text completion, respectively while leveraging their complementary effectively. Moreover, we propose to embed our model into the SOTA scene text spotting method and significantly enhance its robustness against missing pixels, which demonstrates the practicality of the newly developed task. To validate the effectiveness of our method, we construct three real datasets based on existing text-related datasets, containing 1838 images and covering three scenarios with curved, incidental, and styled texts, and conduct extensive experiments to show that our method outperforms baselines significantly.
https://arxiv.org/abs/2407.16204
Video generation has witnessed great success recently, but their application in generating long videos still remains challenging due to the difficulty in maintaining the temporal consistency of generated videos and the high memory cost during generation. To tackle the problems, in this paper, we propose a brave and new idea of Multi-sentence Video Grounding for Long Video Generation, connecting the massive video moment retrieval to the video generation task for the first time, providing a new paradigm for long video generation. The method of our work can be summarized as three steps: (i) We design sequential scene text prompts as the queries for video grounding, utilizing the massive video moment retrieval to search for video moment segments that meet the text requirements in the video database. (ii) Based on the source frames of retrieved video moment segments, we adopt video editing methods to create new video content while preserving the temporal consistency of the retrieved video. Since the editing can be conducted segment by segment, and even frame by frame, it largely reduces the memory cost. (iii) We also attempt video morphing and personalized generation methods to improve the subject consistency of long video generation, providing ablation experimental results for the subtasks of long video generation. Our approach seamlessly extends the development in image/video editing, video morphing and personalized generation, and video grounding to the long video generation, offering effective solutions for generating long videos at low memory cost.
近年来,视频生成取得了巨大的成功,但在生成长视频方面,其应用仍然具有挑战性,因为维持生成视频的时空一致性和在生成过程中的高内存开销仍然是一个难题。为了解决这些问题,在本文中,我们提出了一个勇敢的新想法——多句子视频 groundeding for long video generation,将大规模视频时刻检索与视频生成任务相结合,为长视频生成提供了一个新的范式。我们工作的方法可以总结为三个步骤: (i)我们设计了一系列场景文本提示作为视频 groundeding 的查询,利用大规模视频时刻检索在视频数据库中查找满足文本要求的视频时刻段。 (ii)基于检索到的视频时刻段的源帧,我们采用视频编辑方法创建新的视频内容,同时保留检索到的视频的时空一致性。由于编辑可以逐段进行,甚至逐帧进行,因此大大降低了内存开销。 (iii)我们还尝试了视频融合和个性化的生成方法,以提高长视频生成的主题一致性,并为长视频生成提供了 ablation 实验结果。我们的方法将图像/视频编辑、视频融合和个性化生成与视频 groundeding 的开发无缝延伸到长视频生成,为生成低成本的长视频提供了有效的解决方案。
https://arxiv.org/abs/2407.13219
Scene Text Recognition (STR) methods have demonstrated robust performance in word-level text recognition. However, in applications the text image is sometimes long due to detected with multiple horizontal words. It triggers the requirement to build long text recognition models from readily available short word-level text datasets, which has been less studied previously. In this paper, we term this the Out of Length (OOL) text recognition. We establish a new Long Text Benchmark (LTB) to facilitate the assessment of different methods in long text recognition. Meanwhile, we propose a novel method called OOL Text Recognition with sub-String Matching (SMTR). SMTR comprises two cross-attention-based modules: one encodes a sub-string containing multiple characters into next and previous queries, and the other employs the queries to attend to the image features, matching the sub-string and simultaneously recognizing its next and previous character. SMTR can recognize text of arbitrary length by iterating the process above. To avoid being trapped in recognizing highly similar sub-strings, we introduce a regularization training to compel SMTR to effectively discover subtle differences between similar sub-strings for precise matching. In addition, we propose an inference augmentation to alleviate confusion caused by identical sub-strings and improve the overall recognition efficiency. Extensive experimental results reveal that SMTR, even when trained exclusively on short text, outperforms existing methods in public short text benchmarks and exhibits a clear advantage on LTB. Code: \url{this https URL}.
场景文本识别(STR)方法在词级文本识别方面表现出稳健的性能。然而,在应用中,由于检测到多个水平单词,文本图像有时会很长。这引发了构建可用的短词级文本数据集上的长文本识别模型的需求,而这一点之前研究得比较少。在本文中,我们将这种现象称为短长度文本识别(OOL)文本识别。我们建立了一个新的长文本基准(LTB),以促进评估不同方法在长文本识别中的效果。同时,我们提出了一种名为OLTST Text Recognition with Sub-String Matching(SMTR)的新方法。SMTR包括两个基于交叉注意力的模块:一个将包含多个字符的子字符串编码为下一个和 previous 查询,另一个使用查询关注图像特征,匹配子字符串并同时识别其下一和前一个字符。通过不断重复上述过程,SMTR可以识别任意长度的文本。为了避免陷入对高度相似子字符串的识别,我们引入了正则化训练,迫使SMTR更有效地发现相似子字符串之间的微小差异以进行精确匹配。此外,我们提出了一种推理增强来减轻相同子字符串引起的混淆,提高整体识别效率。大量的实验结果表明,即使仅通过短文本进行训练,SMTR在公共短文本基准中也表现出优异的性能,并且在LTB上具有明显的优势。代码:\url{this <https://this URL>.}
https://arxiv.org/abs/2407.12317
Recently, scene text recognition (STR) models have shown significant performance improvements. However, existing models still encounter difficulties in recognizing challenging texts that involve factors such as severely distorted and perspective characters. These challenging texts mainly cause two problems: (1) Large Intra-Class Variance. (2) Small Inter-Class Variance. An extremely distorted character may prominently differ visually from other characters within the same category, while the variance between characters from different classes is relatively small. To address the above issues, we propose a novel method that enriches the character features to enhance the discriminability of characters. Firstly, we propose the Character-Aware Constraint Encoder (CACE) with multiple blocks stacked. CACE introduces a decay matrix in each block to explicitly guide the attention region for each token. By continuously employing the decay matrix, CACE enables tokens to perceive morphological information at the character level. Secondly, an Intra-Inter Consistency Loss (I^2CL) is introduced to consider intra-class compactness and inter-class separability at feature space. I^2CL improves the discriminative capability of features by learning a long-term memory unit for each character category. Trained with synthetic data, our model achieves state-of-the-art performance on common benchmarks (94.1% accuracy) and Union14M-Benchmark (61.6% accuracy). Code is available at this https URL.
近年来,场景文本识别(STR)模型已经取得了显著的性能提升。然而,现有的模型在识别涉及严重扭曲和透视字符的具有挑战性的文本时仍然遇到困难。这些具有挑战性的文本主要导致了两个问题:(1)类内变异性较大。 (2)类间变异性较小。极度扭曲的字符可能比同一类别内的其他字符在视觉上突出,而不同类之间的字符变异性相对较小。为了应对上述问题,我们提出了一个新方法,以增强字符特征的区分度。首先,我们提出了具有多个区块的Character-Aware Constraint Encoder(CACE)。CACE在每个区块中引入了衰减矩阵,以明确指导每个标记的注意力区域。通过持续使用衰减矩阵,CACE使得每个字符级别能够感知语义信息。其次,引入了Intra-Inter Consistency Loss(I^2CL)来考虑类内紧凑性和类间分离性在特征空间中的问题。I^2CL通过学习每个字符类别的长期记忆单元来提高特征的区分度。通过训练合成数据,我们的模型在常见基准(94.1%准确率)和Union14M-Benchmark(61.6%准确率)上实现了最先进的性能。代码可以从该链接的URL中获取。
https://arxiv.org/abs/2407.05562
The advancement of diffusion models has pushed the boundary of text-to-3D object generation. While it is straightforward to composite objects into a scene with reasonable geometry, it is nontrivial to texture such a scene perfectly due to style inconsistency and occlusions between objects. To tackle these problems, we propose a coarse-to-fine 3D scene texturing framework, referred to as RoomTex, to generate high-fidelity and style-consistent textures for untextured compositional scene meshes. In the coarse stage, RoomTex first unwraps the scene mesh to a panoramic depth map and leverages ControlNet to generate a room panorama, which is regarded as the coarse reference to ensure the global texture consistency. In the fine stage, based on the panoramic image and perspective depth maps, RoomTex will refine and texture every single object in the room iteratively along a series of selected camera views, until this object is completely painted. Moreover, we propose to maintain superior alignment between RGB and depth spaces via subtle edge detection methods. Extensive experiments show our method is capable of generating high-quality and diverse room textures, and more importantly, supporting interactive fine-grained texture control and flexible scene editing thanks to our inpainting-based framework and compositional mesh input. Our project page is available at this https URL.
扩散模型的进步推动了文本到3D物体生成的边界。虽然将物体合成到场景中具有合理几何形状是直观的,但由于风格不一致和物体之间的遮挡,纹理这个场景并不是完全可取的。为了解决这些问题,我们提出了一个粗纹理到细纹理的3D场景纹理框架,称为RoomTex,以生成高质量的样式一致的纹理,用于未纹理的合成场景网格。在粗阶段,RoomTex首先将场景网格解开并利用ControlNet生成一个房间全景图,被认为是粗参考以保证全局纹理一致性。在细阶段,根据全景图像和透视深度图,RoomTex将沿着选择的一些视图对房间内的每个对象逐一进行纹理精度和纹理的优化,直到纹理完全绘制完成。此外,我们提出通过微妙的边缘检测方法,在RGB和深度空间之间保持卓越的对齐。大量的实验证明,我们的方法能够生成高质量和多样化的房间纹理,并且最重要的是,通过基于修复的框架和支持交互式细粒度纹理控制和灵活的场景编辑,支持场景编辑。您可以在这个链接查看我们的项目页面。
https://arxiv.org/abs/2406.02461
While diffusion models have significantly advanced the quality of image generation, their capability to accurately and coherently render text within these images remains a substantial challenge. Conventional diffusion-based methods for scene text generation are typically limited by their reliance on an intermediate layout output. This dependency often results in a constrained diversity of text styles and fonts, an inherent limitation stemming from the deterministic nature of the layout generation phase. To address these challenges, this paper introduces SceneTextGen, a novel diffusion-based model specifically designed to circumvent the need for a predefined layout stage. By doing so, SceneTextGen facilitates a more natural and varied representation of text. The novelty of SceneTextGen lies in its integration of three key components: a character-level encoder for capturing detailed typographic properties, coupled with a character-level instance segmentation model and a word-level spotting model to address the issues of unwanted text generation and minor character inaccuracies. We validate the performance of our method by demonstrating improved character recognition rates on generated images across different public visual text datasets in comparison to both standard diffusion based methods and text specific methods.
尽管扩散模型已经显著提高了图像生成的质量,但它们在准确和连贯地渲染图像中的文本能力仍然是一个巨大的挑战。基于扩散的方法通常受到其对中间布局输出的依赖限制。这种依赖往往导致文本风格和字体的多样性受到限制,这是由于布局生成阶段的确定性 nature。为了应对这些挑战,本文引入了SceneTextGen,一种专门设计用于绕过需要指定布局阶段的扩散模型的新颖扩散模型。通过这种方式,SceneTextGen有助于实现更自然和多样化的文本表示。SceneTextGen的创新之处在于其整合了三个关键组件:用于捕捉详细字体性质的字符级别编码器,附加一个字符级别实例分割模型和一个单词级别斑点模型,以解决不必要的文本生成和轻微字型不准确的问题。我们通过比较不同公共视觉文本数据集上生成图像的改进字符识别率,以及标准扩散方法和文本特定方法,验证了我们的方法的性能。
https://arxiv.org/abs/2406.01062