Scene text image super-resolution (STISR) aims at simultaneously increasing the resolution and readability of low-resolution scene text images, thus boosting the performance of the downstream recognition task. Two factors in scene text images, semantic information and visual structure, affect the recognition performance significantly. To mitigate the effects from these factors, this paper proposes a Prior-Enhanced Attention Network (PEAN). Specifically, a diffusion-based module is developed to enhance the text prior, hence offering better guidance for the SR network to generate SR images with higher semantic accuracy. Meanwhile, the proposed PEAN leverages an attention-based modulation module to understand scene text images by neatly perceiving the local and global dependence of images, despite the shape of the text. A multi-task learning paradigm is employed to optimize the network, enabling the model to generate legible SR images. As a result, PEAN establishes new SOTA results on the TextZoom benchmark. Experiments are also conducted to analyze the importance of the enhanced text prior as a means of improving the performance of the SR network. Code will be made available at this https URL.
场景文本图像超分辨率(STISR)旨在同时提高低分辨率场景文本图像的分辨率和可读性,从而提高下游识别任务的性能。场景文本图像中的语义信息和视觉结构对识别性能具有重要影响。为了减轻这些因素的影响,本文提出了一种先验增强注意网络(PEAN)。具体来说,采用扩散模块增强文本先验,从而为SR网络提供更好的指导以生成具有更高语义准确性的SR图像。同时,PEAN利用基于注意的模块来理解场景文本图像,尽管其形状有所不同。采用多任务学习范式优化网络,使模型能够生成清晰可读的SR图像。因此,PEAN在TextZoom基准上取得了最先进的成果。此外,还进行了实验分析,以分析增强文本先验作为提高SR网络性能的重要手段。代码将在此处链接提供。
https://arxiv.org/abs/2311.17955
Scene text detection techniques have garnered significant attention due to their wide-ranging applications. However, existing methods have a high demand for training data, and obtaining accurate human annotations is labor-intensive and time-consuming. As a solution, researchers have widely adopted synthetic text images as a complementary resource to real text images during pre-training. Yet there is still room for synthetic datasets to enhance the performance of scene text detectors. We contend that one main limitation of existing generation methods is the insufficient integration of foreground text with the background. To alleviate this problem, we present the Diffusion Model based Text Generator (DiffText), a pipeline that utilizes the diffusion model to seamlessly blend foreground text regions with the background's intrinsic features. Additionally, we propose two strategies to generate visually coherent text with fewer spelling errors. With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors. Extensive experiments on detecting horizontal, rotated, curved, and line-level texts demonstrate the effectiveness of DiffText in producing realistic text images.
场景文本检测技术因其广泛的应用而受到了广泛关注。然而,现有的方法需要大量的训练数据,而且获得准确的人类注释是费力且耗时的。为了解决这个问题,研究人员在预训练过程中广泛采用了合成文本图像作为真实文本图像的补充资源。然而,合成数据集仍有提高场景文本检测器性能的空间。我们认为现有生成方法的主要局限是前馈神经网络对前景文本与背景特征之间整合不足。为了减轻这个问题,我们提出了基于扩散模型的文本生成器(DiffText),一个利用扩散模型平滑地将前景文本区域与背景的固有特征结合起来的流程。此外,我们还提出了两种策略来生成视觉上连贯的文本,同时减少拼写错误。随着文本实例数量的减少,我们生产的文本图像在帮助文本检测器方面始终超越其他合成数据。对水平、旋转、弯曲和线级文本的检测实验表明,DiffText在生成真实文本图像方面非常有效。
https://arxiv.org/abs/2311.16555
Scene Text Image Super-Resolution (STISR) aims to enhance the resolution and legibility of text within low-resolution (LR) images, consequently elevating recognition accuracy in Scene Text Recognition (STR). Previous methods predominantly employ discriminative Convolutional Neural Networks (CNNs) augmented with diverse forms of text guidance to address this issue. Nevertheless, they remain deficient when confronted with severely blurred images, due to their insufficient generation capability when little structural or semantic information can be extracted from original images. Therefore, we introduce RGDiffSR, a Recognition-Guided Diffusion model for scene text image Super-Resolution, which exhibits great generative diversity and fidelity even in challenging scenarios. Moreover, we propose a Recognition-Guided Denoising Network, to guide the diffusion model generating LR-consistent results through succinct semantic guidance. Experiments on the TextZoom dataset demonstrate the superiority of RGDiffSR over prior state-of-the-art methods in both text recognition accuracy and image fidelity.
场景文本图像超分辨率(STISR)旨在提高低分辨率(LR)图像中文本的分辨率和可读性,从而提高场景文本识别(STR)的识别准确性。以前的方法主要采用添加不同形式的文本指导的卷积神经网络(CNN)来解决此问题。然而,当面临严重模糊的图像时,它们仍然缺乏生成能力,因为它们在原始图像中提取的结构或语义信息不足。因此,我们引入了RGDiffSR,一种用于场景文本图像超分辨率的学习引导扩散(LID)模型,即使在具有严重模糊的图像的挑战性场景中,也表现出广泛的生成多样性和准确性。此外,我们提出了一个基于语义指导的去噪网络,通过简洁的语义指导引导扩散模型生成LR一致的结果。在TextZoom数据集上的实验证明,RGDiffSR在文本识别准确性和图像质量方面优于最先进的methods。
https://arxiv.org/abs/2311.13317
Retrieving textual information from natural scene images is an active research area in the field of computer vision with numerous practical applications. Detecting text regions and extracting text from signboards is a challenging problem due to special characteristics like reflecting lights, uneven illumination, or shadows found in real-life natural scene images. With the advent of deep learning-based methods, different sophisticated techniques have been proposed for text detection and text recognition from the natural scene. Though a significant amount of effort has been devoted to extracting natural scene text for resourceful languages like English, little has been done for low-resource languages like Bangla. In this research work, we have proposed an end-to-end system with deep learning-based models for efficiently detecting, recognizing, correcting, and parsing address information from Bangla signboards. We have created manually annotated datasets and synthetic datasets to train signboard detection, address text detection, address text recognition, address text correction, and address text parser models. We have conducted a comparative study among different CTC-based and Encoder-Decoder model architectures for Bangla address text recognition. Moreover, we have designed a novel address text correction model using a sequence-to-sequence transformer-based network to improve the performance of Bangla address text recognition model by post-correction. Finally, we have developed a Bangla address text parser using the state-of-the-art transformer-based pre-trained language model.
从自然场景图像中检索文本信息是一个活跃的研究领域,在计算机视觉领域具有许多实际应用。检测文本区域和提取标志牌上的文本是一个具有挑战性的问题,因为现实生活中自然场景图像中存在诸如反射光、不均匀照明或阴影等特殊特点。随着深度学习方法的的出现,已经提出了许多用于从自然场景中检测和识别文本的复杂技术。尽管为英语等资源丰富语言创建自然场景文本已经投入了大量精力,但对于像孟加拉语这样的低资源语言,工作还很少。在这项研究中,我们提出了一个基于深度学习模型的端到端系统,用于有效地检测、识别、纠正和解析孟加拉语标志牌上的地址信息。我们创建了手动标注的数据集和合成数据集,用于训练标志牌检测、地址文本检测、地址文本识别、地址文本纠正和地址文本解析模型。我们还研究了使用不同CTC(卷积神经网络)和Encoder-Decoder模型架构的孟加拉语地址文本识别的比较情况。此外,我们利用序列到序列Transformer网络设计了一种新的地址文本修正模型,以提高孟加拉语地址文本识别模型的性能。最后,我们使用最先进的基于Transformer的预训练语言模型开发了孟加拉语地址文本解析器。
https://arxiv.org/abs/2311.13222
Scene text recognition (STR) in the wild frequently encounters challenges when coping with domain variations, font diversity, shape deformations, etc. A straightforward solution is performing model fine-tuning tailored to a specific scenario, but it is computationally intensive and requires multiple model copies for various scenarios. Recent studies indicate that large language models (LLMs) can learn from a few demonstration examples in a training-free manner, termed "In-Context Learning" (ICL). Nevertheless, applying LLMs as a text recognizer is unacceptably resource-consuming. Moreover, our pilot experiments on LLMs show that ICL fails in STR, mainly attributed to the insufficient incorporation of contextual information from diverse samples in the training stage. To this end, we introduce E$^2$STR, a STR model trained with context-rich scene text sequences, where the sequences are generated via our proposed in-context training strategy. E$^2$STR demonstrates that a regular-sized model is sufficient to achieve effective ICL capabilities in STR. Extensive experiments show that E$^2$STR exhibits remarkable training-free adaptation in various scenarios and outperforms even the fine-tuned state-of-the-art approaches on public benchmarks.
野外的场景文本识别(STR)常常在面对领域差异、字体多样性、形状变形等挑战时遇到困难。一个简单的解决方案是对特定场景进行模型微调,但这对计算资源密集型任务来说代价太高,且需要为各种场景创建多个模型副本。最近的研究表明,大型语言模型(LLMs)可以在无训练的情况下从几个演示例子中学习,这种方法被称为“在上下文学习中”(ICL)。然而,将LLMs应用于文本识别器会导致资源耗竭。此外,我们对LLMs的实验表明,ICL在STR方面的表现不佳,主要原因是训练阶段中缺乏不同样本的上下文信息。因此,我们引入了E$^2$STR,一种基于上下文信息的场景文本识别模型,通过我们提出的在上下文训练策略生成序列。E$^2$STR的实验表明,对于STR,一个常规大小的模型已经足够实现有效的ICL能力。大量的实验表明,E$^2$STR在各种场景中都表现出惊人的训练-免费适应性,甚至超过了甚至最先进的公共基准。
https://arxiv.org/abs/2311.13120
Scene Text Image Super-resolution (STISR) has recently achieved great success as a preprocessing method for scene text recognition. STISR aims to transform blurred and noisy low-resolution (LR) text images in real-world settings into clear high-resolution (HR) text images suitable for scene text recognition. In this study, we leverage text-conditional diffusion models (DMs), known for their impressive text-to-image synthesis capabilities, for STISR tasks. Our experimental results revealed that text-conditional DMs notably surpass existing STISR methods. Especially when texts from LR text images are given as input, the text-conditional DMs are able to produce superior quality super-resolution text images. Utilizing this capability, we propose a novel framework for synthesizing LR-HR paired text image datasets. This framework consists of three specialized text-conditional DMs, each dedicated to text image synthesis, super-resolution, and image degradation. These three modules are vital for synthesizing distinct LR and HR paired images, which are more suitable for training STISR methods. Our experiments confirmed that these synthesized image pairs significantly enhance the performance of STISR methods in the TextZoom evaluation.
近年来,场景文本图像超分辨率(STISR)作为一种预处理方法,在场景文本识别任务中取得了巨大的成功。STISR的目标是将现实场景中模糊和噪音低分辨率(LR)文本图像转换为适合场景文本识别的高分辨率(HR)文本图像。在本研究中,我们利用著名的文本条件扩散模型(DMs),这些模型以其令人印象深刻的文本到图像合成能力而闻名,用于STISR任务。我们的实验结果表明,文本条件DM明显超越了现有的STISR方法。特别是当LR文本图像作为输入时,文本条件DM能够产生卓越的超级分辨率文本图像。利用这种能力,我们提出了一个用于合成LR-HR配对文本图像数据的全新框架。这个框架包括三个专门用于文本图像合成、超分辨和图像退化的文本条件DM。这三个模块对于合成不同LR和HR配对的图像至关重要。我们的实验证实,这些合成的图像对STISR方法在TextZoom评估中的性能显著提高。
https://arxiv.org/abs/2311.09759
Diffusion models have gained attention for image editing yielding impressive results in text-to-image tasks. On the downside, one might notice that generated images of stable diffusion models suffer from deteriorated details. This pitfall impacts image editing tasks that require information preservation e.g., scene text editing. As a desired result, the model must show the capability to replace the text on the source image to the target text while preserving the details e.g., color, font size, and background. To leverage the potential of diffusion models, in this work, we introduce Diffusion-BasEd Scene Text manipulation Network so-called DBEST. Specifically, we design two adaptation strategies, namely one-shot style adaptation and text-recognition guidance. In experiments, we thoroughly assess and compare our proposed method against state-of-the-arts on various scene text datasets, then provide extensive ablation studies for each granularity to analyze our performance gain. Also, we demonstrate the effectiveness of our proposed method to synthesize scene text indicated by competitive Optical Character Recognition (OCR) accuracy. Our method achieves 94.15% and 98.12% on COCO-text and ICDAR2013 datasets for character-level evaluation.
扩散模型因在文本到图像任务中产生令人印象深刻的结果而受到了关注。然而,一个缺点是,生成的稳定扩散模型的图像在细节上会恶化。这个缺陷会影响需要保留信息的图像编辑任务,例如场景文本编辑。作为理想的结果,模型必须表现出在保留细节的同时将源图像上的文本替换为目标文本的能力,例如颜色、字体大小和背景。为了利用扩散模型的潜力,在本文中,我们引入了扩散-基于场景文本编辑网络DBEST。具体来说,我们设计了两项适应策略,即一次风格适应和文本识别指导。在实验中,我们深入评估和比较了我们的方法与现有技术的 various场景文本数据集上的表现,然后对每个粒度进行广泛的消融研究,以分析我们性能的提高。此外,我们还证明了我们方法的有效性,用于生成与竞争光学字符识别(OCR)准确性相同的场景文本。我们的方法在COCO-text和ICDAR2013数据集上的字符级别评估获得了94.15%和98.12%的准确率。
https://arxiv.org/abs/2311.00734
This paper presents a comprehensive evaluation of the Optical Character Recognition (OCR) capabilities of the recently released GPT-4V(ision), a Large Multimodal Model (LMM). We assess the model's performance across a range of OCR tasks, including scene text recognition, handwritten text recognition, handwritten mathematical expression recognition, table structure recognition, and information extraction from visually-rich document. The evaluation reveals that GPT-4V performs well in recognizing and understanding Latin contents, but struggles with multilingual scenarios and complex tasks. Based on these observations, we delve deeper into the necessity of specialized OCR models and deliberate on the strategies to fully harness the pretrained general LMMs like GPT-4V for OCR downstream tasks. The study offers a critical reference for future research in OCR with LMMs. Evaluation pipeline and results are available at this https URL.
本文对GPT-4V(Vision),一个 recently发布的Large Multimodal Model(LMM)进行了Optical Character Recognition(OCR)能力进行全面评估。我们在一系列OCR任务中评估了模型的性能,包括场景文本识别、手写文本识别、手写数学表达式识别、表结构识别和从视觉丰富的文档中提取信息。评估显示,GPT-4V在识别和理解拉丁文本方面表现良好,但在多语言场景和复杂任务上表现不佳。基于这些观察结果,我们深入研究了专用OCR模型的必要性,并考虑了如何充分利用预训练的一般LMMs(如GPT-4V)进行OCR下游任务的策略。该研究为未来的OCR研究提供了重要的参考。评估流程和结果可在此链接查看:https://url.cn/xyz6h4yx
https://arxiv.org/abs/2310.16809
Scene Text Editing (STE) aims to substitute text in an image with new desired text while preserving the background and styles of the original text. However, present techniques present a notable challenge in the generation of edited text images that exhibit a high degree of clarity and legibility. This challenge primarily stems from the inherent diversity found within various text types and the intricate textures of complex backgrounds. To address this challenge, this paper introduces a three-stage framework for transferring texts across text images. Initially, we introduce a text-swapping network that seamlessly substitutes the original text with the desired replacement. Subsequently, we incorporate a background inpainting network into our framework. This specialized network is designed to skillfully reconstruct background images, effectively addressing the voids left after the removal of the original text. This process meticulously preserves visual harmony and coherence in the background. Ultimately, the synthesis of outcomes from the text-swapping network and the background inpainting network is achieved through a fusion network, culminating in the creation of the meticulously edited final image. A demo video is included in the supplementary material.
场景文本编辑(STE)旨在将图像中的文本替换为新的期望文本,同时保留原始文本的背景和风格。然而,现有技术在生成编辑文本图像时面临着明显的挑战,这些图像具有高清晰度和可读性。这个挑战主要源于各种文本类型之间的固有差异以及复杂背景中的复杂纹理。为了应对这个挑战,本文引入了一个跨文本图像传输的三阶段框架。最初,我们引入了一个文本替换网络,它无缝地用期望的替换文本替换原始文本。随后,我们将一个背景修复网络纳入我们的框架中。这个专门网络旨在巧妙地重构背景图像,有效地解决了在移除原始文本后留下的缺口。这个过程精心保留了背景的视觉和谐和连贯性。最后,通过文本替换网络和背景修复网络的合成,达到了精确编辑最终图像的效果。在补充资料中包括一个演示视频。
https://arxiv.org/abs/2310.13366
Diffusion-based methods have achieved prominent success in generating 2D media. However, accomplishing similar proficiencies for scene-level mesh texturing in 3D spatial applications, e.g., XR/VR, remains constrained, primarily due to the intricate nature of 3D geometry and the necessity for immersive free-viewpoint rendering. In this paper, we propose a novel indoor scene texturing framework, which delivers text-driven texture generation with enchanting details and authentic spatial coherence. The key insight is to first imagine a stylized 360° panoramic texture from the central viewpoint of the scene, and then propagate it to the rest areas with inpainting and imitating techniques. To ensure meaningful and aligned textures to the scene, we develop a novel coarse-to-fine panoramic texture generation approach with dual texture alignment, which both considers the geometry and texture cues of the captured scenes. To survive from cluttered geometries during texture propagation, we design a separated strategy, which conducts texture inpainting in confidential regions and then learns an implicit imitating network to synthesize textures in occluded and tiny structural areas. Extensive experiments and the immersive VR application on real-world indoor scenes demonstrate the high quality of the generated textures and the engaging experience on VR headsets. Project webpage: this https URL
扩散基方法已经在生成2D媒体方面取得了显著的成功。然而,在3D空间应用中实现与场景级别网格纹理类似的效率,例如XR/VR,仍然受到限制,主要原因是3D几何的复杂性和需要沉浸式自由视点渲染的必要性。在本文中,我们提出了一个新颖的室内场景纹理框架,它通过文本驱动纹理生成具有迷人细节和真实空间连贯性的纹理。关键见解是首先想象场景中心视点的抽象360°全景纹理,然后通过修复和仿真的技术将其传播到其他区域。为了确保纹理与场景的一致性,我们开发了一种新的粗纹理到细纹理全景纹理生成方法,该方法采用双纹理对齐,同时考虑了捕获场景的几何和纹理提示。为了在纹理传播过程中从杂乱的的几何中存活下来,我们设计了一种分离的策略,它在对纹理传播的区域进行修复纹理后,并学习一个隐含的模仿网络来合成遮挡和微小结构区域的纹理。在现实世界室内场景的沉浸式VR应用中进行广泛的实验,证明了生成的纹理的高质量以及VR头戴设备上的引人入胜的体验。项目网页:https:// this URL
https://arxiv.org/abs/2310.13119
Explainable AI (XAI) is the study on how humans can be able to understand the cause of a model's prediction. In this work, the problem of interest is Scene Text Recognition (STR) Explainability, using XAI to understand the cause of an STR model's prediction. Recent XAI literatures on STR only provide a simple analysis and do not fully explore other XAI methods. In this study, we specifically work on data explainability frameworks, called attribution-based methods, that explain the important parts of an input data in deep learning models. However, integrating them into STR produces inconsistent and ineffective explanations, because they only explain the model in the global context. To solve this problem, we propose a new method, STRExp, to take into consideration the local explanations, i.e. the individual character prediction explanations. This is then benchmarked across different attribution-based methods on different STR datasets and evaluated across different STR models.
可解释人工智能(XAI)是研究人类如何理解和解释模型预测原因的学科。在本文中,我们关注的是场景文本识别(STR)可解释性问题,使用XAI了解STR模型的预测原因。最近,关于STR的可解释性 literature 仅提供了一个简单的分析,并没有完全探索其他XAI 方法。在这项研究中,我们特别关注用于解释深度学习模型输入数据的重要部分的 attribution-based 方法。然而,将它们集成到STR 中产生了不准确和不有效的解释,因为它们只解释了模型在全局上下文中的情况。为了解决这个问题,我们提出了一个新的方法 STRExp,考虑局部解释,即个体特征预测解释。然后,在不同的STR数据集上进行了基准测试,并在不同的STR模型上进行了评估。
https://arxiv.org/abs/2310.09549
In this paper, we explore the potential of the Contrastive Language-Image Pretraining (CLIP) model in scene text recognition (STR), and establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR) to leverage both visual and linguistic knowledge in CLIP. Different from previous CLIP-based methods mainly considering feature generalization on visual encoding, we propose a symmetrical distillation strategy (SDS) that further captures the linguistic knowledge in the CLIP text encoder. By cascading the CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure is built with an image-to-text feature flow that covers not only visual but also linguistic information for distillation.Benefiting from the natural alignment in CLIP, such guidance flow provides a progressive optimization objective from vision to language, which can supervise the STR feature forwarding process layer-by-layer.Besides, a new Linguistic Consistency Loss (LCL) is proposed to enhance the linguistic capability by considering second-order statistics during the optimization. Overall, CLIP-OCR is the first to design a smooth transition between image and text for the STR task.Extensive experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average accuracy on six popular STR benchmarks.Code will be available at this https URL.
在本文中,我们探讨了对比性语言图像预训练(CLIP)模型在场景文本识别(STR)中的潜力,并建立了一个名为CLIP-OCR的新对称语言特征蒸馏框架,以利用CLIP中的视觉和语言知识。与之前主要关注在视觉编码上特征泛化的CLIP-based方法不同,我们提出了一个对称蒸馏策略(SDS),进一步捕捉了CLIP文本编码器中的语言知识。通过级联CLIP图像编码器与反向CLIP文本编码器,构建了一个对称结构,具有图像到文本的特征流,不仅涵盖了视觉信息,还涵盖了语言信息。 得益于CLIP的自然对齐特性,例如指导流,为从视觉到语言的逐步优化提供了连续的优化目标。此外,为了增强语言能力,我们提出了一个新的语言一致性损失(LCL),在优化过程中考虑了第二组统计数据。总的来说,CLIP-OCR是第一个为STR任务设计平滑图像和文本之间过渡的模型。 广泛的实验证明,CLIP-OCR在六个流行STR基准上的平均准确度为93.8%。代码将在这个https://的URL上提供。
https://arxiv.org/abs/2310.04999
Text-driven 3D indoor scene generation could be useful for gaming, film industry, and AR/VR applications. However, existing methods cannot faithfully capture the room layout, nor do they allow flexible editing of individual objects in the room. To address these problems, we present Ctrl-Room, which is able to generate convincing 3D rooms with designer-style layouts and high-fidelity textures from just a text prompt. Moreover, Ctrl-Room enables versatile interactive editing operations such as resizing or moving individual furniture items. Our key insight is to separate the modeling of layouts and appearance. %how to model the room that takes into account both scene texture and geometry at the same time. To this end, Our proposed method consists of two stages, a `Layout Generation Stage' and an `Appearance Generation Stage'. The `Layout Generation Stage' trains a text-conditional diffusion model to learn the layout distribution with our holistic scene code parameterization. Next, the `Appearance Generation Stage' employs a fine-tuned ControlNet to produce a vivid panoramic image of the room guided by the 3D scene layout and text prompt. In this way, we achieve a high-quality 3D room with convincing layouts and lively textures. Benefiting from the scene code parameterization, we can easily edit the generated room model through our mask-guided editing module, without expensive editing-specific training. Extensive experiments on the Structured3D dataset demonstrate that our method outperforms existing methods in producing more reasonable, view-consistent, and editable 3D rooms from natural language prompts.
文本驱动的3D室内场景生成对于游戏、电影工业和AR/VR应用可能很有用。但是,现有的方法无法忠实地捕捉房间布局,也不能允许对房间里的个体对象进行灵活的编辑。为了解决这些问题,我们提出了Ctrl-Room,它能够通过文本提示生成具有设计师风格布局和高保真的纹理的逼真3D房间。此外,Ctrl-Room还实现了各种灵活的交互编辑操作,例如调整或移动单个家具物品。我们的关键发现是分离建模布局和外观。%如何同时考虑场景纹理和几何形状来建模房间。为此,我们提出了一种两步方法,分别是`Layout GenerationStage`和`Appearance GenerationStage`。`Layout GenerationStage`训练了一个文本条件扩散模型来学习布局分布,同时使用我们的整体场景代码参数化。接下来,`Appearance GenerationStage`使用一个微调的控制Net来生成由3D场景布局和文本提示引导的生动全景图像。通过这些步骤,我们实现了高质量的3D房间,具有逼真的布局和生动纹理。得益于场景代码参数化,我们可以轻松地通过我们的mask引导编辑模块编辑生成的房间模型,而无需昂贵的编辑特定训练。在结构化3D数据集上进行广泛的实验表明,我们的方法在从自然语言提示生成更合理、视角一致且可编辑的3D房间方面比现有方法表现更好。
https://arxiv.org/abs/2310.03602
The adaptation capability to a wide range of domains is crucial for scene text spotting models when deployed to real-world conditions. However, existing state-of-the-art (SOTA) approaches usually incorporate scene text detection and recognition simply by pretraining on natural scene text datasets, which do not directly exploit the intermediate feature representations between multiple domains. Here, we investigate the problem of domain-adaptive scene text spotting, i.e., training a model on multi-domain source data such that it can directly adapt to target domains rather than being specialized for a specific domain or scenario. Further, we investigate a transformer baseline called Swin-TESTR to focus on solving scene-text spotting for both regular and arbitrary-shaped scene text along with an exhaustive evaluation. The results clearly demonstrate the potential of intermediate representations to achieve significant performance on text spotting benchmarks across multiple domains (e.g. language, synth-to-real, and documents). both in terms of accuracy and efficiency.
将广泛 domains 的适应能力应用于现实世界条件中对于场景文本 spotting 模型至关重要。然而,现有的前沿技术方法通常只是通过自然场景文本数据集的预训练来集成场景文本检测和识别功能,这使得它们无法直接利用多个领域之间的中间特征表示。在此,我们研究了一种名为 Swin-TESTR 的Transformer 基线,旨在解决既有的、基于多个领域的场景文本 spotting 问题,即通过训练模型,使其能够直接适应目标领域,而不必为特定的领域或场景进行专门的定制。我们还研究了一种名为场景文本定制(Scene Text定制)的方法,该方法专注于解决规则和任意形状的场景文本中的场景文本 spotting 问题,并进行了充分的评估。结果清楚地表明,中间表示的潜在潜力,能够在多个领域之间实现显著的文本 spotting 基准表现(例如语言、合成到真实、文档等)。同时,在准确性和效率方面实现双赢。
https://arxiv.org/abs/2310.00917
When used in a real-world noisy environment, the capacity to generalize to multiple domains is essential for any autonomous scene text spotting system. However, existing state-of-the-art methods employ pretraining and fine-tuning strategies on natural scene datasets, which do not exploit the feature interaction across other complex domains. In this work, we explore and investigate the problem of domain-agnostic scene text spotting, i.e., training a model on multi-domain source data such that it can directly generalize to target domains rather than being specialized for a specific domain or scenario. In this regard, we present the community a text spotting validation benchmark called Under-Water Text (UWT) for noisy underwater scenes to establish an important case study. Moreover, we also design an efficient super-resolution based end-to-end transformer baseline called DA-TextSpotter which achieves comparable or superior performance over existing text spotting architectures for both regular and arbitrary-shaped scene text spotting benchmarks in terms of both accuracy and model efficiency. The dataset, code and pre-trained models will be released upon acceptance.
在实际应用中,对于任何自主场景文本检测系统来说,能够泛化到多个领域的能力是至关重要的。然而,现有的先进方法使用自然场景数据集进行预训练和微调策略,这些策略并未充分利用其他复杂领域的特征交互。在这项工作中,我们探索并研究了领域无关的场景文本检测问题,即训练模型从多领域源数据中进行训练,使其能够直接泛化到目标领域,而不是为特定领域或场景进行专门训练。为此,我们提出了水下文本(UWT)作为水下场景文本检测验证基准,以建立一个重要的案例研究。此外,我们还设计了一种高效的超分辨率基于端到端Transformer基线,称为DA-TextSpotter,该架构在Regular和任意形状的常规场景文本检测基准上比现有文本检测架构实现了同等或更好的性能。该数据集、代码和预训练模型将在接受后发布。
https://arxiv.org/abs/2310.00558
The quality of pre-training data plays a critical role in the performance of foundation models. Popular foundation models often design their own recipe for data filtering, which makes it hard to analyze and compare different data filtering approaches. DataComp is a new benchmark dedicated to evaluating different methods for data filtering. This paper describes our learning and solution when participating in the DataComp challenge. Our filtering strategy includes three stages: single-modality filtering, cross-modality filtering, and data distribution alignment. We integrate existing methods and propose new solutions, such as computing CLIP score on horizontally flipped images to mitigate the interference of scene text, using vision and language models to retrieve training samples for target downstream tasks, rebalancing the data distribution to improve the efficiency of allocating the computational budget, etc. We slice and dice our design choices, provide in-depth analysis, and discuss open questions. Our approach outperforms the best method from the DataComp paper by over 4% on the average performance of 38 tasks and by over 2% on ImageNet.
预训练数据的质量在基础模型的性能中扮演了关键角色。流行的基础模型往往自己设计数据过滤的方法,这使得难以分析和比较不同数据过滤方法。数据Comp是一个专门用于评估不同数据过滤方法的新基准。本文描述了我们在参加数据Comp挑战时的学习和实践。我们的过滤策略包括三个阶段:单一modality过滤、跨modality过滤和数据分布对齐。我们融合了现有方法并提出了新的解决方案,例如计算横向翻转图像的CLIP得分以减轻场景文本的干扰。使用视觉和语言模型检索目标下述任务的训练样本。我们还重新平衡了数据分布,以提高分配计算预算的效率,等等。我们分割和分级我们的设计选择,提供了深入分析和讨论。我们的方法在38个任务的平均表现上比数据Comp paper中的最佳方法高出4%,而在ImageNet上高出2%。
https://arxiv.org/abs/2309.15954
Inspired by the great success of language model (LM)-based pre-training, recent studies in visual document understanding have explored LM-based pre-training methods for modeling text within document images. Among them, pre-training that reads all text from an image has shown promise, but often exhibits instability and even fails when applied to broader domains, such as those involving both visual documents and scene text images. This is a substantial limitation for real-world scenarios, where the processing of text image inputs in diverse domains is essential. In this paper, we investigate effective pre-training tasks in the broader domains and also propose a novel pre-training method called SCOB that leverages character-wise supervised contrastive learning with online text rendering to effectively pre-train document and scene text domains by bridging the domain gap. Moreover, SCOB enables weakly supervised learning, significantly reducing annotation costs. Extensive benchmarks demonstrate that SCOB generally improves vanilla pre-training methods and achieves comparable performance to state-of-the-art methods. Our findings suggest that SCOB can be served generally and effectively for read-type pre-training methods. The code will be available at this https URL.
受到语言模型(LM)基于预训练的巨大成功启发,近期在视觉文档理解领域中的研究探索了基于LM的预训练方法,用于在文档图像内建模文本。其中,从图像中读取所有文本的预训练方法表现出了 promise,但通常在更广泛的 domains 中表现出不稳定甚至失败,如涉及视觉文档和场景文本图像的 domains。这是一个对于现实世界场景的重大限制,因为在 diverse 的 domains 中处理文本图像输入是至关重要的。在本文中,我们研究了更广泛的 domains 中的有效预训练任务,并提出了一种新的预训练方法称为 SCOB,它利用字符级别的监督Contrastive学习在线文本渲染,通过填补域之间的差距有效地预训练文档和场景文本 domains。此外,SCOB enables 弱监督学习, significantly 降低标注成本。广泛的基准测试表明,SCOB generally 改善常规预训练方法并达到与最新方法相当的性能。我们的发现表明,SCOB 可以普遍有效地用于阅读型预训练方法。代码将在这个 https URL 上可用。
https://arxiv.org/abs/2309.12382
Current Scene text image super-resolution approaches primarily focus on extracting robust features, acquiring text information, and complex training strategies to generate super-resolution images. However, the upsampling module, which is crucial in the process of converting low-resolution images to high-resolution ones, has received little attention in existing works. To address this issue, we propose the Pixel Adapter Module (PAM) based on graph attention to address pixel distortion caused by upsampling. The PAM effectively captures local structural information by allowing each pixel to interact with its neighbors and update features. Unlike previous graph attention mechanisms, our approach achieves 2-3 orders of magnitude improvement in efficiency and memory utilization by eliminating the dependency on sparse adjacency matrices and introducing a sliding window approach for efficient parallel computation. Additionally, we introduce the MLP-based Sequential Residual Block (MSRB) for robust feature extraction from text images, and a Local Contour Awareness loss ($\mathcal{L}_{lca}$) to enhance the model's perception of details. Comprehensive experiments on TextZoom demonstrate that our proposed method generates high-quality super-resolution images, surpassing existing methods in recognition accuracy. For single-stage and multi-stage strategies, we achieved improvements of 0.7\% and 2.6\%, respectively, increasing the performance from 52.6\% and 53.7\% to 53.3\% and 56.3\%. The code is available at this https URL.
当前场景文本图像超分辨率方法主要关注提取稳健特征、获取文本信息以及复杂的训练策略以生成超分辨率图像。然而,在将低分辨率图像转换为高分辨率图像的过程中不可或缺的高斯采样模块却往往未被现有工作所重视。为了解决这一问题,我们提出了基于图注意力的像素适配模块(PAM)来解决高斯采样引起的像素扭曲问题。PAM有效地捕捉了局部结构信息,允许每个像素与其邻居交互并更新特征,与以前的图注意力机制不同,我们的方法实现了2-3个数量级的效率和内存利用率改进,通过消除稀疏邻接矩阵的依赖并引入高效的并行计算窗口方法。此外,我们还引入了基于多层感知器的序列残差块(MSRB)以从文本图像中提取稳健特征,并引入了Local Contour awareness loss( $\mathcal{L}_{lca}$)来增强模型对细节的感知。在文本Zoom实验中,我们实现了0.7%和2.6%的性能提升,将识别精度从52.6%和53.7%提高到了53.3%和56.3%。代码可在此处获取。
https://arxiv.org/abs/2309.08919
Crowdsourced platforms provide huge amounts of street-view images that contain valuable building information. This work addresses the challenges in applying Scene Text Recognition (STR) in crowdsourced street-view images for building attribute mapping. We use Flickr images, particularly examining texts on building facades. A Berlin Flickr dataset is created, and pre-trained STR models are used for text detection and recognition. Manual checking on a subset of STR-recognized images demonstrates high accuracy. We examined the correlation between STR results and building functions, and analysed instances where texts were recognized on residential buildings but not on commercial ones. Further investigation revealed significant challenges associated with this task, including small text regions in street-view images, the absence of ground truth labels, and mismatches in buildings in Flickr images and building footprints in OpenStreetMap (OSM). To develop city-wide mapping beyond urban hotspot locations, we suggest differentiating the scenarios where STR proves effective while developing appropriate algorithms or bringing in additional data for handling other cases. Furthermore, interdisciplinary collaboration should be undertaken to understand the motivation behind building photography and labeling. The STR-on-Flickr results are publicly available at this https URL.
众包平台提供了大量包含有价值建筑信息的街道图像。这项工作解决了在众包街道图像中应用场景文本识别(STR)的挑战,以构建建筑属性映射。我们使用了Flickr图像,特别是检查建筑外墙上的文本。柏林的Flickr数据集被创建,并使用预训练的STR模型进行文本检测和识别。对STR识别图像的手动检查表明有很高的准确性。我们检查了STR结果与建筑功能之间的关系,并分析了在住宅建筑中识别文本但在社交网络上的公共建筑中识别文本的实例。进一步研究表明,这项工作存在重要挑战,包括街道图像中的小文本区域、缺乏真实标签、以及Flickr图像中的建筑和OpenStreetMap(OSM)中的建筑 footprints 不匹配。为了拓展城市热点地点之外的城市地图,我们建议将STR有效情况与其他情况区分开来,并开发适当的算法或引入额外的数据来处理其他情况。此外,跨学科合作应该开展以理解建筑摄影和标签背后的动机。STR在Flickr上的结果在此httpsURL上公开可用。
https://arxiv.org/abs/2309.08042
We introduce the structured scene-text spotting task, which requires a scene-text OCR system to spot text in the wild according to a query regular expression. Contrary to generic scene text OCR, structured scene-text spotting seeks to dynamically condition both scene text detection and recognition on user-provided regular expressions. To tackle this task, we propose the Structured TExt sPotter (STEP), a model that exploits the provided text structure to guide the OCR process. STEP is able to deal with regular expressions that contain spaces and it is not bound to detection at the word-level granularity. Our approach enables accurate zero-shot structured text spotting in a wide variety of real-world reading scenarios and is solely trained on publicly available data. To demonstrate the effectiveness of our approach, we introduce a new challenging test dataset that contains several types of out-of-vocabulary structured text, reflecting important reading applications of fields such as prices, dates, serial numbers, license plates etc. We demonstrate that STEP can provide specialised OCR performance on demand in all tested scenarios.
我们引入了结构化场景文本检测任务,该任务要求场景文本 OCR 系统根据查询 Regular Expression 检测野生文本。与通用的场景文本 OCR 不同,结构化场景文本检测旨在动态地Condition 用户提供的 Regular Expression 对场景文本的检测和识别。为了解决这个问题,我们提出了 Structured TExt sPotter (STEP),这是一个利用提供的文字结构指导 OCR 过程的模型。Step 能够处理包含空格的 Regular Expression,并且不局限于字级别的检测。我们的方法能够在各种真实阅读场景下准确检测零次文本检测,并且仅基于公开数据进行训练。为了证明我们方法的有效性,我们引入了一个新的挑战性测试数据集,其中包含几种罕见的结构化文本类型,反映了各个领域的重要阅读应用,例如价格、日期、序列号、许可证标签等。我们证明了 STEP 能够在所有测试场景中提供专门的 OCR 性能。
https://arxiv.org/abs/2309.02356