Scene text editing aims to modify text content within scene images while maintaining style consistency. Traditional methods achieve this by explicitly disentangling style and content from the source image and then fusing the style with the target content, while ensuring content consistency using a pre-trained recognition model. Despite notable progress, these methods suffer from complex pipelines, leading to suboptimal performance in complex scenarios. In this work, we introduce Recognition-Synergistic Scene Text Editing (RS-STE), a novel approach that fully exploits the intrinsic synergy of text recognition for editing. Our model seamlessly integrates text recognition with text editing within a unified framework, and leverages the recognition model's ability to implicitly disentangle style and content while ensuring content consistency. Specifically, our approach employs a multi-modal parallel decoder based on transformer architecture, which predicts both text content and stylized images in parallel. Additionally, our cyclic self-supervised fine-tuning strategy enables effective training on unpaired real-world data without ground truth, enhancing style and content consistency through a twice-cyclic generation process. Built on a relatively simple architecture, \mymodel achieves state-of-the-art performance on both synthetic and real-world benchmarks, and further demonstrates the effectiveness of leveraging the generated hard cases to boost the performance of downstream recognition tasks. Code is available at this https URL.
场景文本编辑的目标是在保持风格一致性的前提下修改场景图像中的文字内容。传统方法通过从源图中显式地分离出样式和内容,然后将这些样式与目标内容融合来实现这一目的,并使用预训练的识别模型确保内容的一致性。尽管取得了显著的进步,但这些方法由于复杂的管道而难以在复杂的情况下表现出色。 在这项工作中,我们引入了一种新的方法——协同识别场景文本编辑(Recognition-Synergistic Scene Text Editing, RS-STE),它充分利用了文字识别与编辑之间的内在联系。我们的模型在一个统一的框架内无缝地结合了文本识别和文本编辑,并利用了识别模型在隐式分离样式和内容的同时保持内容一致性的能力。具体而言,我们采用了一种基于变压器架构的多模态并行解码器,可以同时预测文本内容和风格化图像。 此外,我们的循环自我监督微调策略能够在没有真实标签的情况下有效训练未配对的真实世界数据,并通过两次循环生成过程增强样式和内容的一致性。在相对简单的架构基础上,\mymodel 在合成和真实世界的基准测试中均达到了最先进的性能,并进一步证明了利用生成的难例来提升下游识别任务性能的有效性。 代码可以在提供的链接地址获取。
https://arxiv.org/abs/2503.08387
Anchor-based 3D Gaussian splatting (3D-GS) exploits anchor features in 3D Gaussian prediction, which has achieved impressive 3D rendering quality with reduced Gaussian redundancy. On the other hand, it often encounters the dilemma among anchor features, model size, and rendering quality - large anchor features lead to large 3D models and high-quality rendering whereas reducing anchor features degrades Gaussian attribute prediction which leads to clear artifacts in the rendered textures and geometries. We design SOGS, an anchor-based 3D-GS technique that introduces second-order anchors to achieve superior rendering quality and reduced anchor features and model size simultaneously. Specifically, SOGS incorporates covariance-based second-order statistics and correlation across feature dimensions to augment features within each anchor, compensating for the reduced feature size and improving rendering quality effectively. In addition, it introduces a selective gradient loss to enhance the optimization of scene textures and scene geometries, leading to high-quality rendering with small anchor features. Extensive experiments over multiple widely adopted benchmarks show that SOGS achieves superior rendering quality in novel view synthesis with clearly reduced model size.
基于锚点的3D高斯点绘制(3D-GS)利用了在3D高斯预测中的锚特征,从而在减少高斯冗余的情况下实现了令人印象深刻的3D渲染质量。然而,它经常面临一个难题:在锚特征、模型大小和渲染质量之间进行权衡——较大的锚特征会导致更大的3D模型以及高质量的渲染效果;而减少锚特征则会降低高斯属性预测的质量,从而导致渲染纹理和几何结构中出现明显的伪影。 我们设计了一种名为SOGS的技术,这是一种基于锚点的3D-GS技术,它引入了二阶锚点以同时实现卓越的渲染质量和更少的锚点特征及模型大小。具体来说,SOGS整合了基于协方差的二阶统计信息和各维度间的相关性来增强每个锚点内的特征,以此补偿减少的特征尺寸并有效提升渲染质量。此外,它还引入了一种选择性梯度损失,以增强场景纹理与几何结构优化,从而在使用小锚特征的情况下也能实现高质量的渲染效果。 广泛的实验证明,在多个广泛采用的基准测试上,SOGS实现了显著降低模型大小的同时,在新颖视角合成中的渲染质量也优于其他方法。
https://arxiv.org/abs/2503.07476
Visual Place Recognition (VPR) is a crucial capability for long-term autonomous robots, enabling them to identify previously visited locations using visual information. However, existing methods remain limited in indoor settings due to the highly repetitive structures inherent in such environments. We observe that scene text typically appears in indoor spaces, serving to distinguish visually similar but different places. This inspires us to propose TextInPlace, a simple yet effective VPR framework that integrates Scene Text Spotting (STS) to mitigate visual perceptual ambiguity in repetitive indoor environments. Specifically, TextInPlace adopts a dual-branch architecture within a local parameter sharing network. The VPR branch employs attention-based aggregation to extract global descriptors for coarse-grained retrieval, while the STS branch utilizes a bridging text spotter to detect and recognize scene text. Finally, the discriminative text is filtered to compute text similarity and re-rank the top-K retrieved images. To bridge the gap between current text-based repetitive indoor scene datasets and the typical scenarios encountered in robot navigation, we establish an indoor VPR benchmark dataset, called Maze-with-Text. Extensive experiments on both custom and public datasets demonstrate that TextInPlace achieves superior performance over existing methods that rely solely on appearance information. The dataset, code, and trained models are publicly available at this https URL.
视觉位置识别(VPR)是长期自主机器人的一项关键能力,使它们能够利用视觉信息来识别之前访问过的地点。然而,现有的方法在室内环境中仍然受限于这些环境中的高度重复结构。我们注意到场景文本通常出现在室内空间中,并且这种文本有助于区分外观相似但实际上不同的地方。这促使我们提出了TextInPlace,这是一种简单而有效的VPR框架,它整合了场景文字检测(STS)以减轻重复的室内环境中视觉感知上的模糊性。 具体来说,TextInPlace采用了一种在局部参数共享网络内的双分支架构。VPR支路使用基于注意力机制的聚合来提取用于粗粒度检索的全局描述符,而STS支路则利用一个桥梁文本检测器来识别和识别人类场景中的文字。最后,它过滤出具有区分性的文本以计算文本相似性并重新排序前K个检索到的图像。 为了弥合现有基于文本的高度重复室内场景数据集与机器人导航中典型的现实情况之间的差距,我们建立了一个名为Maze-with-Text的室内VPR基准数据集。在自定义和公开的数据集中进行的广泛实验表明,TextInPlace优于仅依赖外观信息的现有方法,并且性能更优。 该数据集、代码以及训练模型可在以下网址获取:[链接](请将方括号中的文字替换为实际链接)。
https://arxiv.org/abs/2503.06501
During the steel billet production process, it is essential to recognize machine-printed or manually written billet numbers on moving billets in real-time. To address the issue of low recognition accuracy for existing scene text recognition methods, caused by factors such as image distortions and distribution differences between training and test data, we propose a billet number recognition method that integrates test-time adaptation with prior knowledge. First, we introduce a test-time adaptation method into a model that uses the DB network for text detection and the SVTR network for text recognition. By minimizing the model's entropy during the testing phase, the model can adapt to the distribution of test data without the need for supervised fine-tuning. Second, we leverage the billet number encoding rules as prior knowledge to assess the validity of each recognition result. Invalid results, which do not comply with the encoding rules, are replaced. Finally, we introduce a validation mechanism into the CTC algorithm using prior knowledge to address its limitations in recognizing damaged characters. Experimental results on real datasets, including both machine-printed billet numbers and handwritten billet numbers, show significant improvements in evaluation metrics, validating the effectiveness of the proposed method.
在钢坯生产过程中,实时识别移动中的钢坯上的机器打印或手写编号是至关重要的。为了解决现有场景文本识别方法因图像失真和训练数据与测试数据分布差异等因素导致的低识别精度问题,我们提出了一种结合测试时间适应性调整及先验知识的钢坯编号识别方法。 首先,我们将测试时间适应性调整的方法引入到一个使用DB网络进行文本检测和SVTR网络进行文本识别的模型中。通过在测试阶段最小化模型的熵值,使模型能够适应测试数据的分布特征而无需监督式微调。 其次,我们利用钢坯编号编码规则作为先验知识来评估每个识别结果的有效性。对于不符合编码规则的无效结果,我们会用适当的数值进行替换。 最后,我们将基于先验知识的验证机制引入到CTC算法中,以解决其在识别受损字符时存在的局限性问题。 通过实际数据集(包括机器打印和手写钢坯编号)上的实验结果显示,在评估指标方面取得了显著提升,这证明了我们提出的方法的有效性。
https://arxiv.org/abs/2502.09026
Mainstream Scene Text Recognition (STR) algorithms are developed based on RGB cameras which are sensitive to challenging factors such as low illumination, motion blur, and cluttered backgrounds. In this paper, we propose to recognize the scene text using bio-inspired event cameras by collecting and annotating a large-scale benchmark dataset, termed EventSTR. It contains 9,928 high-definition (1280 * 720) event samples and involves both Chinese and English characters. We also benchmark multiple STR algorithms as the baselines for future works to compare. In addition, we propose a new event-based scene text recognition framework, termed SimC-ESTR. It first extracts the event features using a visual encoder and projects them into tokens using a Q-former module. More importantly, we propose to augment the vision tokens based on a memory mechanism before feeding into the large language models. A similarity-based error correction mechanism is embedded within the large language model to correct potential minor errors fundamentally based on contextual information. Extensive experiments on the newly proposed EventSTR dataset and two simulation STR datasets fully demonstrate the effectiveness of our proposed model. We believe that the dataset and algorithmic model can innovatively propose an event-based STR task and are expected to accelerate the application of event cameras in various industries. The source code and pre-trained models will be released on this https URL
主流的场景文本识别(STR)算法基于RGB相机开发,这些相机对诸如低光照、运动模糊和杂乱背景等挑战性因素非常敏感。在本文中,我们提出了一种使用仿生事件相机来识别场景文本的方法,并为此收集并标注了一个大规模基准数据集,称为EventSTR。该数据集中包含9,928个高清(1280 * 720)事件样本,涉及中文和英文字符。此外,我们还为未来的工作提供了多个STR算法作为基线模型以供比较。另外,我们提出了一种新的基于事件的场景文本识别框架,称为SimC-ESTR。该框架首先使用视觉编码器提取事件特征,并利用Q-former模块将这些特征映射到令牌中。更重要的是,在输入大型语言模型之前,我们提出了通过内存机制来增强视觉令牌的方法。此外,我们在大型语言模型中嵌入了一种基于相似性的错误纠正机制,可以根据上下文信息从根本上纠正潜在的小错误。在新提出的EventSTR数据集和两个模拟的STR数据集上的大量实验充分证明了我们所提出模型的有效性。我们认为,这个数据集和算法模型可以创新地提出一种基于事件的STR任务,并有望加速事件相机在各个行业的应用。源代码和预训练模型将在以下网址发布:[https URL]
https://arxiv.org/abs/2502.09020
We introduce EgoTextVQA, a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text. EgoTextVQA contains 1.5K ego-view videos and 7K scene-text aware questions that reflect real-user needs in outdoor driving and indoor house-keeping activities. The questions are designed to elicit identification and reasoning on scene text in an egocentric and dynamic environment. With EgoTextVQA, we comprehensively evaluate 10 prominent multimodal large language models. Currently, all models struggle, and the best results (Gemini 1.5 Pro) are around 33% accuracy, highlighting the severe deficiency of these techniques in egocentric QA assistance. Our further investigations suggest that precise temporal grounding and multi-frame reasoning, along with high resolution and auxiliary scene-text inputs, are key for better performance. With thorough analyses and heuristic suggestions, we hope EgoTextVQA can serve as a solid testbed for research in egocentric scene-text QA assistance.
我们介绍了EgoTextVQA,这是一个新颖且严谨构建的基准测试平台,用于评估涉及场景文本的以自我为中心(egocentric)的问题回答辅助。EgoTextVQA包含1500个第一人称视角视频和7000个反映户外驾驶和室内家居管理活动中真实用户需求的场景文字感知问题。这些问题旨在激发在动态且以自我为中心环境中对场景文本的识别与推理能力。通过使用EgoTextVQA,我们全面评估了10种突出的多模态大型语言模型。当前,所有模型都面临挑战,其中表现最佳的是Gemini 1.5 Pro,其准确率为约33%,这表明这些技术在以自我为中心的问题回答辅助方面存在严重的不足之处。 进一步的研究表明,精确的时间定位、多帧推理以及高分辨率和场景文本输入的辅助是提高性能的关键因素。通过详细的分析和启发式建议,我们希望EgoTextVQA能成为研究以自我为中心场景文字问题回答辅助的坚实测试平台。
https://arxiv.org/abs/2502.07411
Designing datasets for Visual Question Answering (VQA) is a difficult and complex task that requires NLP for parsing and computer vision for analysing the relevant aspects of the image for answering the question asked. Several benchmark datasets have been developed by researchers but there are many issues with using them for methodical performance tests. This paper proposes a new benchmark dataset -- a pilot version called VQA-Levels is ready now -- for testing VQA systems systematically and assisting researchers in advancing the field. The questions are classified into seven levels ranging from direct answers based on low-level image features (without needing even a classifier) to those requiring high-level abstraction of the entire image content. The questions in the dataset exhibit one or many of ten properties. Each is categorised into a specific level from 1 to 7. Levels 1 - 3 are directly on the visual content while the remaining levels require extra knowledge about the objects in the image. Each question generally has a unique one or two-word answer. The questions are 'natural' in the sense that a human is likely to ask such a question when seeing the images. An example question at Level 1 is, ``What is the shape of the red colored region in the image?" while at Level 7, it is, ``Why is the man cutting the paper?". Initial testing of the proposed dataset on some of the existing VQA systems reveals that their success is high on Level 1 (low level features) and Level 2 (object classification) questions, least on Level 3 (scene text) followed by Level 6 (extrapolation) and Level 7 (whole scene analysis) questions. The work in this paper will go a long way to systematically analyze VQA systems.
设计用于视觉问答(VQA)的数据集是一项困难且复杂的任务,需要自然语言处理技术来解析问题,并使用计算机视觉技术分析图像中与回答问题相关的方面。尽管研究人员已经开发出了几个基准数据集,但在进行系统性性能测试时这些数据集仍然存在诸多问题。本文提出了一种新的基准数据集——一个称为VQA-Levels的试点版本现已准备就绪——旨在对VQA系统进行系统的测试,并帮助研究者推进该领域的进展。这些问题被分类为七个等级,从基于图像低级特征(甚至不需要使用分类器)直接回答的问题到需要整个图像内容高层次抽象的问题不等。数据集中的问题表现出十种属性之一或多种。每一个问题根据其复杂性从1到7进行分级。等级1至3主要涉及视觉内容的直接分析,而剩余的等级则需要关于图像中物体的额外知识。一般而言,每个问题通常具有一个或两个词的独特答案。这些问题在“自然”方面表现出色,也就是说当看到这些图像时人类有可能会提出这样的问题。例如,在Level 1(低级特征)的问题是:“图片中的红色区域是什么形状?”而在Level 7(整个场景分析),问题是:“为什么这个男人正在剪纸?”对现有VQA系统进行的初步测试显示,它们在Level 1和Level 2(对象分类)这类简单问题上的成功率较高,在Level 3(场景文本)、Level 6(推断)以及最难的Level 7(整个场景分析)类问题上的表现则相对较差。本文的研究将有助于对VQA系统进行系统的分析。
https://arxiv.org/abs/2502.02951
Multimodal large language models (MLLMs) have shown impressive capabilities across various domains, excelling in processing and understanding information from multiple modalities. Despite the rapid progress made previously, insufficient OCR ability hinders MLLMs from excelling in text-related tasks. In this paper, we present \textbf{Ocean-OCR}, a 3B MLLM with state-of-the-art performance on various OCR scenarios and comparable understanding ability on general tasks. We employ Native Resolution ViT to enable variable resolution input and utilize a substantial collection of high-quality OCR datasets to enhance the model performance. We demonstrate the superiority of Ocean-OCR through comprehensive experiments on open-source OCR benchmarks and across various OCR scenarios. These scenarios encompass document understanding, scene text recognition, and handwritten recognition, highlighting the robust OCR capabilities of Ocean-OCR. Note that Ocean-OCR is the first MLLM to outperform professional OCR models such as TextIn and PaddleOCR.
多模态大型语言模型(MLLM)在多个领域展示了令人印象深刻的性能,特别是在处理和理解来自多种模式的信息方面表现出色。尽管之前取得了快速进展,但不足的光学字符识别(OCR)能力阻碍了MLLM在与文本相关任务中的表现。在这篇论文中,我们介绍了**Ocean-OCR**,这是一种3B参数大小的多模态语言模型,在各种OCR场景下展现出最先进的性能,并且在一般任务上具有相当的理解能力。为了使模型能够接受不同分辨率的输入,我们采用了原生分辨率ViT,并利用大量的高质量OCR数据集来提升模型的表现力。通过在开源OCR基准测试和多种OCR场景上的全面实验,展示了Ocean-OCR的优势。这些场景包括文档理解、场景文本识别以及手写识别等,突显了Ocean-OCR强大的OCR能力。值得注意的是,Ocean-OCR是首个在其性能上超过专业OCR模型(如TextIn和PaddleOCR)的多模态大型语言模型。
https://arxiv.org/abs/2501.15558
The recent emergence of the Segment Anything Model (SAM) enables various domain-specific segmentation tasks to be tackled cost-effectively by using bounding boxes as prompts. However, in scene text segmentation, SAM can not achieve desirable performance. The word-level bounding box as prompts is too coarse for characters, while the character-level bounding box as prompts suffers from over-segmentation and under-segmentation issues. In this paper, we propose an automatic annotation pipeline named Char-SAM, that turns SAM into a low-cost segmentation annotator with a Character-level visual prompt. Specifically, leveraging some existing text detection datasets with word-level bounding box annotations, we first generate finer-grained character-level bounding box prompts using the Character Bounding-box Refinement CBR module. Next, we employ glyph information corresponding to text character categories as a new prompt in the Character Glyph Refinement (CGR) module to guide SAM in producing more accurate segmentation masks, addressing issues of over-segmentation and under-segmentation. These modules fully utilize the bbox-to-mask capability of SAM to generate high-quality text segmentation annotations automatically. Extensive experiments on TextSeg validate the effectiveness of Char-SAM. Its training-free nature also enables the generation of high-quality scene text segmentation datasets from real-world datasets like COCO-Text and MLT17.
最近推出的Segment Anything Model (SAM) 使得可以通过使用边界框作为提示来低成本地处理各种特定领域的分割任务。然而,在场景文本分割中,SAM 的表现并不理想。以单词级的边界框作为提示对于字符来说过于粗糙,而以字符级别的边界框作为提示则会导致过度分割和欠分割的问题。为此,我们提出了一种自动注释流水线Char-SAM,它将SAM转变为一种低成本的文字分割标注工具,使用的是基于字符级别的视觉提示。具体而言,利用现有的带有单词级边界框注释的文本检测数据集,我们首先通过Character Bounding-box Refinement (CBR) 模块生成更精细的字符级别边界框提示。接下来,我们在Character Glyph Refinement (CGR) 模块中采用与文字字符类别对应的字形信息作为新的提示,以此来引导SAM产生更加准确的分割掩码,从而解决过度分割和欠分割的问题。这些模块充分利用了SAM将bbox转换为mask的能力,能够自动生成高质量的文字分割标注。在TextSeg上的广泛实验验证了Char-SAM的有效性。其无需训练的特点还使得可以从像COCO-Text 和 MLT17这样的真实世界数据集中生成高质量的场景文本分割数据集。
https://arxiv.org/abs/2412.19917
Recent advancements in scene text spotting have focused on end-to-end methodologies that heavily rely on precise location annotations, which are often costly and labor-intensive to procure. In this study, we introduce an innovative approach that leverages only transcription annotations for training text spotting models, substantially reducing the dependency on elaborate annotation processes. Our methodology employs a query-based paradigm that facilitates the learning of implicit location features through the interaction between text queries and image embeddings. These features are later refined during the text recognition phase using an attention activation map. Addressing the challenges associated with training a weakly-supervised model from scratch, we implement a circular curriculum learning strategy to enhance model convergence. Additionally, we introduce a coarse-to-fine cross-attention localization mechanism for more accurate text instance localization. Notably, our framework supports audio-based annotation, which significantly diminishes annotation time and provides an inclusive alternative for individuals with disabilities. Our approach achieves competitive performance against existing benchmarks, demonstrating that high accuracy in text spotting can be attained without extensive location annotations.
近期的场景文本识别研究主要集中在端到端的方法上,这些方法依赖于精确的位置标注,而获取这种标注通常既昂贵又费时。在本研究中,我们提出了一种创新的方法,仅利用转录标注来训练文本识别模型,从而大大减少了对复杂标注过程的依赖。我们的方法采用基于查询的范式,在该框架下,通过文本查询与图像嵌入之间的交互学习隐含位置特征,并在后续的文字识别阶段使用注意力激活图进一步细化这些特征。 为了解决从零开始训练弱监督模型时面临的挑战,我们实施了循环课程学习策略来提高模型收敛性。此外,为了实现更准确的文本实例定位,我们引入了一种由粗到细的跨注意力定位机制。值得一提的是,我们的框架还支持基于音频的标注方法,这显著减少了标注时间,并为残疾人士提供了包容性的替代方案。 在现有的基准测试中,我们的方法达到了与现有技术相当的表现水平,证明了即使没有大量的位置标注,也能够实现高水平的文字识别准确率。
https://arxiv.org/abs/2412.19504
Cross-view geo-localization identifies the locations of street-view images by matching them with geo-tagged satellite images or OSM. However, most studies focus on image-to-image retrieval, with fewer addressing text-guided retrieval, a task vital for applications like pedestrian navigation and emergency response. In this work, we introduce a novel task for cross-view geo-localization with natural language descriptions, which aims to retrieve corresponding satellite images or OSM database based on scene text. To support this task, we construct the CVG-Text dataset by collecting cross-view data from multiple cities and employing a scene text generation approach that leverages the annotation capabilities of Large Multimodal Models to produce high-quality scene text descriptions with localization this http URL, we propose a novel text-based retrieval localization method, CrossText2Loc, which improves recall by 10% and demonstrates excellent long-text retrieval capabilities. In terms of explainability, it not only provides similarity scores but also offers retrieval reasons. More information can be found at this https URL.
跨视角地理定位通过将街景图像与带地理标签的卫星图像或OSM(开放街道地图)进行匹配来识别其位置。然而,大多数研究集中于图像到图像的检索,较少关注文本引导的检索任务,而这一任务对于行人导航和紧急响应等应用至关重要。在本项工作中,我们引入了一种新的跨视角地理定位任务,该任务使用自然语言描述来检索相应的卫星图像或OSM数据库,基于场景中的文字内容进行匹配。为了支持这项任务,我们通过收集来自多个城市的跨视角数据并采用一种利用大型多模态模型的标注能力生成高质量场景文本描述的方法构建了CVG-Text数据集(详情请参见此链接:[http://this%20https%20URL])。在此基础上,我们提出了一种新颖的文字检索定位方法——CrossText2Loc,该方法能将召回率提高10%,并且展示了出色的长文本检索能力。在解释性方面,它不仅提供了相似度得分,还给出了检索原因。更多详情请参见此链接:[https://this%20https%20URL]。 注释:原文中的两个“this https URL”指向具体的网址信息,在翻译中保留了这种表示方式,实际应用时应替换为具体的网络地址。
https://arxiv.org/abs/2412.17007
Indoor scene texture synthesis has garnered significant interest due to its important potential applications in virtual reality, digital media, and creative arts. Existing diffusion model-based researches either rely on per-view inpainting techniques, which are plagued by severe cross-view inconsistencies and conspicuous seams, or they resort to optimization-based approaches that entail substantial computational overhead. In this work, we present RoomPainter, a framework that seamlessly integrates efficiency and consistency to achieve high-fidelity texturing of indoor scenes. The core of RoomPainter features a zero-shot technique that effectively adapts a 2D diffusion model for 3D-consistent texture synthesis, along with a two-stage generation strategy that ensures both global and local consistency. Specifically, we introduce Attention-Guided Multi-View Integrated Sampling (MVIS) combined with a neighbor-integrated attention mechanism for zero-shot texture map generation. Using the MVIS, we firstly generate texture map for the entire room to ensure global consistency, then adopt its variant, namely an attention-guided multi-view integrated repaint sampling (MVRS) to repaint individual instances within the room, thereby further enhancing local consistency. Experiments demonstrate that RoomPainter achieves superior performance for indoor scene texture synthesis in visual quality, global consistency, and generation efficiency.
室内场景纹理合成因其在虚拟现实、数字媒体和创意艺术中的重要潜在应用而引起了广泛的关注。现有的基于扩散模型的研究要么依赖于视图内插技术,这些技术受到严重跨视角不一致性和明显接缝的影响,要么采用计算成本高昂的优化方法。在这项工作中,我们提出了RoomPainter框架,该框架无缝集成了效率和一致性,以实现室内场景的高质量纹理化。RoomPainter的核心是一个零样本技术,能够有效地将2D扩散模型适应于3D一致性的纹理合成,并结合两阶段生成策略确保全局和局部的一致性。具体而言,我们引入了基于注意引导的多视图集成采样(MVIS)技术,以及一个邻居集成注意力机制来实现零样本纹理贴图生成。利用MVIS,我们首先为整个房间生成纹理贴图以确保全局一致性,然后采用其变体——注意引导的多视图再绘制采样(MVRS),重新绘制房间内的各个实例,从而进一步增强局部一致性。实验表明,RoomPainter在视觉质量、全局一致性和生成效率方面实现了室内场景纹理合成的优越性能。
https://arxiv.org/abs/2412.16778
In the field of scene text spotting, previous OCR methods primarily relied on image encoders and pre-trained text information, but they often overlooked the advantages of incorporating human language instructions. To address this gap, we propose InstructOCR, an innovative instruction-based scene text spotting model that leverages human language instructions to enhance the understanding of text within images. Our framework employs both text and image encoders during training and inference, along with instructions meticulously designed based on text attributes. This approach enables the model to interpret text more accurately and flexibly. Extensive experiments demonstrate the effectiveness of our model and we achieve state-of-the-art results on widely used benchmarks. Furthermore, the proposed framework can be seamlessly applied to scene text VQA tasks. By leveraging instruction strategies during pre-training, the performance on downstream VQA tasks can be significantly improved, with a 2.6% increase on the TextVQA dataset and a 2.1% increase on the ST-VQA dataset. These experimental results provide insights into the benefits of incorporating human language instructions for OCR-related tasks.
在场景文本检测领域,之前的OCR方法主要依赖图像编码器和预训练的文本信息,但它们往往忽视了结合人类语言指令的优势。为了解决这一问题,我们提出了InstructOCR,这是一个创新性的基于指令的场景文本检测模型,利用人类语言指令来增强对图像中文本的理解。我们的框架在训练和推理过程中使用了文本编码器和图像编码器,并且根据文本属性精心设计了指令。这种方法使模型能够更准确、灵活地解释文本内容。广泛的实验表明我们模型的有效性,并且我们在广泛使用的基准测试中取得了最先进的结果。此外,所提出的框架可以无缝应用于场景文本视觉问答(VQA)任务。通过在预训练过程中利用指令策略,下游VQA任务的性能得到了显著提升,在TextVQA数据集上提升了2.6%,在ST-VQA数据集上提升了2.1%。这些实验结果为将人类语言指令融入OCR相关任务中的益处提供了见解。
https://arxiv.org/abs/2412.15523
Connected component (CC) is a proper text shape representation that aligns with human reading intuition. However, CC-based text detection methods have recently faced a developmental bottleneck that their time-consuming post-processing is difficult to eliminate. To address this issue, we introduce an explicit relational reasoning network (ERRNet) to elegantly model the component relationships without post-processing. Concretely, we first represent each text instance as multiple ordered text components, and then treat these components as objects in sequential movement. In this way, scene text detection can be innovatively viewed as a tracking problem. From this perspective, we design an end-to-end tracking decoder to achieve a CC-based method dispensing with post-processing entirely. Additionally, we observe that there is an inconsistency between classification confidence and localization quality, so we propose a Polygon Monte-Carlo method to quickly and accurately evaluate the localization quality. Based on this, we introduce a position-supervised classification loss to guide the task-aligned learning of ERRNet. Experiments on challenging benchmarks demonstrate the effectiveness of our ERRNet. It consistently achieves state-of-the-art accuracy while holding highly competitive inference speed.
连通组件(CC)是一种与人类阅读直觉相符的合适文本形状表示方法。然而,基于CC的文本检测方法最近遇到了一个发展瓶颈,即它们耗时较长的后处理步骤难以消除。为了解决这一问题,我们引入了一种显式关系推理网络(ERRNet),以优雅地建模组件间的关系而无需进行后处理。具体来说,我们将每个文本实例表示为多个有序的文本组件,并将这些组件视为一系列运动中的对象。通过这种方式,场景文本检测可以被创新性地视作一个跟踪问题。从这个角度来看,我们设计了一个端到端的跟踪解码器来实现一种完全不依赖后处理的基于CC的方法。此外,我们注意到分类置信度与定位质量之间存在不一致性,因此提出了一种多边形蒙特卡洛方法以快速准确地评估定位质量。基于此,我们引入了位置监督分类损失函数来指导ERRNet的任务对齐学习。在具有挑战性的基准测试上进行的实验表明我们的ERRNet是有效的,它不仅始终实现了最先进的准确性,而且保持了高度竞争力的推理速度。
https://arxiv.org/abs/2412.14692
Video text-based visual question answering (Video TextVQA) is a practical task that aims to answer questions by jointly reasoning textual and visual information in a given video. Inspired by the development of TextVQA in image domain, existing Video TextVQA approaches leverage a language model (e.g. T5) to process text-rich multiple frames and generate answers auto-regressively. Nevertheless, the spatio-temporal relationships among visual entities (including scene text and objects) will be disrupted and models are susceptible to interference from unrelated information, resulting in irrational reasoning and inaccurate answering. To tackle these challenges, we propose the TEA (stands for ``\textbf{T}rack th\textbf{E} \textbf{A}nswer'') method that better extends the generative TextVQA framework from image to video. TEA recovers the spatio-temporal relationships in a complementary way and incorporates OCR-aware clues to enhance the quality of reasoning questions. Extensive experiments on several public Video TextVQA datasets validate the effectiveness and generalization of our framework. TEA outperforms existing TextVQA methods, video-language pretraining methods and video large language models by great margins.
基于视频文本的视觉问答(Video TextVQA)是一项实用任务,旨在通过共同推理给定视频中的文本和视觉信息来回答问题。受到图像领域TextVQA发展的启发,现有的Video TextVQA方法利用语言模型(例如T5)处理富含文本的多帧,并自回归地生成答案。然而,这种做法会破坏视觉实体(包括场景文本和对象)之间的时空关系,使模型容易受到无关信息的干扰,导致不合理推理和不准确的回答。为了解决这些问题,我们提出了TEA(意为“**T**rack th**E** **A**nswer”)方法,该方法更好地将生成性的TextVQA框架从图像扩展到视频。TEA以互补的方式恢复了时空关系,并融合了OCR感知线索来提高推理问题的质量。在多个公开的Video TextVQA数据集上进行的大量实验验证了我们框架的有效性和泛化能力。TEA显著优于现有的TextVQA方法、视频语言预训练方法和视频大型语言模型。
https://arxiv.org/abs/2412.12502
Scene text spotting has attracted the enthusiasm of relative researchers in recent years. Most existing scene text spotters follow the detection-then-recognition paradigm, where the vanilla detection module hardly determines the reading order and leads to failure recognition. After rethinking the auto-regressive scene text recognition method, we find that a well-trained recognizer can implicitly perceive the local semantics of all characters in a complete word or a sentence without a character-level detection module. Local semantic knowledge not only includes text content but also spatial information in the right reading order. Motivated by the above analysis, we propose the Local Semantics Guided scene text Spotter (LSGSpotter), which auto-regressively decodes the position and content of characters guided by the local semantics. Specifically, two effective modules are proposed in LSGSpotter. On the one hand, we design a Start Point Localization Module (SPLM) for locating text start points to determine the right reading order. On the other hand, a Multi-scale Adaptive Attention Module (MAAM) is proposed to adaptively aggregate text features in a local area. In conclusion, LSGSpotter achieves the arbitrary reading order spotting task without the limitation of sophisticated detection, while alleviating the cost of computational resources with the grid sampling strategy. Extensive experiment results show LSGSpotter achieves state-of-the-art performance on the InverseText benchmark. Moreover, our spotter demonstrates superior performance on English benchmarks for arbitrary-shaped text, achieving improvements of 0.7\% and 2.5\% on Total-Text and SCUT-CTW1500, respectively. These results validate our text spotter is effective for scene texts in arbitrary reading order and shape.
场景文本检测近年来引起了相关研究人员的极大兴趣。大多数现有的场景文本检测器遵循先检测后识别的范式,在这种范式中,传统的检测模块很难确定阅读顺序,并导致了识别失败。在重新思考自动回归的场景文本识别方法之后,我们发现一个训练良好的识别器可以在没有字符级检测模块的情况下隐含地感知到完整单词或句子中所有字符的局部语义知识。局部语义知识不仅包括文本内容,还包括正确的阅读顺序中的空间信息。受上述分析的启发,我们提出了由局部语义引导的场景文本检测器(LSGSpotter),它在局部语义的指导下自动回归地解码字符的位置和内容。具体来说,在LSGSpotter中提出了两个有效的模块:一方面,设计了一个起点定位模块(SPLM)来确定文本的起始点并决定正确的阅读顺序;另一方面,提出了一种多尺度自适应注意力模块(MAAM),用于在局部区域内自适应地聚合文本特征。总之,LSGSpotter能够在不受复杂检测限制的情况下完成任意阅读顺序的识别任务,并通过网格采样策略减轻了计算资源的成本。广泛的实验结果表明,LSGSpotter在InverseText基准测试中达到了最先进的性能。此外,在针对任意形状文本的英语基准上,我们的检测器也表现出色,分别在Total-Text和SCUT-CTW1500上实现了0.7%和2.5%的改进。这些结果验证了我们的文本检测器对于任意阅读顺序和形状的场景文本是有效的。
https://arxiv.org/abs/2412.10159
Large Multimodal Models (LMMs) have demonstrated impressive performance on recognizing document images with natural language instructions. However, it remains unclear to what extent capabilities in literacy with rich structure and fine-grained visual challenges. The current landscape lacks a comprehensive benchmark to effectively measure the literate capabilities of LMMs. Existing benchmarks are often limited by narrow scenarios and specified tasks. To this end, we introduce CC-OCR, a comprehensive benchmark that possess a diverse range of scenarios, tasks, and challenges. CC-OCR comprises four OCR-centric tracks: multi-scene text reading, multilingual text reading, document parsing, and key information extraction. It includes 39 subsets with 7,058 full annotated images, of which 41% are sourced from real applications, being released for the first time. Furthermore, we evaluate nine prominent LMMs and reveal both the strengths and weaknesses of these models, particularly in text grounding, multi-orientation, and hallucination of repetition. CC-OCR aims to comprehensively evaluate the capabilities of LMMs on OCR-centered tasks, driving advancement in LMMs.
https://arxiv.org/abs/2412.02210
Scene text recognition (STR) suffers from the challenges of either less realistic synthetic training data or the difficulty of collecting sufficient high-quality real-world data, limiting the effectiveness of trained STR models. Meanwhile, despite producing holistically appealing text images, diffusion-based text image generation methods struggle to generate accurate and realistic instance-level text on a large scale. To tackle this, we introduce TextSSR: a novel framework for Synthesizing Scene Text Recognition data via a diffusion-based universal text region synthesis model. It ensures accuracy by focusing on generating text within a specified image region and leveraging rich glyph and position information to create the less complex text region compared to the entire image. Furthermore, we utilize neighboring text within the region as a prompt to capture real-world font styles and layout patterns, guiding the generated text to resemble actual scenes. Finally, due to its prompt-free nature and capability for character-level synthesis, TextSSR enjoys a wonderful scalability and we construct an anagram-based TextSSR-F dataset with 0.4 million text instances with complexity and realism. Experiments show that models trained on added TextSSR-F data exhibit better accuracy compared to models trained on 4 million existing synthetic data. Moreover, its accuracy margin to models trained fully on a real-world dataset is less than 3.7%, confirming TextSSR's effectiveness and its great potential in scene text image synthesis. Our code is available at this https URL.
https://arxiv.org/abs/2412.01137
Connectionist temporal classification (CTC)-based scene text recognition (STR) methods, e.g., SVTR, are widely employed in OCR applications, mainly due to their simple architecture, which only contains a visual model and a CTC-aligned linear classifier, and therefore fast inference. However, they generally have worse accuracy than encoder-decoder-based methods (EDTRs), particularly in challenging scenarios. In this paper, we propose SVTRv2, a CTC model that beats leading EDTRs in both accuracy and inference speed. SVTRv2 introduces novel upgrades to handle text irregularity and utilize linguistic context, which endows it with the capability to deal with challenging and diverse text instances. First, a multi-size resizing (MSR) strategy is proposed to adaptively resize the text and maintain its readability. Meanwhile, we introduce a feature rearrangement module (FRM) to ensure that visual features accommodate the alignment requirement of CTC well, thus alleviating the alignment puzzle. Second, we propose a semantic guidance module (SGM). It integrates linguistic context into the visual model, allowing it to leverage language information for improved accuracy. Moreover, SGM can be omitted at the inference stage and would not increase the inference cost. We evaluate SVTRv2 in both standard and recent challenging benchmarks, where SVTRv2 is fairly compared with 24 mainstream STR models across multiple scenarios, including different types of text irregularity, languages, and long text. The results indicate that SVTRv2 surpasses all the EDTRs across the scenarios in terms of accuracy and speed. Code is available at this https URL.
基于连接主义时间分类(CTC)的场景文本识别(STR)方法,例如SVTR,在OCR应用中被广泛使用,主要归因于其简单的架构,仅包含一个视觉模型和一个与CTC对齐的线性分类器,因此推理速度较快。然而,它们通常比基于编解码器的方法(EDTRs)具有更低的准确性,尤其是在挑战性场景下表现不佳。本文中,我们提出了SVTRv2,这是一种在准确性和推理速度上都超越领先EDTRs的CTC模型。SVTRv2引入了新颖的功能升级来处理文本不规则性和利用语言上下文,使其具备应对复杂多样的文字实例的能力。首先,提出了一种多尺寸调整(MSR)策略以自适应地调整文本并保持其可读性。同时,我们引入了一个特征重新排列模块(FRM),确保视觉特性能够很好地满足CTC的对齐要求,从而缓解了对齐难题。其次,我们提出了一个语义引导模块(SGM)。它将语言上下文整合到视觉模型中,允许利用语言信息提高准确性。此外,在推理阶段可以省略SGM,不会增加推理成本。我们在标准和近期具有挑战性的基准测试上评估了SVTRv2,与跨多种场景的24种主流STR模型进行了公平比较,包括不同类型的文本不规则性、语言以及长文本。结果表明,SVTRv2在准确性和速度方面超越了所有EDTRs。代码可在以下网址获取:此 https URL。
https://arxiv.org/abs/2411.15858
Existing scene text recognition (STR) methods struggle to recognize challenging texts, especially for artistic and severely distorted characters. The limitation lies in the insufficient exploration of character morphologies, including the monotonousness of widely used synthetic training data and the sensitivity of the model to character morphologies. To address these issues, inspired by the human learning process of viewing and summarizing, we facilitate the contrastive learning-based STR framework in a self-motivated manner by leveraging synthetic and real unlabeled data without any human cost. In the viewing process, to compensate for the simplicity of synthetic data and enrich character morphology diversity, we propose an Online Generation Strategy to generate background-free samples with diverse character styles. By excluding background noise distractions, the model is encouraged to focus on character morphology and generalize the ability to recognize complex samples when trained with only simple synthetic data. To boost the summarizing process, we theoretically demonstrate the derivation error in the previous character contrastive loss, which mistakenly causes the sparsity in the intra-class distribution and exacerbates ambiguity on challenging samples. Therefore, a new Character Unidirectional Alignment Loss is proposed to correct this error and unify the representation of the same characters in all samples by aligning the character features in the student model with the reference features in the teacher model. Extensive experiment results show that our method achieves SOTA performance (94.7\% and 70.9\% average accuracy on common benchmarks and Union14M-Benchmark). Code will be available at this https URL.
现有的场景文本识别(STR)方法在识别具有挑战性的文本方面存在困难,特别是对于艺术性和严重扭曲的字符。其局限性在于对字符形态探索不足,包括广泛使用的合成训练数据单调以及模型对字符形态敏感。为了解决这些问题,受人类学习过程中的观察和总结启发,我们通过利用合成和现实未标记的数据,在无需人工成本的情况下,以自我激励的方式促进对比学习为基础的STR框架的发展。在观察过程中,为了弥补合成数据的简单性并丰富字符形态多样性,我们提出了一种在线生成策略来生成具有多样化字符风格且无背景的样本。通过排除背景噪音干扰,模型被鼓励专注于字符形态,并能够仅使用简单的合成数据进行训练时泛化识别复杂样本的能力。为了增强总结过程,我们在理论上证明了先前字符对比损失中的推导错误,该错误误导致类内分布稀疏并加剧了对具有挑战性样本的模糊性。因此,我们提出了一个新的字符单向对齐损失来纠正这一错误并通过在学生模型中将字符特征与教师模型中的参考特征对齐统一所有样本中相同字符的表现形式。广泛的实验结果表明,我们的方法实现了最先进的性能(在常见基准和Union14M-Benchmark上的平均准确率分别为94.7%和70.9%)。代码将在以下链接提供:此 https URL。
https://arxiv.org/abs/2411.15585