Scene text detection has seen the emergence of high-performing methods that excel on academic benchmarks. However, these detectors often fail to replicate such success in real-world scenarios. We uncover two key factors contributing to this discrepancy through extensive experiments. First, a \textit{Fine-tuning Gap}, where models leverage \textit{Dataset-Specific Optimization} (DSO) paradigm for one domain at the cost of reduced effectiveness in others, leads to inflated performances on academic benchmarks. Second, the suboptimal performance in practical settings is primarily attributed to the long-tailed distribution of texts, where detectors struggle with rare and complex categories as artistic or overlapped text. Given that the DSO paradigm might undermine the generalization ability of models, we advocate for a \textit{Joint-Dataset Learning} (JDL) protocol to alleviate the Fine-tuning Gap. Additionally, an error analysis is conducted to identify three major categories and 13 subcategories of challenges in long-tailed scene text, upon which we propose a Long-Tailed Benchmark (LTB). LTB facilitates a comprehensive evaluation of ability to handle a diverse range of long-tailed challenges. We further introduce MAEDet, a self-supervised learning-based method, as a strong baseline for LTB. The code is available at this https URL.
场景文本检测领域已涌现出一系列在学术基准测试中表现出色的方法,然而这些方法往往难以在实际环境中复制同样的成功。通过广泛的实验研究,我们发现了导致这一差距的两个关键因素。首先,存在一个称为“微调间隙”(Fine-tuning Gap)的现象:模型利用特定数据集优化(Dataset-Specific Optimization, DSO)范式为某一领域量身定制时,会导致在其他领域的有效性降低,并且这种做法会使得模型在学术基准测试中的表现被夸大。其次,在实际应用中性能不佳主要是由于文本的长尾分布导致的,即检测器难以处理罕见和复杂的类别,如艺术性或重叠的文字。 鉴于DSO范式可能削弱模型的泛化能力,我们提倡采用联合数据集学习(Joint-Dataset Learning, JDL)协议来缓解微调间隙问题。此外,我们还进行了一项错误分析,识别出长尾场景文本中的三个主要挑战类别和十三个子类别,并提出一个长尾基准测试(Long-Tailed Benchmark, LTB),以全面评估处理多样化的长尾挑战的能力。 为了进一步支持这一研究,我们引入了MAEDet方法作为LTB的一个强基线,该方法基于自监督学习。相关的代码可以在以下链接中获取:[此URL](请将实际的GitHub或其他代码托管平台的URL填入括号内)。
https://arxiv.org/abs/2505.15649
Scene text editing, a subfield of image editing, requires modifying texts in images while preserving style consistency and visual coherence with the surrounding environment. While diffusion-based methods have shown promise in text generation, they still struggle to produce high-quality results. These methods often generate distorted or unrecognizable characters, particularly when dealing with complex characters like Chinese. In such systems, characters are composed of intricate stroke patterns and spatial relationships that must be precisely maintained. We present GlyphMastero, a specialized glyph encoder designed to guide the latent diffusion model for generating texts with stroke-level precision. Our key insight is that existing methods, despite using pretrained OCR models for feature extraction, fail to capture the hierarchical nature of text structures - from individual strokes to stroke-level interactions to overall character-level structure. To address this, our glyph encoder explicitly models and captures the cross-level interactions between local-level individual characters and global-level text lines through our novel glyph attention module. Meanwhile, our model implements a feature pyramid network to fuse the multi-scale OCR backbone features at the global-level. Through these cross-level and multi-scale fusions, we obtain more detailed glyph-aware guidance, enabling precise control over the scene text generation process. Our method achieves an 18.02\% improvement in sentence accuracy over the state-of-the-art multi-lingual scene text editing baseline, while simultaneously reducing the text-region Fréchet inception distance by 53.28\%.
场景文本编辑是图像编辑的一个子领域,它要求在保留风格一致性和周围环境视觉连贯性的前提下修改图片中的文字。虽然基于扩散的方法在文本生成方面显示出潜力,但它们仍然难以产生高质量的结果。这些方法经常生成扭曲或无法辨认的字符,尤其是在处理像中文这样复杂的字符时。在这种系统中,字符由精细的笔画模式和空间关系构成,必须精确维护。 我们提出了GlyphMastero,这是一种专门设计用于引导潜在扩散模型以实现笔画级别精度的文字生成技术的字体编码器。我们的关键见解是现有的方法虽然使用了预训练的OCR(光学字符识别)模型进行特征提取,但未能捕捉文本结构的层次性——从单个笔画到笔画级别的交互再到整个字符级别的结构。为了解决这个问题,我们的字体编码器通过我们新颖的字体注意模块明确地建模并捕获局部级别个体字符与全局级别文字行之间的跨层级互动。 同时,我们的模型实现了一个特征金字塔网络来融合多尺度OCR骨干特性在全局层面的信息。通过这些跨层级和多层次的融合,我们获得了更为详细的文字感知指导,从而能够对场景文本生成过程进行更精确的控制。 我们的方法相对于最先进的多语言场景文本编辑基线,在句子准确度方面提高了18.02%,同时将文本区域Fréchet inception距离减少了53.28%。
https://arxiv.org/abs/2505.04915
The task of scene text editing is to modify or add texts on images while maintaining the fidelity of newly generated text and visual coherence with the background. Recent works based on latent diffusion models (LDM) show improved text editing results, yet still face challenges and often generate inaccurate or unrecognizable characters, especially for non-Latin ones (\eg, Chinese), which have complex glyph structures. To address these issues, we present FLUX-Text, a simple and advanced multilingual scene text editing framework based on FLUX-Fill. Specifically, we carefully investigate glyph conditioning, considering both visual and textual modalities. To retain the original generative capabilities of FLUX-Fill while enhancing its understanding and generation of glyphs, we propose lightweight glyph and text embedding modules. Owning to the lightweight design, FLUX-Text is trained only with $100K$ training examples compared to current popular methods trained with 2.9M ones. With no bells and whistles, our method achieves state-of-the-art performance on text editing tasks. Qualitative and quantitative experiments on the public datasets demonstrate that our method surpasses previous works in text fidelity.
场景文本编辑的任务是在保持新生成的文字质量和与背景图像视觉一致性的前提下,修改或增加图片中的文字。最近基于潜在扩散模型(LDM)的工作提高了文本编辑的效果,但仍然面临挑战,并且常常产生不准确或难以辨认的字符,尤其是对于非拉丁文(如中文),这些文字具有复杂的字形结构。为了应对这些问题,我们提出了FLUX-Text,这是一个基于FLUX-Fill的简单而先进的多语言场景文本编辑框架。具体来说,我们仔细研究了字形条件化,同时考虑视觉和文本模态。 为了保留FLUX-Fill原有的生成能力并增强其对字形的理解与生成,我们提出了一种轻量级的字形和文本嵌入模块。由于采用了轻量化设计,FLUX-Text仅使用10万个训练样本进行训练,而目前流行的方法则需要290万个训练样本。在没有任何复杂技巧的情况下,我们的方法在文本编辑任务上达到了最先进的性能水平。通过公共数据集上的定性和定量实验表明,我们的方法在文字保真度方面超越了之前的工作。
https://arxiv.org/abs/2505.03329
Vision-language models (VLMs) have demonstrated impressive capabilities in understanding and reasoning about visual and textual content. However, their robustness to common image corruptions remains under-explored. In this work, we present the first comprehensive analysis of VLM robustness across 19 corruption types from the ImageNet-C benchmark, spanning four categories: noise, blur, weather, and digital distortions. We introduce two new benchmarks, TextVQA-C and GQA-C, to systematically evaluate how corruptions affect scene text understanding and object-based reasoning, respectively. Our analysis reveals that transformer-based VLMs exhibit distinct vulnerability patterns across tasks: text recognition deteriorates most severely under blur and snow corruptions, while object reasoning shows higher sensitivity to corruptions such as frost and impulse noise. We connect these observations to the frequency-domain characteristics of different corruptions, revealing how transformers' inherent bias toward low-frequency processing explains their differential robustness patterns. Our findings provide valuable insights for developing more corruption-robust vision-language models for real-world applications.
视觉语言模型(VLMs)在理解和推理视觉和文本内容方面表现出令人印象深刻的能力。然而,它们对常见图像损坏的鲁棒性仍鲜有研究。在这项工作中,我们首次全面分析了来自ImageNet-C基准测试的19种不同类型图像损坏对VLM鲁棒性的影响,这些类型涵盖了四个类别:噪声、模糊、天气和数字扭曲。我们介绍了两个新的基准测试——TextVQA-C和GQA-C,分别系统地评估图像损坏如何影响场景文本理解和基于对象的推理。我们的分析揭示了基于变压器的VLM在不同任务中表现出独特的脆弱性模式:文本识别在遭受模糊和雪破坏时退化最为严重;而物体推理则对霜冻和脉冲噪声等破坏表现出更高的敏感度。我们将这些观察结果与不同类型损坏的频域特性联系起来,展示了变压器模型固有的低频处理偏见如何解释其不同的鲁棒性模式。我们的发现为开发更适用于现实世界应用、能更好地抵抗图像损坏影响的视觉语言模型提供了宝贵的见解。
https://arxiv.org/abs/2504.13690
Recently supervised learning rapidly develops in scene text segmentation. However, the lack of high-quality datasets and the high cost of pixel annotation greatly limit the development of them. Considering the well-performed few-shot learning methods for downstream tasks, we investigate the application of the few-shot learning method to scene text segmentation. We propose TSAL, which leverages CLIP's prior knowledge to learn text attributes for segmentation. To fully utilize the semantic and texture information in the image, a visual-guided branch is proposed to separately extract text and background features. To reduce data dependency and improve text detection accuracy, the adaptive prompt-guided branch employs effective adaptive prompt templates to capture various text attributes. To enable adaptive prompts capture distinctive text features and complex background distribution, we propose Adaptive Feature Alignment module(AFA). By aligning learnable tokens of different attributes with visual features and prompt prototypes, AFA enables adaptive prompts to capture both general and distinctive attribute information. TSAL can capture the unique attributes of text and achieve precise segmentation using only few images. Experiments demonstrate that our method achieves SOTA performance on multiple text segmentation datasets under few-shot settings and show great potential in text-related domains.
最近,监督学习在场景文本分割领域取得了迅速发展。然而,高质量数据集的缺乏和像素标注成本高昂极大地限制了该领域的进步。鉴于针对下游任务表现良好的少量样本学习方法,我们研究了将少量样本学习方法应用于场景文本分割的应用。为此,我们提出了TSAL(Text Segmentation with Adaptive Prompt Learning),它利用CLIP的先验知识来学习用于分割的文字属性。为了充分利用图像中的语义和纹理信息,提出了一种视觉引导分支,以分别提取文字和背景特征。 为减少对数据的依赖并提高文本检测精度,适应性提示引导分支采用了有效的自适应提示模板来捕捉各种文字属性。为了让自适应提示能够捕获独特且复杂的文字特征及背景分布,我们提出了自适应特征对齐模块(Adaptive Feature Alignment, AFA)。通过将不同属性的学习令牌与视觉特征和提示原型进行对齐,AFA使自适应提示能够同时捕捉通用和独特的属性信息。 TSAL能够在仅使用少量图像的情况下捕获文本的独特属性,并实现精确的分割。实验表明,在多个文本分割数据集下的少量样本设置中,我们的方法达到了最先进的性能,并在与文本相关的领域展示了巨大的潜力。
https://arxiv.org/abs/2504.11164
Most previous scene text spotting methods rely on high-quality manual annotations to achieve promising performance. To reduce their expensive costs, we study semi-supervised text spotting (SSTS) to exploit useful information from unlabeled images. However, directly applying existing semi-supervised methods of general scenes to SSTS will face new challenges: 1) inconsistent pseudo labels between detection and recognition tasks, and 2) sub-optimal supervisions caused by inconsistency between teacher/student. Thus, we propose a new Semi-supervised framework for End-to-end Text Spotting, namely SemiETS that leverages the complementarity of text detection and recognition. Specifically, it gradually generates reliable hierarchical pseudo labels for each task, thereby reducing noisy labels. Meanwhile, it extracts important information in locations and transcriptions from bidirectional flows to improve consistency. Extensive experiments on three datasets under various settings demonstrate the effectiveness of SemiETS on arbitrary-shaped text. For example, it outperforms previous state-of-the-art SSL methods by a large margin on end-to-end spotting (+8.7%, +5.6%, and +2.6% H-mean under 0.5%, 1%, and 2% labeled data settings on Total-Text, respectively). More importantly, it still improves upon a strongly supervised text spotter trained with plenty of labeled data by 2.0%. Compelling domain adaptation ability shows practical potential. Moreover, our method demonstrates consistent improvement on different text spotters.
大多数以前的场景文本识别方法依赖于高质量的手工标注来实现较好的性能。为了减少昂贵的成本,我们研究了半监督文本定位(SSTS),以利用未标记图像中的有用信息。然而,直接将现有的通用场景半监督方法应用于SSTS会面临新的挑战:1)检测和识别任务之间的伪标签不一致;2)由于教师/学生之间的一致性问题导致的次优监督信号。因此,我们提出了一种新的用于端到端文本定位的半监督框架——SemiETS,该框架利用了文本检测与识别间的互补关系。具体来说,它逐步为每个任务生成可靠的分层伪标签,从而减少噪声标签的数量。同时,它从双向流动中提取位置和转录中的重要信息以提高一致性。在三个不同设置的数据集上的广泛实验表明,SemiETS在任意形状文本的定位上表现出有效性。例如,在Total-Text数据集中,对于端到端检测任务,其分别在0.5%,1% 和2% 标注数据的条件下超过了之前的最先进SSL方法8.7%,5.6%,和2.6%H-means。更重要的是,它仍然比经过大量标注数据训练的强大监督文本定位器提高了2.0% 的性能。强大的领域适应能力显示了其实用潜力。此外,我们的方法在不同的文本定位器上均表现出一致的改进效果。
https://arxiv.org/abs/2504.09966
Pursuing efficient text shape representations helps scene text detection models focus on compact foreground regions and optimize the contour reconstruction steps to simplify the whole detection pipeline. Current approaches either represent irregular shapes via box-to-polygon strategy or decomposing a contour into pieces for fitting gradually, the deficiency of coarse contours or complex pipelines always exists in these models. Considering the above issues, we introduce EdgeText to fit text contours compactly while alleviating excessive contour rebuilding processes. Concretely, it is observed that the two long edges of texts can be regarded as smooth curves. It allows us to build contours via continuous and smooth edges that cover text regions tightly instead of fitting piecewise, which helps avoid the two limitations in current models. Inspired by this observation, EdgeText formulates the text representation as the edge approximation problem via parameterized curve fitting functions. In the inference stage, our model starts with locating text centers, and then creating curve functions for approximating text edges relying on the points. Meanwhile, truncation points are determined based on the location features. In the end, extracting curve segments from curve functions by using the pixel coordinate information brought by truncation points to reconstruct text contours. Furthermore, considering the deep dependency of EdgeText on text edges, a bilateral enhanced perception (BEP) module is designed. It encourages our model to pay attention to the recognition of edge features. Additionally, to accelerate the learning of the curve function parameters, we introduce a proportional integral loss (PI-loss) to force the proposed model to focus on the curve distribution and avoid being disturbed by text scales.
追求高效的文本形状表示有助于场景文本检测模型聚焦于紧致的前景区域,并优化轮廓重建步骤以简化整个检测流程。当前的方法要么通过盒到多边形策略来表示不规则形状,要么将一个轮廓分解成多个片段逐步拟合。这些方法中往往存在粗略轮廓或复杂流程的问题。 为了解决上述问题,我们提出了EdgeText模型,旨在紧凑地匹配文本轮廓同时缓解过度的轮廓重建过程。具体而言,观察到文本的两条长边可以视为平滑曲线,这使得我们可以构建连续且平滑覆盖文本区域的边缘以代替分段拟合的方式,从而避免现有模型中的两个局限性。受到这一发现启发,EdgeText将文本表示形式化为参数化曲线拟合函数的问题。 在推理阶段,我们的模型首先定位文字中心,然后基于这些点创建用于近似文字边界的曲线函数。同时,根据位置特征确定截断点。最后,利用由截断点带来的像素坐标信息从曲线函数中提取曲线段来重建文本轮廓。 此外,考虑到EdgeText对文本边缘的深度依赖性,我们设计了一种双边增强感知(BEP)模块,以鼓励模型注意边缘特性的识别。为了加速学习曲线函数参数的速度,我们引入了比例积分损失(PI-loss),使提出的模型专注于曲线分布并避免受到文字尺度的影响。
https://arxiv.org/abs/2504.04001
Scene text image super-resolution (STISR) enhances the resolution and quality of low-resolution images. Unlike previous studies that treated scene text images as natural images, recent methods using a text prior (TP), extracted from a pre-trained text recognizer, have shown strong performance. However, two major issues emerge: (1) Explicit categorical priors, like TP, can negatively impact STISR if incorrect. We reveal that these explicit priors are unstable and propose replacing them with Non-CAtegorical Prior (NCAP) using penultimate layer representations. (2) Pre-trained recognizers used to generate TP struggle with low-resolution images. To address this, most studies jointly train the recognizer with the STISR network to bridge the domain gap between low- and high-resolution images, but this can cause an overconfidence phenomenon in the prior modality. We highlight this issue and propose a method to mitigate it by mixing hard and soft labels. Experiments on the TextZoom dataset demonstrate an improvement by 3.5%, while our method significantly enhances generalization performance by 14.8\% across four text recognition datasets. Our method generalizes to all TP-guided STISR networks.
场景文本图像超分辨率(STISR)技术旨在提高低分辨率图像的分辨率和质量。与以往将场景文本图像视为自然图像处理的研究不同,最近使用从预训练的文字识别器中提取的文本先验(TP)的方法展现了强大的性能。然而,这种方法存在两个主要问题: 1. 显式的类别先验,如TP,在错误的情况下会负面影响STISR的效果。我们发现这些显式先验不稳定,并提出用来自倒数第二层表示的非分类前缀(NCAP)来替代它们。 2. 用于生成TP的预训练识别器在处理低分辨率图像时遇到困难。为了应对这一问题,大多数研究通过将识别器与STISR网络联合训练来弥合低分辨率和高分辨率图像之间的领域差距,但这可能会导致先验模态中的过度自信现象。我们强调了这个问题,并提出了一种混合硬标签和软标签的方法来缓解它。 在TextZoom数据集上的实验表明,我们的方法相比基准提高了3.5%的性能,而在四个文本识别数据集中,我们的方法显著增强了泛化性能,增幅达14.8%。此外,我们提出的方法能够应用于所有基于TP引导的STISR网络中。
https://arxiv.org/abs/2504.00410
Recent advancements in autoregressive and diffusion models have led to strong performance in image generation with short scene text words. However, generating coherent, long-form text in images, such as paragraphs in slides or documents, remains a major challenge for current generative models. We present the first work specifically focused on long text image generation, addressing a critical gap in existing text-to-image systems that typically handle only brief phrases or single sentences. Through comprehensive analysis of state-of-the-art autoregressive generation models, we identify the image tokenizer as a critical bottleneck in text generating quality. To address this, we introduce a novel text-focused, binary tokenizer optimized for capturing detailed scene text features. Leveraging our tokenizer, we develop \ModelName, a multimodal autoregressive model that excels in generating high-quality long-text images with unprecedented fidelity. Our model offers robust controllability, enabling customization of text properties such as font style, size, color, and alignment. Extensive experiments demonstrate that \ModelName~significantly outperforms SD3.5 Large~\cite{sd3} and GPT4o~\cite{gpt4o} with DALL-E 3~\cite{dalle3} in generating long text accurately, consistently, and flexibly. Beyond its technical achievements, \ModelName~opens up exciting opportunities for innovative applications like interleaved document and PowerPoint generation, establishing a new frontier in long-text image generating.
最近在自回归和扩散模型方面取得的进展已经使得使用简短场景文本词生成图像时表现出色。然而,当前的生成模型仍然难以克服一个重大挑战:即在图像中生成连贯且长篇幅的文字,例如幻灯片或文档中的段落。我们首次提出了专注于长文本图像生成的工作,填补了现有文字到图像系统的一个关键空白,这些系统通常只能处理简短语句或单个句子。 通过对最先进的自回归生成模型进行全面分析,我们发现图像标记器是影响文本生成质量的关键瓶颈。为了解决这一问题,我们引入了一种新颖的以文本为中心、二进制化的标记器,该标记器专为捕捉详细的场景文字特征而优化。利用我们的标记器,我们开发了\ModelName(假设这里的\ModelName是指一个具体的模型名称),这是一种多模态自回归模型,在生成高质量且细节丰富的长篇幅文字图像方面表现卓越,并以前所未有的精确度实现了这一目标。 该模型提供强大的可控制性,能够根据需求定制文本属性,如字体样式、大小、颜色和对齐方式。广泛的实验表明,\ModelName在准确、一致和灵活地生成长文本方面显著优于SD3.5 Large~\cite{sd3}、GPT4o~\cite{gpt4o}与DALL-E 3~\cite{dalle3}的组合。 除了技术成就之外,\ModelName还为创新应用开辟了令人兴奋的机会,例如交错生成文档和PowerPoint演示文稿,从而在长篇幅文字图像生成领域开创了一个新的前沿。
https://arxiv.org/abs/2503.20198
In recent years, vision transformers with text decoder have demonstrated remarkable performance on Scene Text Recognition (STR) due to their ability to capture long-range dependencies and contextual relationships with high learning capacity. However, the computational and memory demands of these models are significant, limiting their deployment in resource-constrained applications. To address this challenge, we propose an efficient and accurate STR system. Specifically, we focus on improving the efficiency of encoder models by introducing a cascaded-transformers structure. This structure progressively reduces the vision token size during the encoding step, effectively eliminating redundant tokens and reducing computational cost. Our experimental results confirm that our STR system achieves comparable performance to state-of-the-art baselines while substantially decreasing computational requirements. In particular, for large-models, the accuracy remains same, 92.77 to 92.68, while computational complexity is almost halved with our structure.
近年来,带有文本解码器的视觉变换器在场景文本识别(STR)任务中表现出卓越性能,这是因为它们能够捕捉长距离依赖关系和上下文关联,并具有很高的学习能力。然而,这些模型的计算和内存需求较大,限制了其在资源受限的应用中的部署。为了应对这一挑战,我们提出了一种高效且准确的STR系统。具体而言,通过引入级联变换器结构,我们在提高编码器效率方面进行了改进。这种结构在编码过程中逐步减少视觉标记大小,有效地消除了冗余标记并降低了计算成本。实验结果证实,我们的STR系统在与最先进的基准模型保持相当性能的同时,大幅减少了计算需求。特别是对于大型模型而言,在使用我们提出的结构后,精度从92.77降至92.68(变化很小),而计算复杂度几乎减半。
https://arxiv.org/abs/2503.18883
Text images are unique in their dual nature, encompassing both visual and linguistic information. The visual component encompasses structural and appearance-based features, while the linguistic dimension incorporates contextual and semantic elements. In scenarios with degraded visual quality, linguistic patterns serve as crucial supplements for comprehension, highlighting the necessity of integrating both aspects for robust scene text recognition (STR). Contemporary STR approaches often use language models or semantic reasoning modules to capture linguistic features, typically requiring large-scale annotated datasets. Self-supervised learning, which lacks annotations, presents challenges in disentangling linguistic features related to the global context. Typically, sequence contrastive learning emphasizes the alignment of local features, while masked image modeling (MIM) tends to exploit local structures to reconstruct visual patterns, resulting in limited linguistic knowledge. In this paper, we propose a Linguistics-aware Masked Image Modeling (LMIM) approach, which channels the linguistic information into the decoding process of MIM through a separate branch. Specifically, we design a linguistics alignment module to extract vision-independent features as linguistic guidance using inputs with different visual appearances. As features extend beyond mere visual structures, LMIM must consider the global context to achieve reconstruction. Extensive experiments on various benchmarks quantitatively demonstrate our state-of-the-art performance, and attention visualizations qualitatively show the simultaneous capture of both visual and linguistic information.
文本图像的独特之处在于其双重性质,既包含视觉信息也包括语言信息。视觉部分涵盖了结构和外观特征,而语言维度则包含了上下文与语义元素。在视觉质量下降的情况下,语言模式作为理解的重要补充显得尤为关键,这强调了为了实现稳健的场景文本识别(STR),融合这两种特性的重要性。 现代STR方法通常采用语言模型或语义推理模块来捕捉语言特征,这些方法往往需要大规模标注数据集的支持。无监督学习由于缺乏标注信息,在分离与全局上下文相关的语言特征方面存在挑战。通常情况下,序列对比学习侧重于局部特征的对齐,而掩码图像建模(MIM)倾向于利用局部结构来重建视觉模式,从而导致语言知识的获取有限。 在本文中,我们提出了一种基于意识的语言掩码图像建模(LMIM)方法,该方法通过单独分支将语言信息引入MIM的解码过程中。具体而言,我们设计了一个语言对齐模块,利用具有不同视觉外观的输入提取与视觉无关的特征作为语言指导。随着特征扩展到超越单纯的视觉结构之外,LMIM必须考虑全局上下文以实现重建。 在多个基准测试中的大量实验定量地展示了我们的方法达到了最先进的性能水平,并且注意力可视化定性地显示了同时捕获视觉和语言信息的能力。
https://arxiv.org/abs/2503.18746
Scaling architectures have been proven effective for improving Scene Text Recognition (STR), but the individual contribution of vision encoder and text decoder scaling remain under-explored. In this work, we present an in-depth empirical analysis and demonstrate that, contrary to previous observations, scaling the decoder yields significant performance gains, always exceeding those achieved by encoder scaling alone. We also identify label noise as a key challenge in STR, particularly in real-world data, which can limit the effectiveness of STR models. To address this, we propose Cloze Self-Distillation (CSD), a method that mitigates label noise by distilling a student model from context-aware soft predictions and pseudolabels generated by a teacher model. Additionally, we enhance the decoder architecture by introducing differential cross-attention for STR. Our methodology achieves state-of-the-art performance on 10 out of 11 benchmarks using only real data, while significantly reducing the parameter size and computational costs.
扩展架构已被证明在提高场景文本识别(STR)方面是有效的,但视觉编码器和文本解码器的个体贡献仍然未被充分探索。在这项工作中,我们进行了深入的经验分析,并展示了与之前的观察相反,扩大解码器规模能够带来显著的性能提升,这一效果总是超过仅靠编码器扩展所实现的效果。此外,我们还识别出标签噪声作为STR的一个关键挑战,在现实世界的数据中尤其如此,这可能会限制STR模型的有效性。为解决这个问题,我们提出了Cloze自我蒸馏(CSD)方法,该方法通过将学生模型从教师模型生成的上下文感知软预测和伪标签中进行知识蒸馏来减少标签噪声。同时,为了增强解码器架构,我们在STR中引入了差异交叉注意力机制。我们的方法在10个基准测试中的11个上使用仅有的真实数据达到了最先进的性能,并且显著减少了参数数量和计算成本。
https://arxiv.org/abs/2503.16184
Modern scene text recognition systems often depend on large end-to-end architectures that require extensive training and are prohibitively expensive for real-time scenarios. In such cases, the deployment of heavy models becomes impractical due to constraints on memory, computational resources, and latency. To address these challenges, we propose a novel, training-free plug-and-play framework that leverages the strengths of pre-trained text recognizers while minimizing redundant computations. Our approach uses context-based understanding and introduces an attention-based segmentation stage, which refines candidate text regions at the pixel level, improving downstream recognition. Instead of performing traditional text detection that follows a block-level comparison between feature map and source image and harnesses contextual information using pretrained captioners, allowing the framework to generate word predictions directly from scene this http URL texts are semantically and lexically evaluated to get a final score. Predictions that meet or exceed a pre-defined confidence threshold bypass the heavier process of end-to-end text STR profiling, ensuring faster inference and cutting down on unnecessary computations. Experiments on public benchmarks demonstrate that our paradigm achieves performance on par with state-of-the-art systems, yet requires substantially fewer resources.
现代场景文本识别系统通常依赖于大型端到端架构,这些架构需要大量的训练,并且在实时应用场景中成本高昂。在这种情况下,由于内存、计算资源和延迟的限制,部署重型模型变得不切实际。为了解决这些问题,我们提出了一种新颖的无需训练即插即用框架,该框架利用预训练文本识别器的优势,同时最小化冗余计算。我们的方法采用基于上下文的理解,并引入了注意力机制驱动的分割阶段,在像素级别上细化候选文本区域,从而提升后续识别效果。 不同于传统的文本检测方式(该方式通过特征图与源图像之间进行块级对比来实现,并利用预训练的字幕生成器来获取上下文信息),我们的框架可以直接从场景中提取出单词预测。提取出来的文本会经过语义和词汇层面的评估以获得最终得分。那些达到或超过预定置信度阈值的预测结果将绕过更复杂的端到端文本STR处理流程,从而加快推理速度并减少不必要的计算。 在公开基准测试上的实验表明,我们的方法实现了与现有顶级系统相当的表现,但却需要显著较少的资源。
https://arxiv.org/abs/2503.15639
Scene text editing aims to modify text content within scene images while maintaining style consistency. Traditional methods achieve this by explicitly disentangling style and content from the source image and then fusing the style with the target content, while ensuring content consistency using a pre-trained recognition model. Despite notable progress, these methods suffer from complex pipelines, leading to suboptimal performance in complex scenarios. In this work, we introduce Recognition-Synergistic Scene Text Editing (RS-STE), a novel approach that fully exploits the intrinsic synergy of text recognition for editing. Our model seamlessly integrates text recognition with text editing within a unified framework, and leverages the recognition model's ability to implicitly disentangle style and content while ensuring content consistency. Specifically, our approach employs a multi-modal parallel decoder based on transformer architecture, which predicts both text content and stylized images in parallel. Additionally, our cyclic self-supervised fine-tuning strategy enables effective training on unpaired real-world data without ground truth, enhancing style and content consistency through a twice-cyclic generation process. Built on a relatively simple architecture, \mymodel achieves state-of-the-art performance on both synthetic and real-world benchmarks, and further demonstrates the effectiveness of leveraging the generated hard cases to boost the performance of downstream recognition tasks. Code is available at this https URL.
场景文本编辑的目标是在保持风格一致性的前提下修改场景图像中的文字内容。传统方法通过从源图中显式地分离出样式和内容,然后将这些样式与目标内容融合来实现这一目的,并使用预训练的识别模型确保内容的一致性。尽管取得了显著的进步,但这些方法由于复杂的管道而难以在复杂的情况下表现出色。 在这项工作中,我们引入了一种新的方法——协同识别场景文本编辑(Recognition-Synergistic Scene Text Editing, RS-STE),它充分利用了文字识别与编辑之间的内在联系。我们的模型在一个统一的框架内无缝地结合了文本识别和文本编辑,并利用了识别模型在隐式分离样式和内容的同时保持内容一致性的能力。具体而言,我们采用了一种基于变压器架构的多模态并行解码器,可以同时预测文本内容和风格化图像。 此外,我们的循环自我监督微调策略能够在没有真实标签的情况下有效训练未配对的真实世界数据,并通过两次循环生成过程增强样式和内容的一致性。在相对简单的架构基础上,\mymodel 在合成和真实世界的基准测试中均达到了最先进的性能,并进一步证明了利用生成的难例来提升下游识别任务性能的有效性。 代码可以在提供的链接地址获取。
https://arxiv.org/abs/2503.08387
Anchor-based 3D Gaussian splatting (3D-GS) exploits anchor features in 3D Gaussian prediction, which has achieved impressive 3D rendering quality with reduced Gaussian redundancy. On the other hand, it often encounters the dilemma among anchor features, model size, and rendering quality - large anchor features lead to large 3D models and high-quality rendering whereas reducing anchor features degrades Gaussian attribute prediction which leads to clear artifacts in the rendered textures and geometries. We design SOGS, an anchor-based 3D-GS technique that introduces second-order anchors to achieve superior rendering quality and reduced anchor features and model size simultaneously. Specifically, SOGS incorporates covariance-based second-order statistics and correlation across feature dimensions to augment features within each anchor, compensating for the reduced feature size and improving rendering quality effectively. In addition, it introduces a selective gradient loss to enhance the optimization of scene textures and scene geometries, leading to high-quality rendering with small anchor features. Extensive experiments over multiple widely adopted benchmarks show that SOGS achieves superior rendering quality in novel view synthesis with clearly reduced model size.
基于锚点的3D高斯点绘制(3D-GS)利用了在3D高斯预测中的锚特征,从而在减少高斯冗余的情况下实现了令人印象深刻的3D渲染质量。然而,它经常面临一个难题:在锚特征、模型大小和渲染质量之间进行权衡——较大的锚特征会导致更大的3D模型以及高质量的渲染效果;而减少锚特征则会降低高斯属性预测的质量,从而导致渲染纹理和几何结构中出现明显的伪影。 我们设计了一种名为SOGS的技术,这是一种基于锚点的3D-GS技术,它引入了二阶锚点以同时实现卓越的渲染质量和更少的锚点特征及模型大小。具体来说,SOGS整合了基于协方差的二阶统计信息和各维度间的相关性来增强每个锚点内的特征,以此补偿减少的特征尺寸并有效提升渲染质量。此外,它还引入了一种选择性梯度损失,以增强场景纹理与几何结构优化,从而在使用小锚特征的情况下也能实现高质量的渲染效果。 广泛的实验证明,在多个广泛采用的基准测试上,SOGS实现了显著降低模型大小的同时,在新颖视角合成中的渲染质量也优于其他方法。
https://arxiv.org/abs/2503.07476
Visual Place Recognition (VPR) is a crucial capability for long-term autonomous robots, enabling them to identify previously visited locations using visual information. However, existing methods remain limited in indoor settings due to the highly repetitive structures inherent in such environments. We observe that scene text typically appears in indoor spaces, serving to distinguish visually similar but different places. This inspires us to propose TextInPlace, a simple yet effective VPR framework that integrates Scene Text Spotting (STS) to mitigate visual perceptual ambiguity in repetitive indoor environments. Specifically, TextInPlace adopts a dual-branch architecture within a local parameter sharing network. The VPR branch employs attention-based aggregation to extract global descriptors for coarse-grained retrieval, while the STS branch utilizes a bridging text spotter to detect and recognize scene text. Finally, the discriminative text is filtered to compute text similarity and re-rank the top-K retrieved images. To bridge the gap between current text-based repetitive indoor scene datasets and the typical scenarios encountered in robot navigation, we establish an indoor VPR benchmark dataset, called Maze-with-Text. Extensive experiments on both custom and public datasets demonstrate that TextInPlace achieves superior performance over existing methods that rely solely on appearance information. The dataset, code, and trained models are publicly available at this https URL.
视觉位置识别(VPR)是长期自主机器人的一项关键能力,使它们能够利用视觉信息来识别之前访问过的地点。然而,现有的方法在室内环境中仍然受限于这些环境中的高度重复结构。我们注意到场景文本通常出现在室内空间中,并且这种文本有助于区分外观相似但实际上不同的地方。这促使我们提出了TextInPlace,这是一种简单而有效的VPR框架,它整合了场景文字检测(STS)以减轻重复的室内环境中视觉感知上的模糊性。 具体来说,TextInPlace采用了一种在局部参数共享网络内的双分支架构。VPR支路使用基于注意力机制的聚合来提取用于粗粒度检索的全局描述符,而STS支路则利用一个桥梁文本检测器来识别和识别人类场景中的文字。最后,它过滤出具有区分性的文本以计算文本相似性并重新排序前K个检索到的图像。 为了弥合现有基于文本的高度重复室内场景数据集与机器人导航中典型的现实情况之间的差距,我们建立了一个名为Maze-with-Text的室内VPR基准数据集。在自定义和公开的数据集中进行的广泛实验表明,TextInPlace优于仅依赖外观信息的现有方法,并且性能更优。 该数据集、代码以及训练模型可在以下网址获取:[链接](请将方括号中的文字替换为实际链接)。
https://arxiv.org/abs/2503.06501
During the steel billet production process, it is essential to recognize machine-printed or manually written billet numbers on moving billets in real-time. To address the issue of low recognition accuracy for existing scene text recognition methods, caused by factors such as image distortions and distribution differences between training and test data, we propose a billet number recognition method that integrates test-time adaptation with prior knowledge. First, we introduce a test-time adaptation method into a model that uses the DB network for text detection and the SVTR network for text recognition. By minimizing the model's entropy during the testing phase, the model can adapt to the distribution of test data without the need for supervised fine-tuning. Second, we leverage the billet number encoding rules as prior knowledge to assess the validity of each recognition result. Invalid results, which do not comply with the encoding rules, are replaced. Finally, we introduce a validation mechanism into the CTC algorithm using prior knowledge to address its limitations in recognizing damaged characters. Experimental results on real datasets, including both machine-printed billet numbers and handwritten billet numbers, show significant improvements in evaluation metrics, validating the effectiveness of the proposed method.
在钢坯生产过程中,实时识别移动中的钢坯上的机器打印或手写编号是至关重要的。为了解决现有场景文本识别方法因图像失真和训练数据与测试数据分布差异等因素导致的低识别精度问题,我们提出了一种结合测试时间适应性调整及先验知识的钢坯编号识别方法。 首先,我们将测试时间适应性调整的方法引入到一个使用DB网络进行文本检测和SVTR网络进行文本识别的模型中。通过在测试阶段最小化模型的熵值,使模型能够适应测试数据的分布特征而无需监督式微调。 其次,我们利用钢坯编号编码规则作为先验知识来评估每个识别结果的有效性。对于不符合编码规则的无效结果,我们会用适当的数值进行替换。 最后,我们将基于先验知识的验证机制引入到CTC算法中,以解决其在识别受损字符时存在的局限性问题。 通过实际数据集(包括机器打印和手写钢坯编号)上的实验结果显示,在评估指标方面取得了显著提升,这证明了我们提出的方法的有效性。
https://arxiv.org/abs/2502.09026
Mainstream Scene Text Recognition (STR) algorithms are developed based on RGB cameras which are sensitive to challenging factors such as low illumination, motion blur, and cluttered backgrounds. In this paper, we propose to recognize the scene text using bio-inspired event cameras by collecting and annotating a large-scale benchmark dataset, termed EventSTR. It contains 9,928 high-definition (1280 * 720) event samples and involves both Chinese and English characters. We also benchmark multiple STR algorithms as the baselines for future works to compare. In addition, we propose a new event-based scene text recognition framework, termed SimC-ESTR. It first extracts the event features using a visual encoder and projects them into tokens using a Q-former module. More importantly, we propose to augment the vision tokens based on a memory mechanism before feeding into the large language models. A similarity-based error correction mechanism is embedded within the large language model to correct potential minor errors fundamentally based on contextual information. Extensive experiments on the newly proposed EventSTR dataset and two simulation STR datasets fully demonstrate the effectiveness of our proposed model. We believe that the dataset and algorithmic model can innovatively propose an event-based STR task and are expected to accelerate the application of event cameras in various industries. The source code and pre-trained models will be released on this https URL
主流的场景文本识别(STR)算法基于RGB相机开发,这些相机对诸如低光照、运动模糊和杂乱背景等挑战性因素非常敏感。在本文中,我们提出了一种使用仿生事件相机来识别场景文本的方法,并为此收集并标注了一个大规模基准数据集,称为EventSTR。该数据集中包含9,928个高清(1280 * 720)事件样本,涉及中文和英文字符。此外,我们还为未来的工作提供了多个STR算法作为基线模型以供比较。另外,我们提出了一种新的基于事件的场景文本识别框架,称为SimC-ESTR。该框架首先使用视觉编码器提取事件特征,并利用Q-former模块将这些特征映射到令牌中。更重要的是,在输入大型语言模型之前,我们提出了通过内存机制来增强视觉令牌的方法。此外,我们在大型语言模型中嵌入了一种基于相似性的错误纠正机制,可以根据上下文信息从根本上纠正潜在的小错误。在新提出的EventSTR数据集和两个模拟的STR数据集上的大量实验充分证明了我们所提出模型的有效性。我们认为,这个数据集和算法模型可以创新地提出一种基于事件的STR任务,并有望加速事件相机在各个行业的应用。源代码和预训练模型将在以下网址发布:[https URL]
https://arxiv.org/abs/2502.09020
We introduce EgoTextVQA, a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text. EgoTextVQA contains 1.5K ego-view videos and 7K scene-text aware questions that reflect real-user needs in outdoor driving and indoor house-keeping activities. The questions are designed to elicit identification and reasoning on scene text in an egocentric and dynamic environment. With EgoTextVQA, we comprehensively evaluate 10 prominent multimodal large language models. Currently, all models struggle, and the best results (Gemini 1.5 Pro) are around 33% accuracy, highlighting the severe deficiency of these techniques in egocentric QA assistance. Our further investigations suggest that precise temporal grounding and multi-frame reasoning, along with high resolution and auxiliary scene-text inputs, are key for better performance. With thorough analyses and heuristic suggestions, we hope EgoTextVQA can serve as a solid testbed for research in egocentric scene-text QA assistance.
我们介绍了EgoTextVQA,这是一个新颖且严谨构建的基准测试平台,用于评估涉及场景文本的以自我为中心(egocentric)的问题回答辅助。EgoTextVQA包含1500个第一人称视角视频和7000个反映户外驾驶和室内家居管理活动中真实用户需求的场景文字感知问题。这些问题旨在激发在动态且以自我为中心环境中对场景文本的识别与推理能力。通过使用EgoTextVQA,我们全面评估了10种突出的多模态大型语言模型。当前,所有模型都面临挑战,其中表现最佳的是Gemini 1.5 Pro,其准确率为约33%,这表明这些技术在以自我为中心的问题回答辅助方面存在严重的不足之处。 进一步的研究表明,精确的时间定位、多帧推理以及高分辨率和场景文本输入的辅助是提高性能的关键因素。通过详细的分析和启发式建议,我们希望EgoTextVQA能成为研究以自我为中心场景文字问题回答辅助的坚实测试平台。
https://arxiv.org/abs/2502.07411
Designing datasets for Visual Question Answering (VQA) is a difficult and complex task that requires NLP for parsing and computer vision for analysing the relevant aspects of the image for answering the question asked. Several benchmark datasets have been developed by researchers but there are many issues with using them for methodical performance tests. This paper proposes a new benchmark dataset -- a pilot version called VQA-Levels is ready now -- for testing VQA systems systematically and assisting researchers in advancing the field. The questions are classified into seven levels ranging from direct answers based on low-level image features (without needing even a classifier) to those requiring high-level abstraction of the entire image content. The questions in the dataset exhibit one or many of ten properties. Each is categorised into a specific level from 1 to 7. Levels 1 - 3 are directly on the visual content while the remaining levels require extra knowledge about the objects in the image. Each question generally has a unique one or two-word answer. The questions are 'natural' in the sense that a human is likely to ask such a question when seeing the images. An example question at Level 1 is, ``What is the shape of the red colored region in the image?" while at Level 7, it is, ``Why is the man cutting the paper?". Initial testing of the proposed dataset on some of the existing VQA systems reveals that their success is high on Level 1 (low level features) and Level 2 (object classification) questions, least on Level 3 (scene text) followed by Level 6 (extrapolation) and Level 7 (whole scene analysis) questions. The work in this paper will go a long way to systematically analyze VQA systems.
设计用于视觉问答(VQA)的数据集是一项困难且复杂的任务,需要自然语言处理技术来解析问题,并使用计算机视觉技术分析图像中与回答问题相关的方面。尽管研究人员已经开发出了几个基准数据集,但在进行系统性性能测试时这些数据集仍然存在诸多问题。本文提出了一种新的基准数据集——一个称为VQA-Levels的试点版本现已准备就绪——旨在对VQA系统进行系统的测试,并帮助研究者推进该领域的进展。这些问题被分类为七个等级,从基于图像低级特征(甚至不需要使用分类器)直接回答的问题到需要整个图像内容高层次抽象的问题不等。数据集中的问题表现出十种属性之一或多种。每一个问题根据其复杂性从1到7进行分级。等级1至3主要涉及视觉内容的直接分析,而剩余的等级则需要关于图像中物体的额外知识。一般而言,每个问题通常具有一个或两个词的独特答案。这些问题在“自然”方面表现出色,也就是说当看到这些图像时人类有可能会提出这样的问题。例如,在Level 1(低级特征)的问题是:“图片中的红色区域是什么形状?”而在Level 7(整个场景分析),问题是:“为什么这个男人正在剪纸?”对现有VQA系统进行的初步测试显示,它们在Level 1和Level 2(对象分类)这类简单问题上的成功率较高,在Level 3(场景文本)、Level 6(推断)以及最难的Level 7(整个场景分析)类问题上的表现则相对较差。本文的研究将有助于对VQA系统进行系统的分析。
https://arxiv.org/abs/2502.02951