Large Multimodal Models (LMMs) have achieved impressive progress in visual perception and reasoning. However, when confronted with visually ambiguous or non-semantic scene text, they often struggle to accurately spot and understand the content, frequently generating semantically plausible yet visually incorrect answers, which we refer to as semantic hallucination. In this work, we investigate the underlying causes of semantic hallucination and identify a key finding: Transformer layers in LLM with stronger attention focus on scene text regions are less prone to producing semantic hallucinations. Thus, we propose a training-free semantic hallucination mitigation framework comprising two key components: (1) ZoomText, a coarse-to-fine strategy that identifies potential text regions without external detectors; and (2) Grounded Layer Correction, which adaptively leverages the internal representations from layers less prone to hallucination to guide decoding, correcting hallucinated outputs for non-semantic samples while preserving the semantics of meaningful ones. To enable rigorous evaluation, we introduce TextHalu-Bench, a benchmark of over 1,730 samples spanning both semantic and non-semantic cases, with manually curated question-answer pairs designed to probe model hallucinations. Extensive experiments demonstrate that our method not only effectively mitigates semantic hallucination but also achieves strong performance on public benchmarks for scene text spotting and understanding.
大型多模态模型(LMMs)在视觉感知和推理方面取得了显著进展。然而,面对视觉模糊或非语义场景文本时,它们往往难以准确识别和理解内容,并常常生成语义上看似合理但实际上与图像不符的答案,我们称之为语义幻觉。在这项工作中,我们探讨了导致语义幻觉的潜在原因并发现了一个关键点:大型语言模型(LLM)中注意力更强的Transformer层在处理场景文本区域时较少出现语义幻觉。基于这一发现,我们提出了一种无训练方法来减轻语义幻觉框架,该框架包含两个核心组件: 1. **ZoomText**:一种从粗到细策略,无需外部检测器即可识别潜在的文字区域。 2. **Grounded Layer Correction**(地面层校正):此模块会根据模型中较不易产生幻觉的层数内部表示来调整解码过程,在不损害有意义样本语义的情况下纠正非语义样本的输出。 为了严格评估该方法的效果,我们引入了TextHalu-Bench,这是一个包含超过1,730个样本的基准测试集,这些样本涵盖了从有语义到无语义的各种场景,并配有人工策划的问题-答案对,旨在探测模型可能出现的幻觉。广泛的实验表明,我们的方法不仅有效减轻了语义幻觉问题,在公共场景文本检测和理解基准上也取得了很强的表现。
https://arxiv.org/abs/2506.05551
Chinese scene text retrieval is a practical task that aims to search for images containing visual instances of a Chinese query text. This task is extremely challenging because Chinese text often features complex and diverse layouts in real-world scenes. Current efforts tend to inherit the solution for English scene text retrieval, failing to achieve satisfactory performance. In this paper, we establish a Diversified Layout benchmark for Chinese Street View Text Retrieval (DL-CSVTR), which is specifically designed to evaluate retrieval performance across various text layouts, including vertical, cross-line, and partial alignments. To address the limitations in existing methods, we propose Chinese Scene Text Retrieval CLIP (CSTR-CLIP), a novel model that integrates global visual information with multi-granularity alignment training. CSTR-CLIP applies a two-stage training process to overcome previous limitations, such as the exclusion of visual features outside the text region and reliance on single-granularity alignment, thereby enabling the model to effectively handle diverse text layouts. Experiments on existing benchmark show that CSTR-CLIP outperforms the previous state-of-the-art model by 18.82% accuracy and also provides faster inference speed. Further analysis on DL-CSVTR confirms the superior performance of CSTR-CLIP in handling various text layouts. The dataset and code will be publicly available to facilitate research in Chinese scene text retrieval.
中文场景文本检索是一项实用的任务,旨在搜索包含查询文本视觉实例的图像。由于实际场景中中文文本经常具有复杂多样的布局形式,这一任务极具挑战性。目前的研究通常沿用了英文场景文本检索的方法,导致无法取得令人满意的性能。 在本文中,我们建立了一个多样化布局基准(Diversified Layout benchmark)用于评估中文街景文本检索(DL-CSVTR),该基准特别设计来评估不同文本布局下的检索性能,包括竖直、跨行和部分对齐的场景。为了解决现有方法中的局限性,我们提出了一个新的模型——中文场景文本检索CLIP (CSTR-CLIP),这个模型整合了全局视觉信息并采用多粒度对齐训练策略。 CSTR-CLIP 通过两阶段训练过程克服了先前模型的一些限制,比如忽略文字区域外的视觉特征和依赖单一粒度的对齐方式。这使得模型能够有效地处理多样化的文本布局。 在现有基准上的实验表明,与最先进的方法相比,CSTR-CLIP 在准确率上提高了18.82%,并且提供了更快的推理速度。进一步在 DL-CSVTR 上的分析证实了 CSTR-CLIP 处理各种文本布局时的优越性。数据集和代码将公开发布,以促进中文场景文本检索的研究进展。 通过这种方式,CSTR-CLIP 不仅提升了模型在处理复杂、多变中文街景文本任务上的表现,同时也为研究人员提供了更加全面且实用的数据资源与工具支持。
https://arxiv.org/abs/2506.04999
Despite recent progress on the short-video Text-Visual Question Answering (ViteVQA) task - largely driven by benchmarks such as M4-ViteVQA - existing datasets still suffer from limited video duration and narrow evaluation scopes, making it difficult to adequately assess the growing capabilities of powerful multimodal large language models (MLLMs). To address these limitations, we introduce TextVidBench, the first benchmark specifically designed for long-video text question answering (>3 minutes). TextVidBench makes three key contributions: 1) Cross-domain long-video coverage: Spanning 9 categories (e.g., news, sports, gaming), with an average video length of 2306 seconds, enabling more realistic evaluation of long-video understanding. 2) A three-stage evaluation framework: "Text Needle-in-Haystack -> Temporal Grounding -> Text Dynamics Captioning". 3) High-quality fine-grained annotations: Containing over 5,000 question-answer pairs with detailed semantic labeling. Furthermore, we propose an efficient paradigm for improving large models through: (i) introducing the IT-Rope mechanism and temporal prompt engineering to enhance temporal perception, (ii) adopting non-uniform positional encoding to better handle long video sequences, and (iii) applying lightweight fine-tuning on video-text data. Extensive experiments on multiple public datasets as well as TextVidBench demonstrate that our new benchmark presents significant challenges to existing models, while our proposed method offers valuable insights into improving long-video scene text understanding capabilities.
尽管在短视频文本-视觉问答(ViteVQA)任务上取得了近期进展——这主要得益于M4-ViteVQA等基准测试的推动,现有的数据集仍然存在视频时长有限和评估范围狭窄的问题,使得难以充分评价强大多模态大型语言模型(MLLMs)的能力。为了解决这些问题,我们引入了TextVidBench,这是第一个专门针对长时间视频文本问答(>3分钟)设计的基准。 TextVidBench做出了三项关键贡献: 1. **跨域长视频覆盖**:涵盖9个类别(例如新闻、体育、游戏),平均视频长度为2306秒,这使得对长视频理解的评估更加现实。 2. **三阶段评估框架**:“文本大海捞针 -> 时间定位 -> 文本动态描述”。 3. **高质量细粒度注释**:包含超过5,000个问题-答案配对,并带有详细的语义标签。 此外,我们还提出了一种通过以下方式改进大规模模型的高效范式: (i) 引入IT-Rope机制和时间提示工程来增强时间感知能力; (ii) 采用非均匀位置编码以更好地处理长视频序列; (iii) 在视频文本数据上进行轻量级微调。 在多个公共数据集以及TextVidBench上的广泛实验表明,我们的新基准对现有模型提出了重大挑战,而我们提出的改进方法为提升长时间视频场景中的文本理解能力提供了有价值的见解。
https://arxiv.org/abs/2506.04983
While recent advancements in Image Super-Resolution (SR) using diffusion models have shown promise in improving overall image quality, their application to scene text images has revealed limitations. These models often struggle with accurate text region localization and fail to effectively model image and multilingual character-to-shape priors. This leads to inconsistencies, the generation of hallucinated textures, and a decrease in the perceived quality of the super-resolved text. To address these issues, we introduce TextSR, a multimodal diffusion model specifically designed for Multilingual Scene Text Image Super-Resolution. TextSR leverages a text detector to pinpoint text regions within an image and then employs Optical Character Recognition (OCR) to extract multilingual text from these areas. The extracted text characters are then transformed into visual shapes using a UTF-8 based text encoder and cross-attention. Recognizing that OCR may sometimes produce inaccurate results in real-world scenarios, we have developed two innovative methods to enhance the robustness of our model. By integrating text character priors with the low-resolution text images, our model effectively guides the super-resolution process, enhancing fine details within the text and improving overall legibility. The superior performance of our model on both the TextZoom and TextVQA datasets sets a new benchmark for STISR, underscoring the efficacy of our approach.
尽管最近使用扩散模型在图像超分辨率(SR)方面取得了进展,这些方法在提高整体图像质量方面显示出潜力,但它们在应用于场景文本图像时却暴露出了一些局限性。这些模型往往难以准确地定位文本区域,并且无法有效建模图像和多语言字符到形状的先验知识,这导致了不一致、生成幻觉纹理以及超分辨率后文本感知质量下降的问题。 为了应对这些问题,我们引入了一种专为多语言场景文本图像超分辨率设计的TextSR多模式扩散模型。TextSR利用一个文本检测器来识别图像中的文本区域,并使用光学字符识别(OCR)从这些区域中提取多语言文本。然后,通过基于UTF-8的文本编码和跨注意力机制将提取出的文字转换为视觉形状。 考虑到在实际场景中,OCR可能会产生不准确的结果,我们开发了两种创新的方法来增强模型的鲁棒性。通过结合低分辨率图像中的字符先验知识,我们的模型能够有效地指导超分辨率过程,提高文字细节并提升整体可读性。我们在TextZoom和TextVQA数据集上的测试结果表明,我们的方法在场景文本图像超分辨率(STISR)领域树立了新的性能标杆,这进一步证明了我们方法的有效性和优越性。
https://arxiv.org/abs/2505.23119
This study aims to investigate the challenge of insufficient three-dimensional context in synthetic datasets for scene text rendering. Although recent advances in diffusion models and related techniques have improved certain aspects of scene text generation, most existing approaches continue to rely on 2D data, sourcing authentic training examples from movie posters and book covers, which limits their ability to capture the complex interactions among spatial layout and visual effects in real-world scenes. In particular, traditional 2D datasets do not provide the necessary geometric cues for accurately embedding text into diverse backgrounds. To address this limitation, we propose a novel standard for constructing synthetic datasets that incorporates surface normals to enrich three-dimensional scene characteristic. By adding surface normals to conventional 2D data, our approach aims to enhance the representation of spatial relationships and provide a more robust foundation for future scene text rendering methods. Extensive experiments demonstrate that datasets built under this new standard offer improved geometric context, facilitating further advancements in text rendering under complex 3D-spatial conditions.
这项研究旨在探讨合成数据集中三维上下文不足对场景文本渲染的挑战。尽管最近在扩散模型及相关技术方面的进展改善了场景文本生成的一些方面,但大多数现有方法仍然依赖于二维数据,并从电影海报和书籍封面等来源获取真实的训练示例,这限制了它们捕捉现实世界场景中空间布局与视觉效果之间复杂互动的能力。特别是,传统的二维数据集无法提供将文本准确嵌入到各种背景所需的几何线索。 为了解决这一局限性,我们提出了一种新的构建合成数据集的标准,该标准引入表面法线以丰富三维场景特性。通过向常规的二维数据添加表面法线,我们的方法旨在增强空间关系的表现,并为未来的场景文本渲染方法提供更坚实的基石。广泛的实验表明,在这种新标准下建立的数据集提供了改进的几何上下文,有助于在复杂的三维空间条件下进一步推进文本渲染技术的进步。
https://arxiv.org/abs/2505.18479
Diffusion-based scene text synthesis has progressed rapidly, yet existing methods commonly rely on additional visual conditioning modules and require large-scale annotated data to support multilingual generation. In this work, we revisit the necessity of complex auxiliary modules and further explore an approach that simultaneously ensures glyph accuracy and achieves high-fidelity scene integration, by leveraging diffusion models' inherent capabilities for contextual reasoning. To this end, we introduce TextFlux, a DiT-based framework that enables multilingual scene text synthesis. The advantages of TextFlux can be summarized as follows: (1) OCR-free model architecture. TextFlux eliminates the need for OCR encoders (additional visual conditioning modules) that are specifically used to extract visual text-related features. (2) Strong multilingual scalability. TextFlux is effective in low-resource multilingual settings, and achieves strong performance in newly added languages with fewer than 1,000 samples. (3) Streamlined training setup. TextFlux is trained with only 1% of the training data required by competing methods. (4) Controllable multi-line text generation. TextFlux offers flexible multi-line synthesis with precise line-level control, outperforming methods restricted to single-line or rigid layouts. Extensive experiments and visualizations demonstrate that TextFlux outperforms previous methods in both qualitative and quantitative evaluations.
基于扩散模型的场景文本合成技术取得了快速进展,但现有方法通常依赖额外的视觉条件模块,并且需要大规模标注数据来支持多语言生成。在这项工作中,我们重新审视了复杂辅助模块的必要性,并进一步探索了一种同时确保字形准确性并实现高质量场景整合的方法,通过利用扩散模型在上下文推理方面的内在能力。为此,我们介绍了基于DiT框架的TextFlux系统,它能够进行多语言场景文本合成。TextFlux的优势可以总结如下: 1. **无OCR架构**:TextFlux消除了专门用于提取视觉文本相关特征的OCR编码器(额外的视觉条件模块)的需求。 2. **强大的多语言可扩展性**:在低资源环境下,TextFlux仍然非常有效,并且在新添加的语言中使用少于1,000个样本也能实现强劲性能。 3. **简化的训练设置**:与竞争方法相比,TextFlux仅需1%的训练数据即可完成训练。 4. **可控多行文本生成**:TextFlux提供灵活的多行合成,并具有精确到每行级别的控制能力,超越了限制于单行或固定布局的方法。 大量的实验和可视化结果证明,在定性和定量评估中,TextFlux均优于先前的方法。
https://arxiv.org/abs/2505.17778
Scene text detection has seen the emergence of high-performing methods that excel on academic benchmarks. However, these detectors often fail to replicate such success in real-world scenarios. We uncover two key factors contributing to this discrepancy through extensive experiments. First, a \textit{Fine-tuning Gap}, where models leverage \textit{Dataset-Specific Optimization} (DSO) paradigm for one domain at the cost of reduced effectiveness in others, leads to inflated performances on academic benchmarks. Second, the suboptimal performance in practical settings is primarily attributed to the long-tailed distribution of texts, where detectors struggle with rare and complex categories as artistic or overlapped text. Given that the DSO paradigm might undermine the generalization ability of models, we advocate for a \textit{Joint-Dataset Learning} (JDL) protocol to alleviate the Fine-tuning Gap. Additionally, an error analysis is conducted to identify three major categories and 13 subcategories of challenges in long-tailed scene text, upon which we propose a Long-Tailed Benchmark (LTB). LTB facilitates a comprehensive evaluation of ability to handle a diverse range of long-tailed challenges. We further introduce MAEDet, a self-supervised learning-based method, as a strong baseline for LTB. The code is available at this https URL.
场景文本检测领域已涌现出一系列在学术基准测试中表现出色的方法,然而这些方法往往难以在实际环境中复制同样的成功。通过广泛的实验研究,我们发现了导致这一差距的两个关键因素。首先,存在一个称为“微调间隙”(Fine-tuning Gap)的现象:模型利用特定数据集优化(Dataset-Specific Optimization, DSO)范式为某一领域量身定制时,会导致在其他领域的有效性降低,并且这种做法会使得模型在学术基准测试中的表现被夸大。其次,在实际应用中性能不佳主要是由于文本的长尾分布导致的,即检测器难以处理罕见和复杂的类别,如艺术性或重叠的文字。 鉴于DSO范式可能削弱模型的泛化能力,我们提倡采用联合数据集学习(Joint-Dataset Learning, JDL)协议来缓解微调间隙问题。此外,我们还进行了一项错误分析,识别出长尾场景文本中的三个主要挑战类别和十三个子类别,并提出一个长尾基准测试(Long-Tailed Benchmark, LTB),以全面评估处理多样化的长尾挑战的能力。 为了进一步支持这一研究,我们引入了MAEDet方法作为LTB的一个强基线,该方法基于自监督学习。相关的代码可以在以下链接中获取:[此URL](请将实际的GitHub或其他代码托管平台的URL填入括号内)。
https://arxiv.org/abs/2505.15649
Scene text editing, a subfield of image editing, requires modifying texts in images while preserving style consistency and visual coherence with the surrounding environment. While diffusion-based methods have shown promise in text generation, they still struggle to produce high-quality results. These methods often generate distorted or unrecognizable characters, particularly when dealing with complex characters like Chinese. In such systems, characters are composed of intricate stroke patterns and spatial relationships that must be precisely maintained. We present GlyphMastero, a specialized glyph encoder designed to guide the latent diffusion model for generating texts with stroke-level precision. Our key insight is that existing methods, despite using pretrained OCR models for feature extraction, fail to capture the hierarchical nature of text structures - from individual strokes to stroke-level interactions to overall character-level structure. To address this, our glyph encoder explicitly models and captures the cross-level interactions between local-level individual characters and global-level text lines through our novel glyph attention module. Meanwhile, our model implements a feature pyramid network to fuse the multi-scale OCR backbone features at the global-level. Through these cross-level and multi-scale fusions, we obtain more detailed glyph-aware guidance, enabling precise control over the scene text generation process. Our method achieves an 18.02\% improvement in sentence accuracy over the state-of-the-art multi-lingual scene text editing baseline, while simultaneously reducing the text-region Fréchet inception distance by 53.28\%.
场景文本编辑是图像编辑的一个子领域,它要求在保留风格一致性和周围环境视觉连贯性的前提下修改图片中的文字。虽然基于扩散的方法在文本生成方面显示出潜力,但它们仍然难以产生高质量的结果。这些方法经常生成扭曲或无法辨认的字符,尤其是在处理像中文这样复杂的字符时。在这种系统中,字符由精细的笔画模式和空间关系构成,必须精确维护。 我们提出了GlyphMastero,这是一种专门设计用于引导潜在扩散模型以实现笔画级别精度的文字生成技术的字体编码器。我们的关键见解是现有的方法虽然使用了预训练的OCR(光学字符识别)模型进行特征提取,但未能捕捉文本结构的层次性——从单个笔画到笔画级别的交互再到整个字符级别的结构。为了解决这个问题,我们的字体编码器通过我们新颖的字体注意模块明确地建模并捕获局部级别个体字符与全局级别文字行之间的跨层级互动。 同时,我们的模型实现了一个特征金字塔网络来融合多尺度OCR骨干特性在全局层面的信息。通过这些跨层级和多层次的融合,我们获得了更为详细的文字感知指导,从而能够对场景文本生成过程进行更精确的控制。 我们的方法相对于最先进的多语言场景文本编辑基线,在句子准确度方面提高了18.02%,同时将文本区域Fréchet inception距离减少了53.28%。
https://arxiv.org/abs/2505.04915
The task of scene text editing is to modify or add texts on images while maintaining the fidelity of newly generated text and visual coherence with the background. Recent works based on latent diffusion models (LDM) show improved text editing results, yet still face challenges and often generate inaccurate or unrecognizable characters, especially for non-Latin ones (\eg, Chinese), which have complex glyph structures. To address these issues, we present FLUX-Text, a simple and advanced multilingual scene text editing framework based on FLUX-Fill. Specifically, we carefully investigate glyph conditioning, considering both visual and textual modalities. To retain the original generative capabilities of FLUX-Fill while enhancing its understanding and generation of glyphs, we propose lightweight glyph and text embedding modules. Owning to the lightweight design, FLUX-Text is trained only with $100K$ training examples compared to current popular methods trained with 2.9M ones. With no bells and whistles, our method achieves state-of-the-art performance on text editing tasks. Qualitative and quantitative experiments on the public datasets demonstrate that our method surpasses previous works in text fidelity.
场景文本编辑的任务是在保持新生成的文字质量和与背景图像视觉一致性的前提下,修改或增加图片中的文字。最近基于潜在扩散模型(LDM)的工作提高了文本编辑的效果,但仍然面临挑战,并且常常产生不准确或难以辨认的字符,尤其是对于非拉丁文(如中文),这些文字具有复杂的字形结构。为了应对这些问题,我们提出了FLUX-Text,这是一个基于FLUX-Fill的简单而先进的多语言场景文本编辑框架。具体来说,我们仔细研究了字形条件化,同时考虑视觉和文本模态。 为了保留FLUX-Fill原有的生成能力并增强其对字形的理解与生成,我们提出了一种轻量级的字形和文本嵌入模块。由于采用了轻量化设计,FLUX-Text仅使用10万个训练样本进行训练,而目前流行的方法则需要290万个训练样本。在没有任何复杂技巧的情况下,我们的方法在文本编辑任务上达到了最先进的性能水平。通过公共数据集上的定性和定量实验表明,我们的方法在文字保真度方面超越了之前的工作。
https://arxiv.org/abs/2505.03329
Vision-language models (VLMs) have demonstrated impressive capabilities in understanding and reasoning about visual and textual content. However, their robustness to common image corruptions remains under-explored. In this work, we present the first comprehensive analysis of VLM robustness across 19 corruption types from the ImageNet-C benchmark, spanning four categories: noise, blur, weather, and digital distortions. We introduce two new benchmarks, TextVQA-C and GQA-C, to systematically evaluate how corruptions affect scene text understanding and object-based reasoning, respectively. Our analysis reveals that transformer-based VLMs exhibit distinct vulnerability patterns across tasks: text recognition deteriorates most severely under blur and snow corruptions, while object reasoning shows higher sensitivity to corruptions such as frost and impulse noise. We connect these observations to the frequency-domain characteristics of different corruptions, revealing how transformers' inherent bias toward low-frequency processing explains their differential robustness patterns. Our findings provide valuable insights for developing more corruption-robust vision-language models for real-world applications.
视觉语言模型(VLMs)在理解和推理视觉和文本内容方面表现出令人印象深刻的能力。然而,它们对常见图像损坏的鲁棒性仍鲜有研究。在这项工作中,我们首次全面分析了来自ImageNet-C基准测试的19种不同类型图像损坏对VLM鲁棒性的影响,这些类型涵盖了四个类别:噪声、模糊、天气和数字扭曲。我们介绍了两个新的基准测试——TextVQA-C和GQA-C,分别系统地评估图像损坏如何影响场景文本理解和基于对象的推理。我们的分析揭示了基于变压器的VLM在不同任务中表现出独特的脆弱性模式:文本识别在遭受模糊和雪破坏时退化最为严重;而物体推理则对霜冻和脉冲噪声等破坏表现出更高的敏感度。我们将这些观察结果与不同类型损坏的频域特性联系起来,展示了变压器模型固有的低频处理偏见如何解释其不同的鲁棒性模式。我们的发现为开发更适用于现实世界应用、能更好地抵抗图像损坏影响的视觉语言模型提供了宝贵的见解。
https://arxiv.org/abs/2504.13690
Recently supervised learning rapidly develops in scene text segmentation. However, the lack of high-quality datasets and the high cost of pixel annotation greatly limit the development of them. Considering the well-performed few-shot learning methods for downstream tasks, we investigate the application of the few-shot learning method to scene text segmentation. We propose TSAL, which leverages CLIP's prior knowledge to learn text attributes for segmentation. To fully utilize the semantic and texture information in the image, a visual-guided branch is proposed to separately extract text and background features. To reduce data dependency and improve text detection accuracy, the adaptive prompt-guided branch employs effective adaptive prompt templates to capture various text attributes. To enable adaptive prompts capture distinctive text features and complex background distribution, we propose Adaptive Feature Alignment module(AFA). By aligning learnable tokens of different attributes with visual features and prompt prototypes, AFA enables adaptive prompts to capture both general and distinctive attribute information. TSAL can capture the unique attributes of text and achieve precise segmentation using only few images. Experiments demonstrate that our method achieves SOTA performance on multiple text segmentation datasets under few-shot settings and show great potential in text-related domains.
最近,监督学习在场景文本分割领域取得了迅速发展。然而,高质量数据集的缺乏和像素标注成本高昂极大地限制了该领域的进步。鉴于针对下游任务表现良好的少量样本学习方法,我们研究了将少量样本学习方法应用于场景文本分割的应用。为此,我们提出了TSAL(Text Segmentation with Adaptive Prompt Learning),它利用CLIP的先验知识来学习用于分割的文字属性。为了充分利用图像中的语义和纹理信息,提出了一种视觉引导分支,以分别提取文字和背景特征。 为减少对数据的依赖并提高文本检测精度,适应性提示引导分支采用了有效的自适应提示模板来捕捉各种文字属性。为了让自适应提示能够捕获独特且复杂的文字特征及背景分布,我们提出了自适应特征对齐模块(Adaptive Feature Alignment, AFA)。通过将不同属性的学习令牌与视觉特征和提示原型进行对齐,AFA使自适应提示能够同时捕捉通用和独特的属性信息。 TSAL能够在仅使用少量图像的情况下捕获文本的独特属性,并实现精确的分割。实验表明,在多个文本分割数据集下的少量样本设置中,我们的方法达到了最先进的性能,并在与文本相关的领域展示了巨大的潜力。
https://arxiv.org/abs/2504.11164
Most previous scene text spotting methods rely on high-quality manual annotations to achieve promising performance. To reduce their expensive costs, we study semi-supervised text spotting (SSTS) to exploit useful information from unlabeled images. However, directly applying existing semi-supervised methods of general scenes to SSTS will face new challenges: 1) inconsistent pseudo labels between detection and recognition tasks, and 2) sub-optimal supervisions caused by inconsistency between teacher/student. Thus, we propose a new Semi-supervised framework for End-to-end Text Spotting, namely SemiETS that leverages the complementarity of text detection and recognition. Specifically, it gradually generates reliable hierarchical pseudo labels for each task, thereby reducing noisy labels. Meanwhile, it extracts important information in locations and transcriptions from bidirectional flows to improve consistency. Extensive experiments on three datasets under various settings demonstrate the effectiveness of SemiETS on arbitrary-shaped text. For example, it outperforms previous state-of-the-art SSL methods by a large margin on end-to-end spotting (+8.7%, +5.6%, and +2.6% H-mean under 0.5%, 1%, and 2% labeled data settings on Total-Text, respectively). More importantly, it still improves upon a strongly supervised text spotter trained with plenty of labeled data by 2.0%. Compelling domain adaptation ability shows practical potential. Moreover, our method demonstrates consistent improvement on different text spotters.
大多数以前的场景文本识别方法依赖于高质量的手工标注来实现较好的性能。为了减少昂贵的成本,我们研究了半监督文本定位(SSTS),以利用未标记图像中的有用信息。然而,直接将现有的通用场景半监督方法应用于SSTS会面临新的挑战:1)检测和识别任务之间的伪标签不一致;2)由于教师/学生之间的一致性问题导致的次优监督信号。因此,我们提出了一种新的用于端到端文本定位的半监督框架——SemiETS,该框架利用了文本检测与识别间的互补关系。具体来说,它逐步为每个任务生成可靠的分层伪标签,从而减少噪声标签的数量。同时,它从双向流动中提取位置和转录中的重要信息以提高一致性。在三个不同设置的数据集上的广泛实验表明,SemiETS在任意形状文本的定位上表现出有效性。例如,在Total-Text数据集中,对于端到端检测任务,其分别在0.5%,1% 和2% 标注数据的条件下超过了之前的最先进SSL方法8.7%,5.6%,和2.6%H-means。更重要的是,它仍然比经过大量标注数据训练的强大监督文本定位器提高了2.0% 的性能。强大的领域适应能力显示了其实用潜力。此外,我们的方法在不同的文本定位器上均表现出一致的改进效果。
https://arxiv.org/abs/2504.09966
Pursuing efficient text shape representations helps scene text detection models focus on compact foreground regions and optimize the contour reconstruction steps to simplify the whole detection pipeline. Current approaches either represent irregular shapes via box-to-polygon strategy or decomposing a contour into pieces for fitting gradually, the deficiency of coarse contours or complex pipelines always exists in these models. Considering the above issues, we introduce EdgeText to fit text contours compactly while alleviating excessive contour rebuilding processes. Concretely, it is observed that the two long edges of texts can be regarded as smooth curves. It allows us to build contours via continuous and smooth edges that cover text regions tightly instead of fitting piecewise, which helps avoid the two limitations in current models. Inspired by this observation, EdgeText formulates the text representation as the edge approximation problem via parameterized curve fitting functions. In the inference stage, our model starts with locating text centers, and then creating curve functions for approximating text edges relying on the points. Meanwhile, truncation points are determined based on the location features. In the end, extracting curve segments from curve functions by using the pixel coordinate information brought by truncation points to reconstruct text contours. Furthermore, considering the deep dependency of EdgeText on text edges, a bilateral enhanced perception (BEP) module is designed. It encourages our model to pay attention to the recognition of edge features. Additionally, to accelerate the learning of the curve function parameters, we introduce a proportional integral loss (PI-loss) to force the proposed model to focus on the curve distribution and avoid being disturbed by text scales.
追求高效的文本形状表示有助于场景文本检测模型聚焦于紧致的前景区域,并优化轮廓重建步骤以简化整个检测流程。当前的方法要么通过盒到多边形策略来表示不规则形状,要么将一个轮廓分解成多个片段逐步拟合。这些方法中往往存在粗略轮廓或复杂流程的问题。 为了解决上述问题,我们提出了EdgeText模型,旨在紧凑地匹配文本轮廓同时缓解过度的轮廓重建过程。具体而言,观察到文本的两条长边可以视为平滑曲线,这使得我们可以构建连续且平滑覆盖文本区域的边缘以代替分段拟合的方式,从而避免现有模型中的两个局限性。受到这一发现启发,EdgeText将文本表示形式化为参数化曲线拟合函数的问题。 在推理阶段,我们的模型首先定位文字中心,然后基于这些点创建用于近似文字边界的曲线函数。同时,根据位置特征确定截断点。最后,利用由截断点带来的像素坐标信息从曲线函数中提取曲线段来重建文本轮廓。 此外,考虑到EdgeText对文本边缘的深度依赖性,我们设计了一种双边增强感知(BEP)模块,以鼓励模型注意边缘特性的识别。为了加速学习曲线函数参数的速度,我们引入了比例积分损失(PI-loss),使提出的模型专注于曲线分布并避免受到文字尺度的影响。
https://arxiv.org/abs/2504.04001
Scene text image super-resolution (STISR) enhances the resolution and quality of low-resolution images. Unlike previous studies that treated scene text images as natural images, recent methods using a text prior (TP), extracted from a pre-trained text recognizer, have shown strong performance. However, two major issues emerge: (1) Explicit categorical priors, like TP, can negatively impact STISR if incorrect. We reveal that these explicit priors are unstable and propose replacing them with Non-CAtegorical Prior (NCAP) using penultimate layer representations. (2) Pre-trained recognizers used to generate TP struggle with low-resolution images. To address this, most studies jointly train the recognizer with the STISR network to bridge the domain gap between low- and high-resolution images, but this can cause an overconfidence phenomenon in the prior modality. We highlight this issue and propose a method to mitigate it by mixing hard and soft labels. Experiments on the TextZoom dataset demonstrate an improvement by 3.5%, while our method significantly enhances generalization performance by 14.8\% across four text recognition datasets. Our method generalizes to all TP-guided STISR networks.
场景文本图像超分辨率(STISR)技术旨在提高低分辨率图像的分辨率和质量。与以往将场景文本图像视为自然图像处理的研究不同,最近使用从预训练的文字识别器中提取的文本先验(TP)的方法展现了强大的性能。然而,这种方法存在两个主要问题: 1. 显式的类别先验,如TP,在错误的情况下会负面影响STISR的效果。我们发现这些显式先验不稳定,并提出用来自倒数第二层表示的非分类前缀(NCAP)来替代它们。 2. 用于生成TP的预训练识别器在处理低分辨率图像时遇到困难。为了应对这一问题,大多数研究通过将识别器与STISR网络联合训练来弥合低分辨率和高分辨率图像之间的领域差距,但这可能会导致先验模态中的过度自信现象。我们强调了这个问题,并提出了一种混合硬标签和软标签的方法来缓解它。 在TextZoom数据集上的实验表明,我们的方法相比基准提高了3.5%的性能,而在四个文本识别数据集中,我们的方法显著增强了泛化性能,增幅达14.8%。此外,我们提出的方法能够应用于所有基于TP引导的STISR网络中。
https://arxiv.org/abs/2504.00410
Recent advancements in autoregressive and diffusion models have led to strong performance in image generation with short scene text words. However, generating coherent, long-form text in images, such as paragraphs in slides or documents, remains a major challenge for current generative models. We present the first work specifically focused on long text image generation, addressing a critical gap in existing text-to-image systems that typically handle only brief phrases or single sentences. Through comprehensive analysis of state-of-the-art autoregressive generation models, we identify the image tokenizer as a critical bottleneck in text generating quality. To address this, we introduce a novel text-focused, binary tokenizer optimized for capturing detailed scene text features. Leveraging our tokenizer, we develop \ModelName, a multimodal autoregressive model that excels in generating high-quality long-text images with unprecedented fidelity. Our model offers robust controllability, enabling customization of text properties such as font style, size, color, and alignment. Extensive experiments demonstrate that \ModelName~significantly outperforms SD3.5 Large~\cite{sd3} and GPT4o~\cite{gpt4o} with DALL-E 3~\cite{dalle3} in generating long text accurately, consistently, and flexibly. Beyond its technical achievements, \ModelName~opens up exciting opportunities for innovative applications like interleaved document and PowerPoint generation, establishing a new frontier in long-text image generating.
最近在自回归和扩散模型方面取得的进展已经使得使用简短场景文本词生成图像时表现出色。然而,当前的生成模型仍然难以克服一个重大挑战:即在图像中生成连贯且长篇幅的文字,例如幻灯片或文档中的段落。我们首次提出了专注于长文本图像生成的工作,填补了现有文字到图像系统的一个关键空白,这些系统通常只能处理简短语句或单个句子。 通过对最先进的自回归生成模型进行全面分析,我们发现图像标记器是影响文本生成质量的关键瓶颈。为了解决这一问题,我们引入了一种新颖的以文本为中心、二进制化的标记器,该标记器专为捕捉详细的场景文字特征而优化。利用我们的标记器,我们开发了\ModelName(假设这里的\ModelName是指一个具体的模型名称),这是一种多模态自回归模型,在生成高质量且细节丰富的长篇幅文字图像方面表现卓越,并以前所未有的精确度实现了这一目标。 该模型提供强大的可控制性,能够根据需求定制文本属性,如字体样式、大小、颜色和对齐方式。广泛的实验表明,\ModelName在准确、一致和灵活地生成长文本方面显著优于SD3.5 Large~\cite{sd3}、GPT4o~\cite{gpt4o}与DALL-E 3~\cite{dalle3}的组合。 除了技术成就之外,\ModelName还为创新应用开辟了令人兴奋的机会,例如交错生成文档和PowerPoint演示文稿,从而在长篇幅文字图像生成领域开创了一个新的前沿。
https://arxiv.org/abs/2503.20198
In recent years, vision transformers with text decoder have demonstrated remarkable performance on Scene Text Recognition (STR) due to their ability to capture long-range dependencies and contextual relationships with high learning capacity. However, the computational and memory demands of these models are significant, limiting their deployment in resource-constrained applications. To address this challenge, we propose an efficient and accurate STR system. Specifically, we focus on improving the efficiency of encoder models by introducing a cascaded-transformers structure. This structure progressively reduces the vision token size during the encoding step, effectively eliminating redundant tokens and reducing computational cost. Our experimental results confirm that our STR system achieves comparable performance to state-of-the-art baselines while substantially decreasing computational requirements. In particular, for large-models, the accuracy remains same, 92.77 to 92.68, while computational complexity is almost halved with our structure.
近年来,带有文本解码器的视觉变换器在场景文本识别(STR)任务中表现出卓越性能,这是因为它们能够捕捉长距离依赖关系和上下文关联,并具有很高的学习能力。然而,这些模型的计算和内存需求较大,限制了其在资源受限的应用中的部署。为了应对这一挑战,我们提出了一种高效且准确的STR系统。具体而言,通过引入级联变换器结构,我们在提高编码器效率方面进行了改进。这种结构在编码过程中逐步减少视觉标记大小,有效地消除了冗余标记并降低了计算成本。实验结果证实,我们的STR系统在与最先进的基准模型保持相当性能的同时,大幅减少了计算需求。特别是对于大型模型而言,在使用我们提出的结构后,精度从92.77降至92.68(变化很小),而计算复杂度几乎减半。
https://arxiv.org/abs/2503.18883
Text images are unique in their dual nature, encompassing both visual and linguistic information. The visual component encompasses structural and appearance-based features, while the linguistic dimension incorporates contextual and semantic elements. In scenarios with degraded visual quality, linguistic patterns serve as crucial supplements for comprehension, highlighting the necessity of integrating both aspects for robust scene text recognition (STR). Contemporary STR approaches often use language models or semantic reasoning modules to capture linguistic features, typically requiring large-scale annotated datasets. Self-supervised learning, which lacks annotations, presents challenges in disentangling linguistic features related to the global context. Typically, sequence contrastive learning emphasizes the alignment of local features, while masked image modeling (MIM) tends to exploit local structures to reconstruct visual patterns, resulting in limited linguistic knowledge. In this paper, we propose a Linguistics-aware Masked Image Modeling (LMIM) approach, which channels the linguistic information into the decoding process of MIM through a separate branch. Specifically, we design a linguistics alignment module to extract vision-independent features as linguistic guidance using inputs with different visual appearances. As features extend beyond mere visual structures, LMIM must consider the global context to achieve reconstruction. Extensive experiments on various benchmarks quantitatively demonstrate our state-of-the-art performance, and attention visualizations qualitatively show the simultaneous capture of both visual and linguistic information.
文本图像的独特之处在于其双重性质,既包含视觉信息也包括语言信息。视觉部分涵盖了结构和外观特征,而语言维度则包含了上下文与语义元素。在视觉质量下降的情况下,语言模式作为理解的重要补充显得尤为关键,这强调了为了实现稳健的场景文本识别(STR),融合这两种特性的重要性。 现代STR方法通常采用语言模型或语义推理模块来捕捉语言特征,这些方法往往需要大规模标注数据集的支持。无监督学习由于缺乏标注信息,在分离与全局上下文相关的语言特征方面存在挑战。通常情况下,序列对比学习侧重于局部特征的对齐,而掩码图像建模(MIM)倾向于利用局部结构来重建视觉模式,从而导致语言知识的获取有限。 在本文中,我们提出了一种基于意识的语言掩码图像建模(LMIM)方法,该方法通过单独分支将语言信息引入MIM的解码过程中。具体而言,我们设计了一个语言对齐模块,利用具有不同视觉外观的输入提取与视觉无关的特征作为语言指导。随着特征扩展到超越单纯的视觉结构之外,LMIM必须考虑全局上下文以实现重建。 在多个基准测试中的大量实验定量地展示了我们的方法达到了最先进的性能水平,并且注意力可视化定性地显示了同时捕获视觉和语言信息的能力。
https://arxiv.org/abs/2503.18746
Scaling architectures have been proven effective for improving Scene Text Recognition (STR), but the individual contribution of vision encoder and text decoder scaling remain under-explored. In this work, we present an in-depth empirical analysis and demonstrate that, contrary to previous observations, scaling the decoder yields significant performance gains, always exceeding those achieved by encoder scaling alone. We also identify label noise as a key challenge in STR, particularly in real-world data, which can limit the effectiveness of STR models. To address this, we propose Cloze Self-Distillation (CSD), a method that mitigates label noise by distilling a student model from context-aware soft predictions and pseudolabels generated by a teacher model. Additionally, we enhance the decoder architecture by introducing differential cross-attention for STR. Our methodology achieves state-of-the-art performance on 10 out of 11 benchmarks using only real data, while significantly reducing the parameter size and computational costs.
扩展架构已被证明在提高场景文本识别(STR)方面是有效的,但视觉编码器和文本解码器的个体贡献仍然未被充分探索。在这项工作中,我们进行了深入的经验分析,并展示了与之前的观察相反,扩大解码器规模能够带来显著的性能提升,这一效果总是超过仅靠编码器扩展所实现的效果。此外,我们还识别出标签噪声作为STR的一个关键挑战,在现实世界的数据中尤其如此,这可能会限制STR模型的有效性。为解决这个问题,我们提出了Cloze自我蒸馏(CSD)方法,该方法通过将学生模型从教师模型生成的上下文感知软预测和伪标签中进行知识蒸馏来减少标签噪声。同时,为了增强解码器架构,我们在STR中引入了差异交叉注意力机制。我们的方法在10个基准测试中的11个上使用仅有的真实数据达到了最先进的性能,并且显著减少了参数数量和计算成本。
https://arxiv.org/abs/2503.16184
Modern scene text recognition systems often depend on large end-to-end architectures that require extensive training and are prohibitively expensive for real-time scenarios. In such cases, the deployment of heavy models becomes impractical due to constraints on memory, computational resources, and latency. To address these challenges, we propose a novel, training-free plug-and-play framework that leverages the strengths of pre-trained text recognizers while minimizing redundant computations. Our approach uses context-based understanding and introduces an attention-based segmentation stage, which refines candidate text regions at the pixel level, improving downstream recognition. Instead of performing traditional text detection that follows a block-level comparison between feature map and source image and harnesses contextual information using pretrained captioners, allowing the framework to generate word predictions directly from scene this http URL texts are semantically and lexically evaluated to get a final score. Predictions that meet or exceed a pre-defined confidence threshold bypass the heavier process of end-to-end text STR profiling, ensuring faster inference and cutting down on unnecessary computations. Experiments on public benchmarks demonstrate that our paradigm achieves performance on par with state-of-the-art systems, yet requires substantially fewer resources.
现代场景文本识别系统通常依赖于大型端到端架构,这些架构需要大量的训练,并且在实时应用场景中成本高昂。在这种情况下,由于内存、计算资源和延迟的限制,部署重型模型变得不切实际。为了解决这些问题,我们提出了一种新颖的无需训练即插即用框架,该框架利用预训练文本识别器的优势,同时最小化冗余计算。我们的方法采用基于上下文的理解,并引入了注意力机制驱动的分割阶段,在像素级别上细化候选文本区域,从而提升后续识别效果。 不同于传统的文本检测方式(该方式通过特征图与源图像之间进行块级对比来实现,并利用预训练的字幕生成器来获取上下文信息),我们的框架可以直接从场景中提取出单词预测。提取出来的文本会经过语义和词汇层面的评估以获得最终得分。那些达到或超过预定置信度阈值的预测结果将绕过更复杂的端到端文本STR处理流程,从而加快推理速度并减少不必要的计算。 在公开基准测试上的实验表明,我们的方法实现了与现有顶级系统相当的表现,但却需要显著较少的资源。
https://arxiv.org/abs/2503.15639
Scene text editing aims to modify text content within scene images while maintaining style consistency. Traditional methods achieve this by explicitly disentangling style and content from the source image and then fusing the style with the target content, while ensuring content consistency using a pre-trained recognition model. Despite notable progress, these methods suffer from complex pipelines, leading to suboptimal performance in complex scenarios. In this work, we introduce Recognition-Synergistic Scene Text Editing (RS-STE), a novel approach that fully exploits the intrinsic synergy of text recognition for editing. Our model seamlessly integrates text recognition with text editing within a unified framework, and leverages the recognition model's ability to implicitly disentangle style and content while ensuring content consistency. Specifically, our approach employs a multi-modal parallel decoder based on transformer architecture, which predicts both text content and stylized images in parallel. Additionally, our cyclic self-supervised fine-tuning strategy enables effective training on unpaired real-world data without ground truth, enhancing style and content consistency through a twice-cyclic generation process. Built on a relatively simple architecture, \mymodel achieves state-of-the-art performance on both synthetic and real-world benchmarks, and further demonstrates the effectiveness of leveraging the generated hard cases to boost the performance of downstream recognition tasks. Code is available at this https URL.
场景文本编辑的目标是在保持风格一致性的前提下修改场景图像中的文字内容。传统方法通过从源图中显式地分离出样式和内容,然后将这些样式与目标内容融合来实现这一目的,并使用预训练的识别模型确保内容的一致性。尽管取得了显著的进步,但这些方法由于复杂的管道而难以在复杂的情况下表现出色。 在这项工作中,我们引入了一种新的方法——协同识别场景文本编辑(Recognition-Synergistic Scene Text Editing, RS-STE),它充分利用了文字识别与编辑之间的内在联系。我们的模型在一个统一的框架内无缝地结合了文本识别和文本编辑,并利用了识别模型在隐式分离样式和内容的同时保持内容一致性的能力。具体而言,我们采用了一种基于变压器架构的多模态并行解码器,可以同时预测文本内容和风格化图像。 此外,我们的循环自我监督微调策略能够在没有真实标签的情况下有效训练未配对的真实世界数据,并通过两次循环生成过程增强样式和内容的一致性。在相对简单的架构基础上,\mymodel 在合成和真实世界的基准测试中均达到了最先进的性能,并进一步证明了利用生成的难例来提升下游识别任务性能的有效性。 代码可以在提供的链接地址获取。
https://arxiv.org/abs/2503.08387