Japanese scene text poses challenges that multilingual benchmarks often fail to capture, including mixed scripts, frequent vertical writing, and a character inventory far larger than the Latin alphabet. Although Japanese is included in several multilingual benchmarks, these resources do not adequately capture the language-specific complexities. Meanwhile, existing Japanese visual text datasets have primarily focused on scanned documents, leaving in-the-wild scene text underexplored. To fill this gap, we introduce JaWildText, a diagnostic benchmark for evaluating vision-language models (VLMs) on Japanese scene text understanding. JaWildText contains 3,241 instances from 2,961 images newly captured in Japan, with 1.12 million annotated characters spanning 3,643 unique character types. It comprises three complementary tasks that vary in visual organization, output format, and writing style: (i) Dense Scene Text Visual Question Answering (STVQA), which requires reasoning over multiple pieces of visual text evidence; (ii) Receipt Key Information Extraction (KIE), which tests layout-aware structured extraction from mobile-captured receipts; and (iii) Handwriting OCR, which evaluates page-level transcription across various media and writing directions. We evaluate 14 open-weight VLMs and find that the best model achieves an average score of 0.64 across the three tasks. Error analyses show recognition remains the dominant bottleneck, especially for kanji. JaWildText enables fine-grained, script-aware diagnosis of Japanese scene text capabilities, and will be released with evaluation code.
日语场景文本面临着多语言基准测试常难以捕捉的挑战,包括混合文字系统、频繁的竖排书写,以及远超拉丁字母的字符总量。尽管日语已被纳入多个多语言基准测试,但这些资源未能充分体现该语言特有的复杂性。与此同时,现有日语视觉文本数据集主要聚焦扫描文档,对现实场景中的文本探索不足。为填补这一空白,我们推出JaWildText——一个用于评估视觉-语言模型(VLMs)日语场景文本理解能力的诊断基准。该数据集包含2,961张在日本新采集图像中的3,241个实例,覆盖3,643种独特字符类型,共标注112万个字符。数据集包含三个互补任务,在视觉组织、输出格式和书写风格上各有差异:(i)密集场景文本视觉问答(STVQA),需对多段视觉文本证据进行推理;(ii)收据关键信息提取(KIE),测试对手机拍摄收据的布局感知结构化提取;(iii)手写OCR,评估跨媒介和书写方向的整页转录。我们评估了14个开源视觉-语言模型,发现最佳模型在三项任务上的平均得分仅为0.64。错误分析表明,识别仍是主要瓶颈,尤其是对汉字。JaWildText支持细粒度、感知文字系统的日语场景文本能力诊断,并将随评估代码一并开源。
https://arxiv.org/abs/2603.27942
Previous works based on Segment Anything Model (SAM) have achieved promising performance in unified scene text detection and layout analysis. However, the typical reliance on pixel-level text segmentation for sampling thousands of foreground points as prompts leads to unsatisfied inference latency and limited data utilization. To address above issues, we propose ET-SAM, an Efficient framework with two decoders for unified scene Text detection and layout analysis based on SAM. Technically, we customize a lightweight point decoder that produces word heatmaps for achieving a few foreground points, thereby eliminating excessive point prompts and accelerating inference. Without the dependence on pixel-level segmentation, we further design a joint training strategy to leverage existing data with heterogeneous text-level annotations. Specifically, the datasets with multi-level, word-level only, and line-level only annotations are combined in parallel as a unified training set. For these datasets, we introduce three corresponding sets of learnable task prompts in both the point decoder and hierarchical mask decoder to mitigate discrepancies across this http URL experiments demonstrate that, compared to the previous SAM-based architecture, ET-SAM achieves about 3$\times$ inference acceleration while obtaining competitive performance on HierText, and improves an average of 11.0% F-score on Total-Text, CTW1500, and ICDAR15.
基于分割任意模型(SAM)的先前工作已在统一场景文本检测与布局分析中取得了令人瞩目的性能。然而,典型方法依赖像素级文本分割来采样数千个前景点作为提示,这导致了令人不满意的推理延迟和有限的数据利用率。为解决上述问题,我们提出了ET-SAM,一个基于SAM的高效双解码器框架,用于统一场景文本检测与布局分析。技术上,我们定制了一个轻量级点解码器,用于生成词热力图以实现少量前景点,从而消除过多的点提示并加速推理。在不依赖像素级分割的前提下,我们进一步设计了一种联合训练策略,以利用具有异构文本级标注的现有数据。具体而言,将具有多级、仅词级和仅行级标注的数据集并行合并为一个统一的训练集。针对这些数据集,我们在点解码器和层级掩码解码器中引入了三组相应的可学习任务提示,以缓解跨数据集差异。实验表明,与先前基于SAM的架构相比,ET-SAM实现了约3倍的推理加速,同时在HierText上获得了具有竞争力的性能,并在Total-Text、CTW1500和ICDAR15上平均提升了11.0%的F-score。
https://arxiv.org/abs/2603.25168
Scene text editing seeks to modify textual content in natural images while maintaining visual realism and semantic consistency. Existing methods often require task-specific training or paired data, limiting their scalability and adaptability. In this paper, we propose TextFlow, a training-free scene text editing framework that integrates the strengths of Attention Boost (AttnBoost) and Flow Manifold Steering (FMS) to enable flexible, high-fidelity text manipulation without additional training. Specifically, FMS preserves the structural and style consistency by modeling the visual flow of characters and background regions, while AttnBoost enhances the rendering of textual content through attention-based guidance. By jointly leveraging these complementary modules, our approach performs end-to-end text editing through semantic alignment and spatial refinement in a plug-and-play manner. Extensive experiments demonstrate that our framework achieves visual quality and text accuracy comparable to or superior to those of training-based counterparts, generalizing well across diverse scenes and languages. This study advances scene text editing toward a more efficient, generalizable, and training-free paradigm. Code is available at this https URL
场景文本编辑旨在修改自然图像中的文本内容,同时保持视觉真实性和语义一致性。现有方法通常需要针对特定任务进行训练或依赖成对数据,限制了其可扩展性和适应性。本文提出TextFlow,一个免训练的场景文本编辑框架,该框架整合了注意力增强(AttnBoost)和流形引导(FMS)的优势,无需额外训练即可实现灵活、高保真的文本操作。具体而言,FMS通过建模字符与背景区域的视觉流来保持结构与风格一致性,而AttnBoost则通过注意力引导增强文本内容的渲染。通过联合利用这些互补模块,我们的方法以即插即用的方式,通过语义对齐和空间细化实现端到端的文本编辑。大量实验表明,我们的框架在视觉质量和文本准确性方面与基于训练的方法相当甚至更优,且能良好泛化至多样场景和语言。本研究推动场景文本编辑向更高效、更具泛化能力且免训练的范式发展。代码可在此链接获取。
https://arxiv.org/abs/2603.24571
The creation of high-fidelity, customizable 3D indoor scene textures remains a significant challenge. While text-driven methods offer flexibility, they lack the precision for fine-grained, instance-level control, and often produce textures with insufficient quality, artifacts, and baked-in shading. To overcome these limitations, we introduce CustomTex, a novel framework for instance-level, high-fidelity scene texturing driven by reference images. CustomTex takes an untextured 3D scene and a set of reference images specifying the desired appearance for each object instance, and generates a unified, high-resolution texture map. The core of our method is a dual-distillation approach that separates semantic control from pixel-level enhancement. We employ semantic-level distillation, equipped with an instance cross-attention, to ensure semantic plausibility and ``reference-instance'' alignment, and pixel-level distillation to enforce high visual fidelity. Both are unified within a Variational Score Distillation (VSD) optimization framework. Experiments demonstrate that CustomTex achieves precise instance-level consistency with reference images and produces textures with superior sharpness, reduced artifacts, and minimal baked-in shading compared to state-of-the-art methods. Our work establishes a more direct and user-friendly path to high-quality, customizable 3D scene appearance editing.
https://arxiv.org/abs/2603.19121
Generating accurate glyphs for visual text rendering is essential yet challenging. Existing methods typically enhance text rendering by training on a large amount of high-quality scene text images, but the limited coverage of glyph variations and excessive stylization often compromise glyph accuracy, especially for complex or out-of-domain characters. Some methods leverage reinforcement learning to alleviate this issue, yet their reward models usually depend on text recognition systems that are insensitive to fine-grained glyph errors, so images with incorrect glyphs may still receive high rewards. Inspired by Direct Preference Optimization (DPO), we propose GlyphPrinter, a preference-based text rendering method that eliminates reliance on explicit reward models. However, the standard DPO objective only models overall preference between two samples, which is insufficient for visual text rendering where glyph errors typically occur in localized regions. To address this issue, we construct the GlyphCorrector dataset with region-level glyph preference annotations and propose Region-Grouped DPO (R-GDPO), a region-based objective that optimizes inter- and intra-sample preferences over annotated regions, substantially enhancing glyph accuracy. Furthermore, we introduce Regional Reward Guidance, an inference strategy that samples from an optimal distribution with controllable glyph accuracy. Extensive experiments demonstrate that the proposed GlyphPrinter outperforms existing methods in glyph accuracy while maintaining a favorable balance between stylization and precision.
https://arxiv.org/abs/2603.15616
Multilingual document and scene text understanding plays an important role in applications such as search, finance, and public services. However, most existing benchmarks focus on high-resource languages and fail to evaluate models in realistic multilingual environments. In Southeast Asia, the diversity of languages, complex writing systems, and highly varied document types make this challenge even greater. We introduce SEA-Vision, a benchmark that jointly evaluates Document Parsing and Text-Centric Visual Question Answering (TEC-VQA) across 11 Southeast Asian languages. SEA-Vision contains 15,234 document parsing pages from nine representative document types, annotated with hierarchical page-, block-, and line-level labels. It also provides 7,496 TEC-VQA question-answer pairs that probe text recognition, numerical calculation, comparative analysis, logical reasoning, and spatial understanding. To make such multilingual, multi-task annotation feasible, we design a hybrid pipeline for Document Parsing and TEC-VQA. It combines automated filtering and scoring with MLLM-assisted labeling and lightweight native-speaker verification, greatly reducing manual labeling while maintaining high quality. We evaluate several leading multimodal models and observe pronounced performance degradation on low-resource Southeast Asian languages, highlighting substantial remaining gaps in multilingual document and scene text understanding. We believe SEA-Vision will help drive global progress in document and scene text understanding.
https://arxiv.org/abs/2603.15409
Scene Text Image Super-Resolution (STISR) aims to restore high-resolution details in low-resolution text images, which is crucial for both human readability and machine recognition. Existing methods, however, often depend on external Optical Character Recognition (OCR) models for textual priors or rely on complex multi-component architectures that are difficult to train and reproduce. In this paper, we introduce DualTSR, a unified end-to-end framework that addresses both issues. DualTSR employs a single multimodal transformer backbone trained with a dual diffusion objective. It simultaneously models the continuous distribution of high-resolution images via Conditional Flow Matching and the discrete distribution of textual content via discrete diffusion. This shared design enables visual and textual information to interact at every layer, allowing the model to infer text priors internally instead of relying on an external OCR module. Compared with prior multi-branch diffusion systems, DualTSR offers a simpler end-to-end formulation with fewer hand-crafted components. Experiments on synthetic Chinese benchmarks and a curated real-world evaluation protocol show that DualTSR achieves strong perceptual quality and text fidelity.
https://arxiv.org/abs/2603.14207
Scene text recognition (STR) methods have demonstrated their excellent capability in English text images. However, due to the complex inner structures of Chinese and the extensive character categories, it poses challenges for recognizing Chinese text in images. Recently, studies have shown that the methods designed for English text recognition encounter an accuracy bottleneck when recognizing Chinese text images. This raises the question: Is it appropriate to apply the model developed for English to the Chinese STR task? To explore this issue, we propose a novel method named LER, which explicitly decouples each character and independently recognizes characters while taking into account the complex inner structures of Chinese. LER consists of three modules: Localization, Extraction, and Recognition. Firstly, the localization module utilizes multimodal information to determine the character's position precisely. Then, the extraction module dissociates all characters in parallel. Finally, the recognition module considers the unique inner structures of Chinese to provide the text prediction results. Extensive experiments conducted on large-scale Chinese benchmarks indicate that our method significantly outperforms existing methods. Furthermore, extensive experiments conducted on six English benchmarks and the Union14M benchmark show impressive results in English text recognition by LER. Code is available at this https URL.
https://arxiv.org/abs/2603.13886
Ultra-low bitrate image compression faces a critical challenge: preserving small-font scene text while maintaining overall visual quality. Region-of-interest (ROI) bit allocation can prioritize text but often degrades global fidelity, leading to a trade-off between local accuracy and overall image quality. Instead of relying on ROI coding, we incorporate auxiliary textual information extracted by OCR and transmitted with negligible overhead, enabling the decoder to leverage this semantic guidance. Our method, TextBoost, operationalizes this idea through three strategic designs: (i) adaptively filtering OCR outputs and rendering them into a guidance map; (ii) integrating this guidance with decoder features in a calibrated manner via an attention-guided fusion block; and (iii) enforcing guidance-consistent reconstruction in text regions with a regularizing loss that promotes natural blending with the scene. Extensive experiments on TextOCR and ICDAR 2015 demonstrate that TextBoost yields up to 60.6% higher text-recognition F1 at comparable Peak Signal-to-Noise Ratio (PSNR) and bits per pixel (bpp), producing sharper small-font text while preserving global image quality and effectively decoupling text enhancement from global rate-distortion optimization.
超低比特率图像压缩面临的一个关键挑战是,在保持整体视觉质量的同时,保留小字体场景文本。区域感兴趣(ROI)的比特分配可以优先处理文本,但这通常会降低全局保真度,导致局部准确性和整体图像质量之间的权衡。我们不依赖于ROI编码,而是结合由OCR提取并以几乎无开销的方式传输的辅助文本信息,使解码器能够利用这种语义指导。 我们的方法TextBoost通过三个战略设计将这一理念付诸实践:(i) 自适应地过滤OCR输出,并将其转换为引导图;(ii) 通过注意力引导融合块以校准方式整合该引导与解码器特征;(iii) 在文本区域强制执行一致性重建,同时采用正则化损失来促进自然的场景融合。在TextOCR和ICDAR 2015数据集上的广泛实验表明,相较于同等峰值信噪比(PSNR)和每像素比特数(bpp),TextBoost可以将文本识别F1值提高高达60.6%,同时生成更清晰的小字体文本,并保持全局图像质量,有效地解耦了文本增强与全局率失真优化之间的关系。
https://arxiv.org/abs/2603.04115
Scene text recognition (STR) and handwritten text recognition (HTR) face significant challenges in accurately transcribing textual content from images into machine-readable formats. Conventional OCR models often predict transcriptions directly, which limits detailed reasoning about text structure. We propose a VQA-inspired data augmentation framework that strengthens OCR training through structured question-answering tasks. For each image-text pair, we generate natural-language questions probing character-level attributes such as presence, position, and frequency, with answers derived from ground-truth text. These auxiliary tasks encourage finer-grained reasoning, and the OCR model aligns visual features with textual queries to jointly reason over images and questions. Experiments on WordArt and Esposalles datasets show consistent improvements over baseline models, with significant reductions in both CER and WER. Our code is publicly available at this https URL.
场景文本识别(STR)和手写文本识别(HTR)在将图像中的文字内容准确转录为机器可读格式时面临着重大挑战。传统的OCR模型通常直接预测转录结果,这限制了对文本结构的详细推理能力。我们提出了一种受VQA(视觉问答)启发的数据增强框架,该框架通过结构化的问答任务加强OCR训练。对于每个图像-文本配对,我们生成自然语言问题来探究字符级别的属性,如存在性、位置和频率,并根据真实文本信息得出答案。这些辅助任务鼓励更精细的推理过程,同时OCR模型将视觉特征与文本查询相结合,共同处理图像和问题以进行综合推理。在WordArt和Esposalles数据集上的实验表明,相较于基准模型,我们的方法在字符错误率(CER)和单词错误率(WER)上都有显著降低。我们公开可用的代码托管在此链接:[此URL]。
https://arxiv.org/abs/2603.03580
Khmer is a low-resource language characterized by a complex script, presenting significant challenges for optical character recognition (OCR). While document printed text recognition has advanced because of available datasets, performance on other modalities, such as handwritten and scene text, remains limited by data scarcity. Training modality-specific models for each modality does not allow cross-modality transfer learning, from which modalities with limited data could otherwise benefit. Moreover, deploying many modality-specific models results in significant memory overhead and requires error-prone routing each input image to the appropriate model. On the other hand, simply training on a combined dataset with a non-uniform data distribution across different modalities often leads to degraded performance on underrepresented modalities. To address these, we propose a universal Khmer text recognition (UKTR) framework capable of handling diverse text modalities. Central to our method is a novel modality-aware adaptive feature selection (MAFS) technique designed to adapt visual features according to a particular input image modality and enhance recognition robustness across modalities. Extensive experiments demonstrate that our model achieves state-of-the-art (SoTA) performance. Furthermore, we introduce the first comprehensive benchmark for universal Khmer text recognition, which we release to the community to facilitate future research. Our datasets and models can be accessible via this gated repository\footnote{in review}.
柬埔寨语是一种低资源语言,其特点是拥有复杂的书写系统,这给光学字符识别(OCR)带来了巨大的挑战。尽管由于可用的数据集,印刷文档文本的识别取得了进展,但在其他模态下(如手写和场景文字),性能仍然受限于数据稀缺性。为每种模态训练特定模型不允许跨模态迁移学习,而那些数据较少的模态本可以从这种迁移学习中受益。此外,部署多种特定于模态的模型会导致显著的内存开销,并需要将每个输入图像路由到适当模型的过程(该过程容易出错)。另一方面,在具有不同模态间非均匀数据分布的组合数据集上简单地进行训练往往导致在代表性不足的模态下性能下降。为了解决这些问题,我们提出了一种通用柬埔寨文本识别(UKTR)框架,能够处理各种文本模态。我们的方法的核心是一种新颖的认知模态自适应特征选择(MAFS)技术,旨在根据特定输入图像的模态调整视觉特征,并增强跨模态识别的稳健性。广泛的实验表明,我们的模型达到了最先进的性能。此外,我们还引入了第一个针对通用柬埔寨文本识别的全面基准测试,我们将此发布到社区以促进未来的研究。我们的数据集和模型可以通过这个受控访问仓库获取(注:正在审查中)。
https://arxiv.org/abs/2603.00702
Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream? We investigate the OCR routing mechanism across three architecture families (Qwen3-VL, Phi-4, InternVL3.5) using causal interventions. By computing activation differences between original images and text-inpainted versions, we identify architecture-specific OCR bottlenecks whose dominant location depends on the vision-language integration strategy: DeepStack models (Qwen) show peak sensitivity at mid-depth (about 50%) for scene text, while single-stage projection models (Phi-4, InternVL) peak at early layers (6-25%), though the exact layer of maximum effect varies across datasets. The OCR signal is remarkably low-dimensional: PC1 captures 72.9% of variance. Crucially, principal component analysis (PCA) directions learned on one dataset transfer to others, demonstrating shared text-processing pathways. Surprisingly, in models with modular OCR circuits (notably Qwen3-VL-4B), OCR removal can improve counting performance (up to +6.9 percentage points), suggesting OCR interferes with other visual processing in sufficiently modular architectures.
视觉语言模型(VLM)可以从图像中读取文本,但光学字符识别(OCR)信息是如何进入语言处理流程的呢?我们通过因果干预的方式,在三种架构家族(Qwen3-VL、Phi-4和InternVL3.5)之间研究了OCR路由机制。通过计算原始图像与带文字填充版本之间的激活差异,我们确定了每个架构特有的OCR瓶颈,其主要位置取决于视觉语言集成策略:深层栈模型(如Qwen)在场景文本中的敏感度峰值出现在中层深度(约50%),而单阶段投影模型(Phi-4和InternVL)则在早期层次(6%-25%)达到峰值,尽管效果最大的具体层次因数据集的不同而变化。OCR信号具有非常低的维度:主成分1可以解释72.9%的变化。关键地,通过一个数据集中学习到的主要成分分析(PCA)方向可以在其他数据集中转移,证明了文本处理路径存在共享机制。出乎意料的是,在拥有模块化OCR电路的模型中(尤其是Qwen3-VL-4B),去除OCR信息可以提高计数性能(最多提升6.9个百分点),这表明在足够模块化的架构中,OCR干扰了其他视觉处理过程。
https://arxiv.org/abs/2602.22918
Large-scale and categorical-balanced text data is essential for training effective Scene Text Recognition (STR) models, which is hard to achieve when collecting real data. Synthetic data offers a cost-effective and perfectly labeled alternative. However, its performance often lags behind, revealing a significant domain gap between real and current synthetic data. In this work, we systematically analyze mainstream rendering-based synthetic datasets and identify their key limitations: insufficient diversity in corpus, font, and layout, which restricts their realism in complex scenarios. To address these issues, we introduce UnionST, a strong data engine synthesizes text covering a union of challenging samples and better aligns with the complexity observed in the wild. We then construct UnionST-S, a large-scale synthetic dataset with improved simulations in challenging scenarios. Furthermore, we develop a self-evolution learning (SEL) framework for effective real data annotation. Experiments show that models trained on UnionST-S achieve significant improvements over existing synthetic datasets. They even surpass real-data performance in certain scenarios. Moreover, when using SEL, the trained models achieve competitive performance by only seeing 9% of real data labels.
大规模且类别平衡的文本数据对于训练有效的场景文字识别(STR)模型至关重要,但在实际收集这类数据时非常困难。合成数据提供了一种成本效益高且标注完美的替代方案。然而,其性能通常落后于真实数据,显示出当前合成数据与现实之间存在显著差距。在这项工作中,我们系统地分析了主流的基于渲染的合成数据集,并确定了它们的关键限制:文本语料库、字体和布局多样性不足,这在复杂场景中限制了它们的真实感。为了解决这些问题,我们引入了UnionST,这是一种强大的数据引擎,它综合生成具有挑战性的样本并更好地与现实世界中的复杂性对齐。然后,我们构建了UnionST-S,这是一个大规模的合成数据集,在具有挑战性的场景中提供了改进的模拟效果。此外,我们还开发了一种自我进化学习(SEL)框架,用于有效的真实数据标注。实验表明,基于UnionST-S训练的模型在现有合成数据集上取得了显著的进步,并且在某些情况下甚至超过了真实数据的表现。另外,在使用SEL时,仅通过9%的真实数据标签进行训练,所得到的模型就能实现具有竞争力的性能。
https://arxiv.org/abs/2602.06450
Scene text spotting aims to detect and recognize text in real-world images, where instances are often short, fragmented, or visually ambiguous. Existing methods primarily rely on visual cues and implicitly capture local character dependencies, but they overlook the benefits of external linguistic knowledge. Prior attempts to integrate language models either adapt language modeling objectives without external knowledge or apply pretrained models that are misaligned with the word-level granularity of scene text. We propose TiCLS, an end-to-end text spotter that explicitly incorporates external linguistic knowledge from a character-level pretrained language model. TiCLS introduces a linguistic decoder that fuses visual and linguistic features, yet can be initialized by a pretrained language model, enabling robust recognition of ambiguous or fragmented text. Experiments on ICDAR 2015 and Total-Text demonstrate that TiCLS achieves state-of-the-art performance, validating the effectiveness of PLM-guided linguistic integration for scene text spotting.
场景文本识别的目标是在现实世界的图像中检测和识别文字,这些实例往往很短、片段化或视觉上模糊。现有方法主要依赖于视觉线索,并且隐式地捕捉局部字符间的依赖关系,但它们忽略了外部语言知识的好处。以前尝试整合语言模型的方法要么在不使用外部知识的情况下适应语言建模目标,要么应用与场景文本的词级粒度不匹配的预训练模型。 我们提出了TiCLS,这是一种端到端的文字识别器,它明确地从字符级别的预训练语言模型中引入了外部语言知识。TiCLS 引入了一个语言解码器,该解码器融合视觉和语言特征,并且可以通过预训练的语言模型进行初始化,从而能够更稳健地识别模糊或片段化的文本。 在ICDAR 2015和Total-Text数据集上的实验表明,TiCLS 达到了最先进的性能水平,这验证了PLM(预训练语言模型)引导下的语言整合对于场景文本识别的有效性。
https://arxiv.org/abs/2602.04030
To pursue an efficient text assembling process, existing methods detect texts via the shrink-mask expansion strategy. However, the shrinking operation loses the visual features of text margins and confuses the foreground and background difference, which brings intrinsic limitations to recognize text features. We follow this issue and design Text-Pass Filter (TPF) for arbitrary-shaped text detection. It segments the whole text directly, which avoids the intrinsic limitations. It is noteworthy that different from previous whole text region-based methods, TPF can separate adhesive texts naturally without complex decoding or post-processing processes, which makes it possible for real-time text detection. Concretely, we find that the band-pass filter allows through components in a specified band of frequencies, called its passband but blocks components with frequencies above or below this band. It provides a natural idea for extracting whole texts separately. By simulating the band-pass filter, TPF constructs a unique feature-filter pair for each text. In the inference stage, every filter extracts the corresponding matched text by passing its pass-feature and blocking other features. Meanwhile, considering the large aspect ratio problem of ribbon-like texts makes it hard to recognize texts wholly, a Reinforcement Ensemble Unit (REU) is designed to enhance the feature consistency of the same text and to enlarge the filter's recognition field to help recognize whole texts. Furthermore, a Foreground Prior Unit (FPU) is introduced to encourage TPF to discriminate the difference between the foreground and background, which improves the feature-filter pair quality. Experiments demonstrate the effectiveness of REU and FPU while showing the TPF's superiority.
https://arxiv.org/abs/2601.18098
Understanding signboard text in natural scenes is essential for real-world applications of Visual Question Answering (VQA), yet remains underexplored, particularly in low-resource languages. We introduce ViSignVQA, the first large-scale Vietnamese dataset designed for signboard-oriented VQA, which comprises 10,762 images and 25,573 question-answer pairs. The dataset captures the diverse linguistic, cultural, and visual characteristics of Vietnamese signboards, including bilingual text, informal phrasing, and visual elements such as color and layout. To benchmark this task, we adapted state-of-the-art VQA models (e.g., BLIP-2, LaTr, PreSTU, and SaL) by integrating a Vietnamese OCR model (SwinTextSpotter) and a Vietnamese pretrained language model (ViT5). The experimental results highlight the significant role of the OCR-enhanced context, with F1-score improvements of up to 209% when the OCR text is appended to questions. Additionally, we propose a multi-agent VQA framework combining perception and reasoning agents with GPT-4, achieving 75.98% accuracy via majority voting. Our study presents the first large-scale multimodal dataset for Vietnamese signboard understanding. This underscores the importance of domain-specific resources in enhancing text-based VQA for low-resource languages. ViSignVQA serves as a benchmark capturing real-world scene text characteristics and supporting the development and evaluation of OCR-integrated VQA models in Vietnamese.
理解自然场景中的招牌文字对于视觉问答(Visual Question Answering,VQA)在现实世界的应用至关重要,然而这一领域尤其在低资源语言中仍然研究不足。我们介绍了ViSignVQA,这是首个为面向招牌的VQA设计的大规模越南语数据集,包含10,762张图像和25,573组问题-答案对。该数据集捕捉了越南语招牌文字的语言、文化和视觉特点,包括双语文本、非正式表达方式以及颜色和布局等视觉元素。 为了评估这一任务,我们通过整合一种越南语OCR模型(SwinTextSpotter)和一种预训练的越南语语言模型(ViT5),对最先进的VQA模型进行了改造,如BLIP-2、LaTr、PreSTU 和 SaL。实验结果表明,增强型 OCR 上下文发挥了显著作用,在将OCR 文本附加到问题时,F1 分数最高提高了 209%。 此外,我们提出了一种结合感知和推理代理以及GPT-4的多智能体VQA框架,通过多数投票达到了75.98%的准确率。我们的研究首次展示了大规模的越南语招牌理解多模态数据集,并突显了领域特定资源在增强基于文本的低资源语言 VQA 中的重要性。 ViSignVQA 作为基准,捕捉了现实世界场景文字的特点,并支持集成 OCR 的 VQA 模型在越南语中的开发和评估。
https://arxiv.org/abs/2512.22218
We introduce SELECT (Scene tExt Label Errors deteCTion), a novel approach that leverages multi-modal training to detect label errors in real-world scene text datasets. Utilizing an image-text encoder and a character-level tokenizer, SELECT addresses the issues of variable-length sequence labels, label sequence misalignment, and character-level errors, outperforming existing methods in accuracy and practical utility. In addition, we introduce Similarity-based Sequence Label Corruption (SSLC), a process that intentionally introduces errors into the training labels to mimic real-world error scenarios during training. SSLC not only can cause a change in the sequence length but also takes into account the visual similarity between characters during corruption. Our method is the first to detect label errors in real-world scene text datasets successfully accounting for variable-length labels. Experimental results demonstrate the effectiveness of SELECT in detecting label errors and improving STR accuracy on real-world text datasets, showcasing its practical utility.
我们引入了SELECT(Scene tExt Label Errors deteCTion),这是一种利用多模态训练来检测现实世界场景文本数据集中标签错误的创新方法。通过使用图像-文本编码器和字符级分词器,SELECT 解决了可变长度序列标签、标签序列错位以及字符级别错误的问题,并在准确性和实用性方面优于现有方法。此外,我们还引入了一种基于相似性的序列标签腐败(SSLC)过程,在训练过程中故意向训练标签中引入错误以模拟现实世界中的错误场景。SSLC 不仅可以改变序列长度,还可以在腐败过程中考虑字符之间的视觉相似性。我们的方法是首次成功地检测到真实世界的场景文本数据集中可变长度标签的标签错误的方法。实验结果证明了SELECT 在检测标签错误和提高实际文本数据集上的STR准确性方面的有效性,展示了其实用价值。
https://arxiv.org/abs/2512.14050
In scene text detection, Transformer-based methods have addressed the global feature extraction limitations inherent in traditional convolution neural network-based methods. However, most directly rely on native Transformer attention layers as encoders without evaluating their cross-domain limitations and inherent shortcomings: forgetting important information or focusing on irrelevant representations when modeling long-range dependencies for text detection. The recently proposed state space model Mamba has demonstrated better long-range dependencies modeling through a linear complexity selection mechanism. Therefore, we propose a novel scene text detector based on Mamba that integrates the selection mechanism with attention layers, enhancing the encoder's ability to extract relevant information from long sequences. We adopt the Top\_k algorithm to explicitly select key information and reduce the interference of irrelevant information in Mamba modeling. Additionally, we design a dual-scale feed-forward network and an embedding pyramid enhancement module to facilitate high-dimensional hidden state interactions and multi-scale feature fusion. Our method achieves state-of-the-art or competitive performance on various benchmarks, with F-measures of 89.7\%, 89.2\%, and 78.5\% on CTW1500, TotalText, and ICDAR19ArT, respectively. Codes will be available.
在场景文本检测中,基于Transformer的方法解决了传统卷积神经网络方法固有的全局特征提取局限性。然而,大多数此类方法直接依赖于原生的Transformer注意力层作为编码器,并未对其跨域限制和内在缺点进行评估:即在建模长距离依赖时可能会忽略重要信息或过度关注不相关的表示。最近提出的状态空间模型Mamba通过线性复杂度选择机制展示了更好的长距离依赖建模能力。因此,我们提出了一种基于Mamba的新型场景文本检测器,该检测器将选择机制与注意力层相结合,增强了编码器从长序列中提取相关信息的能力。我们在Mamba建模过程中采用Top_k算法来显式地选择关键信息并减少不相关干扰信息的影响。此外,我们设计了一个双尺度前向网络和一个嵌入金字塔增强模块,以促进高维隐藏状态之间的交互及多尺度特征融合。我们的方法在各种基准测试中均取得了最先进的或竞争性的性能,具体而言,在CTW1500、TotalText和ICDAR19ArT数据集上的F-measure值分别为89.7%、89.2%和78.5%。代码将公开提供。
https://arxiv.org/abs/2512.06657
Scene Text Editing (STE) involves replacing text in a scene image with new target text while preserving both the original text style and background texture. Existing methods suffer from two major challenges: inconsistency and length-insensitivity. They often fail to maintain coherence between the edited local patch and the surrounding area, and they struggle to handle significant differences in text length before and after editing. To tackle these challenges, we propose an end-to-end framework called Global-Local Aware Scene Text Editing (GLASTE), which simultaneously incorporates high-level global contextual information along with delicate local features. Specifically, we design a global-local combination structure, joint global and local losses, and enhance text image features to ensure consistency in text style within local patches while maintaining harmony between local and global areas. Additionally, we express the text style as a vector independent of the image size, which can be transferred to target text images of various sizes. We use an affine fusion to fill target text images into the editing patch while maintaining their aspect ratio unchanged. Extensive experiments on real-world datasets validate that our GLASTE model outperforms previous methods in both quantitative metrics and qualitative results and effectively mitigates the two challenges.
场景文本编辑(STE)涉及在保留原始文本样式和背景纹理的同时,将场景图像中的文字替换为新的目标文字。现有方法面临两大挑战:不一致性及长度敏感性问题。它们通常无法维持编辑后的局部区域与其周围环境之间的连贯性,并且难以处理编辑前后的文本长度显著变化的问题。 为了克服这些挑战,我们提出了一种名为全局-局部感知场景文本编辑(GLASTE)的端到端框架,该框架同时整合了高层次的全局上下文信息和精细的局部特征。具体而言,我们设计了一个全局-局部组合结构,并引入联合全局与局部损失函数以确保在保持局部区域之间和谐的同时,在局部区域内维持一致的文字风格。 此外,我们将文字样式表示为独立于图像大小的向量形式,这使得它可以转移到各种尺寸的目标文本图像中。我们在填充目标文本到编辑区域时采用仿射融合技术来保留其纵横比不变。 通过真实世界数据集上的广泛实验验证了GLASTE模型在量化指标和定性结果上均优于先前的方法,并且有效缓解了上述两大挑战。
https://arxiv.org/abs/2512.03574
Mask Diffusion Models (MDMs) have recently emerged as a promising alternative to auto-regressive models (ARMs) for vision-language tasks, owing to their flexible balance of efficiency and accuracy. In this paper, for the first time, we introduce MDMs into the Scene Text Recognition (STR) task. We show that vanilla MDM lags behind ARMs in terms of accuracy, although it improves recognition efficiency. To bridge this gap, we propose MDiff4STR, a Mask Diffusion model enhanced with two key improvement strategies tailored for STR. Specifically, we identify two key challenges in applying MDMs to STR: noising gap between training and inference, and overconfident predictions during inference. Both significantly hinder the performance of MDMs. To mitigate the first issue, we develop six noising strategies that better align training with inference behavior. For the second, we propose a token-replacement noise mechanism that provides a non-mask noise type, encouraging the model to reconsider and revise overly confident but incorrect predictions. We conduct extensive evaluations of MDiff4STR on both standard and challenging STR benchmarks, covering diverse scenarios including irregular, artistic, occluded, and Chinese text, as well as whether the use of pretraining. Across these settings, MDiff4STR consistently outperforms popular STR models, surpassing state-of-the-art ARMs in accuracy, while maintaining fast inference with only three denoising steps. Code: this https URL.
最近,Mask Diffusion Models (MDM) 作为一种有前景的替代方案出现了,它可以为视觉-语言任务提供比自回归模型(ARM)更加灵活的效率和准确性平衡。在本文中,我们首次将MDM引入到Scene Text Recognition (STR) 任务中。研究表明,在提高识别效率的同时,标准的MDM在准确率上仍然落后于ARM。为了弥补这一差距,我们提出了MDiff4STR,这是一种增强型Mask Diffusion模型,它采用了两种针对STR优化的关键改进策略。 具体来说,我们在将MDMs应用于STR时确定了两个关键挑战:训练和推理之间的噪音差异,以及推理过程中过度自信的预测问题。这两个因素都显著阻碍了MDM的表现。为了解决第一个问题,我们开发出了六种噪音策略,以更好地协调训练与推理行为的一致性。对于第二个问题,则提出了一种基于标记替换的噪声机制来提供一种非屏蔽类型的噪声,促使模型重新考虑并修正那些过度自信但实际上错误的预测。 我们在标准和具有挑战性的STR基准测试上对MDiff4STR进行了广泛的评估,涵盖了包括不规则、艺术化、被遮挡的文字以及中文文本在内的多种场景,并探讨了预训练的影响。在这些设置中,MDiff4STR始终超越流行的STR模型,在准确性方面超过了最先进的ARMs,同时通过仅使用三个去噪步骤保持了快速的推理速度。 该研究的代码可在此链接获取:[请插入实际链接]
https://arxiv.org/abs/2512.01422