Understanding signboard text in natural scenes is essential for real-world applications of Visual Question Answering (VQA), yet remains underexplored, particularly in low-resource languages. We introduce ViSignVQA, the first large-scale Vietnamese dataset designed for signboard-oriented VQA, which comprises 10,762 images and 25,573 question-answer pairs. The dataset captures the diverse linguistic, cultural, and visual characteristics of Vietnamese signboards, including bilingual text, informal phrasing, and visual elements such as color and layout. To benchmark this task, we adapted state-of-the-art VQA models (e.g., BLIP-2, LaTr, PreSTU, and SaL) by integrating a Vietnamese OCR model (SwinTextSpotter) and a Vietnamese pretrained language model (ViT5). The experimental results highlight the significant role of the OCR-enhanced context, with F1-score improvements of up to 209% when the OCR text is appended to questions. Additionally, we propose a multi-agent VQA framework combining perception and reasoning agents with GPT-4, achieving 75.98% accuracy via majority voting. Our study presents the first large-scale multimodal dataset for Vietnamese signboard understanding. This underscores the importance of domain-specific resources in enhancing text-based VQA for low-resource languages. ViSignVQA serves as a benchmark capturing real-world scene text characteristics and supporting the development and evaluation of OCR-integrated VQA models in Vietnamese.
理解自然场景中的招牌文字对于视觉问答(Visual Question Answering,VQA)在现实世界的应用至关重要,然而这一领域尤其在低资源语言中仍然研究不足。我们介绍了ViSignVQA,这是首个为面向招牌的VQA设计的大规模越南语数据集,包含10,762张图像和25,573组问题-答案对。该数据集捕捉了越南语招牌文字的语言、文化和视觉特点,包括双语文本、非正式表达方式以及颜色和布局等视觉元素。 为了评估这一任务,我们通过整合一种越南语OCR模型(SwinTextSpotter)和一种预训练的越南语语言模型(ViT5),对最先进的VQA模型进行了改造,如BLIP-2、LaTr、PreSTU 和 SaL。实验结果表明,增强型 OCR 上下文发挥了显著作用,在将OCR 文本附加到问题时,F1 分数最高提高了 209%。 此外,我们提出了一种结合感知和推理代理以及GPT-4的多智能体VQA框架,通过多数投票达到了75.98%的准确率。我们的研究首次展示了大规模的越南语招牌理解多模态数据集,并突显了领域特定资源在增强基于文本的低资源语言 VQA 中的重要性。 ViSignVQA 作为基准,捕捉了现实世界场景文字的特点,并支持集成 OCR 的 VQA 模型在越南语中的开发和评估。
https://arxiv.org/abs/2512.22218
We introduce SELECT (Scene tExt Label Errors deteCTion), a novel approach that leverages multi-modal training to detect label errors in real-world scene text datasets. Utilizing an image-text encoder and a character-level tokenizer, SELECT addresses the issues of variable-length sequence labels, label sequence misalignment, and character-level errors, outperforming existing methods in accuracy and practical utility. In addition, we introduce Similarity-based Sequence Label Corruption (SSLC), a process that intentionally introduces errors into the training labels to mimic real-world error scenarios during training. SSLC not only can cause a change in the sequence length but also takes into account the visual similarity between characters during corruption. Our method is the first to detect label errors in real-world scene text datasets successfully accounting for variable-length labels. Experimental results demonstrate the effectiveness of SELECT in detecting label errors and improving STR accuracy on real-world text datasets, showcasing its practical utility.
我们引入了SELECT(Scene tExt Label Errors deteCTion),这是一种利用多模态训练来检测现实世界场景文本数据集中标签错误的创新方法。通过使用图像-文本编码器和字符级分词器,SELECT 解决了可变长度序列标签、标签序列错位以及字符级别错误的问题,并在准确性和实用性方面优于现有方法。此外,我们还引入了一种基于相似性的序列标签腐败(SSLC)过程,在训练过程中故意向训练标签中引入错误以模拟现实世界中的错误场景。SSLC 不仅可以改变序列长度,还可以在腐败过程中考虑字符之间的视觉相似性。我们的方法是首次成功地检测到真实世界的场景文本数据集中可变长度标签的标签错误的方法。实验结果证明了SELECT 在检测标签错误和提高实际文本数据集上的STR准确性方面的有效性,展示了其实用价值。
https://arxiv.org/abs/2512.14050
In scene text detection, Transformer-based methods have addressed the global feature extraction limitations inherent in traditional convolution neural network-based methods. However, most directly rely on native Transformer attention layers as encoders without evaluating their cross-domain limitations and inherent shortcomings: forgetting important information or focusing on irrelevant representations when modeling long-range dependencies for text detection. The recently proposed state space model Mamba has demonstrated better long-range dependencies modeling through a linear complexity selection mechanism. Therefore, we propose a novel scene text detector based on Mamba that integrates the selection mechanism with attention layers, enhancing the encoder's ability to extract relevant information from long sequences. We adopt the Top\_k algorithm to explicitly select key information and reduce the interference of irrelevant information in Mamba modeling. Additionally, we design a dual-scale feed-forward network and an embedding pyramid enhancement module to facilitate high-dimensional hidden state interactions and multi-scale feature fusion. Our method achieves state-of-the-art or competitive performance on various benchmarks, with F-measures of 89.7\%, 89.2\%, and 78.5\% on CTW1500, TotalText, and ICDAR19ArT, respectively. Codes will be available.
在场景文本检测中,基于Transformer的方法解决了传统卷积神经网络方法固有的全局特征提取局限性。然而,大多数此类方法直接依赖于原生的Transformer注意力层作为编码器,并未对其跨域限制和内在缺点进行评估:即在建模长距离依赖时可能会忽略重要信息或过度关注不相关的表示。最近提出的状态空间模型Mamba通过线性复杂度选择机制展示了更好的长距离依赖建模能力。因此,我们提出了一种基于Mamba的新型场景文本检测器,该检测器将选择机制与注意力层相结合,增强了编码器从长序列中提取相关信息的能力。我们在Mamba建模过程中采用Top_k算法来显式地选择关键信息并减少不相关干扰信息的影响。此外,我们设计了一个双尺度前向网络和一个嵌入金字塔增强模块,以促进高维隐藏状态之间的交互及多尺度特征融合。我们的方法在各种基准测试中均取得了最先进的或竞争性的性能,具体而言,在CTW1500、TotalText和ICDAR19ArT数据集上的F-measure值分别为89.7%、89.2%和78.5%。代码将公开提供。
https://arxiv.org/abs/2512.06657
Scene Text Editing (STE) involves replacing text in a scene image with new target text while preserving both the original text style and background texture. Existing methods suffer from two major challenges: inconsistency and length-insensitivity. They often fail to maintain coherence between the edited local patch and the surrounding area, and they struggle to handle significant differences in text length before and after editing. To tackle these challenges, we propose an end-to-end framework called Global-Local Aware Scene Text Editing (GLASTE), which simultaneously incorporates high-level global contextual information along with delicate local features. Specifically, we design a global-local combination structure, joint global and local losses, and enhance text image features to ensure consistency in text style within local patches while maintaining harmony between local and global areas. Additionally, we express the text style as a vector independent of the image size, which can be transferred to target text images of various sizes. We use an affine fusion to fill target text images into the editing patch while maintaining their aspect ratio unchanged. Extensive experiments on real-world datasets validate that our GLASTE model outperforms previous methods in both quantitative metrics and qualitative results and effectively mitigates the two challenges.
场景文本编辑(STE)涉及在保留原始文本样式和背景纹理的同时,将场景图像中的文字替换为新的目标文字。现有方法面临两大挑战:不一致性及长度敏感性问题。它们通常无法维持编辑后的局部区域与其周围环境之间的连贯性,并且难以处理编辑前后的文本长度显著变化的问题。 为了克服这些挑战,我们提出了一种名为全局-局部感知场景文本编辑(GLASTE)的端到端框架,该框架同时整合了高层次的全局上下文信息和精细的局部特征。具体而言,我们设计了一个全局-局部组合结构,并引入联合全局与局部损失函数以确保在保持局部区域之间和谐的同时,在局部区域内维持一致的文字风格。 此外,我们将文字样式表示为独立于图像大小的向量形式,这使得它可以转移到各种尺寸的目标文本图像中。我们在填充目标文本到编辑区域时采用仿射融合技术来保留其纵横比不变。 通过真实世界数据集上的广泛实验验证了GLASTE模型在量化指标和定性结果上均优于先前的方法,并且有效缓解了上述两大挑战。
https://arxiv.org/abs/2512.03574
Mask Diffusion Models (MDMs) have recently emerged as a promising alternative to auto-regressive models (ARMs) for vision-language tasks, owing to their flexible balance of efficiency and accuracy. In this paper, for the first time, we introduce MDMs into the Scene Text Recognition (STR) task. We show that vanilla MDM lags behind ARMs in terms of accuracy, although it improves recognition efficiency. To bridge this gap, we propose MDiff4STR, a Mask Diffusion model enhanced with two key improvement strategies tailored for STR. Specifically, we identify two key challenges in applying MDMs to STR: noising gap between training and inference, and overconfident predictions during inference. Both significantly hinder the performance of MDMs. To mitigate the first issue, we develop six noising strategies that better align training with inference behavior. For the second, we propose a token-replacement noise mechanism that provides a non-mask noise type, encouraging the model to reconsider and revise overly confident but incorrect predictions. We conduct extensive evaluations of MDiff4STR on both standard and challenging STR benchmarks, covering diverse scenarios including irregular, artistic, occluded, and Chinese text, as well as whether the use of pretraining. Across these settings, MDiff4STR consistently outperforms popular STR models, surpassing state-of-the-art ARMs in accuracy, while maintaining fast inference with only three denoising steps. Code: this https URL.
最近,Mask Diffusion Models (MDM) 作为一种有前景的替代方案出现了,它可以为视觉-语言任务提供比自回归模型(ARM)更加灵活的效率和准确性平衡。在本文中,我们首次将MDM引入到Scene Text Recognition (STR) 任务中。研究表明,在提高识别效率的同时,标准的MDM在准确率上仍然落后于ARM。为了弥补这一差距,我们提出了MDiff4STR,这是一种增强型Mask Diffusion模型,它采用了两种针对STR优化的关键改进策略。 具体来说,我们在将MDMs应用于STR时确定了两个关键挑战:训练和推理之间的噪音差异,以及推理过程中过度自信的预测问题。这两个因素都显著阻碍了MDM的表现。为了解决第一个问题,我们开发出了六种噪音策略,以更好地协调训练与推理行为的一致性。对于第二个问题,则提出了一种基于标记替换的噪声机制来提供一种非屏蔽类型的噪声,促使模型重新考虑并修正那些过度自信但实际上错误的预测。 我们在标准和具有挑战性的STR基准测试上对MDiff4STR进行了广泛的评估,涵盖了包括不规则、艺术化、被遮挡的文字以及中文文本在内的多种场景,并探讨了预训练的影响。在这些设置中,MDiff4STR始终超越流行的STR模型,在准确性方面超过了最先进的ARMs,同时通过仅使用三个去噪步骤保持了快速的推理速度。 该研究的代码可在此链接获取:[请插入实际链接]
https://arxiv.org/abs/2512.01422
Reading scene text, that is, text appearing in images, has numerous application areas, including assistive technology, search, and e-commerce. Although scene text recognition in English has advanced significantly and is often considered nearly a solved problem, Indian language scene text recognition remains an open challenge. This is due to script diversity, non-standard fonts, and varying writing styles, and, more importantly, the lack of high-quality datasets and open-source models. To address these gaps, we introduce the Bharat Scene Text Dataset (BSTD) - a large-scale and comprehensive benchmark for studying Indian Language Scene Text Recognition. It comprises more than 100K words that span 11 Indian languages and English, sourced from over 6,500 scene images captured across various linguistic regions of India. The dataset is meticulously annotated and supports multiple scene text tasks, including: (i) Scene Text Detection, (ii) Script Identification, (iii) Cropped Word Recognition, and (iv) End-to-End Scene Text Recognition. We evaluated state-of-the-art models originally developed for English by adapting (fine-tuning) them for Indian languages. Our results highlight the challenges and opportunities in Indian language scene text recognition. We believe that this dataset represents a significant step toward advancing research in this domain. All our models and data are open source.
https://arxiv.org/abs/2511.23071
The advent of generative models has dramatically improved the accuracy of image inpainting. In particular, by removing specific text from document images, reconstructing original images is extremely important for industrial applications. However, most existing methods of text removal focus on deleting simple scene text which appears in images captured by a camera in an outdoor environment. There is little research dedicated to complex and practical images with dense text. Therefore, we created benchmark data for text removal from images including a large amount of text. From the data, we found that text-removal performance becomes vulnerable against mask profile perturbation. Thus, for practical text-removal tasks, precise tuning of the mask shape is essential. This study developed a method to model highly flexible mask profiles and learn their parameters using Bayesian optimization. The resulting profiles were found to be character-wise masks. It was also found that the minimum cover of a text region is not optimal. Our research is expected to pave the way for a user-friendly guideline for manual masking.
https://arxiv.org/abs/2511.22499
Video text-based visual question answering (Video TextVQA) task aims to answer questions about videos by leveraging the visual text appearing within the videos. This task poses significant challenges, requiring models to accurately perceive and comprehend scene text that varies in scale, orientation, and clarity across frames, while effectively integrating temporal and semantic context to generate precise answers. Moreover, the model must identify question-relevant textual cues and filter out redundant or irrelevant information to ensure answering is guided by the most relevant and informative cues. To address these challenges, we propose SFA, a training-free framework and the first Video-LLM-based method tailored for Video TextVQA, motivated by the human process of answering questions. By adaptively scanning video frames, selectively focusing on key regions, and directly amplifying them, SFA effectively guides the Video-LLM's attention toward essential cues, enabling it to generate more accurate answers. SFA achieves new state-of-the-art results across several public Video TextVQA datasets and surpasses previous methods by a substantial margin, demonstrating its effectiveness and generalizability.
https://arxiv.org/abs/2511.20190
As VLMs are deployed in safety-critical applications, their ability to abstain from answering when uncertain becomes crucial for reliability, especially in Scene Text Visual Question Answering (STVQA) tasks. For example, OCR errors like misreading "50 mph" as "60 mph" could cause severe traffic accidents. This leads us to ask: Can VLMs know when they can't see? Existing abstention methods suggest pessimistic answers: they either rely on miscalibrated output probabilities or require semantic agreement unsuitable for OCR tasks. However, this failure may indicate we are looking in the wrong place: uncertainty signals could be hidden in VLMs' internal representations. Building on this insight, we propose Latent Representation Probing (LRP): training lightweight probes on hidden states or attention patterns. We explore three probe designs: concatenating representations across all layers, aggregating attention over visual tokens, and ensembling single layer probes by majority vote. Experiments on four benchmarks across image and video modalities show LRP improves abstention accuracy by 7.6\% over best baselines. Our analysis reveals: probes generalize across various uncertainty sources and datasets, and optimal signals emerge from intermediate rather than final layers. This establishes a principled framework for building deployment-ready AI systems by detecting confidence signals from internal states rather than unreliable outputs.
https://arxiv.org/abs/2511.19806
Recent advances in diffusion-based real-world image super-resolution (Real-ISR) have demonstrated remarkable perceptual quality, yet the balance between fidelity and controllability remains a problem: multi-step diffusion-based methods suffer from generative diversity and randomness, resulting in low fidelity, while one-step methods lose control flexibility due to fidelity-specific finetuning. In this paper, we present ODTSR, a one-step diffusion transformer based on Qwen-Image that performs Real-ISR considering fidelity and controllability simultaneously: a newly introduced visual stream receives low-quality images (LQ) with adjustable noise (Control Noise), and the original visual stream receives LQs with consistent noise (Prior Noise), forming the Noise-hybrid Visual Stream (NVS) design. ODTSR further employs Fidelity-aware Adversarial Training (FAA) to enhance controllability and achieve one-step inference. Extensive experiments demonstrate that ODTSR not only achieves state-of-the-art (SOTA) performance on generic Real-ISR, but also enables prompt controllability on challenging scenarios such as real-world scene text image super-resolution (STISR) of Chinese characters without training on specific datasets.
https://arxiv.org/abs/2511.17138
The proliferation of hour-long videos (e.g., lectures, podcasts, documentaries) has intensified demand for efficient content structuring. However, existing approaches are constrained by small-scale training with annotations that are typical short and coarse, restricting generalization to nuanced transitions in long videos. We introduce ARC-Chapter, the first large-scale video chaptering model trained on over million-level long video chapters, featuring bilingual, temporally grounded, and hierarchical chapter annotations. To achieve this goal, we curated a bilingual English-Chinese chapter dataset via a structured pipeline that unifies ASR transcripts, scene texts, visual captions into multi-level annotations, from short title to long summaries. We demonstrate clear performance improvements with data scaling, both in data volume and label intensity. Moreover, we design a new evaluation metric termed GRACE, which incorporates many-to-one segment overlaps and semantic similarity, better reflecting real-world chaptering flexibility. Extensive experiments demonstrate that ARC-Chapter establishes a new state-of-the-art by a significant margin, outperforming the previous best by 14.0% in F1 score and 11.3% in SODA score. Moreover, ARC-Chapter shows excellent transferability, improving the state-of-the-art on downstream tasks like dense video captioning on YouCook2.
https://arxiv.org/abs/2511.14349
Scene Text Editing (STE) aims to naturally modify text in images while preserving visual consistency, the decisive factors of which can be divided into three parts, i.e., text style, text content, and background. Previous methods have struggled with incomplete disentanglement of editable attributes, typically addressing only one aspect - such as editing text content - thus limiting controllability and visual consistency. To overcome these limitations, we propose TripleFDS, a novel framework for STE with disentangled modular attributes, and an accompanying dataset called SCB Synthesis. SCB Synthesis provides robust training data for triple feature disentanglement by utilizing the "SCB Group", a novel construct that combines three attributes per image to generate diverse, disentangled training groups. Leveraging this construct as a basic training unit, TripleFDS first disentangles triple features, ensuring semantic accuracy through inter-group contrastive regularization and reducing redundancy through intra-sample multi-feature orthogonality. In the synthesis phase, TripleFDS performs feature remapping to prevent "shortcut" phenomena during reconstruction and mitigate potential feature leakage. Trained on 125,000 SCB Groups, TripleFDS achieves state-of-the-art image fidelity (SSIM of 44.54) and text accuracy (ACC of 93.58%) on the mainstream STE benchmarks. Besides superior performance, the more flexible editing of TripleFDS supports new operations such as style replacement and background transfer. Code: this https URL
https://arxiv.org/abs/2511.13399
Scene Text Editing (STE) is the task of modifying text content in an image while preserving its visual style, such as font, color, and background. While recent diffusion-based approaches have shown improvements in visual quality, key limitations remain: lack of support for low-resource languages, domain gap between synthetic and real data, and the absence of appropriate metrics for evaluating text style preservation. To address these challenges, we propose STELLAR (Scene Text Editor for Low-resource LAnguages and Real-world data). STELLAR enables reliable multilingual editing through a language-adaptive glyph encoder and a multi-stage training strategy that first pre-trains on synthetic data and then fine-tunes on real images. We also construct a new dataset, STIPLAR(Scene Text Image Pairs of Low-resource lAnguages and Real-world data), for training and evaluation. Furthermore, we propose Text Appearance Similarity (TAS), a novel metric that assesses style preservation by independently measuring font, color, and background similarity, enabling robust evaluation even without ground truth. Experimental results demonstrate that STELLAR outperforms state-of-the-art models in visual consistency and recognition accuracy, achieving an average TAS improvement of 2.2% across languages over the baselines.
https://arxiv.org/abs/2511.09977
Scene Text Recognition (STR) remains challenging due to real-world complexities, where decoupled visual-linguistic optimization in existing frameworks amplifies error propagation through cross-modal misalignment. Visual encoders exhibit attention bias toward background distractors, while decoders suffer from spatial misalignment when parsing geometrically deformed text-collectively degrading recognition accuracy for irregular patterns. Inspired by the hierarchical cognitive processes in human visual perception, we propose OTSNet, a novel three-stage network embodying a neurocognitive-inspired Observation-Thinking-Spelling pipeline for unified STR modeling. The architecture comprises three core components: (1) a Dual Attention Macaron Encoder (DAME) that refines visual features through differential attention maps to suppress irrelevant regions and enhance discriminative focus; (2) a Position-Aware Module (PAM) and Semantic Quantizer (SQ) that jointly integrate spatial context with glyph-level semantic abstraction via adaptive sampling; and (3) a Multi-Modal Collaborative Verifier (MMCV) that enforces self-correction through cross-modal fusion of visual, semantic, and character-level features. Extensive experiments demonstrate that OTSNet achieves state-of-the-art performance, attaining 83.5% average accuracy on the challenging Union14M-L benchmark and 79.1% on the heavily occluded OST dataset-establishing new records across 9 out of 14 evaluation scenarios.
https://arxiv.org/abs/2511.08133
Motion blur in scene text images severely impairs readability and hinders the reliability of computer vision tasks, including autonomous driving, document digitization, and visual information retrieval. Conventional deblurring approaches are often inadequate in handling spatially varying blur and typically fall short in modeling the long-range dependencies necessary for restoring textual clarity. To overcome these limitations, we introduce a hybrid deep learning framework that combines convolutional neural networks (CNNs) with vision transformers (ViTs), thereby leveraging both local feature extraction and global contextual reasoning. The architecture employs a CNN-based encoder-decoder to preserve structural details, while a transformer module enhances global awareness through self-attention. Training is conducted on a curated dataset derived from TextOCR, where sharp scene-text samples are paired with synthetically blurred versions generated using realistic motion-blur kernels of multiple sizes and orientations. Model optimization is guided by a composite loss that incorporates mean absolute error (MAE), squared error (MSE), perceptual similarity, and structural similarity (SSIM). Quantitative eval- uations show that the proposed method attains 32.20 dB in PSNR and 0.934 in SSIM, while remaining lightweight with 2.83 million parameters and an average inference time of 61 ms. These results highlight the effectiveness and computational efficiency of the CNN-ViT hybrid design, establishing its practicality for real-world motion-blurred scene-text restoration.
https://arxiv.org/abs/2511.06087
Creation of large-scale databases for Visual Question Answering tasks pertaining to the text data in a scene (text-VQA) involves skilful human annotation, which is tedious and challenging. With the advent of foundation models that handle vision and language modalities, and with the maturity of OCR systems, it is the need of the hour to establish an end-to-end pipeline that can synthesize Question-Answer (QA) pairs based on scene-text from a given image. We propose a pipeline for automated synthesis for text-VQA dataset that can produce faithful QA pairs, and which scales up with the availability of scene text data. Our proposed method harnesses the capabilities of multiple models and algorithms involving OCR detection and recognition (text spotting), region of interest (ROI) detection, caption generation, and question generation. These components are streamlined into a cohesive pipeline to automate the synthesis and validation of QA pairs. To the best of our knowledge, this is the first pipeline proposed to automatically synthesize and validate a large-scale text-VQA dataset comprising around 72K QA pairs based on around 44K images.
https://arxiv.org/abs/2511.02046
Current generative super-resolution methods show strong performance on natural images but distort text, creating a fundamental trade-off between image quality and textual readability. To address this, we introduce \textbf{TIGER} (\textbf{T}ext-\textbf{I}mage \textbf{G}uided sup\textbf{E}r-\textbf{R}esolution), a novel two-stage framework that breaks this trade-off through a \textit{"text-first, image-later"} paradigm. \textbf{TIGER} explicitly decouples glyph restoration from image enhancement: it first reconstructs precise text structures and then uses them to guide subsequent full-image super-resolution. This glyph-to-image guidance ensures both high fidelity and visual consistency. To support comprehensive training and evaluation, we also contribute the \textbf{UltraZoom-ST} (UltraZoom-Scene Text), the first scene text dataset with extreme zoom (\textbf{$\times$14.29}). Extensive experiments show that \textbf{TIGER} achieves \textbf{state-of-the-art} performance, enhancing readability while preserving overall image quality.
https://arxiv.org/abs/2510.21590
With the rapid development of diffusion models, style transfer has made remarkable progress. However, flexible and localized style editing for scene text remains an unsolved challenge. Although existing scene text editing methods have achieved text region editing, they are typically limited to content replacement and simple styles, which lack the ability of free-style transfer. In this paper, we introduce SceneTextStylizer, a novel training-free diffusion-based framework for flexible and high-fidelity style transfer of text in scene images. Unlike prior approaches that either perform global style transfer or focus solely on textual content modification, our method enables prompt-guided style transformation specifically for text regions, while preserving both text readability and stylistic consistency. To achieve this, we design a feature injection module that leverages diffusion model inversion and self-attention to transfer style features effectively. Additionally, a region control mechanism is introduced by applying a distance-based changing mask at each denoising step, enabling precise spatial control. To further enhance visual quality, we incorporate a style enhancement module based on the Fourier transform to reinforce stylistic richness. Extensive experiments demonstrate that our method achieves superior performance in scene text style transformation, outperforming existing state-of-the-art methods in both visual fidelity and text preservation.
随着扩散模型的快速发展,风格迁移取得了显著进展。然而,场景文本中的灵活且局部化的风格编辑仍然是一个未解决的挑战。尽管现有的场景文本编辑方法已经实现了文字区域编辑,但它们通常局限于内容替换和简单的样式更改,并缺乏自由式风格转移的能力。在本文中,我们介绍了SceneTextStylizer,这是一种新颖的、无需训练的基于扩散模型的框架,用于对场景图像中的文本进行灵活且高保真的风格转换。 与先前的方法要么执行全局风格迁移,要么仅专注于文本内容修改不同,我们的方法通过使用提示引导的方式实现了针对文字区域的特定风格变换,并同时保持了文本可读性和风格一致性。为了实现这一点,我们设计了一个特征注入模块,该模块利用扩散模型逆向工程和自注意力机制有效地转移风格特征。此外,我们引入了一种基于距离变化掩码的区域控制机制,在每次去噪步骤中应用以实现精确的空间控制。为了进一步提高视觉质量,我们还集成了一个基于傅立叶变换的风格增强模块,用于加强风格丰富性。 大量的实验表明,我们的方法在场景文本风格转换方面表现出了优越性能,在视觉保真度和文本保留方面均优于现有的最先进的方法。
https://arxiv.org/abs/2510.10910
Current text detection datasets primarily target natural or document scenes, where text typically appear in regular font and shapes, monotonous colors, and orderly layouts. The text usually arranged along straight or curved lines. However, these characteristics differ significantly from anime scenes, where text is often diverse in style, irregularly arranged, and easily confused with complex visual elements such as symbols and decorative patterns. Text in anime scene also includes a large number of handwritten and stylized fonts. Motivated by this gap, we introduce AnimeText, a large-scale dataset containing 735K images and 4.2M annotated text blocks. It features hierarchical annotations and hard negative samples tailored for anime scenarios. %Cross-dataset evaluations using state-of-the-art methods demonstrate that models trained on AnimeText achieve superior performance in anime text detection tasks compared to existing datasets. To evaluate the robustness of AnimeText in complex anime scenes, we conducted cross-dataset benchmarking using state-of-the-art text detection methods. Experimental results demonstrate that models trained on AnimeText outperform those trained on existing datasets in anime scene text detection tasks. AnimeText on HuggingFace: this https URL
当前的文字检测数据集主要针对自然场景或文档场景,其中文本通常以规则字体和形状、单调的颜色以及有序的布局出现。这些场景中的文字一般沿直线或曲线排列。然而,在动漫场景中,上述特征有显著差异:文字风格多样、排布不规则,并且容易与复杂的视觉元素(如符号和装饰图案)混淆。此外,动漫场景中的文本包括大量手写字体和特殊字体。 为了填补这一空白,我们引入了AnimeText,这是一个大规模的数据集,包含735K张图片和4.2M个标注的文字块。该数据集具有层次化的注释,并针对动漫场景定制了硬负样本(hard negative samples)。使用最先进的方法进行跨数据集评估表明,在动漫文字检测任务中,基于AnimeText训练的模型优于现有数据集上的模型。 为了评估AnimeText在复杂动漫场景中的鲁棒性,我们使用最先进的文本检测方法进行了跨数据集基准测试。实验结果表明,在动漫场景的文字检测任务中,基于AnimeText训练的模型比基于其他现有数据集训练的模型表现出色。可以在HuggingFace上访问AnimeText:[此链接](https://huggingface.co/datasets/this-url)(请将“this-url”替换为实际的URL)。
https://arxiv.org/abs/2510.07951
Text removal is a crucial task in computer vision with applications such as privacy preservation, image editing, and media reuse. While existing research has primarily focused on scene text removal in natural images, limitations in current datasets hinder out-of-domain generalization or accurate evaluation. In particular, widely used benchmarks such as SCUT-EnsText suffer from ground truth artifacts due to manual editing, overly simplistic text backgrounds, and evaluation metrics that do not capture the quality of generated results. To address these issues, we introduce an approach to synthesizing a text removal benchmark applicable to domains other than scene texts. Our dataset features text rendered on complex backgrounds using object-aware placement and vision-language model-generated content, ensuring clean ground truth and challenging text removal scenarios. The dataset is available at this https URL .
文本移除是计算机视觉中的一个关键任务,具有隐私保护、图像编辑和媒体重用等多种应用。尽管现有研究主要集中在自然图像中场景文字的去除上,但当前数据集的局限性阻碍了跨域泛化或准确评估。尤其是广泛使用的基准如SCUT-EnsText,由于人工编辑导致的真实标签瑕疵、过于简单的文本背景以及无法捕捉生成结果质量的评价指标等问题。 为解决这些问题,我们提出了一种合成适用于除场景文字外其他领域的文本移除基准的方法。我们的数据集特点是使用对象感知放置和视觉语言模型生成的内容,在复杂背景下渲染文本,确保真实标签干净且提供挑战性的文本移除场景。该数据集可在此网址获取:[请在此处插入实际的URL链接]。
https://arxiv.org/abs/2510.02787