The recent emergence of the Segment Anything Model (SAM) enables various domain-specific segmentation tasks to be tackled cost-effectively by using bounding boxes as prompts. However, in scene text segmentation, SAM can not achieve desirable performance. The word-level bounding box as prompts is too coarse for characters, while the character-level bounding box as prompts suffers from over-segmentation and under-segmentation issues. In this paper, we propose an automatic annotation pipeline named Char-SAM, that turns SAM into a low-cost segmentation annotator with a Character-level visual prompt. Specifically, leveraging some existing text detection datasets with word-level bounding box annotations, we first generate finer-grained character-level bounding box prompts using the Character Bounding-box Refinement CBR module. Next, we employ glyph information corresponding to text character categories as a new prompt in the Character Glyph Refinement (CGR) module to guide SAM in producing more accurate segmentation masks, addressing issues of over-segmentation and under-segmentation. These modules fully utilize the bbox-to-mask capability of SAM to generate high-quality text segmentation annotations automatically. Extensive experiments on TextSeg validate the effectiveness of Char-SAM. Its training-free nature also enables the generation of high-quality scene text segmentation datasets from real-world datasets like COCO-Text and MLT17.
最近推出的Segment Anything Model (SAM) 使得可以通过使用边界框作为提示来低成本地处理各种特定领域的分割任务。然而,在场景文本分割中,SAM 的表现并不理想。以单词级的边界框作为提示对于字符来说过于粗糙,而以字符级别的边界框作为提示则会导致过度分割和欠分割的问题。为此,我们提出了一种自动注释流水线Char-SAM,它将SAM转变为一种低成本的文字分割标注工具,使用的是基于字符级别的视觉提示。具体而言,利用现有的带有单词级边界框注释的文本检测数据集,我们首先通过Character Bounding-box Refinement (CBR) 模块生成更精细的字符级别边界框提示。接下来,我们在Character Glyph Refinement (CGR) 模块中采用与文字字符类别对应的字形信息作为新的提示,以此来引导SAM产生更加准确的分割掩码,从而解决过度分割和欠分割的问题。这些模块充分利用了SAM将bbox转换为mask的能力,能够自动生成高质量的文字分割标注。在TextSeg上的广泛实验验证了Char-SAM的有效性。其无需训练的特点还使得可以从像COCO-Text 和 MLT17这样的真实世界数据集中生成高质量的场景文本分割数据集。
https://arxiv.org/abs/2412.19917
Recent advancements in scene text spotting have focused on end-to-end methodologies that heavily rely on precise location annotations, which are often costly and labor-intensive to procure. In this study, we introduce an innovative approach that leverages only transcription annotations for training text spotting models, substantially reducing the dependency on elaborate annotation processes. Our methodology employs a query-based paradigm that facilitates the learning of implicit location features through the interaction between text queries and image embeddings. These features are later refined during the text recognition phase using an attention activation map. Addressing the challenges associated with training a weakly-supervised model from scratch, we implement a circular curriculum learning strategy to enhance model convergence. Additionally, we introduce a coarse-to-fine cross-attention localization mechanism for more accurate text instance localization. Notably, our framework supports audio-based annotation, which significantly diminishes annotation time and provides an inclusive alternative for individuals with disabilities. Our approach achieves competitive performance against existing benchmarks, demonstrating that high accuracy in text spotting can be attained without extensive location annotations.
近期的场景文本识别研究主要集中在端到端的方法上,这些方法依赖于精确的位置标注,而获取这种标注通常既昂贵又费时。在本研究中,我们提出了一种创新的方法,仅利用转录标注来训练文本识别模型,从而大大减少了对复杂标注过程的依赖。我们的方法采用基于查询的范式,在该框架下,通过文本查询与图像嵌入之间的交互学习隐含位置特征,并在后续的文字识别阶段使用注意力激活图进一步细化这些特征。 为了解决从零开始训练弱监督模型时面临的挑战,我们实施了循环课程学习策略来提高模型收敛性。此外,为了实现更准确的文本实例定位,我们引入了一种由粗到细的跨注意力定位机制。值得一提的是,我们的框架还支持基于音频的标注方法,这显著减少了标注时间,并为残疾人士提供了包容性的替代方案。 在现有的基准测试中,我们的方法达到了与现有技术相当的表现水平,证明了即使没有大量的位置标注,也能够实现高水平的文字识别准确率。
https://arxiv.org/abs/2412.19504
Cross-view geo-localization identifies the locations of street-view images by matching them with geo-tagged satellite images or OSM. However, most studies focus on image-to-image retrieval, with fewer addressing text-guided retrieval, a task vital for applications like pedestrian navigation and emergency response. In this work, we introduce a novel task for cross-view geo-localization with natural language descriptions, which aims to retrieve corresponding satellite images or OSM database based on scene text. To support this task, we construct the CVG-Text dataset by collecting cross-view data from multiple cities and employing a scene text generation approach that leverages the annotation capabilities of Large Multimodal Models to produce high-quality scene text descriptions with localization this http URL, we propose a novel text-based retrieval localization method, CrossText2Loc, which improves recall by 10% and demonstrates excellent long-text retrieval capabilities. In terms of explainability, it not only provides similarity scores but also offers retrieval reasons. More information can be found at this https URL.
跨视角地理定位通过将街景图像与带地理标签的卫星图像或OSM(开放街道地图)进行匹配来识别其位置。然而,大多数研究集中于图像到图像的检索,较少关注文本引导的检索任务,而这一任务对于行人导航和紧急响应等应用至关重要。在本项工作中,我们引入了一种新的跨视角地理定位任务,该任务使用自然语言描述来检索相应的卫星图像或OSM数据库,基于场景中的文字内容进行匹配。为了支持这项任务,我们通过收集来自多个城市的跨视角数据并采用一种利用大型多模态模型的标注能力生成高质量场景文本描述的方法构建了CVG-Text数据集(详情请参见此链接:[http://this%20https%20URL])。在此基础上,我们提出了一种新颖的文字检索定位方法——CrossText2Loc,该方法能将召回率提高10%,并且展示了出色的长文本检索能力。在解释性方面,它不仅提供了相似度得分,还给出了检索原因。更多详情请参见此链接:[https://this%20https%20URL]。 注释:原文中的两个“this https URL”指向具体的网址信息,在翻译中保留了这种表示方式,实际应用时应替换为具体的网络地址。
https://arxiv.org/abs/2412.17007
Indoor scene texture synthesis has garnered significant interest due to its important potential applications in virtual reality, digital media, and creative arts. Existing diffusion model-based researches either rely on per-view inpainting techniques, which are plagued by severe cross-view inconsistencies and conspicuous seams, or they resort to optimization-based approaches that entail substantial computational overhead. In this work, we present RoomPainter, a framework that seamlessly integrates efficiency and consistency to achieve high-fidelity texturing of indoor scenes. The core of RoomPainter features a zero-shot technique that effectively adapts a 2D diffusion model for 3D-consistent texture synthesis, along with a two-stage generation strategy that ensures both global and local consistency. Specifically, we introduce Attention-Guided Multi-View Integrated Sampling (MVIS) combined with a neighbor-integrated attention mechanism for zero-shot texture map generation. Using the MVIS, we firstly generate texture map for the entire room to ensure global consistency, then adopt its variant, namely an attention-guided multi-view integrated repaint sampling (MVRS) to repaint individual instances within the room, thereby further enhancing local consistency. Experiments demonstrate that RoomPainter achieves superior performance for indoor scene texture synthesis in visual quality, global consistency, and generation efficiency.
室内场景纹理合成因其在虚拟现实、数字媒体和创意艺术中的重要潜在应用而引起了广泛的关注。现有的基于扩散模型的研究要么依赖于视图内插技术,这些技术受到严重跨视角不一致性和明显接缝的影响,要么采用计算成本高昂的优化方法。在这项工作中,我们提出了RoomPainter框架,该框架无缝集成了效率和一致性,以实现室内场景的高质量纹理化。RoomPainter的核心是一个零样本技术,能够有效地将2D扩散模型适应于3D一致性的纹理合成,并结合两阶段生成策略确保全局和局部的一致性。具体而言,我们引入了基于注意引导的多视图集成采样(MVIS)技术,以及一个邻居集成注意力机制来实现零样本纹理贴图生成。利用MVIS,我们首先为整个房间生成纹理贴图以确保全局一致性,然后采用其变体——注意引导的多视图再绘制采样(MVRS),重新绘制房间内的各个实例,从而进一步增强局部一致性。实验表明,RoomPainter在视觉质量、全局一致性和生成效率方面实现了室内场景纹理合成的优越性能。
https://arxiv.org/abs/2412.16778
In the field of scene text spotting, previous OCR methods primarily relied on image encoders and pre-trained text information, but they often overlooked the advantages of incorporating human language instructions. To address this gap, we propose InstructOCR, an innovative instruction-based scene text spotting model that leverages human language instructions to enhance the understanding of text within images. Our framework employs both text and image encoders during training and inference, along with instructions meticulously designed based on text attributes. This approach enables the model to interpret text more accurately and flexibly. Extensive experiments demonstrate the effectiveness of our model and we achieve state-of-the-art results on widely used benchmarks. Furthermore, the proposed framework can be seamlessly applied to scene text VQA tasks. By leveraging instruction strategies during pre-training, the performance on downstream VQA tasks can be significantly improved, with a 2.6% increase on the TextVQA dataset and a 2.1% increase on the ST-VQA dataset. These experimental results provide insights into the benefits of incorporating human language instructions for OCR-related tasks.
在场景文本检测领域,之前的OCR方法主要依赖图像编码器和预训练的文本信息,但它们往往忽视了结合人类语言指令的优势。为了解决这一问题,我们提出了InstructOCR,这是一个创新性的基于指令的场景文本检测模型,利用人类语言指令来增强对图像中文本的理解。我们的框架在训练和推理过程中使用了文本编码器和图像编码器,并且根据文本属性精心设计了指令。这种方法使模型能够更准确、灵活地解释文本内容。广泛的实验表明我们模型的有效性,并且我们在广泛使用的基准测试中取得了最先进的结果。此外,所提出的框架可以无缝应用于场景文本视觉问答(VQA)任务。通过在预训练过程中利用指令策略,下游VQA任务的性能得到了显著提升,在TextVQA数据集上提升了2.6%,在ST-VQA数据集上提升了2.1%。这些实验结果为将人类语言指令融入OCR相关任务中的益处提供了见解。
https://arxiv.org/abs/2412.15523
Connected component (CC) is a proper text shape representation that aligns with human reading intuition. However, CC-based text detection methods have recently faced a developmental bottleneck that their time-consuming post-processing is difficult to eliminate. To address this issue, we introduce an explicit relational reasoning network (ERRNet) to elegantly model the component relationships without post-processing. Concretely, we first represent each text instance as multiple ordered text components, and then treat these components as objects in sequential movement. In this way, scene text detection can be innovatively viewed as a tracking problem. From this perspective, we design an end-to-end tracking decoder to achieve a CC-based method dispensing with post-processing entirely. Additionally, we observe that there is an inconsistency between classification confidence and localization quality, so we propose a Polygon Monte-Carlo method to quickly and accurately evaluate the localization quality. Based on this, we introduce a position-supervised classification loss to guide the task-aligned learning of ERRNet. Experiments on challenging benchmarks demonstrate the effectiveness of our ERRNet. It consistently achieves state-of-the-art accuracy while holding highly competitive inference speed.
连通组件(CC)是一种与人类阅读直觉相符的合适文本形状表示方法。然而,基于CC的文本检测方法最近遇到了一个发展瓶颈,即它们耗时较长的后处理步骤难以消除。为了解决这一问题,我们引入了一种显式关系推理网络(ERRNet),以优雅地建模组件间的关系而无需进行后处理。具体来说,我们将每个文本实例表示为多个有序的文本组件,并将这些组件视为一系列运动中的对象。通过这种方式,场景文本检测可以被创新性地视作一个跟踪问题。从这个角度来看,我们设计了一个端到端的跟踪解码器来实现一种完全不依赖后处理的基于CC的方法。此外,我们注意到分类置信度与定位质量之间存在不一致性,因此提出了一种多边形蒙特卡洛方法以快速准确地评估定位质量。基于此,我们引入了位置监督分类损失函数来指导ERRNet的任务对齐学习。在具有挑战性的基准测试上进行的实验表明我们的ERRNet是有效的,它不仅始终实现了最先进的准确性,而且保持了高度竞争力的推理速度。
https://arxiv.org/abs/2412.14692
Video text-based visual question answering (Video TextVQA) is a practical task that aims to answer questions by jointly reasoning textual and visual information in a given video. Inspired by the development of TextVQA in image domain, existing Video TextVQA approaches leverage a language model (e.g. T5) to process text-rich multiple frames and generate answers auto-regressively. Nevertheless, the spatio-temporal relationships among visual entities (including scene text and objects) will be disrupted and models are susceptible to interference from unrelated information, resulting in irrational reasoning and inaccurate answering. To tackle these challenges, we propose the TEA (stands for ``\textbf{T}rack th\textbf{E} \textbf{A}nswer'') method that better extends the generative TextVQA framework from image to video. TEA recovers the spatio-temporal relationships in a complementary way and incorporates OCR-aware clues to enhance the quality of reasoning questions. Extensive experiments on several public Video TextVQA datasets validate the effectiveness and generalization of our framework. TEA outperforms existing TextVQA methods, video-language pretraining methods and video large language models by great margins.
基于视频文本的视觉问答(Video TextVQA)是一项实用任务,旨在通过共同推理给定视频中的文本和视觉信息来回答问题。受到图像领域TextVQA发展的启发,现有的Video TextVQA方法利用语言模型(例如T5)处理富含文本的多帧,并自回归地生成答案。然而,这种做法会破坏视觉实体(包括场景文本和对象)之间的时空关系,使模型容易受到无关信息的干扰,导致不合理推理和不准确的回答。为了解决这些问题,我们提出了TEA(意为“**T**rack th**E** **A**nswer”)方法,该方法更好地将生成性的TextVQA框架从图像扩展到视频。TEA以互补的方式恢复了时空关系,并融合了OCR感知线索来提高推理问题的质量。在多个公开的Video TextVQA数据集上进行的大量实验验证了我们框架的有效性和泛化能力。TEA显著优于现有的TextVQA方法、视频语言预训练方法和视频大型语言模型。
https://arxiv.org/abs/2412.12502
Scene text spotting has attracted the enthusiasm of relative researchers in recent years. Most existing scene text spotters follow the detection-then-recognition paradigm, where the vanilla detection module hardly determines the reading order and leads to failure recognition. After rethinking the auto-regressive scene text recognition method, we find that a well-trained recognizer can implicitly perceive the local semantics of all characters in a complete word or a sentence without a character-level detection module. Local semantic knowledge not only includes text content but also spatial information in the right reading order. Motivated by the above analysis, we propose the Local Semantics Guided scene text Spotter (LSGSpotter), which auto-regressively decodes the position and content of characters guided by the local semantics. Specifically, two effective modules are proposed in LSGSpotter. On the one hand, we design a Start Point Localization Module (SPLM) for locating text start points to determine the right reading order. On the other hand, a Multi-scale Adaptive Attention Module (MAAM) is proposed to adaptively aggregate text features in a local area. In conclusion, LSGSpotter achieves the arbitrary reading order spotting task without the limitation of sophisticated detection, while alleviating the cost of computational resources with the grid sampling strategy. Extensive experiment results show LSGSpotter achieves state-of-the-art performance on the InverseText benchmark. Moreover, our spotter demonstrates superior performance on English benchmarks for arbitrary-shaped text, achieving improvements of 0.7\% and 2.5\% on Total-Text and SCUT-CTW1500, respectively. These results validate our text spotter is effective for scene texts in arbitrary reading order and shape.
场景文本检测近年来引起了相关研究人员的极大兴趣。大多数现有的场景文本检测器遵循先检测后识别的范式,在这种范式中,传统的检测模块很难确定阅读顺序,并导致了识别失败。在重新思考自动回归的场景文本识别方法之后,我们发现一个训练良好的识别器可以在没有字符级检测模块的情况下隐含地感知到完整单词或句子中所有字符的局部语义知识。局部语义知识不仅包括文本内容,还包括正确的阅读顺序中的空间信息。受上述分析的启发,我们提出了由局部语义引导的场景文本检测器(LSGSpotter),它在局部语义的指导下自动回归地解码字符的位置和内容。具体来说,在LSGSpotter中提出了两个有效的模块:一方面,设计了一个起点定位模块(SPLM)来确定文本的起始点并决定正确的阅读顺序;另一方面,提出了一种多尺度自适应注意力模块(MAAM),用于在局部区域内自适应地聚合文本特征。总之,LSGSpotter能够在不受复杂检测限制的情况下完成任意阅读顺序的识别任务,并通过网格采样策略减轻了计算资源的成本。广泛的实验结果表明,LSGSpotter在InverseText基准测试中达到了最先进的性能。此外,在针对任意形状文本的英语基准上,我们的检测器也表现出色,分别在Total-Text和SCUT-CTW1500上实现了0.7%和2.5%的改进。这些结果验证了我们的文本检测器对于任意阅读顺序和形状的场景文本是有效的。
https://arxiv.org/abs/2412.10159
Large Multimodal Models (LMMs) have demonstrated impressive performance on recognizing document images with natural language instructions. However, it remains unclear to what extent capabilities in literacy with rich structure and fine-grained visual challenges. The current landscape lacks a comprehensive benchmark to effectively measure the literate capabilities of LMMs. Existing benchmarks are often limited by narrow scenarios and specified tasks. To this end, we introduce CC-OCR, a comprehensive benchmark that possess a diverse range of scenarios, tasks, and challenges. CC-OCR comprises four OCR-centric tracks: multi-scene text reading, multilingual text reading, document parsing, and key information extraction. It includes 39 subsets with 7,058 full annotated images, of which 41% are sourced from real applications, being released for the first time. Furthermore, we evaluate nine prominent LMMs and reveal both the strengths and weaknesses of these models, particularly in text grounding, multi-orientation, and hallucination of repetition. CC-OCR aims to comprehensively evaluate the capabilities of LMMs on OCR-centered tasks, driving advancement in LMMs.
https://arxiv.org/abs/2412.02210
Scene text recognition (STR) suffers from the challenges of either less realistic synthetic training data or the difficulty of collecting sufficient high-quality real-world data, limiting the effectiveness of trained STR models. Meanwhile, despite producing holistically appealing text images, diffusion-based text image generation methods struggle to generate accurate and realistic instance-level text on a large scale. To tackle this, we introduce TextSSR: a novel framework for Synthesizing Scene Text Recognition data via a diffusion-based universal text region synthesis model. It ensures accuracy by focusing on generating text within a specified image region and leveraging rich glyph and position information to create the less complex text region compared to the entire image. Furthermore, we utilize neighboring text within the region as a prompt to capture real-world font styles and layout patterns, guiding the generated text to resemble actual scenes. Finally, due to its prompt-free nature and capability for character-level synthesis, TextSSR enjoys a wonderful scalability and we construct an anagram-based TextSSR-F dataset with 0.4 million text instances with complexity and realism. Experiments show that models trained on added TextSSR-F data exhibit better accuracy compared to models trained on 4 million existing synthetic data. Moreover, its accuracy margin to models trained fully on a real-world dataset is less than 3.7%, confirming TextSSR's effectiveness and its great potential in scene text image synthesis. Our code is available at this https URL.
https://arxiv.org/abs/2412.01137
Connectionist temporal classification (CTC)-based scene text recognition (STR) methods, e.g., SVTR, are widely employed in OCR applications, mainly due to their simple architecture, which only contains a visual model and a CTC-aligned linear classifier, and therefore fast inference. However, they generally have worse accuracy than encoder-decoder-based methods (EDTRs), particularly in challenging scenarios. In this paper, we propose SVTRv2, a CTC model that beats leading EDTRs in both accuracy and inference speed. SVTRv2 introduces novel upgrades to handle text irregularity and utilize linguistic context, which endows it with the capability to deal with challenging and diverse text instances. First, a multi-size resizing (MSR) strategy is proposed to adaptively resize the text and maintain its readability. Meanwhile, we introduce a feature rearrangement module (FRM) to ensure that visual features accommodate the alignment requirement of CTC well, thus alleviating the alignment puzzle. Second, we propose a semantic guidance module (SGM). It integrates linguistic context into the visual model, allowing it to leverage language information for improved accuracy. Moreover, SGM can be omitted at the inference stage and would not increase the inference cost. We evaluate SVTRv2 in both standard and recent challenging benchmarks, where SVTRv2 is fairly compared with 24 mainstream STR models across multiple scenarios, including different types of text irregularity, languages, and long text. The results indicate that SVTRv2 surpasses all the EDTRs across the scenarios in terms of accuracy and speed. Code is available at this https URL.
基于连接主义时间分类(CTC)的场景文本识别(STR)方法,例如SVTR,在OCR应用中被广泛使用,主要归因于其简单的架构,仅包含一个视觉模型和一个与CTC对齐的线性分类器,因此推理速度较快。然而,它们通常比基于编解码器的方法(EDTRs)具有更低的准确性,尤其是在挑战性场景下表现不佳。本文中,我们提出了SVTRv2,这是一种在准确性和推理速度上都超越领先EDTRs的CTC模型。SVTRv2引入了新颖的功能升级来处理文本不规则性和利用语言上下文,使其具备应对复杂多样的文字实例的能力。首先,提出了一种多尺寸调整(MSR)策略以自适应地调整文本并保持其可读性。同时,我们引入了一个特征重新排列模块(FRM),确保视觉特性能够很好地满足CTC的对齐要求,从而缓解了对齐难题。其次,我们提出了一个语义引导模块(SGM)。它将语言上下文整合到视觉模型中,允许利用语言信息提高准确性。此外,在推理阶段可以省略SGM,不会增加推理成本。我们在标准和近期具有挑战性的基准测试上评估了SVTRv2,与跨多种场景的24种主流STR模型进行了公平比较,包括不同类型的文本不规则性、语言以及长文本。结果表明,SVTRv2在准确性和速度方面超越了所有EDTRs。代码可在以下网址获取:此 https URL。
https://arxiv.org/abs/2411.15858
Existing scene text recognition (STR) methods struggle to recognize challenging texts, especially for artistic and severely distorted characters. The limitation lies in the insufficient exploration of character morphologies, including the monotonousness of widely used synthetic training data and the sensitivity of the model to character morphologies. To address these issues, inspired by the human learning process of viewing and summarizing, we facilitate the contrastive learning-based STR framework in a self-motivated manner by leveraging synthetic and real unlabeled data without any human cost. In the viewing process, to compensate for the simplicity of synthetic data and enrich character morphology diversity, we propose an Online Generation Strategy to generate background-free samples with diverse character styles. By excluding background noise distractions, the model is encouraged to focus on character morphology and generalize the ability to recognize complex samples when trained with only simple synthetic data. To boost the summarizing process, we theoretically demonstrate the derivation error in the previous character contrastive loss, which mistakenly causes the sparsity in the intra-class distribution and exacerbates ambiguity on challenging samples. Therefore, a new Character Unidirectional Alignment Loss is proposed to correct this error and unify the representation of the same characters in all samples by aligning the character features in the student model with the reference features in the teacher model. Extensive experiment results show that our method achieves SOTA performance (94.7\% and 70.9\% average accuracy on common benchmarks and Union14M-Benchmark). Code will be available at this https URL.
现有的场景文本识别(STR)方法在识别具有挑战性的文本方面存在困难,特别是对于艺术性和严重扭曲的字符。其局限性在于对字符形态探索不足,包括广泛使用的合成训练数据单调以及模型对字符形态敏感。为了解决这些问题,受人类学习过程中的观察和总结启发,我们通过利用合成和现实未标记的数据,在无需人工成本的情况下,以自我激励的方式促进对比学习为基础的STR框架的发展。在观察过程中,为了弥补合成数据的简单性并丰富字符形态多样性,我们提出了一种在线生成策略来生成具有多样化字符风格且无背景的样本。通过排除背景噪音干扰,模型被鼓励专注于字符形态,并能够仅使用简单的合成数据进行训练时泛化识别复杂样本的能力。为了增强总结过程,我们在理论上证明了先前字符对比损失中的推导错误,该错误误导致类内分布稀疏并加剧了对具有挑战性样本的模糊性。因此,我们提出了一个新的字符单向对齐损失来纠正这一错误并通过在学生模型中将字符特征与教师模型中的参考特征对齐统一所有样本中相同字符的表现形式。广泛的实验结果表明,我们的方法实现了最先进的性能(在常见基准和Union14M-Benchmark上的平均准确率分别为94.7%和70.9%)。代码将在以下链接提供:此 https URL。
https://arxiv.org/abs/2411.15585
Context-aware methods have achieved remarkable advancements in supervised scene text recognition by leveraging semantic priors from words. Considering the heterogeneity of text and background in STR, we propose that such contextual priors can be reinterpreted as the relations between textual elements, serving as effective self-supervised labels for representation learning. However, textual relations are restricted to the finite size of the dataset due to lexical dependencies, which causes over-fitting problem, thus compromising the representation quality. To address this, our work introduces a unified framework of Relational Contrastive Learning and Masked Image Modeling for STR (RCMSTR), which explicitly models the enriched textual relations. For the RCL branch, we first introduce the relational rearrangement module to cultivate new relations on the fly. Based on this, we further conduct relational contrastive learning to model the intra- and inter-hierarchical relations for frames, sub-words and this http URL the other hand, MIM can naturally boost the context information via masking, where we find that the block masking strategy is more effective for STR. For the effective integration of RCL and MIM, we also introduce a novel decoupling design aimed at mitigating the impact of masked images on contrastive learning. Additionally, to enhance the compatibility of MIM with CNNs, we propose the adoption of sparse convolutions and directly sharing the weights with dense convolutions in training. The proposed RCMSTR demonstrates superior performance in various evaluation protocols for different STR-related downstream tasks, outperforming the existing state-of-the-art self-supervised STR techniques. Ablation studies and qualitative experimental results further validate the effectiveness of our this http URL code and pre-trained models will be available at this https URL .
上下文感知的方法通过利用单词中的语义先验,在监督场景文本识别中取得了显著的进步。考虑到STR中文本和背景的异质性,我们提出可以将这些上下文先验重新解释为文本元素之间的关系,作为表示学习的有效自监督标签。然而,由于词汇依赖性,文本关系受到数据集有限大小的限制,这会导致过拟合问题,从而影响表示的质量。为了应对这一挑战,我们的工作提出了一个用于STR(RCMSTR)的关系对比学习和掩码图像建模统一框架,该框架明确地建模了丰富的文本关系。 对于RCL分支,我们首先引入了一个关系重新排列模块来实时培养新的关系。基于此,我们进一步进行关系对比学习以模型帧、子词之间以及与之相关的层级关系。另一方面,MIM可以通过掩码自然地增强上下文信息,在STR中发现块掩码策略更为有效。为了实现RCL和MIM的有效集成,我们也引入了一种新颖的解耦设计,旨在减轻被掩码图像对对比学习的影响。此外,为了提高MIM与CNNs之间的兼容性,我们建议采用稀疏卷积,并直接在训练过程中与密集卷积共享权重。 所提出的RCMSTR在各种评估协议下展示了卓越的表现,超越了现有的最先进的自监督STR技术,在不同的STR相关下游任务中表现出色。消融研究和定性实验结果进一步验证了我们的方法的有效性。代码和预训练模型将于这个网址提供:[此处为链接]。
https://arxiv.org/abs/2411.11219
Visual question answering (VQA) refers to the problem where, given an image and a natural language question about the image, a correct natural language answer has to be generated. A VQA model has to demonstrate both the visual understanding of the image and the semantic understanding of the question, demonstrating reasoning capability. Since the inception of this field, a plethora of VQA datasets and models have been published. In this article, we meticulously analyze the current state of VQA datasets and models, while cleanly dividing them into distinct categories and then summarizing the methodologies and characteristics of each category. We divide VQA datasets into four categories: (1) available datasets that contain a rich collection of authentic images, (2) synthetic datasets that contain only synthetic images produced through artificial means, (3) diagnostic datasets that are specially designed to test model performance in a particular area, e.g., understanding the scene text, and (4) KB (Knowledge-Based) datasets that are designed to measure a model's ability to utilize outside knowledge. Concurrently, we explore six main paradigms of VQA models: fusion, where we discuss different methods of fusing information between visual and textual modalities; attention, the technique of using information from one modality to filter information from another; external knowledge base, where we discuss different models utilizing outside information; composition or reasoning, where we analyze techniques to answer advanced questions that require complex reasoning steps; explanation, which is the process of generating visual and textual descriptions to verify sound reasoning; and graph models, which encode and manipulate relationships through nodes in a graph. We also discuss some miscellaneous topics, such as scene text understanding, counting, and bias reduction.
视觉问答(VQA)指的是这样一个问题:给定一张图像和一个关于该图像的自然语言问题,需要生成正确的自然语言答案。一个VQA模型必须同时展示对图像的视觉理解和对问题的语义理解,并表现出推理能力。自这个领域诞生以来,大量的VQA数据集和模型已经被发表。本文详细分析了当前VQA数据集和模型的状态,清晰地将它们划分为不同的类别并总结每个类别的方法论和特点。我们将VQA数据集分为四类:(1)包含大量真实图像的现有数据集;(2)仅含有通过人工手段生成的合成图像的数据集;(3)专门设计用于测试特定领域模型性能的诊断数据集,例如理解场景文本;(4)知识基础(KB)数据集,旨在衡量模型利用外部知识的能力。同时,我们探索了VQA模型的六大主要范式:融合,讨论视觉和文本模式之间信息整合的不同方法;注意力,使用一种模态的信息来过滤另一种模态的信息的技术;外部知识库,探讨使用外部信息的不同模型;组合或推理,分析解答需要复杂推理步骤的高级问题的技术;解释,生成用于验证合理推理过程的视觉和文本描述的过程;以及图模型,通过图中的节点编码和操作关系。我们还讨论了一些杂项话题,如场景文本理解、计数和偏见减少。
https://arxiv.org/abs/2411.11150
The task of partial scene text retrieval involves localizing and searching for text instances that are the same or similar to a given query text from an image gallery. However, existing methods can only handle text-line instances, leaving the problem of searching for partial patches within these text-line instances unsolved due to a lack of patch annotations in the training data. To address this issue, we propose a network that can simultaneously retrieve both text-line instances and their partial patches. Our method embeds the two types of data (query text and scene text instances) into a shared feature space and measures their cross-modal similarities. To handle partial patches, our proposed approach adopts a Multiple Instance Learning (MIL) approach to learn their similarities with query text, without requiring extra annotations. However, constructing bags, which is a standard step of conventional MIL approaches, can introduce numerous noisy samples for training, and lower inference speed. To address this issue, we propose a Ranking MIL (RankMIL) approach to adaptively filter those noisy samples. Additionally, we present a Dynamic Partial Match Algorithm (DPMA) that can directly search for the target partial patch from a text-line instance during the inference stage, without requiring bags. This greatly improves the search efficiency and the performance of retrieving partial patches. The source code and dataset are available at this https URL.
部分场景文本检索的任务涉及从图像库中定位和搜索与给定查询文本相同或相似的文本实例。然而,现有的方法只能处理整行文本实例,由于训练数据中缺乏补丁标注,导致在这些文本行实例内搜索部分补丁的问题仍未解决。为了解决这一问题,我们提出了一种能够同时检索文本行实例及其部分补丁的网络。我们的方法将两类数据(查询文本和场景文本实例)嵌入到一个共享特征空间,并测量它们之间的跨模态相似性。为了处理部分补丁,我们采用多示例学习(MIL)的方法来学习其与查询文本的相似性,而无需额外标注。然而,构建包袋——这是传统MIL方法的标准步骤——可能会引入大量噪音样本进行训练,并降低推理速度。为了解决这一问题,我们提出了一种排名多实例学习(RankMIL)方法来自适应地过滤这些噪音样本。此外,我们还提出了动态部分匹配算法(DPMA),该算法可以在推理阶段直接从文本行实例中搜索目标部分补丁,而无需构建包袋。这大大提高了搜索效率和检索部分补丁的性能。源代码和数据集可在[此处](https://www.example.com)获得(请将此链接替换为实际提供的URL)。
https://arxiv.org/abs/2411.10261
Texts on the intelligent transportation scene include mass information. Fully harnessing this information is one of the critical drivers for advancing intelligent transportation. Unlike the general scene, detecting text in transportation has extra demand, such as a fast inference speed, except for high accuracy. Most existing real-time text detection methods are based on the shrink mask, which loses some geometry semantic information and needs complex post-processing. In addition, the previous method usually focuses on correct output, which ignores feature correction and lacks guidance during the intermediate process. To this end, we propose an efficient multi-scene text detector that contains an effective text representation similar mask (SM) and a feature correction module (FCM). Unlike previous methods, the former aims to preserve the geometric information of the instances as much as possible. Its post-progressing saves 50$\%$ of the time, accurately and efficiently reconstructing text contours. The latter encourages false positive features to move away from the positive feature center, optimizing the predictions from the feature level. Some ablation studies demonstrate the efficiency of the SM and the effectiveness of the FCM. Moreover, the deficiency of existing traffic datasets (such as the low-quality annotation or closed source data unavailability) motivated us to collect and annotate a traffic text dataset, which introduces motion blur. In addition, to validate the scene robustness of the SM-Net, we conduct experiments on traffic, industrial, and natural scene datasets. Extensive experiments verify it achieves (SOTA) performance on several benchmarks. The code and dataset are available at: \url{this https URL}.
智能交通场景中的文本包括大量信息,充分利用这些信息是推动智能交通发展的重要驱动力之一。与一般场景不同的是,在交通中检测文本还具有额外需求,除了高精度外还需要快速推理速度。现有的大多数实时文本检测方法都是基于缩放掩码的,这种方法会丢失一些几何语义信息,并且需要复杂的后处理。此外,之前的方法通常专注于正确输出,忽略了特征校正并在中间过程中缺乏指导。为此,我们提出了一种高效的多场景文本检测器,该检测器包含一种有效的文本表示相似掩码(SM)和一个特征校正模块(FCM)。与以前的方法不同的是,前者旨在尽可能保留实例的几何信息,其后处理节省了50%的时间,并且能够准确高效地重构文本轮廓。后者鼓励误报特征远离正样本特征中心,从特征级别优化预测结果。一些消融研究展示了SM的有效性和FCM的效果。此外,现有交通数据集(如低质量标注或封闭源数据不可用)的不足促使我们收集并标注了一个包含运动模糊的交通文本数据集。为了验证SM-Net在场景中的鲁棒性,我们在交通、工业和自然场景的数据集上进行了实验。广泛的实验证明它在多个基准测试中实现了最先进的(SOTA)性能。代码和数据集可在以下链接获取:\url{this https URL}。
https://arxiv.org/abs/2411.02794
In this paper, we propose TextDestroyer, the first training- and annotation-free method for scene text destruction using a pre-trained diffusion model. Existing scene text removal models require complex annotation and retraining, and may leave faint yet recognizable text information, compromising privacy protection and content concealment. TextDestroyer addresses these issues by employing a three-stage hierarchical process to obtain accurate text masks. Our method scrambles text areas in the latent start code using a Gaussian distribution before reconstruction. During the diffusion denoising process, self-attention key and value are referenced from the original latent to restore the compromised background. Latent codes saved at each inversion step are used for replacement during reconstruction, ensuring perfect background restoration. The advantages of TextDestroyer include: (1) it eliminates labor-intensive data annotation and resource-intensive training; (2) it achieves more thorough text destruction, preventing recognizable traces; and (3) it demonstrates better generalization capabilities, performing well on both real-world scenes and generated images.
在这篇论文中,我们提出了TextDestroyer,这是一种无需训练和标注的场景文本破坏方法,利用预训练扩散模型实现。现有的场景文本去除模型需要复杂的标注和重新训练,并且可能会留下模糊但可识别的文字信息,从而影响隐私保护和内容隐藏。TextDestroyer通过采用三阶段分层过程来获取准确的文本掩码,解决了这些问题。我们的方法在重建之前使用高斯分布对潜在代码中的文本区域进行扰乱。在扩散去噪过程中,引用原始潜变量的自注意力键和值以恢复受损背景。每个逆向步骤保存的潜在编码用于重建时替换,确保完美恢复背景。TextDestroyer的优势包括:(1) 消除了劳动密集型数据标注和资源密集型训练;(2) 实现了更彻底的文字破坏,防止可识别痕迹的存在;(3) 展示了更好的泛化能力,在真实场景图像和生成的图像上均表现良好。
https://arxiv.org/abs/2411.00355
To prevent unauthorized use of text in images, Scene Text Removal (STR) has become a crucial task. It focuses on automatically removing text and replacing it with a natural, text-less background while preserving significant details such as texture, color, and contrast. Despite its importance in privacy protection, STR faces several challenges, including boundary artifacts, inconsistent texture and color, and preserving correct shadows. Most STR approaches estimate a text region mask to train a model, solving for image translation or inpainting to generate a text-free image. Thus, the quality of the generated image depends on the accuracy of the inpainting mask and the generator's capability. In this work, we leverage the superior capabilities of diffusion models in generating high-quality, consistent images to address the STR problem. We introduce a ControlNet diffusion model, treating STR as an inpainting task. To enhance the model's robustness, we develop a mask pretraining pipeline to condition our diffusion model. This involves training a masked autoencoder (MAE) using a combination of box masks and coarse stroke masks, and fine-tuning it using masks derived from our novel segmentation-based mask refinement framework. This framework iteratively refines an initial mask and segments it using the SLIC and Hierarchical Feature Selection (HFS) algorithms to produce an accurate final text mask. This improves mask prediction and utilizes rich textural information in natural scene images to provide accurate inpainting masks. Experiments on the SCUT-EnsText and SCUT-Syn datasets demonstrate that our method significantly outperforms existing state-of-the-art techniques.
为了防止未经授权使用图像中的文本,场景文本移除(STR)已成为一项关键任务。它专注于自动删除文本并用自然无文字的背景替换,同时保留诸如纹理、颜色和对比度等重要细节。尽管在隐私保护方面具有重要意义,但STR面临多种挑战,包括边界伪影、不一致的纹理和色彩以及保持正确的阴影。大多数STR方法都会估计一个文本区域掩码来训练模型,通过解决图像翻译或补全问题生成无文字的图像。因此,生成图像的质量取决于补全过程掩码的准确性和生成器的能力。在这项工作中,我们利用扩散模型在生成高质量、一致图像方面的优越能力来解决STR问题。我们引入了一个ControlNet扩散模型,将STR视为一种补全任务。为了提高模型的鲁棒性,我们开发了一种掩码预训练管道来对我们的扩散模型进行条件设置。这包括使用框掩码和粗略笔触掩码组合训练一个掩码自动编码器(MAE),并用来自我们新颖分割基础的掩码优化框架产生的掩码对其进行微调。该框架通过迭代地精化初始掩码,并利用SLIC和层次特征选择(HFS)算法进行分段,生成准确的最终文本掩码。这提高了掩码预测精度,并利用自然场景图像中的丰富纹理信息提供精确的补全掩码。在SCUT-EnsText和SCUT-Syn数据集上的实验表明,我们的方法显著优于现有的最先进的技术。
https://arxiv.org/abs/2410.21721
Developing effective scene text detection and recognition models hinges on extensive training data, which can be both laborious and costly to obtain, especially for low-resourced languages. Conventional methods tailored for Latin characters often falter with non-Latin scripts due to challenges like character stacking, diacritics, and variable character widths without clear word boundaries. In this paper, we introduce the first Khmer scene-text dataset, featuring 1,544 expert-annotated images, including 997 indoor and 547 outdoor scenes. This diverse dataset includes flat text, raised text, poorly illuminated text, distant and partially obscured text. Annotations provide line-level text and polygonal bounding box coordinates for each scene. The benchmark includes baseline models for scene-text detection and recognition tasks, providing a robust starting point for future research endeavors. The KhmerST dataset is publicly accessible at this https URL.
开发有效的场景文本检测和识别模型依赖于大量的训练数据,而获取这些数据既费时又昂贵,特别是对于资源匮乏的语言来说更是如此。针对拉丁字母的传统方法在处理非拉丁脚本时往往效果不佳,因为存在字符堆叠、附加符号以及没有明确单词边界的可变字符宽度等挑战。本文介绍了首个高棉语场景文本数据集,该数据集包含1,544张专家标注的图像,其中包括997张室内场景和547张室外场景。这个多样化的数据集涵盖了平面文本、凸起文本、光照条件不佳的文本、远处以及部分遮挡的文本。标注信息提供了每个场景的行级文本及多边形边界框坐标。基准测试包括了场景文本检测与识别任务的基础模型,为未来的科研工作提供了一个坚实的基础。KhmerST数据集可以在以下链接中公开访问:[此处是提供的URL]。
https://arxiv.org/abs/2410.18277
Large Multimodal Models (LMMs) have achieved significant breakthroughs in various vision-language and vision-centric tasks based on auto-regressive modeling. However, these models typically focus on either vision-centric tasks, such as visual grounding and region description, or vision-language tasks, like image caption and multi-scenario VQAs. None of the LMMs have yet comprehensively unified both types of tasks within a single model, as seen in Large Language Models in the natural language processing field. Furthermore, even with abundant multi-task instruction-following data, directly stacking these data for universal capabilities extension remains challenging. To address these issues, we introduce a novel multi-dimension curated and consolidated multimodal dataset, named CCMD-8M, which overcomes the data barriers of unifying vision-centric and vision-language tasks through multi-level data curation and multi-task consolidation. More importantly, we present Griffon-G, a general large multimodal model that addresses both vision-centric and vision-language tasks within a single end-to-end paradigm. Griffon-G resolves the training collapse issue encountered during the joint optimization of these tasks, achieving better training efficiency. Evaluations across multimodal benchmarks, general Visual Question Answering (VQA) tasks, scene text-centric VQA tasks, document-related VQA tasks, Referring Expression Comprehension, and object detection demonstrate that Griffon-G surpasses the advanced LMMs and achieves expert-level performance in complicated vision-centric tasks.
https://arxiv.org/abs/2410.16163