Text2Motion aims to generate human motions from texts. Existing datasets rely on the assumption that texts include action labels (such as "walk, bend, and pick up"), which is not flexible for practical scenarios. This paper redefines this problem with a more realistic assumption that the texts are arbitrary. Specifically, arbitrary texts include existing action texts composed of action labels (e.g., A person walks and bends to pick up something), and introduce scene texts without explicit action labels (e.g., A person notices his wallet on the ground ahead). To bridge the gaps between this realistic setting and existing datasets, we expand the action texts on the HumanML3D dataset to more scene texts, thereby creating a new HumanML3D++ dataset including arbitrary texts. In this challenging dataset, we benchmark existing state-of-the-art methods and propose a novel two-stage framework to extract action labels from arbitrary texts by the Large Language Model (LLM) and then generate motions from action labels. Extensive experiments are conducted under different application scenarios to validate the effectiveness of the proposed framework on existing and proposed datasets. The results indicate that Text2Motion in this realistic setting is very challenging, fostering new research in this practical direction. Our dataset and code will be released.
Text2Motion旨在从文本中生成人类动作。现有的数据集依赖于假设文本包括动作标签(如“步行,弯曲和捡起”),这并不灵活,因为实际场景中这种情况并不总是适用。本文通过更现实地假设文本是随机的,重新定义了这个问题。具体来说,随机文本包括由动作标签组成的现有动作文本(例如,一个人步行和弯曲来捡起东西),并引入没有明确动作标签的场景文本(例如,一个人注意到他面前的地面上有一张钞票)。为了在现实设置和现有数据集之间弥合差距,我们在HumanML3D数据集上扩展了动作文本,从而创建了一个包含任意文本的新HumanML3D++数据集。在这个具有挑战性的数据集中,我们基准了现有的最先进的方法,并提出了一个新颖的两阶段框架,通过Large Language Model(LLM)从任意文本中提取动作标签,然后生成动作。在不同的应用场景下进行了广泛的实验,以验证所提出的框架在现有和假设数据集上的有效性。结果表明,在现实设置下,Text2Motion非常具有挑战性,推动了这一领域的新研究。我们的数据集和代码将公开发布。
https://arxiv.org/abs/2404.14745
Extremely low-light text images are common in natural scenes, making scene text detection and recognition challenging. One solution is to enhance these images using low-light image enhancement methods before text extraction. However, previous methods often do not try to particularly address the significance of low-level features, which are crucial for optimal performance on downstream scene text tasks. Further research is also hindered by the lack of extremely low-light text datasets. To address these limitations, we propose a novel encoder-decoder framework with an edge-aware attention module to focus on scene text regions during enhancement. Our proposed method uses novel text detection and edge reconstruction losses to emphasize low-level scene text features, leading to successful text extraction. Additionally, we present a Supervised Deep Curve Estimation (Supervised-DCE) model to synthesize extremely low-light images based on publicly available scene text datasets such as ICDAR15 (IC15). We also labeled texts in the extremely low-light See In the Dark (SID) and ordinary LOw-Light (LOL) datasets to allow for objective assessment of extremely low-light image enhancement through scene text tasks. Extensive experiments show that our model outperforms state-of-the-art methods in terms of both image quality and scene text metrics on the widely-used LOL, SID, and synthetic IC15 datasets. Code and dataset will be released publicly at this https URL.
极低光文本图像在自然场景中很常见,使得场景文本检测和识别变得具有挑战性。一种解决方案是在文本提取之前使用低光图像增强方法增强这些图像。然而,之前的方法通常没有特别关注低级别特征的重要性,这些特征对于下游场景文本任务具有关键作用。此外,缺乏极低光文本数据集也进一步阻碍了进一步的研究。为了克服这些限制,我们提出了一个新颖的编码器-解码器框架,配备边缘感知注意模块,以在增强过程中关注场景文本区域。我们的方法利用新的文本检测和边缘重构损失来强调低级别场景文本特征,从而实现成功的文本提取。此外,我们还提出了一个基于已知场景文本数据集如ICDAR15( see In the Dark,SID)的监督深度曲线估计(Supervised-DCE)模型,用于基于公开可用的场景文本数据合成极低光图像。我们还对极低光See In the Dark(SID)和普通Low-Light(LOL)数据集中的文本进行了标注,以使场景文本任务通过场景文本评估极低光图像增强。大量的实验结果表明,我们的模型在广泛使用的LOL、SID和合成IC15数据集上的图像质量和场景文本指标都优于最先进的方法。代码和数据集将在这个https:// URL上发布。
https://arxiv.org/abs/2404.14135
In this paper, we present a method for enhancing the accuracy of scene text recognition tasks by judging whether the image and text match each other. While previous studies focused on generating the recognition results from input images, our approach also considers the model's misrecognition results to understand its error tendencies, thus improving the text recognition pipeline. This method boosts text recognition accuracy by providing explicit feedback on the data that the model is likely to misrecognize by predicting correct or incorrect between the image and text. The experimental results on publicly available datasets demonstrate that our proposed method outperforms the baseline and state-of-the-art methods in scene text recognition.
在本文中,我们提出了一种通过判断图像和文本是否匹配来提高场景文本识别任务准确性的方法。与之前的研究不同,我们的方法考虑了模型的误识别结果,以了解其错误趋势,从而改善了文本识别流程。通过为模型提供关于其可能误识别的图像和文本之间的明确反馈,这种方法提高了文本识别的准确性。在公开数据集上进行的实验结果表明,与基线和最先进的方法相比,我们提出的方法在场景文本识别中表现出色。
https://arxiv.org/abs/2404.05967
This paper presents a simple yet efficient ensemble learning framework for Vietnamese scene text spotting. Leveraging the power of ensemble learning, which combines multiple models to yield more accurate predictions, our approach aims to significantly enhance the performance of scene text spotting in challenging urban settings. Through experimental evaluations on the VinText dataset, our proposed method achieves a significant improvement in accuracy compared to existing methods with an impressive accuracy of 5%. These results unequivocally demonstrate the efficacy of ensemble learning in the context of Vietnamese scene text spotting in urban environments, highlighting its potential for real world applications, such as text detection and recognition in urban signage, advertisements, and various text-rich urban scenes.
本文提出了一种简单的但高效的聚类学习框架,用于越南场景文本检测。利用聚类学习的优势,该方法结合多个模型以产生更准确的预测,旨在显著增强具有挑战性城市环境中的场景文本检测的性能。通过对VinText数据集的实验评估,与现有方法相比,我们提出的方法在准确性方面显著提高,达到令人印象深刻的5%的准确度。这些结果无条件地证明了在越南城市环境中使用聚类学习提高场景文本检测效果的可能性,并突出了其在城市标志、广告和各种充满文本的城市场景等现实应用中的潜力。
https://arxiv.org/abs/2404.00852
Scene text image super-resolution has significantly improved the accuracy of scene text recognition. However, many existing methods emphasize performance over efficiency and ignore the practical need for lightweight solutions in deployment scenarios. Faced with the issues, our work proposes an efficient framework called SGENet to facilitate deployment on resource-limited platforms. SGENet contains two branches: super-resolution branch and semantic guidance branch. We apply a lightweight pre-trained recognizer as a semantic extractor to enhance the understanding of text information. Meanwhile, we design the visual-semantic alignment module to achieve bidirectional alignment between image features and semantics, resulting in the generation of highquality prior guidance. We conduct extensive experiments on benchmark dataset, and the proposed SGENet achieves excellent performance with fewer computational costs. Code is available at this https URL
场景文本图像超分辨率显著提高了场景文本识别的准确性。然而,许多现有方法强调性能而非效率,并忽略了在部署场景中实现轻量级解决方案的实际需求。面对这些问题,我们的工作提出了一种高效的框架SGENet,以促进在资源受限平台上部署。SGENet包含两个分支:超分辨率分支和语义引导分支。我们使用轻量预训练识别器作为语义提取器来增强文本信息的理解。同时,我们设计了一个视觉语义对齐模块,以实现图像特征和语义之间的双向对齐,从而生成高质量的先验指导。我们在基准数据集上进行广泛的实验,与提出的SGENet相比,具有卓越的性能,但计算成本较低。代码可在此处下载:https://url
https://arxiv.org/abs/2403.13330
Road surface reconstruction plays a vital role in autonomous driving systems, enabling road lane perception and high-precision mapping. Recently, neural implicit encoding has achieved remarkable results in scene representation, particularly in the realistic rendering of scene textures. However, it faces challenges in directly representing geometric information for large-scale scenes. To address this, we propose EMIE-MAP, a novel method for large-scale road surface reconstruction based on explicit mesh and implicit encoding. The road geometry is represented using explicit mesh, where each vertex stores implicit encoding representing the color and semantic information. To overcome the difficulty in optimizing road elevation, we introduce a trajectory-based elevation initialization and an elevation residual learning method based on Multi-Layer Perceptron (MLP). Additionally, by employing implicit encoding and multi-camera color MLPs decoding, we achieve separate modeling of scene physical properties and camera characteristics, allowing surround-view reconstruction compatible with different camera models. Our method achieves remarkable road surface reconstruction performance in a variety of real-world challenging scenarios.
道路表面重构在自动驾驶系统中起着至关重要的作用,实现了道路车道的感知和精确地图绘制。最近,神经隐式编码在场景表示方面取得了显著的成果,特别是在真实感绘制场景纹理方面。然而,在直接表示大规模场景的几何信息方面,它面临挑战。为了应对这个问题,我们提出了EMIE-MAP,一种基于显式网格和隐式编码的大型道路表面重构新方法。道路几何信息通过显式网格表示,其中每个顶点存储隐式编码,表示颜色和语义信息。为了克服优化道路抬升的困难,我们引入了基于轨迹的抬升初始化和基于多层感知器(MLP)的抬升残差学习方法。此外,通过采用隐式编码和多相机颜色MLP解码,我们实现了场景物理特性和相机特性的单独建模,允许不同相机模型的环绕视图重建。我们的方法在各种现实世界的具有挑战性的场景中取得了出色的道路表面重构性能。
https://arxiv.org/abs/2403.11789
Existing scene text spotters are designed to locate and transcribe texts from images. However, it is challenging for a spotter to achieve precise detection and recognition of scene texts simultaneously. Inspired by the glimpse-focus spotting pipeline of human beings and impressive performances of Pre-trained Language Models (PLMs) on visual tasks, we ask: 1) "Can machines spot texts without precise detection just like human beings?", and if yes, 2) "Is text block another alternative for scene text spotting other than word or character?" To this end, our proposed scene text spotter leverages advanced PLMs to enhance performance without fine-grained detection. Specifically, we first use a simple detector for block-level text detection to obtain rough positional information. Then, we finetune a PLM using a large-scale OCR dataset to achieve accurate recognition. Benefiting from the comprehensive language knowledge gained during the pre-training phase, the PLM-based recognition module effectively handles complex scenarios, including multi-line, reversed, occluded, and incomplete-detection texts. Taking advantage of the fine-tuned language model on scene recognition benchmarks and the paradigm of text block detection, extensive experiments demonstrate the superior performance of our scene text spotter across multiple public benchmarks. Additionally, we attempt to spot texts directly from an entire scene image to demonstrate the potential of PLMs, even Large Language Models (LLMs).
现有的场景文本检测器旨在从图像中定位和转录文本。然而,对于检测器来说,同时实现精确的检测和识别场景文本是非常具有挑战性的。受到人类瞥视焦点检测流程和预训练语言模型(PLMs)在视觉任务上的出色表现启发,我们提出以下问题:1)“机器能否像人类一样准确地检测到文本,而无需进行精确的检测?”如果答案是肯定的,2)“文本块是否是场景文本检测的另一种选择,除了单词或字符?”为此,我们提出的场景文本检测器利用预训练语言模型(PLMs)增强性能,而无需进行微细检测。具体来说,我们首先使用简单的基于块级的文本检测器获得粗略的位置信息。然后,我们使用一个大型的OCR数据集对PLM进行微调,以实现精确的识别。得益于在预训练阶段获得的全面语言知识,基于PLM的识别模块有效地处理复杂的场景,包括多行、反向、遮挡和不完整的检测文本。利用场景识别基准测试中的微调后的语言模型以及文本块检测的范式,广泛的实验证明了我们在多个公共基准测试中的优越性能。此外,我们还尝试从整个场景图像中直接检测文本,以展示PLMs的潜力,即使是大语言模型(LLMs)。
https://arxiv.org/abs/2403.10047
Visual text rendering poses a fundamental challenge for contemporary text-to-image generation models, with the core problem lying in text encoder deficiencies. To achieve accurate text rendering, we identify two crucial requirements for text encoders: character awareness and alignment with glyphs. Our solution involves crafting a series of customized text encoder, Glyph-ByT5, by fine-tuning the character-aware ByT5 encoder using a meticulously curated paired glyph-text dataset. We present an effective method for integrating Glyph-ByT5 with SDXL, resulting in the creation of the Glyph-SDXL model for design image generation. This significantly enhances text rendering accuracy, improving it from less than $20\%$ to nearly $90\%$ on our design image benchmark. Noteworthy is Glyph-SDXL's newfound ability for text paragraph rendering, achieving high spelling accuracy for tens to hundreds of characters with automated multi-line layouts. Finally, through fine-tuning Glyph-SDXL with a small set of high-quality, photorealistic images featuring visual text, we showcase a substantial improvement in scene text rendering capabilities in open-domain real images. These compelling outcomes aim to encourage further exploration in designing customized text encoders for diverse and challenging tasks.
视觉文本渲染对当代文本-图像生成模型来说是一个基本挑战,其核心问题在于文本编码器的不足。为了实现准确的文本渲染,我们提出了两个关键需求:字符意识和与字符相关的对齐。我们的解决方案是通过微调带有字符意识的ByT5编码器,使用精心挑选的配对字符-文本数据集进行微调,来创建一个自定义的文本编码器Glyph-ByT5。我们展示了将Glyph-ByT5与SDXL集成有效的方法,从而创建了Glyph-SDXL模型,用于设计图像生成。这使得文本渲染准确性大大提高,从不到20%提高到了几乎90%。值得注意的是,Glyph-SDXL在文本段落渲染方面表现出的新能力,实现了对数十到数百个字符的高拼写准确度,并采用自动多行布局。最后,通过用包含视觉文本的高质量、实拍图像微调Glyph-SDXL,我们在开放域真实图像中展示了场景文本渲染能力的重大改进。这些引人注目的结果鼓励我们在各种具有挑战性的任务中进一步探索自定义文本编码器的设计。
https://arxiv.org/abs/2403.09622
Scene-Text Visual Question Answering (ST-VQA) aims to understand scene text in images and answer questions related to the text content. Most existing methods heavily rely on the accuracy of Optical Character Recognition (OCR) systems, and aggressive fine-tuning based on limited spatial location information and erroneous OCR text information often leads to inevitable overfitting. In this paper, we propose a multimodal adversarial training architecture with spatial awareness capabilities. Specifically, we introduce an Adversarial OCR Enhancement (AOE) module, which leverages adversarial training in the embedding space of OCR modality to enhance fault-tolerant representation of OCR texts, thereby reducing noise caused by OCR errors. Simultaneously, We add a Spatial-Aware Self-Attention (SASA) mechanism to help the model better capture the spatial relationships among OCR tokens. Various experiments demonstrate that our method achieves significant performance improvements on both the ST-VQA and TextVQA datasets and provides a novel paradigm for multimodal adversarial training.
场景文本视觉问答(ST-VQA)旨在理解图像中的场景文本,并回答与文本内容相关的问题。现有方法很大程度上依赖于光学字符识别(OCR)系统的准确性,而且基于有限的空间位置信息和错误的OCR文本信息进行激进的微调往往会导致不可预测的过拟合。在本文中,我们提出了一个具有空间感知能力的多模态对抗训练架构。具体来说,我们引入了一个对抗性OCR增强(AOE)模块,它利用OCR模态的嵌入空间中的对抗性训练来增强OCR文本的容错表示,从而减少由OCR错误引起的噪声。同时,我们还添加了一个空间感知自注意力(SASA)机制,帮助模型更好地捕捉OCR词汇之间的空间关系。各种实验结果表明,我们的方法在ST-VQA和TextVQA数据集上都取得了显著的性能提升,并为多模态对抗训练树立了新的范例。
https://arxiv.org/abs/2403.09288
The importance of Scene Text Recognition (STR) in today's increasingly digital world cannot be overstated. Given the significance of STR, data intensive deep learning approaches that auto-learn feature mappings have primarily driven the development of STR solutions. Several benchmark datasets and substantial work on deep learning models are available for Latin languages to meet this need. On more complex, syntactically and semantically, Indian languages spoken and read by 1.3 billion people, there is less work and datasets available. This paper aims to address the Indian space's lack of a comprehensive dataset by proposing the largest and most comprehensive real dataset - IndicSTR12 - and benchmarking STR performance on 12 major Indian languages. A few works have addressed the same issue, but to the best of our knowledge, they focused on a small number of Indian languages. The size and complexity of the proposed dataset are comparable to those of existing Latin contemporaries, while its multilingualism will catalyse the development of robust text detection and recognition models. It was created specifically for a group of related languages with different scripts. The dataset contains over 27000 word-images gathered from various natural scenes, with over 1000 word-images for each language. Unlike previous datasets, the images cover a broader range of realistic conditions, including blur, illumination changes, occlusion, non-iconic texts, low resolution, perspective text etc. Along with the new dataset, we provide a high-performing baseline on three models - PARSeq, CRNN, and STARNet.
在当今日益数字化的世界中,场景文本识别(STR)的重要性不容忽视。考虑到STR的重要性,数据密集型深度学习方法自动学习特征映射已经在STR解决方案的发展中发挥了主要作用。目前有多个用于拉丁语的数据集和大量关于深度学习模型的研究,以满足这一需求。对于说和读13亿人口的印度语言来说,在更复杂、语义和语法方面,可用的工作和数据集较少。本文旨在通过提出最大的、最全面的真实数据集-IndicSTR12,来解决印度空间缺乏全面数据集的问题,并评估STR在12种主要印度语言上的性能。虽然已经有一些研究解决了同样的问题,但据我们所知,它们主要针对少数印度语言。拟议的数据集的大小和复杂性与现有拉丁语作品相当,而其多语言性将催生强大的文本检测和识别模型的开发。它特意为几种相关语言的一个群体而创建。数据集包含来自各种自然场景的超过27000个单词图像,每个语言都有超过1000个单词图像。与以前的数据集不同,图像涵盖了更广泛的现实情况,包括模糊、光照变化、遮挡、非典型文本、低分辨率、透视文本等。与新数据集一起,我们为PARSeq、CRNN和STARNet提供了高性能的基准。
https://arxiv.org/abs/2403.08007
Scene text recognition is an important and challenging task in computer vision. However, most prior works focus on recognizing pre-defined words, while there are various out-of-vocabulary (OOV) words in real-world applications. In this paper, we propose a novel open-vocabulary text recognition framework, Pseudo-OCR, to recognize OOV words. The key challenge in this task is the lack of OOV training data. To solve this problem, we first propose a pseudo label generation module that leverages character detection and image inpainting to produce substantial pseudo OOV training data from real-world images. Unlike previous synthetic data, our pseudo OOV data contains real characters and backgrounds to simulate real-world applications. Secondly, to reduce noises in pseudo data, we present a semantic checking mechanism to filter semantically meaningful data. Thirdly, we introduce a quality-aware margin loss to boost the training with pseudo data. Our loss includes a margin-based part to enhance the classification ability, and a quality-aware part to penalize low-quality samples in both real and pseudo data. Extensive experiments demonstrate that our approach outperforms the state-of-the-art on eight datasets and achieves the first rank in the ICDAR2022 challenge.
场景文本识别是计算机视觉中一个重要而具有挑战性的任务。然而,大多数先前的作品都专注于识别预定义的单词,而在现实应用中存在各种不在词汇表中的(OOV)单词。在本文中,我们提出了一个新颖的开放词汇文本识别框架,称为伪-OCR,以识别OOV单词。这个任务的关键挑战是缺乏OOV训练数据。为解决这个问题,我们首先提出了一个伪标签生成模块,利用字符检测和图像修复技术从现实世界的图像中产生大量伪OOV训练数据。与之前的合成数据不同,我们的伪OOV数据包含真实字符和背景,以模拟真实世界的应用。其次,为了减少伪数据中的噪声,我们提出了一个语义检查机制来过滤语义上有意义的数据。第三,我们引入了质量感知边距损失来提高带有伪数据的训练。我们的损失包括基于边距的质量和基于质量的损失。大量实验证明,我们的方法在八个数据集上的表现超过了现有技术的水平,在ICDAR2022挑战中获得了第一名的成绩。
https://arxiv.org/abs/2403.07518
Text detection is frequently used in vision-based mobile robots when they need to interpret texts in their surroundings to perform a given task. For instance, delivery robots in multilingual cities need to be capable of doing multilingual text detection so that the robots can read traffic signs and road markings. Moreover, the target languages change from region to region, implying the need of efficiently re-training the models to recognize the novel/new languages. However, collecting and labeling training data for novel languages are cumbersome, and the efforts to re-train an existing/trained text detector are considerable. Even worse, such a routine would repeat whenever a novel language appears. This motivates us to propose a new problem setting for tackling the aforementioned challenges in a more efficient way: "We ask for a generalizable multilingual text detection framework to detect and identify both seen and unseen language regions inside scene images without the requirement of collecting supervised training data for unseen languages as well as model re-training". To this end, we propose "MENTOR", the first work to realize a learning strategy between zero-shot learning and few-shot learning for multilingual scene text detection.
文本检测在基于视觉的移动机器人中非常常见,当它们需要在其周围环境中解释文本以执行特定任务时。例如,多语言城市中的送货机器人需要具备多语言文本检测能力,以便机器人能阅读交通标志和道路标线。此外,目标语言会根据地区发生变化,这意味着需要有效地重新训练模型以识别新/未知的语言。然而,收集和标注新语言的训练数据非常耗时,重新训练现有/训练好的文本检测模型的努力也很大。更糟糕的是,这种模式会重复出现,每次出现新语言都会如此。因此,我们激励自己提出一个更有效的解决方案来解决上述挑战: "我们要求一个通用的多语言场景文本检测框架,能够检测和识别场景图像中的已见和未见语言区域,无需收集未见语言的监督训练数据,也不需要重新训练模型。" 为实现这一目标,我们提出了"MENTOR",是第一个在零散学习和高散射学习之间实现学习策略的关于多语言场景文本检测的工作。
https://arxiv.org/abs/2403.07286
We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks, including document question answering (DocVQA) and scene text analysis. Our approach introduces enhancement across several dimensions: by adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions and stabilize early training; We hypothesize that images may contain redundant tokens, and by using similarity to filter out significant tokens, we can not only streamline the token length but also enhance the model's performance. Moreover, by expanding our model's capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability and minimize hallucinations. Additionally, TextMonkey can be finetuned to gain the ability to comprehend commands for clicking screenshots. Overall, our method notably boosts performance across various benchmark datasets, achieving increases of 5.2%, 6.9%, and 2.8% in Scene Text-Centric VQA, Document Oriented VQA, and KIE, respectively, especially with a score of 561 on OCRBench, surpassing prior open-sourced large multimodal models for document understanding. Code will be released at this https URL.
我们提出了TextMonkey,一个专为文本中心任务(包括文档问题回答和场景文本分析)而设计的大型多模态模型(LMM)。我们的方法在几个维度上进行了增强:通过采用Shifted Window Attention且初始化为零,我们在较高输入分辨率上实现了跨窗口连接并稳定了早期的训练;我们假设图像中可能包含冗余词,通过使用相似性过滤掉显著的词,我们不仅可以简化词长,而且还可以提高模型的性能。此外,通过扩展我们的模型的能力包括文本定位和 grounded,并将位置信息融入回答,我们提高了可解释性和减少了幻觉。此外,TextMonkey还可以微调以具有理解截图按键命令的能力。总的来说,我们的方法在各种基准数据集上显著提高了性能,在Scene Text-Centric VQA、Document Oriented VQA和KIE等任务中分别实现了5.2%、6.9%和2.8%的提高,尤其是OCRBench上的得分561,超过了之前开源的大型多模态模型对于文档理解的性能。代码将在此处发布:https:// this URL.
https://arxiv.org/abs/2403.04473
In recent years, text-image joint pre-training techniques have shown promising results in various tasks. However, in Optical Character Recognition (OCR) tasks, aligning text instances with their corresponding text regions in images poses a challenge, as it requires effective alignment between text and OCR-Text (referring to the text in images as OCR-Text to distinguish from the text in natural language) rather than a holistic understanding of the overall image content. In this paper, we propose a new pre-training method called OCR-Text Destylization Modeling (ODM) that transfers diverse styles of text found in images to a uniform style based on the text prompt. With ODM, we achieve better alignment between text and OCR-Text and enable pre-trained models to adapt to the complex and diverse styles of scene text detection and spotting tasks. Additionally, we have designed a new labeling generation method specifically for ODM and combined it with our proposed Text-Controller module to address the challenge of annotation costs in OCR tasks, allowing a larger amount of unlabeled data to participate in pre-training. Extensive experiments on multiple public datasets demonstrate that our method significantly improves performance and outperforms current pre-training methods in scene text detection and spotting tasks. Code is available at {this https URL}.
近年来,在各种任务中,文本图像联合预训练技术取得了良好的效果。然而,在光学字符识别(OCR)任务中,将文本实例与图像中的相应文本区域对齐是一个挑战,因为它需要将文本和OCR-文本(将图像中的文本称为OCR-文本,以与自然语言中的文本区分开来)之间的有效对齐,而不是对整个图像内容的全面理解。在本文中,我们提出了一个名为OCR-Text Destylization Modeling(ODM)的新预训练方法,它基于文本提示将图像中发现的多样文本风格转移为统一风格。通过ODM,我们实现了文本和OCR-文本之间的更好对齐,并使预训练模型能够适应场景文本检测和标注任务的复杂和多样风格。此外,我们还针对ODM设计了一个新的标签生成方法,并将其与我们的Text-Controller模块相结合,以解决OCR任务中标注成本的挑战,允许更多的未标注数据参与预训练。在多个公开数据集上的广泛实验证明,我们的方法显著提高了性能,在场景文本检测和标注任务中优于当前预训练方法。代码可在此处下载:{this <https://URL>}。
https://arxiv.org/abs/2403.00303
Zero-Shot Object Navigation (ZSON) requires agents to autonomously locate and approach unseen objects in unfamiliar environments and has emerged as a particularly challenging task within the domain of Embodied AI. Existing datasets for developing ZSON algorithms lack consideration of dynamic obstacles, object attribute diversity, and scene texts, thus exhibiting noticeable discrepancy from real-world situations. To address these issues, we propose a Dataset for Open-Vocabulary Zero-Shot Object Navigation in Dynamic Environments (DOZE) that comprises ten high-fidelity 3D scenes with over 18k tasks, aiming to mimic complex, dynamic real-world scenarios. Specifically, DOZE scenes feature multiple moving humanoid obstacles, a wide array of open-vocabulary objects, diverse distinct-attribute objects, and valuable textual hints. Besides, different from existing datasets that only provide collision checking between the agent and static obstacles, we enhance DOZE by integrating capabilities for detecting collisions between the agent and moving obstacles. This novel functionality enables evaluation of the agents' collision avoidance abilities in dynamic environments. We test four representative ZSON methods on DOZE, revealing substantial room for improvement in existing approaches concerning navigation efficiency, safety, and object recognition accuracy. Our dataset could be found at this https URL.
零距离物体导航(ZSON)要求智能体在未知环境中自主定位和靠近未见到的物体,这一任务在 embodied AI 领域已成为一个特别具有挑战性的任务。现有的用于开发 ZSON 算法的数据集中没有考虑到动态障碍、物体属性的多样性和场景文本,因此与现实世界情况存在明显的差异。为了解决这些问题,我们提出了一个用于开放词汇零距离物体导航在动态环境中的数据集(DOZE),它包括十个高保真的 3D 场景,超过 18k 个任务,旨在模拟复杂、动态的现实生活中场景。 具体来说,DOZE 场景特征有多名移动的人形障碍物、各种开放词汇物体、多样化的独特属性物体和有价值的文本提示。此外,与现有的仅提供代理与静态障碍物之间碰撞检查的数据集不同,我们通过整合检测代理与移动障碍物之间碰撞的功能来增强 DOZE。这种新功能使得可以在动态环境中评估代理的避障能力。我们在 DOZE 上测试了四种代表性的 ZSON 方法,揭示了现有方法在导航效率、安全性和物体识别准确性方面存在巨大的改进空间。我们的数据集可以在这个链接中找到:https://www.example.com/doze
https://arxiv.org/abs/2402.19007
Incorporating linguistic knowledge can improve scene text recognition, but it is questionable whether the same holds for scene text spotting, which typically involves text detection and recognition. This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models. This allows the model to capture the relationship between characters in the same word. Additionally, we introduce a technique to generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning. As a result, the newly created text distributions are more informative than pure one-hot encoding, leading to improved spotting and recognition performance. Our method is simple and efficient, and it can easily be integrated into existing auto-regressive-based approaches. Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words. It significantly improves both state-of-the-art scene text spotting and recognition pipelines, achieving state-of-the-art results on several benchmarks.
融入语言知识可以提高场景文本识别,但就场景文本检测而言,这种做法是否有效仍存在争议。本文提出了一种利用大型文本语料库中的语言知识来取代传统的一维编码以实现自回归场景文本检测和识别模型的方法。这使得模型能够捕捉同一单词中字符之间的关系。此外,我们还引入了一种生成与场景文本数据集分布相似的文本分布的技术,消除了在领域内微调的需求。因此,新创建的文本分布比纯一维编码更有信息,从而提高了检测和识别性能。我们的方法简单而高效,可以轻松地集成到现有的自回归方法中。实验结果表明,我们的方法不仅提高了识别准确度,还实现了更精确的词的局部定位。它显著提高了最先进的场景文本检测和识别流程,在多个基准测试中都实现了最先进的结果。
https://arxiv.org/abs/2402.17134
Scene text recognition (STR) is a challenging task that requires large-scale annotated data for training. However, collecting and labeling real text images is expensive and time-consuming, which limits the availability of real data. Therefore, most existing STR methods resort to synthetic data, which may introduce domain discrepancy and degrade the performance of STR models. To alleviate this problem, recent semi-supervised STR methods exploit unlabeled real data by enforcing character-level consistency regularization between weakly and strongly augmented views of the same image. However, these methods neglect word-level consistency, which is crucial for sequence recognition tasks. This paper proposes a novel semi-supervised learning method for STR that incorporates word-level consistency regularization from both visual and semantic aspects. Specifically, we devise a shortest path alignment module to align the sequential visual features of different views and minimize their distance. Moreover, we adopt a reinforcement learning framework to optimize the semantic similarity of the predicted strings in the embedding space. We conduct extensive experiments on several standard and challenging STR benchmarks and demonstrate the superiority of our proposed method over existing semi-supervised STR methods.
场景文本识别(STR)是一个具有挑战性的任务,需要大量标注数据进行训练。然而,收集和标注真实文本图像成本昂贵且耗时,这限制了真实数据的可用性。因此,大多数现有的STR方法求助于合成数据,这可能导致领域不一致并降低STR模型的性能。为了减轻这个问题,最近的一些半监督STR方法利用无标签真实数据,通过在同一图像的弱化和增强视图之间强制字符级别一致性正则化来利用它们。然而,这些方法忽视了词级别一致性,这对于序列识别任务至关重要。本文提出了一种新颖的半监督STR方法,该方法从视觉和语义方面结合词级别一致性正则化。具体来说,我们设计了一个最短路径对齐模块,对不同视图的序列视觉特征进行对齐,并最小化它们的距离。此外,我们采用强化学习框架来优化在嵌入空间中预测字符的语义相似度。我们在多个标准和具有挑战性的STR基准上进行了广泛的实验,并证明了与现有半监督STR方法相比,我们提出的方法具有优越性。
https://arxiv.org/abs/2402.15806
Scene text recognition is a rapidly developing field that faces numerous challenges due to the complexity and diversity of scene text, including complex backgrounds, diverse fonts, flexible arrangements, and accidental occlusions. In this paper, we propose a novel approach called Class-Aware Mask-guided feature refinement (CAM) to address these challenges. Our approach introduces canonical class-aware glyph masks generated from a standard font to effectively suppress background and text style noise, thereby enhancing feature discrimination. Additionally, we design a feature alignment and fusion module to incorporate the canonical mask guidance for further feature refinement for text recognition. By enhancing the alignment between the canonical mask feature and the text feature, the module ensures more effective fusion, ultimately leading to improved recognition performance. We first evaluate CAM on six standard text recognition benchmarks to demonstrate its effectiveness. Furthermore, CAM exhibits superiority over the state-of-the-art method by an average performance gain of 4.1% across six more challenging datasets, despite utilizing a smaller model size. Our study highlights the importance of incorporating canonical mask guidance and aligned feature refinement techniques for robust scene text recognition. The code is available at this https URL.
场景文本识别是一个迅速发展的领域,由于场景文本的复杂性和多样性,包括复杂的背景、多样化的字体和灵活的排列以及意外的遮挡,面临着许多挑战。在本文中,我们提出了一个名为类感知引导特征细化(CAM)的新方法来应对这些挑战。我们的方法引入了一个标准字体生成的规范类感知 glyph 口罩,有效地抑制了背景和文本风格噪声,从而提高了特征识别效果。此外,我们还设计了一个特征对齐和融合模块,以进一步对文本识别进行特征细化。通过增强规范口罩特征与文本特征之间的对齐,该模块确保了更有效的融合,最终提高了识别性能。我们首先在六个标准文本识别基准上评估了CAM的有效性,以证明其有效性。此外,CAM在六个更具挑战性的数据集上的平均性能比最先进的方法提高了4.1%,尽管采用了更小的模型大小。我们的研究突出了将规范口罩指导和对齐特征细化技术纳入文本识别的重要性。代码可在此处访问:https://url.cn/
https://arxiv.org/abs/2402.13643
Recent advancements in personalizing text-to-image (T2I) diffusion models have shown the capability to generate images based on personalized visual concepts using a limited number of user-provided examples. However, these models often struggle with maintaining high visual fidelity, particularly in manipulating scenes as defined by textual inputs. Addressing this, we introduce ComFusion, a novel approach that leverages pretrained models generating composition of a few user-provided subject images and predefined-text scenes, effectively fusing visual-subject instances with textual-specific scenes, resulting in the generation of high-fidelity instances within diverse scenes. ComFusion integrates a class-scene prior preservation regularization, which leverages composites the subject class and scene-specific knowledge from pretrained models to enhance generation fidelity. Additionally, ComFusion uses coarse generated images, ensuring they align effectively with both the instance image and scene texts. Consequently, ComFusion maintains a delicate balance between capturing the essence of the subject and maintaining scene fidelity.Extensive evaluations of ComFusion against various baselines in T2I personalization have demonstrated its qualitative and quantitative superiority.
近年来在个性化文本-图像(T2I)扩散模型方面的进步表明,这些模型能够使用有限的用户提供的示例生成基于个性化视觉概念的图像。然而,这些模型通常在操作由文本输入定义的场景时陷入困境,尤其是在处理场景时。为了解决这个问题,我们引入了ComFusion,一种新方法,它利用预训练模型生成几张用户提供的主题图像和预定义文本场景的组合,有效地将视觉主题实例与文本特定的场景融合在一起,从而在多样场景中生成高保真度的实例。ComFusion实现了一个类场景先验保留的规范化,它利用预训练模型的组合来增强生成保真度。此外,ComFusion使用粗生成图像,确保它们与实例图像和场景文本 alignment 有效。因此,ComFusion在捕捉主题本质的同时保持了场景保真度。 通过对ComFusion与各种T2I个性化基准的广泛评估,证明了ComFusion在质量和数量上具有优越性。
https://arxiv.org/abs/2402.11849
Existing methods for scene text detection can be divided into two paradigms: segmentation-based and anchor-based. While Segmentation-based methods are well-suited for irregular shapes, they struggle with compact or overlapping layouts. Conversely, anchor-based approaches excel for complex layouts but suffer from irregular shapes. To strengthen their merits and overcome their respective demerits, we propose a Complementary Proposal Network (CPN) that seamlessly and parallelly integrates semantic and geometric information for superior performance. The CPN comprises two efficient networks for proposal generation: the Deformable Morphology Semantic Network, which generates semantic proposals employing an innovative deformable morphological operator, and the Balanced Region Proposal Network, which produces geometric proposals with pre-defined anchors. To further enhance the complementarity, we introduce an Interleaved Feature Attention module that enables semantic and geometric features to interact deeply before proposal generation. By leveraging both complementary proposals and features, CPN outperforms state-of-the-art approaches with significant margins under comparable computation cost. Specifically, our approach achieves improvements of 3.6%, 1.3% and 1.0% on challenging benchmarks ICDAR19-ArT, IC15, and MSRA-TD500, respectively. Code for our method will be released.
现有的场景文本检测方法可以分为两种范式:基于分割和基于锚定。虽然基于分割的方法对于不规则形状的应用效果很好,但它们在紧凑或重叠布局下表现不佳。相反,基于锚定的方法在复杂布局下表现出色,但存在不规则形状的问题。为了增强其优势并克服各自的缺陷,我们提出了一个互补建议网络(CPN),它平滑地并行地整合语义和几何信息以实现卓越的性能。CPN包括两个用于提议生成的有效网络:具有创新变形形态操作的语义变形形态网络和具有预定义锚定的平衡区域提议网络。为了进一步增强互补性,我们还引入了一个跨特征关注模块,使得语义和几何特征在提议生成前进行深度交互。通过利用互补提议和特征,CPN在类似计算成本下显著优于最先进的 approaches。具体来说,我们的方法在具有挑战性的基准测试ICDAR19-ArT、IC15和MSRA-TD500上分别实现了3.6%、1.3%和1.0%的改进。我们的方法将发布代码。
https://arxiv.org/abs/2402.11540