Abstract
Incorporating linguistic knowledge can improve scene text recognition, but it is questionable whether the same holds for scene text spotting, which typically involves text detection and recognition. This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models. This allows the model to capture the relationship between characters in the same word. Additionally, we introduce a technique to generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning. As a result, the newly created text distributions are more informative than pure one-hot encoding, leading to improved spotting and recognition performance. Our method is simple and efficient, and it can easily be integrated into existing auto-regressive-based approaches. Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words. It significantly improves both state-of-the-art scene text spotting and recognition pipelines, achieving state-of-the-art results on several benchmarks.
Abstract (translated)
融入语言知识可以提高场景文本识别,但就场景文本检测而言,这种做法是否有效仍存在争议。本文提出了一种利用大型文本语料库中的语言知识来取代传统的一维编码以实现自回归场景文本检测和识别模型的方法。这使得模型能够捕捉同一单词中字符之间的关系。此外,我们还引入了一种生成与场景文本数据集分布相似的文本分布的技术,消除了在领域内微调的需求。因此,新创建的文本分布比纯一维编码更有信息,从而提高了检测和识别性能。我们的方法简单而高效,可以轻松地集成到现有的自回归方法中。实验结果表明,我们的方法不仅提高了识别准确度,还实现了更精确的词的局部定位。它显著提高了最先进的场景文本检测和识别流程,在多个基准测试中都实现了最先进的结果。
URL
https://arxiv.org/abs/2402.17134