Abstract
Scene text recognition, as a cross-modal task involving vision and text, is an important research topic in computer vision. Most existing methods use language models to extract semantic information for optimizing visual recognition. However, the guidance of visual cues is ignored in the process of semantic mining, which limits the performance of the algorithm in recognizing irregular scene text. To tackle this issue, we propose a novel cross-modal fusion network (CMFN) for irregular scene text recognition, which incorporates visual cues into the semantic mining process. Specifically, CMFN consists of a position self-enhanced encoder, a visual recognition branch and an iterative semantic recognition branch. The position self-enhanced encoder provides character sequence position encoding for both the visual recognition branch and the iterative semantic recognition branch. The visual recognition branch carries out visual recognition based on the visual features extracted by CNN and the position encoding information provided by the position self-enhanced encoder. The iterative semantic recognition branch, which consists of a language recognition module and a cross-modal fusion gate, simulates the way that human recognizes scene text and integrates cross-modal visual cues for text recognition. The experiments demonstrate that the proposed CMFN algorithm achieves comparable performance to state-of-the-art algorithms, indicating its effectiveness.
Abstract (translated)
场景文本识别是一个涉及视觉和文本的多模态任务,在计算机视觉领域是一个重要的研究课题。大多数现有方法使用语言模型提取语义信息来优化视觉识别。然而,在语义挖掘过程中,忽略了解视觉线索的指导,这限制了算法在识别不规则场景文本时的性能。为了应对这个问题,我们提出了一个新颖的跨模态融合网络(CMFN)用于不规则场景文本识别,其中将视觉线索融入语义挖掘过程。具体来说,CMFN由位置自增强编码器、视觉识别分支和迭代语义识别分支组成。位置自增强编码器为视觉识别分支和迭代语义识别分支提供字符序列位置编码。视觉识别分支根据CNN提取的视觉特征进行视觉识别,并提供位置自增强编码器提供的位置编码信息。迭代语义识别分支由语言识别模块和跨模态融合门组成,模拟了人类识别场景文本的方式,并整合了跨模态视觉线索进行文本识别。实验证明,与最先进的算法相比,所提出的CMFN算法具有可比较的性能,表明了其有效性。
URL
https://arxiv.org/abs/2401.10041