Abstract
Images with visual and scene text content are ubiquitous in everyday life. However current image interpretation systems are mostly limited to using only the visual features, neglecting to leverage the scene text content. In this paper we propose to jointly use scene text and visual channels for robust semantic interpretation of images. We undertake the task of matching Advertisement images against their human generated statements that describe the action that the ad prompts and the rationale it provides for taking this action. We extract the scene text and generate semantic and lexical text representations, which are used in the interpretation of the Ad Image. To deal with irrelevant or erroneous detection of scene text, we use a text attention scheme. We also learn an embedding of the visual channel,\ie visual features based on detected symbolism and objects, into a semantic embedding space, leveraging text semantics obtained from scene text. We show how the multi channel approach, involving visual semantics and scene text, improves upon the current state of the art.
Abstract (translated)
具有视觉和场景文本内容的图像在日常生活中无处不在。然而,目前的图像判读系统大多局限于仅使用视觉特征,忽略了对场景文本内容的利用。本文提出结合场景文本和视觉通道对图像进行稳健的语义解释。我们承担的任务是将广告图像与人工生成的语句进行匹配,这些语句描述了广告提示的操作以及它为采取这种操作提供的理由。提取场景文本,生成语义和词汇文本表达,用于广告图像的解释。为了处理场景文本的无关或错误检测,我们使用了文本注意方案。我们还学习了将视觉通道(即基于检测到的符号和对象的视觉特征)嵌入到语义嵌入空间中,利用从场景文本中获得的文本语义。我们展示了涉及视觉语义和场景文本的多通道方法如何改进当前的技术状态。
URL
https://arxiv.org/abs/1905.10622