Paper Reading AI Learner

Beyond Visual Semantics: Exploring the Role of Scene Text in Image Understanding

2019-05-25 15:53:14
Arka Ujjal Dey, Suman Kumar Ghosh, Ernest Valveny

Abstract

Images with visual and scene text content are ubiquitous in everyday life. However current image interpretation systems are mostly limited to using only the visual features, neglecting to leverage the scene text content. In this paper we propose to jointly use scene text and visual channels for robust semantic interpretation of images. We undertake the task of matching Advertisement images against their human generated statements that describe the action that the ad prompts and the rationale it provides for taking this action. We extract the scene text and generate semantic and lexical text representations, which are used in the interpretation of the Ad Image. To deal with irrelevant or erroneous detection of scene text, we use a text attention scheme. We also learn an embedding of the visual channel,\ie visual features based on detected symbolism and objects, into a semantic embedding space, leveraging text semantics obtained from scene text. We show how the multi channel approach, involving visual semantics and scene text, improves upon the current state of the art.

Abstract (translated)

具有视觉和场景文本内容的图像在日常生活中无处不在。然而,目前的图像判读系统大多局限于仅使用视觉特征,忽略了对场景文本内容的利用。本文提出结合场景文本和视觉通道对图像进行稳健的语义解释。我们承担的任务是将广告图像与人工生成的语句进行匹配,这些语句描述了广告提示的操作以及它为采取这种操作提供的理由。提取场景文本,生成语义和词汇文本表达,用于广告图像的解释。为了处理场景文本的无关或错误检测,我们使用了文本注意方案。我们还学习了将视觉通道(即基于检测到的符号和对象的视觉特征)嵌入到语义嵌入空间中,利用从场景文本中获得的文本语义。我们展示了涉及视觉语义和场景文本的多通道方法如何改进当前的技术状态。

URL

https://arxiv.org/abs/1905.10622

PDF

https://arxiv.org/pdf/1905.10622.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot