Paper Reading AI Learner

Multi-modal reward for visual relationships-based image captioning

2023-03-19 20:52:44
Ali Abedi, Hossein Karshenas, Peyman Adibi

Abstract

Deep neural networks have achieved promising results in automatic image captioning due to their effective representation learning and context-based content generation capabilities. As a prominent type of deep features used in many of the recent image captioning methods, the well-known bottomup features provide a detailed representation of different objects of the image in comparison with the feature maps directly extracted from the raw image. However, the lack of high-level semantic information about the relationships between these objects is an important drawback of bottom-up features, despite their expensive and resource-demanding extraction procedure. To take advantage of visual relationships in caption generation, this paper proposes a deep neural network architecture for image captioning based on fusing the visual relationships information extracted from an image's scene graph with the spatial feature maps of the image. A multi-modal reward function is then introduced for deep reinforcement learning of the proposed network using a combination of language and vision similarities in a common embedding space. The results of extensive experimentation on the MSCOCO dataset show the effectiveness of using visual relationships in the proposed captioning method. Moreover, the results clearly indicate that the proposed multi-modal reward in deep reinforcement learning leads to better model optimization, outperforming several state-of-the-art image captioning algorithms, while using light and easy to extract image features. A detailed experimental study of the components constituting the proposed method is also presented.

Abstract (translated)

神经网络在自动图像翻译中取得了令人瞩目的成果,因为它们有效地学习了表示和基于上下文的内容生成能力。作为最近在许多图像翻译方法中广泛应用的深度特征类型,著名的bottom-up特征提供了与从原始图像直接提取的特征映射相比更详细的图像对象表示。然而,缺乏这些对象之间的高级语义信息是bottom-up特征的一个重要缺点,尽管它们的提取程序昂贵且资源要求高。为了利用图像关系在翻译生成中的作用,本文提出了一种基于融合图像场景图提取的视觉关系信息与图像空间特征映射的深度学习神经网络架构。然后,在共同嵌入空间中通过语言和视觉相似性的组合引入一种多模态奖励函数,用于训练 proposed 网络的深度强化学习。对MSCOCO数据集进行广泛的实验结果显示,使用视觉关系在所提出的翻译方法中的有效性。此外,实验结果清楚地表明,所提出的深度强化学习多模态奖励导致更好的模型优化,比一些最先进的图像翻译算法表现更好,同时使用简单易提取的图像特征。还介绍了组成所提出方法的详细实验研究的组件。

URL

https://arxiv.org/abs/2303.10766

PDF

https://arxiv.org/pdf/2303.10766.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot