Paper Reading AI Learner

Synthesizing Sentiment-Controlled Feedback For Multimodal Text and Image Data

2024-02-12 13:27:22
Puneet Kumar, Sarthak Malik, Balasubramanian Raman, Xiaobai Li

Abstract

The ability to generate sentiment-controlled feedback in response to multimodal inputs, comprising both text and images, addresses a critical gap in human-computer interaction by enabling systems to provide empathetic, accurate, and engaging responses. This capability has profound applications in healthcare, marketing, and education. To this end, we construct a large-scale Controllable Multimodal Feedback Synthesis (CMFeed) dataset and propose a controllable feedback synthesis system. The proposed system includes an encoder, decoder, and controllability block for textual and visual inputs. It extracts textual and visual features using a transformer and Faster R-CNN networks and combines them to generate feedback. The CMFeed dataset encompasses images, text, reactions to the post, human comments with relevance scores, and reactions to the comments. The reactions to the post and comments are utilized to train the proposed model to produce feedback with a particular (positive or negative) sentiment. A sentiment classification accuracy of 77.23% has been achieved, 18.82% higher than the accuracy without using the controllability. Moreover, the system incorporates a similarity module for assessing feedback relevance through rank-based metrics. It implements an interpretability technique to analyze the contribution of textual and visual features during the generation of uncontrolled and controlled feedback.

Abstract (translated)

能够针对多模态输入生成情感控制反馈,包括文本和图像,解决了一个关键的人机交互缺口,使得系统能够提供体贴、准确、引人入胜的回应。这种能力在医疗、营销和教育等领域具有深刻的应用。为此,我们构建了一个大规模可控制多模态反馈合成(CMFeed)数据集,并提出了一个可控制反馈合成系统。所提出的系统包括编码器、解码器和一个可控制块,用于处理文本和视觉输入。它使用Transformer和Faster R-CNN网络提取文本和视觉特征,并将它们组合生成反馈。CMFeed数据集包括图像、文本、对帖子及其评论的反应、以及对这些评论的反应。用于训练所提出的模型产生具有特定(积极或消极)情感的反馈。情感分类准确度为77.23%,比没有使用控制权的有18.82%的提高。此外,系统还包括一个相似度模块,通过基于排名的指标评估反馈的相关性。它采用了一种解释性技术来分析在生成未控制和控制反馈过程中文本和视觉特征的贡献。

URL

https://arxiv.org/abs/2402.07640

PDF

https://arxiv.org/pdf/2402.07640.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot