Paper Reading AI Learner

HIINT: Historical, Intra- and Inter- personal Dynamics Modeling with Cross-person Memory Transformer

2023-05-21 06:43:35
Yubin Kim, Dong Won Lee, Paul Pu Liang, Sharifa Algohwinem, Cynthia Breazeal, Hae Won Park

Abstract

Accurately modeling affect dynamics, which refers to the changes and fluctuations in emotions and affective displays during human conversations, is crucial for understanding human interactions. By analyzing affect dynamics, we can gain insights into how people communicate, respond to different situations, and form relationships. However, modeling affect dynamics is challenging due to contextual factors, such as the complex and nuanced nature of interpersonal relationships, the situation, and other factors that influence affective displays. To address this challenge, we propose a Cross-person Memory Transformer (CPM-T) framework which is able to explicitly model affective dynamics (intrapersonal and interpersonal influences) by identifying verbal and non-verbal cues, and with a large language model to utilize the pre-trained knowledge and perform verbal reasoning. The CPM-T framework maintains memory modules to store and update the contexts within the conversation window, enabling the model to capture dependencies between earlier and later parts of a conversation. Additionally, our framework employs cross-modal attention to effectively align information from multi-modalities and leverage cross-person attention to align behaviors in multi-party interactions. We evaluate the effectiveness and generalizability of our approach on three publicly available datasets for joint engagement, rapport, and human beliefs prediction tasks. Remarkably, the CPM-T framework outperforms baseline models in average F1-scores by up to 7.3%, 9.3%, and 2.0% respectively. Finally, we demonstrate the importance of each component in the framework via ablation studies with respect to multimodal temporal behavior.

Abstract (translated)

准确地建模情感动态,是指在人类对话中情感和表现的变化和波动,对于理解人类互动至关重要。通过分析情感动态,我们可以获取人们对如何沟通、如何应对不同情况、以及形成关系的见解。然而,建模情感动态因为上下文因素(例如人际关系的复杂性和细微差别、情况、以及其他影响情感表现的因素)而具有挑战性。为了应对这一挑战,我们提出了 Cross-Person Memory Transformer (CPM-T)框架,该框架能够明确建模情感动态(个人间影响),通过识别语言和非语言信号,并使用大型语言模型利用预训练的知识进行语言推理。CPM-T框架维护记忆模块,在对话窗口内存储和更新上下文,使模型能够捕捉对话前端和后端之间的依赖关系。此外,我们的框架使用跨模态注意力有效地对齐多模态信息,并利用跨个人注意力在多主体交互中对齐行为。我们评估了我们的方法和三个公开数据集(联合参与、关系建立和人类信念预测任务)的有效性和泛化性。令人惊讶地,CPM-T框架在平均F1得分上比基准模型高出7.3%、9.3%和2.0%。最后,我们通过删除部分框架成分的研究来展示每个组件在框架中的重要性,以多模态时间行为的角度。

URL

https://arxiv.org/abs/2305.12369

PDF

https://arxiv.org/pdf/2305.12369.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot