Paper Reading AI Learner

Dyadic Interaction Modeling for Social Behavior Generation

2024-03-14 03:21:33
Minh Tran, Di Chang, Maksim Siniukov, Mohammad Soleymani

Abstract

Human-human communication is like a delicate dance where listeners and speakers concurrently interact to maintain conversational dynamics. Hence, an effective model for generating listener nonverbal behaviors requires understanding the dyadic context and interaction. In this paper, we present an effective framework for creating 3D facial motions in dyadic interactions. Existing work consider a listener as a reactive agent with reflexive behaviors to the speaker's voice and facial motions. The heart of our framework is Dyadic Interaction Modeling (DIM), a pre-training approach that jointly models speakers' and listeners' motions through masking and contrastive learning to learn representations that capture the dyadic context. To enable the generation of non-deterministic behaviors, we encode both listener and speaker motions into discrete latent representations, through VQ-VAE. The pre-trained model is further fine-tuned for motion generation. Extensive experiments demonstrate the superiority of our framework in generating listener motions, establishing a new state-of-the-art according to the quantitative measures capturing the diversity and realism of generated motions. Qualitative results demonstrate the superior capabilities of the proposed approach in generating diverse and realistic expressions, eye blinks and head gestures.

Abstract (translated)

人类之间的交流就像一场优雅的舞蹈,其中听众和发言者同时相互作用以维持会话动态。因此,要建立一个有效的模型来生成听众的非语言行为,需要理解双人互动的上下文和相互作用。在本文中,我们提出了一个有效的框架来在双人交互中生成3D面部动作。现有的工作将听众视为对发言者声音的反应性代理,并假定面部动作是对发言者声音的直接反应。我们框架的核心是双人交互建模(DIM),一种通过遮罩和对比学习来共同建模发言者和听众运动的预训练方法,以学习捕捉到双人上下文的表示。为了实现非确定性行为,我们将听众和发言者的运动编码为离散的潜在表示,通过VQ-VAE。预训练模型进一步微调以进行运动生成。大量实验证明,我们的框架在生成听众运动方面具有优势,建立了根据生成运动的多样性和现实性新的领先水平。定性结果表明,与所提出的方法相比,具有生成多样化和真实表达、眼睑和头部动作的能力。

URL

https://arxiv.org/abs/2403.09069

PDF

https://arxiv.org/pdf/2403.09069.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot