Abstract
Human-human communication is like a delicate dance where listeners and speakers concurrently interact to maintain conversational dynamics. Hence, an effective model for generating listener nonverbal behaviors requires understanding the dyadic context and interaction. In this paper, we present an effective framework for creating 3D facial motions in dyadic interactions. Existing work consider a listener as a reactive agent with reflexive behaviors to the speaker's voice and facial motions. The heart of our framework is Dyadic Interaction Modeling (DIM), a pre-training approach that jointly models speakers' and listeners' motions through masking and contrastive learning to learn representations that capture the dyadic context. To enable the generation of non-deterministic behaviors, we encode both listener and speaker motions into discrete latent representations, through VQ-VAE. The pre-trained model is further fine-tuned for motion generation. Extensive experiments demonstrate the superiority of our framework in generating listener motions, establishing a new state-of-the-art according to the quantitative measures capturing the diversity and realism of generated motions. Qualitative results demonstrate the superior capabilities of the proposed approach in generating diverse and realistic expressions, eye blinks and head gestures.
Abstract (translated)
人类之间的交流就像一场优雅的舞蹈,其中听众和发言者同时相互作用以维持会话动态。因此,要建立一个有效的模型来生成听众的非语言行为,需要理解双人互动的上下文和相互作用。在本文中,我们提出了一个有效的框架来在双人交互中生成3D面部动作。现有的工作将听众视为对发言者声音的反应性代理,并假定面部动作是对发言者声音的直接反应。我们框架的核心是双人交互建模(DIM),一种通过遮罩和对比学习来共同建模发言者和听众运动的预训练方法,以学习捕捉到双人上下文的表示。为了实现非确定性行为,我们将听众和发言者的运动编码为离散的潜在表示,通过VQ-VAE。预训练模型进一步微调以进行运动生成。大量实验证明,我们的框架在生成听众运动方面具有优势,建立了根据生成运动的多样性和现实性新的领先水平。定性结果表明,与所提出的方法相比,具有生成多样化和真实表达、眼睑和头部动作的能力。
URL
https://arxiv.org/abs/2403.09069