Paper Reading AI Learner

Co-Speech Gesture Synthesis using Discrete Gesture Token Learning

2023-03-04 01:42:09
Shuhong Lu, Youngwoo Yoon, Andrew Feng

Abstract

Synthesizing realistic co-speech gestures is an important and yet unsolved problem for creating believable motions that can drive a humanoid robot to interact and communicate with human users. Such capability will improve the impressions of the robots by human users and will find applications in education, training, and medical services. One challenge in learning the co-speech gesture model is that there may be multiple viable gesture motions for the same speech utterance. The deterministic regression methods can not resolve the conflicting samples and may produce over-smoothed or damped motions. We proposed a two-stage model to address this uncertainty issue in gesture synthesis by modeling the gesture segments as discrete latent codes. Our method utilizes RQ-VAE in the first stage to learn a discrete codebook consisting of gesture tokens from training data. In the second stage, a two-level autoregressive transformer model is used to learn the prior distribution of residual codes conditioned on input speech context. Since the inference is formulated as token sampling, multiple gesture sequences could be generated given the same speech input using top-k sampling. The quantitative results and the user study showed the proposed method outperforms the previous methods and is able to generate realistic and diverse gesture motions.

Abstract (translated)

合成真实的并发手势是一个重要的未解决问题,以创造令人信服的动作,使一架人形机器人与人类用户互动和通信。这种能力将改善机器人对人类用户的 impression ,并在教育、培训和医疗服务中应用。在学习并发手势模型时,有一个挑战,即可能在同一句话中产生多个可行的手势动作。确定性回归方法无法解决冲突样本,可能会导致过度平滑或 damped 的动作。我们提出了一个两阶段模型,以解决手势合成中的不确定问题,通过将手势部分建模为离散潜在编码。我们的方法在第一阶段使用 RQ-VAE 从训练数据学习一个离散编码库,其中包含手势代币。在第二阶段,使用两个级别的自回归Transformer模型学习输入语音上下文后剩余编码的先前分布。由于推理以代币采样的形式表示,可以使用top-k采样生成根据相同 speech 输入生成多个手势序列。定量结果和用户研究表明,该方法优于先前方法,能够生成现实和多样化的手势动作。

URL

https://arxiv.org/abs/2303.12822

PDF

https://arxiv.org/pdf/2303.12822.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot