Abstract
Synthesizing realistic co-speech gestures is an important and yet unsolved problem for creating believable motions that can drive a humanoid robot to interact and communicate with human users. Such capability will improve the impressions of the robots by human users and will find applications in education, training, and medical services. One challenge in learning the co-speech gesture model is that there may be multiple viable gesture motions for the same speech utterance. The deterministic regression methods can not resolve the conflicting samples and may produce over-smoothed or damped motions. We proposed a two-stage model to address this uncertainty issue in gesture synthesis by modeling the gesture segments as discrete latent codes. Our method utilizes RQ-VAE in the first stage to learn a discrete codebook consisting of gesture tokens from training data. In the second stage, a two-level autoregressive transformer model is used to learn the prior distribution of residual codes conditioned on input speech context. Since the inference is formulated as token sampling, multiple gesture sequences could be generated given the same speech input using top-k sampling. The quantitative results and the user study showed the proposed method outperforms the previous methods and is able to generate realistic and diverse gesture motions.
Abstract (translated)
合成真实的并发手势是一个重要的未解决问题,以创造令人信服的动作,使一架人形机器人与人类用户互动和通信。这种能力将改善机器人对人类用户的 impression ,并在教育、培训和医疗服务中应用。在学习并发手势模型时,有一个挑战,即可能在同一句话中产生多个可行的手势动作。确定性回归方法无法解决冲突样本,可能会导致过度平滑或 damped 的动作。我们提出了一个两阶段模型,以解决手势合成中的不确定问题,通过将手势部分建模为离散潜在编码。我们的方法在第一阶段使用 RQ-VAE 从训练数据学习一个离散编码库,其中包含手势代币。在第二阶段,使用两个级别的自回归Transformer模型学习输入语音上下文后剩余编码的先前分布。由于推理以代币采样的形式表示,可以使用top-k采样生成根据相同 speech 输入生成多个手势序列。定量结果和用户研究表明,该方法优于先前方法,能够生成现实和多样化的手势动作。
URL
https://arxiv.org/abs/2303.12822