Abstract
The goal of this work is to simultaneously generate natural talking faces and speech outputs from text. We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework. We address the main challenges of each task: (1) generating a range of head poses representative of real-world scenarios, and (2) ensuring voice consistency despite variations in facial motion for the same identity. To tackle these issues, we introduce a motion sampler based on conditional flow matching, which is capable of high-quality motion code generation in an efficient way. Moreover, we introduce a novel conditioning method for the TTS system, which utilises motion-removed features from the TFG model to yield uniform speech outputs. Our extensive experiments demonstrate that our method effectively creates natural-looking talking faces and speech that accurately match the input text. To our knowledge, this is the first effort to build a multimodal synthesis system that can generalise to unseen identities.
Abstract (translated)
本工作的目标是同时从文本中生成自然对话脸和语音输出。我们通过将谈话面部生成(TFG)和文本转语音(TTS)系统集成到一个统一框架中来实现这一目标。我们解决了每个任务的主要挑战:(1)生成具有真实世界场景中各种头势的广泛范围的头部;(2)在同一身份下,即使面部运动存在差异,也要确保声音的一致性。为解决这些问题,我们引入了一种基于条件流匹配的运动采样方法,该方法能够以高效的方式生成高质量的运动码。此外,我们引入了一种新的条件方法来对TTS系统,该方法利用TFG模型的运动去除特征来产生统一的语音输出。我们广泛的实验证明了我们的方法有效地创建了自然外观的对话脸和语音,准确地匹配了输入文本。据我们所知,这是第一个在未见过的身份上构建多模态合成系统的尝试。
URL
https://arxiv.org/abs/2405.10272