Abstract
Automatic dubbing (AD) is the task of translating the original speech in a video into target language speech. The new target language speech should satisfy isochrony; that is, the new speech should be time aligned with the original video, including mouth movements, pauses, hand gestures, etc. In this paper, we propose training a model that directly optimizes both the translation as well as the speech duration of the generated translations. We show that this system generates speech that better matches the timing of the original speech, compared to prior work, while simplifying the system architecture.
Abstract (translated)
Automatic dubbing (AD)是将视频中的原始语音翻译为目标语言语音的任务。新的目标语言语音应该满足isochrony条件,即新的语音与原始视频的时间对齐,包括口形动作、暂停、手语等。在本文中,我们提出了训练一个模型,该模型直接优化生成的翻译语音和 speech 持续时间。我们表明,相较于以前的工作,该系统生成的语音更与原始语音的时间匹配,同时简化了系统架构。
URL
https://arxiv.org/abs/2302.12979