Abstract
This paper addresses the challenge of transferring the behavior expressivity style of a virtual agent to another one while preserving behaviors shape as they carry communicative meaning. Behavior expressivity style is viewed here as the qualitative properties of behaviors. We propose TranSTYLer, a multimodal transformer based model that synthesizes the multimodal behaviors of a source speaker with the style of a target speaker. We assume that behavior expressivity style is encoded across various modalities of communication, including text, speech, body gestures, and facial expressions. The model employs a style and content disentanglement schema to ensure that the transferred style does not interfere with the meaning conveyed by the source behaviors. Our approach eliminates the need for style labels and allows the generalization to styles that have not been seen during the training phase. We train our model on the PATS corpus, which we extended to include dialog acts and 2D facial landmarks. Objective and subjective evaluations show that our model outperforms state of the art models in style transfer for both seen and unseen styles during training. To tackle the issues of style and content leakage that may arise, we propose a methodology to assess the degree to which behavior and gestures associated with the target style are successfully transferred, while ensuring the preservation of the ones related to the source content.
Abstract (translated)
本论文探讨了如何将虚拟代理的行为表达风格转移到另一个代理上,同时保留其行为形状,因为它们传达了通信意义。行为表达风格被视为行为的质量属性。我们提出了 TranSTYLer 多模式Transformer模型,该模型将多模式行为从一个源演讲者到一个目标演讲者的多模式行为合成起来。我们认为行为表达风格在通信的各种模式中编码,包括文本、演讲、身体手势和面部表情。模型使用风格和内容分离框架以确保转移风格不会干扰源行为传达的意义。我们的方法不需要风格标签,并允许在训练阶段推广到未曾展示的风格。我们在 PATS corpus 上训练我们的模型,该 corpus 扩展了对话行为和 2D 面部地标。客观和主观评估表明,我们在训练期间可见和未可见风格的行为表达和手势表现上,我们的模型比最先进的模型表现更好。为了应对可能出现的风格和内容泄漏问题,我们提出了一种方法,评估目标风格相关的和行为和手势成功地转移的程度,同时确保保留与源内容相关的行为和手势。
URL
https://arxiv.org/abs/2308.10843