Responsive Listening Head Generation: A Benchmark Dataset and Baseline

2021-12-27 07:18:50

Mohan Zhou, Yalong Bai, Wei Zhang, Tiejun Zhao, Tao Mei

arXiv_CV

arXiv_CV Face Gesture Action Speech Dialog Chat

Abstract
Abstract (translated)
URL
PDF

Abstract

Responsive listening during face-to-face conversations is a critical element of social interaction and is well established in psychological research. Through non-verbal signals response to the speakers' words, intonations, or behaviors in real-time, listeners show how they are engaged in dialogue. In this work, we build the Responsive Listener Dataset (RLD), a conversation video corpus collected from the public resources featuring 67 speakers, 76 listeners with three different attitudes. We define the responsive listening head generation task as the synthesis of a non-verbal head with motions and expressions reacting to the multiple inputs, including the audio and visual signal of the speaker. Unlike speech-driven gesture or talking head generation, we introduce more modals in this task, hoping to benefit several research fields, including human-to-human interaction, video-to-video translation, cross-modal understanding, and generation. Furthermore, we release an attitude conditioned listening head generation baseline. Project page: \url{this https URL}.

Abstract (translated)

URL

https://arxiv.org/abs/2112.13548

PDF

https://arxiv.org/pdf/2112.13548.pdf