Paper Reading AI Learner

DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion

2025-04-05 01:19:46
Maksim Siniukov, Di Chang, Minh Tran, Hongkun Gong, Ashutosh Chaubey, Mohammad Soleymani

Abstract

Generating naturalistic and nuanced listener motions for extended interactions remains an open problem. Existing methods often rely on low-dimensional motion codes for facial behavior generation followed by photorealistic rendering, limiting both visual fidelity and expressive richness. To address these challenges, we introduce DiTaiListener, powered by a video diffusion model with multimodal conditions. Our approach first generates short segments of listener responses conditioned on the speaker's speech and facial motions with DiTaiListener-Gen. It then refines the transitional frames via DiTaiListener-Edit for a seamless transition. Specifically, DiTaiListener-Gen adapts a Diffusion Transformer (DiT) for the task of listener head portrait generation by introducing a Causal Temporal Multimodal Adapter (CTM-Adapter) to process speakers' auditory and visual cues. CTM-Adapter integrates speakers' input in a causal manner into the video generation process to ensure temporally coherent listener responses. For long-form video generation, we introduce DiTaiListener-Edit, a transition refinement video-to-video diffusion model. The model fuses video segments into smooth and continuous videos, ensuring temporal consistency in facial expressions and image quality when merging short video segments produced by DiTaiListener-Gen. Quantitatively, DiTaiListener achieves the state-of-the-art performance on benchmark datasets in both photorealism (+73.8% in FID on RealTalk) and motion representation (+6.1% in FD metric on VICO) spaces. User studies confirm the superior performance of DiTaiListener, with the model being the clear preference in terms of feedback, diversity, and smoothness, outperforming competitors by a significant margin.

Abstract (translated)

生成自然且细腻的听者动作,以支持长时间互动的问题仍然未得到解决。现有方法通常依赖于低维运动代码来生成面部行为,并随后进行逼真的渲染,这既限制了视觉保真度也削弱了表现力的丰富性。为了应对这些挑战,我们引入了DiTaiListener,它是由多模态条件下的视频扩散模型驱动的。我们的方法首先使用DiTaiListener-Gen根据说话人的语音和面部动作生成听众反应的短片段。然后通过DiTaiListener-Edit改进过渡帧以实现无缝连接。 具体来说,DiTaiListener-Gen采用了一种经过改编的Diffusion Transformer(DiT)用于听者头像生成任务,并引入了一个因果时间多模态适配器(CTM-Adapter),用以处理说话人的音频和视觉线索。CTM-Adapter将说话人输入以因果方式整合到视频生成过程中,确保了在产生连贯且一致的听众反应时的时间连续性。 对于长时间视频生成,我们引入了DiTaiListener-Edit,这是一个用于过渡细化的视频到视频扩散模型。该模型融合短片段视频以生成流畅且连贯的长视频,确保在将由DiTaiListener-Gen产生的短视频片段合并后,在面部表情和图像质量方面的时间一致性。 从量化指标来看,DiTaiListener在基准数据集上的表现达到了最先进的水平,分别在逼真度(RealTalk数据集上FID得分提升73.8%)和运动表示能力(VICO数据集上FD指标提高6.1%)。用户研究证实了DiTaiListener的优越性,模型在反馈、多样性和流畅性方面明显优于竞争对手。

URL

https://arxiv.org/abs/2504.04010

PDF

https://arxiv.org/pdf/2504.04010.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot