Paper Reading AI Learner

KeyframeFace: From Text to Expressive Facial Keyframes

2025-12-12 06:45:02
Jingchao Wu, Zejian Kang, Haibo Liu, Yuanchen Fei, Xiangru Huang

Abstract

Generating dynamic 3D facial animation from natural language requires understanding both temporally structured semantics and fine-grained expression changes. Existing datasets and methods mainly focus on speech-driven animation or unstructured expression sequences and therefore lack the semantic grounding and temporal structures needed for expressive human performance generation. In this work, we introduce KeyframeFace, a large-scale multimodal dataset designed for text-to-animation research through keyframe-level supervision. KeyframeFace provides 2,100 expressive scripts paired with monocular videos, per-frame ARKit coefficients, contextual backgrounds, complex emotions, manually defined keyframes, and multi-perspective annotations based on ARKit coefficients and images via Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Beyond the dataset, we propose the first text-to-animation framework that explicitly leverages LLM priors for interpretable facial motion synthesis. This design aligns the semantic understanding capabilities of LLMs with the interpretable structure of ARKit's coefficients, enabling high-fidelity expressive animation. KeyframeFace and our LLM-based framework together establish a new foundation for interpretable, keyframe-guided, and context-aware text-to-animation. Code and data are available at this https URL.

Abstract (translated)

从自然语言生成动态3D面部动画需要理解时间结构化的语义和细微的表情变化。现有的数据集和方法主要关注于语音驱动的动画或无序的表情序列,因此缺乏表达性人类性能生成所需的语义基础和时间结构。在这项工作中,我们引入了KeyframeFace,这是一个大规模多模态数据集,旨在通过关键帧级别的监督进行文本到动画的研究。KeyframeFace提供了2,100个具有表现力的脚本,并与单目视频、每帧ARKit系数、背景环境、复杂情绪以及人工定义的关键帧配对。此外,该数据集还包含了基于LLM(大型语言模型)和MLLM(多模态大型语言模型)的大规模注释系统,根据ARKit系数和图像进行多视角注释。 除了这个数据集之外,我们提出了第一个文本到动画框架,它明确地利用了LLM的先验知识来进行可解释的脸部运动合成。这一设计将LLM的理解能力与ARKit系数的可解释结构相结合,从而能够生成高保真的表情动画。KeyframeFace和我们的基于LLM的框架共同为可解释、关键帧引导以及背景感知的文本到动画研究奠定了新的基础。 代码和数据可以在以下链接获取:[此URL](请注意,在实际回复中提供了一个占位符URL,请替换为具体的数据集和代码发布平台的实际链接)。

URL

https://arxiv.org/abs/2512.11321

PDF

https://arxiv.org/pdf/2512.11321.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot