Paper Reading AI Learner

Human-Centric Foundation Models: Perception, Generation and Agentic Modeling

2025-02-12 16:38:40
Shixiang Tang, Yizhou Wang, Lu Chen, Yuan Wang, Sida Peng, Dan Xu, Wanli Ouyang

Abstract

Human understanding and generation are critical for modeling digital humans and humanoid embodiments. Recently, Human-centric Foundation Models (HcFMs) inspired by the success of generalist models, such as large language and vision models, have emerged to unify diverse human-centric tasks into a single framework, surpassing traditional task-specific approaches. In this survey, we present a comprehensive overview of HcFMs by proposing a taxonomy that categorizes current approaches into four groups: (1) Human-centric Perception Foundation Models that capture fine-grained features for multi-modal 2D and 3D understanding. (2) Human-centric AIGC Foundation Models that generate high-fidelity, diverse human-related content. (3) Unified Perception and Generation Models that integrate these capabilities to enhance both human understanding and synthesis. (4) Human-centric Agentic Foundation Models that extend beyond perception and generation to learn human-like intelligence and interactive behaviors for humanoid embodied tasks. We review state-of-the-art techniques, discuss emerging challenges and future research directions. This survey aims to serve as a roadmap for researchers and practitioners working towards more robust, versatile, and intelligent digital human and embodiments modeling.

Abstract (translated)

人类的理解和生成对于数字人及仿人模型的构建至关重要。最近,受大型语言和视觉模型等通用模型成功的启发,以人类为中心的基础模型(HcFMs)兴起并致力于将各种以人为中心的任务整合到一个统一框架中,从而超越了传统的特定任务方法。在这篇综述中,我们提出了一种分类法,通过将其当前的方法分为四个类别来全面概述HcFMs:(1) 以人类为中心的感知基础模型,捕捉多模态2D和3D理解中的细微特征;(2) 以人为中心的人工智能生成(AIGC)基础模型,能够生成高保真度、多样化的人类相关内容;(3) 统一感知与生成模型,整合这些能力以增强人类理解和合成;以及 (4) 以人类为中心的代理基础模型,超越感知和生成,学习类似人的智慧及用于仿人任务中的交互行为。我们回顾了最新的技术,并讨论了新兴挑战和未来的研究方向。该综述旨在为致力于更稳健、多样化且智能的数字人和仿生体建模的研究人员和实践者提供路线图。

URL

https://arxiv.org/abs/2502.08556

PDF

https://arxiv.org/pdf/2502.08556.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot