Paper Reading AI Learner

MIDGET: Music Conditioned 3D Dance Generation

2024-04-18 10:20:37
Jinwu Wang, Wei Mao, Miaomiao Liu

Abstract

In this paper, we introduce a MusIc conditioned 3D Dance GEneraTion model, named MIDGET based on Dance motion Vector Quantised Variational AutoEncoder (VQ-VAE) model and Motion Generative Pre-Training (GPT) model to generate vibrant and highquality dances that match the music rhythm. To tackle challenges in the field, we introduce three new components: 1) a pre-trained memory codebook based on the Motion VQ-VAE model to store different human pose codes, 2) employing Motion GPT model to generate pose codes with music and motion Encoders, 3) a simple framework for music feature extraction. We compare with existing state-of-the-art models and perform ablation experiments on AIST++, the largest publicly available music-dance dataset. Experiments demonstrate that our proposed framework achieves state-of-the-art performance on motion quality and its alignment with the music.

Abstract (translated)

在本文中,我们提出了一种名为MIDGET的3D舞蹈条件生成模型,基于舞蹈运动向量量化变分自编码器(VQ-VAE)模型和运动生成预训练(GPT)模型,以生成与音乐节奏相符的鲜艳和高质量的舞蹈。为了解决该领域内的挑战,我们引入了三个新组件:1)基于Motion VQ-VAE模型的预训练记忆编码书,用于存储不同人类姿势代码;2)使用运动生成预训练(GPT)模型生成音乐和运动编码器的姿势编码;3)一个简单的音乐特征提取框架。我们与现有的最先进模型进行了比较,并在AIST++这个最大的公开可用音乐舞蹈数据集上进行了消融实验。实验结果表明,我们提出的框架在动量质量和与音乐的对齐方面均取得了最先进的性能。

URL

https://arxiv.org/abs/2404.12062

PDF

https://arxiv.org/pdf/2404.12062.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot