Paper Reading AI Learner

Tri-Modal Motion Retrieval by Learning a Joint Embedding Space

2024-03-01 17:23:30
Kangning Yin, Shihao Zou, Yuxuan Ge, Zheng Tian

Abstract

Information retrieval is an ever-evolving and crucial research domain. The substantial demand for high-quality human motion data especially in online acquirement has led to a surge in human motion research works. Prior works have mainly concentrated on dual-modality learning, such as text and motion tasks, but three-modality learning has been rarely explored. Intuitively, an extra introduced modality can enrich a model's application scenario, and more importantly, an adequate choice of the extra modality can also act as an intermediary and enhance the alignment between the other two disparate modalities. In this work, we introduce LAVIMO (LAnguage-VIdeo-MOtion alignment), a novel framework for three-modality learning integrating human-centric videos as an additional modality, thereby effectively bridging the gap between text and motion. Moreover, our approach leverages a specially designed attention mechanism to foster enhanced alignment and synergistic effects among text, video, and motion modalities. Empirically, our results on the HumanML3D and KIT-ML datasets show that LAVIMO achieves state-of-the-art performance in various motion-related cross-modal retrieval tasks, including text-to-motion, motion-to-text, video-to-motion and motion-to-video.

Abstract (translated)

信息检索是一个不断发展和关键的研究领域。特别是,在在线获取高质量的人体运动数据的需求增加的情况下,人类运动研究作品激增。以前的工作主要集中在双模态学习,例如文本和运动任务,但很少有人探讨过三模态学习。直观上,引入一个新的模态可以丰富模型的应用场景,更重要的是,适当选择额外的模态也可以作为中介,增强其他两个不同模态之间的对齐。在本文中,我们引入了LAVIMO(语言-视频-动作对齐),一个将人体中心视频作为附加模态的三模态学习框架,从而有效弥合文本和运动之间的差距。此外,我们的方法利用了专门设计的注意力机制来促进文本、视频和动作模态之间的加强对齐和协同作用。通过实验,我们在HumanML3D和KIT-ML数据集上的结果表明,LAVIMO在各种跨模态检索任务中实现了最先进的性能,包括文本到运动、运动到文本、视频到运动和运动到视频。

URL

https://arxiv.org/abs/2403.00691

PDF

https://arxiv.org/pdf/2403.00691.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot