Paper Reading AI Learner

SATO: Stable Text-to-Motion Framework

2024-05-02 16:50:41
Wenshuo Chen, Hongru Xiao, Erhang Zhang, Lijie Hu, Lei Wang, Mengyuan Liu, Chen Chen

Abstract

Is the Text to Motion model robust? Recent advancements in Text to Motion models primarily stem from more accurate predictions of specific actions. However, the text modality typically relies solely on pre-trained Contrastive Language-Image Pretraining (CLIP) models. Our research has uncovered a significant issue with the text-to-motion model: its predictions often exhibit inconsistent outputs, resulting in vastly different or even incorrect poses when presented with semantically similar or identical text inputs. In this paper, we undertake an analysis to elucidate the underlying causes of this instability, establishing a clear link between the unpredictability of model outputs and the erratic attention patterns of the text encoder module. Consequently, we introduce a formal framework aimed at addressing this issue, which we term the Stable Text-to-Motion Framework (SATO). SATO consists of three modules, each dedicated to stable attention, stable prediction, and maintaining a balance between accuracy and robustness trade-off. We present a methodology for constructing an SATO that satisfies the stability of attention and prediction. To verify the stability of the model, we introduced a new textual synonym perturbation dataset based on HumanML3D and KIT-ML. Results show that SATO is significantly more stable against synonyms and other slight perturbations while keeping its high accuracy performance.

Abstract (translated)

Text to Motion(T2M)模型的稳健性如何?最近,T2M模型的进步主要源于更精确地预测特定动作。然而,文本模态通常仅依赖于预训练的对比性语言-图像预训练(CLIP)模型。我们的研究揭示了一个与T2M模型相关的重大问题:其预测经常表现出不一致的输出,导致在语义上相似或相同的文本输入上呈现出截然不同或甚至错误的姿势。在本文中,我们进行了分析,阐明了这种不稳定性背后的原因,并建立了模型输出不可预测性与文本编码器模块的异常关注模式之间的明确联系。因此,我们引入了一个旨在解决这一问题的形式框架,我们称之为稳定T2M框架(SATO)。SATO包括三个模块,每个模块都致力于稳定性注意力和稳定性预测,并在准确性和稳健性之间保持平衡。我们提出了一个构建SATO的方法,满足注意力和预测的稳定性。为了验证模型的稳定性,我们引入了基于HumanML3D和KIT-ML的新文本同义词扰动数据集。结果表明,SATO在对抗同义词和其他微小扰动时表现出显著的稳定性,同时保持其高准确率性能。

URL

https://arxiv.org/abs/2405.01461

PDF

https://arxiv.org/pdf/2405.01461.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot