Abstract
Is the Text to Motion model robust? Recent advancements in Text to Motion models primarily stem from more accurate predictions of specific actions. However, the text modality typically relies solely on pre-trained Contrastive Language-Image Pretraining (CLIP) models. Our research has uncovered a significant issue with the text-to-motion model: its predictions often exhibit inconsistent outputs, resulting in vastly different or even incorrect poses when presented with semantically similar or identical text inputs. In this paper, we undertake an analysis to elucidate the underlying causes of this instability, establishing a clear link between the unpredictability of model outputs and the erratic attention patterns of the text encoder module. Consequently, we introduce a formal framework aimed at addressing this issue, which we term the Stable Text-to-Motion Framework (SATO). SATO consists of three modules, each dedicated to stable attention, stable prediction, and maintaining a balance between accuracy and robustness trade-off. We present a methodology for constructing an SATO that satisfies the stability of attention and prediction. To verify the stability of the model, we introduced a new textual synonym perturbation dataset based on HumanML3D and KIT-ML. Results show that SATO is significantly more stable against synonyms and other slight perturbations while keeping its high accuracy performance.
Abstract (translated)
Text to Motion(T2M)模型的稳健性如何?最近,T2M模型的进步主要源于更精确地预测特定动作。然而,文本模态通常仅依赖于预训练的对比性语言-图像预训练(CLIP)模型。我们的研究揭示了一个与T2M模型相关的重大问题:其预测经常表现出不一致的输出,导致在语义上相似或相同的文本输入上呈现出截然不同或甚至错误的姿势。在本文中,我们进行了分析,阐明了这种不稳定性背后的原因,并建立了模型输出不可预测性与文本编码器模块的异常关注模式之间的明确联系。因此,我们引入了一个旨在解决这一问题的形式框架,我们称之为稳定T2M框架(SATO)。SATO包括三个模块,每个模块都致力于稳定性注意力和稳定性预测,并在准确性和稳健性之间保持平衡。我们提出了一个构建SATO的方法,满足注意力和预测的稳定性。为了验证模型的稳定性,我们引入了基于HumanML3D和KIT-ML的新文本同义词扰动数据集。结果表明,SATO在对抗同义词和其他微小扰动时表现出显著的稳定性,同时保持其高准确率性能。
URL
https://arxiv.org/abs/2405.01461