Paper Reading AI Learner

Narrative Action Evaluation with Prompt-Guided Multimodal Interaction

2024-04-22 17:55:07
Shiyi Zhang, Sule Bai, Guangyi Chen, Lei Chen, Jiwen Lu, Junle Wang, Yansong Tang

Abstract

In this paper, we investigate a new problem called narrative action evaluation (NAE). NAE aims to generate professional commentary that evaluates the execution of an action. Unlike traditional tasks such as score-based action quality assessment and video captioning involving superficial sentences, NAE focuses on creating detailed narratives in natural language. These narratives provide intricate descriptions of actions along with objective evaluations. NAE is a more challenging task because it requires both narrative flexibility and evaluation rigor. One existing possible solution is to use multi-task learning, where narrative language and evaluative information are predicted separately. However, this approach results in reduced performance for individual tasks because of variations between tasks and differences in modality between language information and evaluation information. To address this, we propose a prompt-guided multimodal interaction framework. This framework utilizes a pair of transformers to facilitate the interaction between different modalities of information. It also uses prompts to transform the score regression task into a video-text matching task, thus enabling task interactivity. To support further research in this field, we re-annotate the MTL-AQA and FineGym datasets with high-quality and comprehensive action narration. Additionally, we establish benchmarks for NAE. Extensive experiment results prove that our method outperforms separate learning methods and naive multi-task learning methods. Data and code are released at \href{this https URL }{here}.

Abstract (translated)

在本文中,我们研究了一个名为叙述动作评估(NAE)的新问题。NAE的目标是生成专业的评论来评估一个行动的执行。与传统的评分基于动作质量评估和涉及浅层句子的视频标题等任务不同,NAE专注于在自然语言中创建详细的叙述。这些叙述提供了动作的详细描述以及客观评价。因为需要叙述的灵活性和评估的严谨性,NAE是一个更具挑战性的任务。一个现有的可能解决方案是使用多任务学习,其中叙述语言和评估信息分别预测。然而,由于任务之间存在差异和语言信息与评估信息之间的差异,这种方法在个人任务上产生了较低的性能。为了解决这个问题,我们提出了一个提示引导的多模态交互框架。这个框架使用了一对变压器来促进不同信息模态之间的交互。它还使用提示将评分回归任务转化为视频文本匹配任务,从而实现任务交互。为了支持在这个领域进一步的研究,我们用高质量、全面的动作叙述重新标注了MTL-AQA和FineGym数据集。此外,我们还为NAE建立了基准。大量实验结果证明,我们的方法超越了单独学习和 naive multi-task learning 方法。数据和代码发布在 \href{this <https://this URL> }{这里}。

URL

https://arxiv.org/abs/2404.14471

PDF

https://arxiv.org/pdf/2404.14471.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot