Paper Reading AI Learner

LALM: Long-Term Action Anticipation with Language Models

2023-11-29 02:17:27
Sanghwan Kim, Daoji Huang, Yongqin Xian, Otmar Hilliges, Luc Van Gool, Xi Wang

Abstract

Understanding human activity is a crucial yet intricate task in egocentric vision, a field that focuses on capturing visual perspectives from the camera wearer's viewpoint. While traditional methods heavily rely on representation learning trained on extensive video data, there exists a significant limitation: obtaining effective video representations proves challenging due to the inherent complexity and variability in human activities.Furthermore, exclusive dependence on video-based learning may constrain a model's capability to generalize across long-tail classes and out-of-distribution scenarios. In this study, we introduce a novel approach for long-term action anticipation using language models (LALM), adept at addressing the complex challenges of long-term activity understanding without the need for extensive training. Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details. By leveraging the context provided by these past events, we devise a prompting strategy for action anticipation using large language models (LLMs). Moreover, we implement Maximal Marginal Relevance for example selection to facilitate in-context learning of the LLMs. Our experimental results demonstrate that LALM surpasses the state-of-the-art methods in the task of long-term action anticipation on the Ego4D benchmark. We further validate LALM on two additional benchmarks, affirming its capacity for generalization across intricate activities with different sets of taxonomies. These are achieved without specific fine-tuning.

Abstract (translated)

理解人类活动是在以自我为中心的视觉领域中一个关键而复杂的任务,该领域关注从相机持有者的角度来看捕捉视觉视角。虽然传统方法在训练广泛的视频数据上依赖于表示学习,但存在一个显著的限制:由于人类活动的固有复杂性和变异性,获得有效的视频表示具有挑战性。此外,仅依赖基于视频的学习可能限制模型在长尾类和离散情况下的泛化能力。在这项研究中,我们引入了一种新方法,使用语言模型(LALM)进行长期动作预测,该模型能够解决不需要广泛训练的复杂挑战。我们的方法包括一个动作识别模型来跟踪先前的动作序列和一个视觉-语言模型来阐述相关环境细节。通过利用这些先前的活动提供的上下文,我们设计了一种使用大型语言模型(LLMs)进行动作预测的提示策略。此外,我们还实现了最大边缘相关性,例如选择,以促进LLMs在上下文中的学习。我们的实验结果表明,LALM在Ego4D基准上超越了最先进的Methods。我们进一步验证了LALM在两个额外的基准上,证实了其在不同类别的复杂活动中进行泛化的能力。这些能力不需要进行具体的微调即可实现。

URL

https://arxiv.org/abs/2311.17944

PDF

https://arxiv.org/pdf/2311.17944.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot