Abstract
Understanding human activity is a crucial yet intricate task in egocentric vision, a field that focuses on capturing visual perspectives from the camera wearer's viewpoint. While traditional methods heavily rely on representation learning trained on extensive video data, there exists a significant limitation: obtaining effective video representations proves challenging due to the inherent complexity and variability in human activities.Furthermore, exclusive dependence on video-based learning may constrain a model's capability to generalize across long-tail classes and out-of-distribution scenarios. In this study, we introduce a novel approach for long-term action anticipation using language models (LALM), adept at addressing the complex challenges of long-term activity understanding without the need for extensive training. Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details. By leveraging the context provided by these past events, we devise a prompting strategy for action anticipation using large language models (LLMs). Moreover, we implement Maximal Marginal Relevance for example selection to facilitate in-context learning of the LLMs. Our experimental results demonstrate that LALM surpasses the state-of-the-art methods in the task of long-term action anticipation on the Ego4D benchmark. We further validate LALM on two additional benchmarks, affirming its capacity for generalization across intricate activities with different sets of taxonomies. These are achieved without specific fine-tuning.
Abstract (translated)
理解人类活动是在以自我为中心的视觉领域中一个关键而复杂的任务,该领域关注从相机持有者的角度来看捕捉视觉视角。虽然传统方法在训练广泛的视频数据上依赖于表示学习,但存在一个显著的限制:由于人类活动的固有复杂性和变异性,获得有效的视频表示具有挑战性。此外,仅依赖基于视频的学习可能限制模型在长尾类和离散情况下的泛化能力。在这项研究中,我们引入了一种新方法,使用语言模型(LALM)进行长期动作预测,该模型能够解决不需要广泛训练的复杂挑战。我们的方法包括一个动作识别模型来跟踪先前的动作序列和一个视觉-语言模型来阐述相关环境细节。通过利用这些先前的活动提供的上下文,我们设计了一种使用大型语言模型(LLMs)进行动作预测的提示策略。此外,我们还实现了最大边缘相关性,例如选择,以促进LLMs在上下文中的学习。我们的实验结果表明,LALM在Ego4D基准上超越了最先进的Methods。我们进一步验证了LALM在两个额外的基准上,证实了其在不同类别的复杂活动中进行泛化的能力。这些能力不需要进行具体的微调即可实现。
URL
https://arxiv.org/abs/2311.17944