Abstract
Detecting Human-Object Interactions (HOI) in zero-shot settings, where models must handle unseen classes, poses significant challenges. Existing methods that rely on aligning visual encoders with large Vision-Language Models (VLMs) to tap into the extensive knowledge of VLMs, require large, computationally expensive models and encounter training difficulties. Adapting VLMs with prompt learning offers an alternative to direct alignment. However, fine-tuning on task-specific datasets often leads to overfitting to seen classes and suboptimal performance on unseen classes, due to the absence of unseen class labels. To address these challenges, we introduce a novel prompt learning-based framework for Efficient Zero-Shot HOI detection (EZ-HOI). First, we introduce Large Language Model (LLM) and VLM guidance for learnable prompts, integrating detailed HOI descriptions and visual semantics to adapt VLMs to HOI tasks. However, because training datasets contain seen-class labels alone, fine-tuning VLMs on such datasets tends to optimize learnable prompts for seen classes instead of unseen ones. Therefore, we design prompt learning for unseen classes using information from related seen classes, with LLMs utilized to highlight the differences between unseen and related seen classes. Quantitative evaluations on benchmark datasets demonstrate that our EZ-HOI achieves state-of-the-art performance across various zero-shot settings with only 10.35% to 33.95% of the trainable parameters compared to existing methods. Code is available at this https URL.
Abstract (translated)
在零样本设置中检测人-物交互(HOI),模型需要处理未见过的类别,这提出了显著的挑战。现有方法依赖于将视觉编码器与大型视觉语言模型(VLMs)对齐以利用VLMs的广泛知识,但这种方法需要大规模且计算成本高昂的模型,并且在训练时遇到困难。通过提示学习来适应VLMs为直接对齐提供了一种替代方案。然而,在任务特定的数据集上微调往往导致过度拟合已见类别并在未见过的类别的性能不佳,因为缺乏未见过类别的标签。为了应对这些挑战,我们提出了一种基于提示学习的新框架,用于高效的零样本HOI检测(EZ-HOI)。首先,我们引入了大型语言模型(LLM)和VLM指导可学习提示,整合详细的HOI描述和视觉语义以适应VLMs到HOI任务。然而,由于训练数据集仅包含已见类别的标签,因此在这些数据集上微调VLMs倾向于优化已见过的类别而非未见过的类别。因此,我们设计了使用相关已见类别信息来学习提示的未见过类别,并利用LLMs突出显示未见过类别和相关已见类别之间的差异。基准数据集上的定量评估表明,我们的EZ-HOI在各种零样本设置中仅使用现有方法10.35%至33.95%的可训练参数实现了最先进的性能。代码可在以下链接获取:[这个 https URL]。
URL
https://arxiv.org/abs/2410.23904