Paper Reading AI Learner

EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection

2024-10-31 13:06:29
Qinqian Lei, Bo Wang, Robby T. Tan

Abstract

Detecting Human-Object Interactions (HOI) in zero-shot settings, where models must handle unseen classes, poses significant challenges. Existing methods that rely on aligning visual encoders with large Vision-Language Models (VLMs) to tap into the extensive knowledge of VLMs, require large, computationally expensive models and encounter training difficulties. Adapting VLMs with prompt learning offers an alternative to direct alignment. However, fine-tuning on task-specific datasets often leads to overfitting to seen classes and suboptimal performance on unseen classes, due to the absence of unseen class labels. To address these challenges, we introduce a novel prompt learning-based framework for Efficient Zero-Shot HOI detection (EZ-HOI). First, we introduce Large Language Model (LLM) and VLM guidance for learnable prompts, integrating detailed HOI descriptions and visual semantics to adapt VLMs to HOI tasks. However, because training datasets contain seen-class labels alone, fine-tuning VLMs on such datasets tends to optimize learnable prompts for seen classes instead of unseen ones. Therefore, we design prompt learning for unseen classes using information from related seen classes, with LLMs utilized to highlight the differences between unseen and related seen classes. Quantitative evaluations on benchmark datasets demonstrate that our EZ-HOI achieves state-of-the-art performance across various zero-shot settings with only 10.35% to 33.95% of the trainable parameters compared to existing methods. Code is available at this https URL.

Abstract (translated)

在零样本设置中检测人-物交互(HOI),模型需要处理未见过的类别,这提出了显著的挑战。现有方法依赖于将视觉编码器与大型视觉语言模型(VLMs)对齐以利用VLMs的广泛知识,但这种方法需要大规模且计算成本高昂的模型,并且在训练时遇到困难。通过提示学习来适应VLMs为直接对齐提供了一种替代方案。然而,在任务特定的数据集上微调往往导致过度拟合已见类别并在未见过的类别的性能不佳,因为缺乏未见过类别的标签。为了应对这些挑战,我们提出了一种基于提示学习的新框架,用于高效的零样本HOI检测(EZ-HOI)。首先,我们引入了大型语言模型(LLM)和VLM指导可学习提示,整合详细的HOI描述和视觉语义以适应VLMs到HOI任务。然而,由于训练数据集仅包含已见类别的标签,因此在这些数据集上微调VLMs倾向于优化已见过的类别而非未见过的类别。因此,我们设计了使用相关已见类别信息来学习提示的未见过类别,并利用LLMs突出显示未见过类别和相关已见类别之间的差异。基准数据集上的定量评估表明,我们的EZ-HOI在各种零样本设置中仅使用现有方法10.35%至33.95%的可训练参数实现了最先进的性能。代码可在以下链接获取:[这个 https URL]。

URL

https://arxiv.org/abs/2410.23904

PDF

https://arxiv.org/pdf/2410.23904.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot