Paper Reading AI Learner

PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

2024-11-04 17:50:36
Ruyang Liu, Haoran Tang, Haibo Liu, Yixiao Ge, Ying Shan, Chen Li, Jiankun Yang

Abstract

The past year has witnessed the significant advancement of video-based large language models. However, the challenge of developing a unified model for both short and long video understanding remains unresolved. Most existing video LLMs cannot handle hour-long videos, while methods custom for long videos tend to be ineffective for shorter videos and images. In this paper, we identify the key issue as the redundant content in videos. To address this, we propose a novel pooling strategy that simultaneously achieves token compression and instruction-aware visual feature aggregation. Our model is termed Prompt-guided Pooling LLaVA, or PPLLaVA for short. Specifically, PPLLaVA consists of three core components: the CLIP-based visual-prompt alignment that extracts visual information relevant to the user's instructions, the prompt-guided pooling that compresses the visual sequence to arbitrary scales using convolution-style pooling, and the clip context extension designed for lengthy prompt common in visual dialogue. Moreover, our codebase also integrates the most advanced video Direct Preference Optimization (DPO) and visual interleave training. Extensive experiments have validated the performance of our model. With superior throughput and only 1024 visual context, PPLLaVA achieves better results on image benchmarks as a video LLM, while achieving state-of-the-art performance across various video benchmarks, excelling in tasks ranging from caption generation to multiple-choice questions, and handling video lengths from seconds to hours. Codes have been available at this https URL.

Abstract (translated)

过去一年见证了基于视频的大语言模型的重要进展。然而,开发一个既能处理短视频又能理解长视频的统一模型仍然是一个未解决的问题。大多数现有的视频大语言模型无法处理长达数小时的视频,而专门为长时间视频设计的方法往往对较短的视频和图像无效。在本文中,我们确定了问题的关键在于视频中的冗余内容。为了解决这个问题,我们提出了一种新的池化策略,该策略同时实现了标记压缩和指令感知的视觉特征聚合。我们的模型称为提示引导池化LLaVA(PPLLaVA)。具体来说,PPLLaVA由三个核心组件组成:基于CLIP的视觉-提示对齐,用于提取与用户指令相关的视觉信息;提示引导池化,通过卷积式的池化将视觉序列压缩到任意规模;以及针对长提示设计的上下文扩展,这在视觉对话中很常见。此外,我们的代码库还集成了最先进的视频直接偏好优化(DPO)和视觉交错训练。广泛的实验验证了我们模型的表现。凭借卓越的吞吐量和仅1024个视觉上下文,PPLLaVA作为视频大语言模型在图像基准测试中取得了更好的结果,并在各种视频基准上实现了最先进性能,在从生成标题到多项选择题的任务中表现出色,能够处理从几秒到数小时长度的视频。代码可在以下链接获取:[此 https URL]。

URL

https://arxiv.org/abs/2411.02327

PDF

https://arxiv.org/pdf/2411.02327.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot