Paper Reading AI Learner

Language as the Medium: Multimodal Video Classification through text only

2023-09-19 17:32:21
Laura Hanu, Anita L. Verő, James Thewlis

Abstract

Despite an exciting new wave of multimodal machine learning models, current approaches still struggle to interpret the complex contextual relationships between the different modalities present in videos. Going beyond existing methods that emphasize simple activities or objects, we propose a new model-agnostic approach for generating detailed textual descriptions that captures multimodal video information. Our method leverages the extensive knowledge learnt by large language models, such as GPT-3.5 or Llama2, to reason about textual descriptions of the visual and aural modalities, obtained from BLIP-2, Whisper and ImageBind. Without needing additional finetuning of video-text models or datasets, we demonstrate that available LLMs have the ability to use these multimodal textual descriptions as proxies for ``sight'' or ``hearing'' and perform zero-shot multimodal classification of videos in-context. Our evaluations on popular action recognition benchmarks, such as UCF-101 or Kinetics, show these context-rich descriptions can be successfully used in video understanding tasks. This method points towards a promising new research direction in multimodal classification, demonstrating how an interplay between textual, visual and auditory machine learning models can enable more holistic video understanding.

Abstract (translated)

尽管出现了令人兴奋的多媒态机器学习模型,但现有的方法仍然难以解释视频中出现的不同感官模式之间的复杂上下文关系。我们提出了一种新的模型无关的方法,用于生成详细文本描述,捕捉多媒态视频信息。我们的方法利用大型语言模型如GPT-3.5或Llama2学习到的广泛知识,以处理从BLIP-2、Whisper和ImageBind获取的视觉和听觉感官模式文本描述。我们不需要进一步调整视频文本模型或数据集,就能证明可用的LLMs有使用这些多媒态文本描述作为“看到”或“听到”的代用品,并在上下文中实现零次机会多媒态分类的能力。我们对流行的行动识别基准点如UCF-101或Kinetics进行评估,表明这些丰富的上下文描述可以在视频理解任务中成功使用。这种方法指向了多媒态分类中的有前途的新研究方向,展示了如何将文本、视觉和听觉机器学习模型之间的交互实现更全面的视频理解。

URL

https://arxiv.org/abs/2309.10783

PDF

https://arxiv.org/pdf/2309.10783.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot