Paper Reading AI Learner

MM-PhyQA: Multimodal Physics Question-Answering With Multi-Image CoT Prompting

2024-04-11 07:11:47
Avinash Anand, Janak Kapuriya, Apoorv Singh, Jay Saraf, Naman Lal, Astha Verma, Rushali Gupta, Rajiv Shah

Abstract

While Large Language Models (LLMs) can achieve human-level performance in various tasks, they continue to face challenges when it comes to effectively tackling multi-step physics reasoning tasks. To identify the shortcomings of existing models and facilitate further research in this area, we curated a novel dataset, MM-PhyQA, which comprises well-constructed, high schoollevel multimodal physics problems. By evaluating the performance of contemporary LLMs that are publicly available, both with and without the incorporation of multimodal elements in these problems, we aim to shed light on their capabilities. For generating answers for questions consisting of multimodal input (in this case, images and text) we employed Zero-shot prediction using GPT-4 and utilized LLaVA (LLaVA and LLaVA-1.5), the latter of which were fine-tuned on our dataset. For evaluating the performance of LLMs consisting solely of textual input, we tested the performance of the base and fine-tuned versions of the Mistral-7B and LLaMA2-7b models. We also showcased the performance of the novel Multi-Image Chain-of-Thought (MI-CoT) Prompting technique, which when used to train LLaVA-1.5 13b yielded the best results when tested on our dataset, with superior scores in most metrics and the highest accuracy of 71.65% on the test set.

Abstract (translated)

大规模语言模型(LLMs)在各种任务上可以达到人类水平的表现,但在解决多步物理推理任务方面仍然面临着挑战。为了识别现有模型的不足,并促进该领域的进一步研究,我们创建了一个名为MM-PhyQA的新数据集,它包括构建良好且高中水平的多模态物理问题。通过评估具有公共可得性 contemporary LLM 的性能,以及这些问题中多模态元素的包含情况,我们旨在阐明其能力。为生成多模态输入(例如图像和文本)问题的答案,我们使用了 GPT-4 的零样本预测,并利用了LLaVA(LLaVA 和 LLaVA-1.5)中的后者的优化版本,该版本在我们的数据集上进行了微调。为了评估仅包含文本输入的 LLM 的性能,我们测试了 Mistral-7B 和 LLaMA2-7b 模型的基版和微调版本。我们还展示了名为 Multi-Image Chain-of-Thought(MI-CoT)提示技术在训练 LLaVA-1.5 13b 时的表现,该技术在我们的数据集上进行测试时,在大多数指标上的表现优于其他方法,同时在测试集中的准确度为 71.65%。

URL

https://arxiv.org/abs/2404.08704

PDF

https://arxiv.org/pdf/2404.08704.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot