Abstract
While Large Language Models (LLMs) can achieve human-level performance in various tasks, they continue to face challenges when it comes to effectively tackling multi-step physics reasoning tasks. To identify the shortcomings of existing models and facilitate further research in this area, we curated a novel dataset, MM-PhyQA, which comprises well-constructed, high schoollevel multimodal physics problems. By evaluating the performance of contemporary LLMs that are publicly available, both with and without the incorporation of multimodal elements in these problems, we aim to shed light on their capabilities. For generating answers for questions consisting of multimodal input (in this case, images and text) we employed Zero-shot prediction using GPT-4 and utilized LLaVA (LLaVA and LLaVA-1.5), the latter of which were fine-tuned on our dataset. For evaluating the performance of LLMs consisting solely of textual input, we tested the performance of the base and fine-tuned versions of the Mistral-7B and LLaMA2-7b models. We also showcased the performance of the novel Multi-Image Chain-of-Thought (MI-CoT) Prompting technique, which when used to train LLaVA-1.5 13b yielded the best results when tested on our dataset, with superior scores in most metrics and the highest accuracy of 71.65% on the test set.
Abstract (translated)
大规模语言模型(LLMs)在各种任务上可以达到人类水平的表现,但在解决多步物理推理任务方面仍然面临着挑战。为了识别现有模型的不足,并促进该领域的进一步研究,我们创建了一个名为MM-PhyQA的新数据集,它包括构建良好且高中水平的多模态物理问题。通过评估具有公共可得性 contemporary LLM 的性能,以及这些问题中多模态元素的包含情况,我们旨在阐明其能力。为生成多模态输入(例如图像和文本)问题的答案,我们使用了 GPT-4 的零样本预测,并利用了LLaVA(LLaVA 和 LLaVA-1.5)中的后者的优化版本,该版本在我们的数据集上进行了微调。为了评估仅包含文本输入的 LLM 的性能,我们测试了 Mistral-7B 和 LLaMA2-7b 模型的基版和微调版本。我们还展示了名为 Multi-Image Chain-of-Thought(MI-CoT)提示技术在训练 LLaVA-1.5 13b 时的表现,该技术在我们的数据集上进行测试时,在大多数指标上的表现优于其他方法,同时在测试集中的准确度为 71.65%。
URL
https://arxiv.org/abs/2404.08704