Abstract
Recent advancements in LLMs have shown their significant potential in tasks like text summarization and generation. Yet, they often encounter difficulty while solving complex physics problems that require arithmetic calculation and a good understanding of concepts. Moreover, many physics problems include images that contain important details required to understand the problem's context. We propose an LMM-based chatbot to answer multimodal physics MCQs. For domain adaptation, we utilize the MM-PhyQA dataset comprising Indian high school-level multimodal physics problems. To improve the LMM's performance, we experiment with two techniques, RLHF (Reinforcement Learning from Human Feedback) and Image Captioning. In image captioning, we add a detailed explanation of the diagram in each image, minimizing hallucinations and image processing errors. We further explore the integration of Reinforcement Learning from Human Feedback (RLHF) methodology inspired by the ranking approach in RLHF to enhance the human-like problem-solving abilities of the models. The RLHF approach incorporates human feedback into the learning process of LLMs, improving the model's problem-solving skills, truthfulness, and reasoning capabilities, minimizing the hallucinations in the answers, and improving the quality instead of using vanilla-supervised fine-tuned models. We employ the LLaVA open-source model to answer multimodal physics MCQs and compare the performance with and without using RLHF.
Abstract (translated)
近年来,LLM(自然语言处理)在诸如文本摘要和生成等任务方面的进展已经展示了其重要的应用潜力。然而,在解决需要进行数学计算并具备对概念的理解的复杂物理学问题时,它们往往遇到困难。此外,许多物理学问题包括包含了解题必要细节的图像。因此,我们提出了一个基于LLM的聊天机器人来回答多模态物理学多项选择题。为了进行领域迁移,我们利用包括印度高中水平多模态物理学问题的MM-PhyQA数据集。为了提高LLM的性能,我们尝试了两种方法:RLHF(基于人类反馈的强化学习)和图像捕获。在图像描述中,我们在每个图像中添加了详细的图表解释,减少了幻觉和图像处理误差。我们进一步探讨了将RLHF中的人反馈方法论与排名方法相结合,以增强模型的类人问题解决能力。RLHF方法将人类反馈纳入LLM的学习过程中,提高了模型的问题解决技能、真理度、推理能力,减少了答案中的幻觉,并提高了模型的质量,而不是使用预训练的微调模型。我们使用LLaVA开源模型来回答多模态物理学多项选择题,并将性能与有无使用RLHF进行比较。
URL
https://arxiv.org/abs/2404.12926