Paper Reading AI Learner

MM-PhyRLHF: Reinforcement Learning Framework for Multimodal Physics Question-Answering

2024-04-19 14:52:57
Avinash Anand, Janak Kapuriya, Chhavi Kirtani, Apoorv Singh, Jay Saraf, Naman Lal, Jatin Kumar, Adarsh Raj Shivam, Astha Verma, Rajiv Ratn Shah, Roger Zimmermann

Abstract

Recent advancements in LLMs have shown their significant potential in tasks like text summarization and generation. Yet, they often encounter difficulty while solving complex physics problems that require arithmetic calculation and a good understanding of concepts. Moreover, many physics problems include images that contain important details required to understand the problem's context. We propose an LMM-based chatbot to answer multimodal physics MCQs. For domain adaptation, we utilize the MM-PhyQA dataset comprising Indian high school-level multimodal physics problems. To improve the LMM's performance, we experiment with two techniques, RLHF (Reinforcement Learning from Human Feedback) and Image Captioning. In image captioning, we add a detailed explanation of the diagram in each image, minimizing hallucinations and image processing errors. We further explore the integration of Reinforcement Learning from Human Feedback (RLHF) methodology inspired by the ranking approach in RLHF to enhance the human-like problem-solving abilities of the models. The RLHF approach incorporates human feedback into the learning process of LLMs, improving the model's problem-solving skills, truthfulness, and reasoning capabilities, minimizing the hallucinations in the answers, and improving the quality instead of using vanilla-supervised fine-tuned models. We employ the LLaVA open-source model to answer multimodal physics MCQs and compare the performance with and without using RLHF.

Abstract (translated)

近年来,LLM(自然语言处理)在诸如文本摘要和生成等任务方面的进展已经展示了其重要的应用潜力。然而,在解决需要进行数学计算并具备对概念的理解的复杂物理学问题时,它们往往遇到困难。此外,许多物理学问题包括包含了解题必要细节的图像。因此,我们提出了一个基于LLM的聊天机器人来回答多模态物理学多项选择题。为了进行领域迁移,我们利用包括印度高中水平多模态物理学问题的MM-PhyQA数据集。为了提高LLM的性能,我们尝试了两种方法:RLHF(基于人类反馈的强化学习)和图像捕获。在图像描述中,我们在每个图像中添加了详细的图表解释,减少了幻觉和图像处理误差。我们进一步探讨了将RLHF中的人反馈方法论与排名方法相结合,以增强模型的类人问题解决能力。RLHF方法将人类反馈纳入LLM的学习过程中,提高了模型的问题解决技能、真理度、推理能力,减少了答案中的幻觉,并提高了模型的质量,而不是使用预训练的微调模型。我们使用LLaVA开源模型来回答多模态物理学多项选择题,并将性能与有无使用RLHF进行比较。

URL

https://arxiv.org/abs/2404.12926

PDF

https://arxiv.org/pdf/2404.12926.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot