Paper Reading AI Learner

LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property Prediction

2024-10-31 19:48:12
Andre Niyongabo Rubungo, Kangming Li, Jason Hattrick-Simpers, Adji Bousso Dieng

Abstract

Large language models (LLMs) are increasingly being used in materials science. However, little attention has been given to benchmarking and standardized evaluation for LLM-based materials property prediction, which hinders progress. We present LLM4Mat-Bench, the largest benchmark to date for evaluating the performance of LLMs in predicting the properties of crystalline materials. LLM4Mat-Bench contains about 1.9M crystal structures in total, collected from 10 publicly available materials data sources, and 45 distinct properties. LLM4Mat-Bench features different input modalities: crystal composition, CIF, and crystal text description, with 4.7M, 615.5M, and 3.1B tokens in total for each modality, respectively. We use LLM4Mat-Bench to fine-tune models with different sizes, including LLM-Prop and MatBERT, and provide zero-shot and few-shot prompts to evaluate the property prediction capabilities of LLM-chat-like models, including Llama, Gemma, and Mistral. The results highlight the challenges of general-purpose LLMs in materials science and the need for task-specific predictive models and task-specific instruction-tuned LLMs in materials property prediction.

Abstract (translated)

大型语言模型(LLMs)在材料科学中的应用日益增多。然而,对于基于LLM的材料属性预测的基准测试和标准化评估的关注较少,这阻碍了进展。我们提出了LLM4Mat-Bench,这是迄今为止用于评估LLMs预测晶态材料性能表现的最大基准。LLM4Mat-Bench总共包含约190万种晶体结构,这些数据来自10个公开可用的材料数据源,并涉及45种不同的属性。LLM4Mat-Bench具有不同的输入模式:晶体成分、CIF文件和晶体文本描述,每种模式分别有总计470万个、6.155亿个和30亿个令牌。我们使用LLM4Mat-Bench对不同大小的模型进行微调,包括LLM-Prop和MatBERT,并提供零样本和少样本提示来评估类似LLM-chat模型(如Llama、Gemma和Mistral)的属性预测能力。结果突显了通用LLMs在材料科学中的挑战,以及需要特定任务的预测模型和针对特定任务指令调整的LLMs来进行材料属性预测的需求。

URL

https://arxiv.org/abs/2411.00177

PDF

https://arxiv.org/pdf/2411.00177.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot