Abstract
Large language models (LLMs) are increasingly being used in materials science. However, little attention has been given to benchmarking and standardized evaluation for LLM-based materials property prediction, which hinders progress. We present LLM4Mat-Bench, the largest benchmark to date for evaluating the performance of LLMs in predicting the properties of crystalline materials. LLM4Mat-Bench contains about 1.9M crystal structures in total, collected from 10 publicly available materials data sources, and 45 distinct properties. LLM4Mat-Bench features different input modalities: crystal composition, CIF, and crystal text description, with 4.7M, 615.5M, and 3.1B tokens in total for each modality, respectively. We use LLM4Mat-Bench to fine-tune models with different sizes, including LLM-Prop and MatBERT, and provide zero-shot and few-shot prompts to evaluate the property prediction capabilities of LLM-chat-like models, including Llama, Gemma, and Mistral. The results highlight the challenges of general-purpose LLMs in materials science and the need for task-specific predictive models and task-specific instruction-tuned LLMs in materials property prediction.
Abstract (translated)
大型语言模型(LLMs)在材料科学中的应用日益增多。然而,对于基于LLM的材料属性预测的基准测试和标准化评估的关注较少,这阻碍了进展。我们提出了LLM4Mat-Bench,这是迄今为止用于评估LLMs预测晶态材料性能表现的最大基准。LLM4Mat-Bench总共包含约190万种晶体结构,这些数据来自10个公开可用的材料数据源,并涉及45种不同的属性。LLM4Mat-Bench具有不同的输入模式:晶体成分、CIF文件和晶体文本描述,每种模式分别有总计470万个、6.155亿个和30亿个令牌。我们使用LLM4Mat-Bench对不同大小的模型进行微调,包括LLM-Prop和MatBERT,并提供零样本和少样本提示来评估类似LLM-chat模型(如Llama、Gemma和Mistral)的属性预测能力。结果突显了通用LLMs在材料科学中的挑战,以及需要特定任务的预测模型和针对特定任务指令调整的LLMs来进行材料属性预测的需求。
URL
https://arxiv.org/abs/2411.00177