LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property Prediction

Abstract
Abstract (translated)
URL
PDF

Abstract

Large language models (LLMs) are increasingly being used in materials science. However, little attention has been given to benchmarking and standardized evaluation for LLM-based materials property prediction, which hinders progress. We present LLM4Mat-Bench, the largest benchmark to date for evaluating the performance of LLMs in predicting the properties of crystalline materials. LLM4Mat-Bench contains about 1.9M crystal structures in total, collected from 10 publicly available materials data sources, and 45 distinct properties. LLM4Mat-Bench features different input modalities: crystal composition, CIF, and crystal text description, with 4.7M, 615.5M, and 3.1B tokens in total for each modality, respectively. We use LLM4Mat-Bench to fine-tune models with different sizes, including LLM-Prop and MatBERT, and provide zero-shot and few-shot prompts to evaluate the property prediction capabilities of LLM-chat-like models, including Llama, Gemma, and Mistral. The results highlight the challenges of general-purpose LLMs in materials science and the need for task-specific predictive models and task-specific instruction-tuned LLMs in materials property prediction.

Abstract (translated)

大型语言模型（LLMs）在材料科学中的应用日益增多。然而，对于基于LLM的材料属性预测的基准测试和标准化评估的关注较少，这阻碍了进展。我们提出了LLM4Mat-Bench，这是迄今为止用于评估LLMs预测晶态材料性能表现的最大基准。LLM4Mat-Bench总共包含约190万种晶体结构，这些数据来自10个公开可用的材料数据源，并涉及45种不同的属性。LLM4Mat-Bench具有不同的输入模式：晶体成分、CIF文件和晶体文本描述，每种模式分别有总计470万个、6.155亿个和30亿个令牌。我们使用LLM4Mat-Bench对不同大小的模型进行微调，包括LLM-Prop和MatBERT，并提供零样本和少样本提示来评估类似LLM-chat模型（如Llama、Gemma和Mistral）的属性预测能力。结果突显了通用LLMs在材料科学中的挑战，以及需要特定任务的预测模型和针对特定任务指令调整的LLMs来进行材料属性预测的需求。

URL

https://arxiv.org/abs/2411.00177

PDF

https://arxiv.org/pdf/2411.00177.pdf

LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property Prediction

Abstract

Abstract (translated)

URL

PDF Copy

PDF