Abstract
As Large Language Models (LLMs) increasingly appear in social science research (e.g., economics and marketing), it becomes crucial to assess how well these models replicate human behavior. In this work, using hypothesis testing, we present a quantitative framework to assess the misalignment between LLM-simulated and actual human behaviors in multiple-choice survey settings. This framework allows us to determine in a principled way whether a specific language model can effectively simulate human opinions, decision-making, and general behaviors represented through multiple-choice options. We applied this framework to a popular language model for simulating people's opinions in various public surveys and found that this model is ill-suited for simulating the tested sub-populations (e.g., across different races, ages, and incomes) for contentious questions. This raises questions about the alignment of this language model with the tested populations, highlighting the need for new practices in using LLMs for social science studies beyond naive simulations of human subjects.
Abstract (translated)
随着大型语言模型(LLMs)越来越多地出现在社会科学研究中(例如经济学和市场营销),评估这些模型在复制人类行为方面的效果变得至关重要。在这项工作中,我们使用假设检验方法提出了一种定量框架,用于评估多选题调查情境下LLM模拟的人类行为与实际人类行为之间的偏差。这一框架使我们能够以一种有原则的方式确定特定语言模型是否能有效地模拟通过多项选择选项表示的人类观点、决策和一般行为。 我们将此框架应用于一个流行的语言模型,在该模型中使用各种公共调查来模拟人们的看法,发现对于具有争议性的问题,这个模型并不适合用来模拟被测试的不同亚群体(如不同种族、年龄和收入水平)。这引发了关于这一语言模型与被测人群之间的对齐问题的疑问,并强调了在社会科学研究中使用LLMs进行超越简单的人类主体模拟的新实践的需求。
URL
https://arxiv.org/abs/2506.14997