Abstract
Large Language Models (LLMs) are increasingly deployed in roles requiring nuanced psychological understanding, such as emotional support agents, counselors, and decision-making assistants. However, their ability to interpret human personality traits, a critical aspect of such applications, remains unexplored, particularly in ecologically valid conversational settings. While prior work has simulated LLM "personas" using discrete Big Five labels on social media data, the alignment of LLMs with continuous, ground-truth personality assessments derived from natural interactions is largely unexamined. To address this gap, we introduce a novel benchmark comprising semi-structured interview transcripts paired with validated continuous Big Five trait scores. Using this dataset, we systematically evaluate LLM performance across three paradigms: (1) zero-shot and chain-of-thought prompting with GPT-4.1 Mini, (2) LoRA-based fine-tuning applied to both RoBERTa and Meta-LLaMA architectures, and (3) regression using static embeddings from pretrained BERT and OpenAI's text-embedding-3-small. Our results reveal that all Pearson correlations between model predictions and ground-truth personality traits remain below 0.26, highlighting the limited alignment of current LLMs with validated psychological constructs. Chain-of-thought prompting offers minimal gains over zero-shot, suggesting that personality inference relies more on latent semantic representation than explicit reasoning. These findings underscore the challenges of aligning LLMs with complex human attributes and motivate future work on trait-specific prompting, context-aware modeling, and alignment-oriented fine-tuning.
Abstract (translated)
大型语言模型(LLMs)越来越多地被部署在需要深刻理解人类心理的角色中,例如情感支持代理、咨询师和决策助手。然而,这些模型解读人类个性特征的能力,这是此类应用的关键方面,在自然对话环境中尚未得到充分探索。尽管先前的研究通过社交媒体数据模拟了使用离散的大五人格标签的LLM“人物”,但与从自然互动中获得的真实连续人格评估之间的对齐关系几乎未被研究。为了填补这一空白,我们引入了一个新颖的基准测试集,该集合包含半结构化访谈记录以及经过验证的连续大五性格得分。利用这个数据集,我们在三种范式下系统地评估了LLM的表现:(1)使用GPT-4.1 Mini进行零样本和链式思维提示;(2)基于LoRA对RoBERTa和Meta-LLaMA架构进行微调;以及(3)使用预先训练的BERT和OpenAI的text-embedding-3-small生成的静态嵌入进行回归分析。我们的研究结果表明,所有模型预测与真实人格特质之间的皮尔逊相关性均低于0.26,这强调了当前LLM与经过验证的心理学构造对齐程度较低的事实。链式思维提示相较于零样本方法并没有带来显著提升,暗示个性推断主要依赖于潜在的语义表示而非显性的推理过程。这些发现突出了将LLM与复杂的人类属性对齐所面临的挑战,并推动了未来关于特质特定提示、上下文感知建模和对准导向微调的研究工作。
URL
https://arxiv.org/abs/2509.13244