Abstract
Large language models (LLMs) have demonstrated substantial commonsense understanding through numerous benchmark evaluations. However, their understanding of cultural commonsense remains largely unexamined. In this paper, we conduct a comprehensive examination of the capabilities and limitations of several state-of-the-art LLMs in the context of cultural commonsense tasks. Using several general and cultural commonsense benchmarks, we find that (1) LLMs have a significant discrepancy in performance when tested on culture-specific commonsense knowledge for different cultures; (2) LLMs' general commonsense capability is affected by cultural context; and (3) The language used to query the LLMs can impact their performance on cultural-related tasks. Our study points to the inherent bias in the cultural understanding of LLMs and provides insights that can help develop culturally aware language models.
Abstract (translated)
大语言模型(LLMs)通过多个基准测试展示了实质性的常识理解。然而,它们在文化常识任务中的理解仍然很大程度上没有被检验。在本文中,我们对几种最先进的LLM在文化常识任务中的能力和局限进行全面评估。使用几个通用的和文化常识基准,我们发现:(1)在测试文化特定常识知识时,LLM的表现存在显著的差异;(2)LLM的通用常识能力受到文化上下文的影响;(3)用于查询LLM的语言可以影响它们在文化相关任务中的表现。我们的研究揭示了LLM在文化理解方面的固有偏见,并为开发具有文化意识的语言模型提供了启示。
URL
https://arxiv.org/abs/2405.04655