Abstract
Analogies test a model's ability to infer implicit relationships between concepts, making them a key benchmark for evaluating reasoning capabilities. While large language models (LLMs) are widely evaluated for reasoning in English, their abilities in Indic languages remain understudied, limiting our understanding of whether these models generalize across languages. To address this gap, we introduce a new Hindi Analogy Test Set (HATS), comprising 405 multiple-choice questions sourced from Indian government exams. We benchmark state-of-the-art multilingual LLMs using various prompting strategies and introduce a grounded Chain of Thought approach that leverages cognitive theories of analogical reasoning. This approach improves model performance on Hindi analogy questions. Our experiments show that models perform best with English prompts, irrespective of the prompting strategy. Our test set addresses the lack of a critical resource to evaluate LLM reasoning capabilities in Hindi.
Abstract (translated)
类比测试能够评估模型推断概念之间隐含关系的能力,这对于衡量推理能力是关键的基准。虽然大型语言模型(LLM)在英语中的推理能力已经得到了广泛的研究和评价,但它们在印地语等印度语言中的表现却鲜有研究,这限制了我们对这些模型跨语言泛化能力的理解。为了解决这一不足,我们引入了一个新的印地语类比测试集(HATS),包含405道源自印度政府考试的多项选择题。我们在多种提示策略下使用最先进的多语言LLM进行了基准测试,并提出了一种基于认知理论中的类比推理的“链式思维”方法来提高模型在印地语类比问题上的表现。 我们的实验表明,无论采用何种提示策略,模型在接受英语提示时的表现最佳。我们的测试集弥补了评估LLM在印地语中推理能力的关键资源不足的问题。
URL
https://arxiv.org/abs/2507.13238