Paper Reading AI Learner

HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models

2025-07-17 15:47:49
Ashray Gupta, Rohan Joseph, Sunny Rai

Abstract

Analogies test a model's ability to infer implicit relationships between concepts, making them a key benchmark for evaluating reasoning capabilities. While large language models (LLMs) are widely evaluated for reasoning in English, their abilities in Indic languages remain understudied, limiting our understanding of whether these models generalize across languages. To address this gap, we introduce a new Hindi Analogy Test Set (HATS), comprising 405 multiple-choice questions sourced from Indian government exams. We benchmark state-of-the-art multilingual LLMs using various prompting strategies and introduce a grounded Chain of Thought approach that leverages cognitive theories of analogical reasoning. This approach improves model performance on Hindi analogy questions. Our experiments show that models perform best with English prompts, irrespective of the prompting strategy. Our test set addresses the lack of a critical resource to evaluate LLM reasoning capabilities in Hindi.

Abstract (translated)

类比测试能够评估模型推断概念之间隐含关系的能力,这对于衡量推理能力是关键的基准。虽然大型语言模型(LLM)在英语中的推理能力已经得到了广泛的研究和评价,但它们在印地语等印度语言中的表现却鲜有研究,这限制了我们对这些模型跨语言泛化能力的理解。为了解决这一不足,我们引入了一个新的印地语类比测试集(HATS),包含405道源自印度政府考试的多项选择题。我们在多种提示策略下使用最先进的多语言LLM进行了基准测试,并提出了一种基于认知理论中的类比推理的“链式思维”方法来提高模型在印地语类比问题上的表现。 我们的实验表明,无论采用何种提示策略,模型在接受英语提示时的表现最佳。我们的测试集弥补了评估LLM在印地语中推理能力的关键资源不足的问题。

URL

https://arxiv.org/abs/2507.13238

PDF

https://arxiv.org/pdf/2507.13238.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot