Abstract
As large language models (LLMs) see increasing adoption across the globe, it is imperative for LLMs to be representative of the linguistic diversity of the world. India is a linguistically diverse country of 1.4 Billion people. To facilitate research on multilingual LLM evaluation, we release IndicGenBench - the largest benchmark for evaluating LLMs on user-facing generation tasks across a diverse set 29 of Indic languages covering 13 scripts and 4 language families. IndicGenBench is composed of diverse generation tasks like cross-lingual summarization, machine translation, and cross-lingual question answering. IndicGenBench extends existing benchmarks to many Indic languages through human curation providing multi-way parallel evaluation data for many under-represented Indic languages for the first time. We evaluate a wide range of proprietary and open-source LLMs including GPT-3.5, GPT-4, PaLM-2, mT5, Gemma, BLOOM and LLaMA on IndicGenBench in a variety of settings. The largest PaLM-2 models performs the best on most tasks, however, there is a significant performance gap in all languages compared to English showing that further research is needed for the development of more inclusive multilingual language models. IndicGenBench is released at this http URL
Abstract (translated)
随着大型语言模型(LLMs)在全球范围内的应用不断增加,LLMs代表世界语言多样性至关重要。印度是一个拥有14亿人口的多语言国家。为了促进对多语言LLM评估的研究,我们发布了IndicGenBench - 针对13个脚本和4个语言家庭的多语言用户生成任务评估的最大基准。IndicGenBench由跨语言摘要、机器翻译和跨语言问题回答等多样化的生成任务组成。通过人类审核,IndicGenBench为许多印度语言提供了多途径并行评估数据,为许多代表性不足的印度语言首次提供了全面评估。我们在IndicGenBench上评估了各种专有和开源LLM,包括GPT-3.5、GPT-4、PaLM-2、mT5、Gemma、BLOOM和LLLaMA。在IndicGenBench上,最大的PaLM-2模型在大多数任务上表现最佳,然而,与英语相比,所有语言之间的性能差距都很大,这表明需要进一步研究更包容的多语言语言模型的开发。IndicGenBench发布在以下URL:
URL
https://arxiv.org/abs/2404.16816