Abstract
In this paper, we evaluate the capacity of current language technologies to understand Basque and Spanish language varieties. We use Natural Language Inference (NLI) as a pivot task and introduce a novel, manually-curated parallel dataset in Basque and Spanish, along with their respective variants. Our empirical analysis of crosslingual and in-context learning experiments using encoder-only and decoder-based Large Language Models (LLMs) shows a performance drop when handling linguistic variation, especially in Basque. Error analysis suggests that this decline is not due to lexical overlap, but rather to the linguistic variation itself. Further ablation experiments indicate that encoder-only models particularly struggle with Western Basque, which aligns with linguistic theory that identifies peripheral dialects (e.g., Western) as more distant from the standard. All data and code are publicly available.
Abstract (translated)
在这篇论文中,我们评估了当前语言技术理解巴斯克语和西班牙语方言的能力。我们使用自然语言推理(NLI)作为核心任务,并引入了一个新颖的手动整理的巴斯克语和西班牙语平行数据集,包括它们各自的变体。我们通过仅编码器模型和基于解码器的大规模语言模型(LLMs)进行跨语言和上下文学习实验的实证分析发现,在处理语言变异时性能下降,特别是在处理巴斯克语时尤为明显。错误分析表明,这种下滑并非由于词汇重叠,而是由语言变异本身引起的。进一步的消融实验显示,仅编码器模型在处理西巴斯克方言时特别困难,这与语言理论相吻合,该理论认为边缘方言(如西部方言)距离标准较远。所有数据和代码都是公开可用的。
URL
https://arxiv.org/abs/2506.15239