Abstract
Despite the extensive amount of labeled datasets in the NLP text classification field, the persistent imbalance in data availability across various languages remains evident. Ukrainian, in particular, stands as a language that still can benefit from the continued refinement of cross-lingual methodologies. Due to our knowledge, there is a tremendous lack of Ukrainian corpora for typical text classification tasks. In this work, we leverage the state-of-the-art advances in NLP, exploring cross-lingual knowledge transfer methods avoiding manual data curation: large multilingual encoders and translation systems, LLMs, and language adapters. We test the approaches on three text classification tasks -- toxicity classification, formality classification, and natural language inference -- providing the "recipe" for the optimal setups.
Abstract (translated)
尽管在自然语言处理领域有很多带有标签的数据集,但各种语言间数据可用性的持续不均衡仍然明显。特别是乌克兰语,作为一个仍然需要继续改进跨语言方法论的语言,具有突出表现。由于我们的知识,乌克兰语的典型文本分类任务中缺乏大量的语料库。在这项工作中,我们利用自然语言处理领域的最新进展,探讨跨语言知识传递方法:大型多语言编码器、翻译系统、LLM 和语言适配器。我们在三个文本分类任务——毒性分类、格式分类和自然语言推理——上进行了测试,提供了最优设置的“食谱”。
URL
https://arxiv.org/abs/2404.02043