Ukrainian Texts Classification: Exploration of Cross-lingual Knowledge Transfer Approaches

Abstract
Abstract (translated)
URL
PDF

Abstract

Despite the extensive amount of labeled datasets in the NLP text classification field, the persistent imbalance in data availability across various languages remains evident. Ukrainian, in particular, stands as a language that still can benefit from the continued refinement of cross-lingual methodologies. Due to our knowledge, there is a tremendous lack of Ukrainian corpora for typical text classification tasks. In this work, we leverage the state-of-the-art advances in NLP, exploring cross-lingual knowledge transfer methods avoiding manual data curation: large multilingual encoders and translation systems, LLMs, and language adapters. We test the approaches on three text classification tasks -- toxicity classification, formality classification, and natural language inference -- providing the "recipe" for the optimal setups.

Abstract (translated)

尽管在自然语言处理领域有很多带有标签的数据集，但各种语言间数据可用性的持续不均衡仍然明显。特别是乌克兰语，作为一个仍然需要继续改进跨语言方法论的语言，具有突出表现。由于我们的知识，乌克兰语的典型文本分类任务中缺乏大量的语料库。在这项工作中，我们利用自然语言处理领域的最新进展，探讨跨语言知识传递方法：大型多语言编码器、翻译系统、LLM 和语言适配器。我们在三个文本分类任务——毒性分类、格式分类和自然语言推理——上进行了测试，提供了最优设置的“食谱”。

URL

https://arxiv.org/abs/2404.02043

PDF

https://arxiv.org/pdf/2404.02043.pdf

Ukrainian Texts Classification: Exploration of Cross-lingual Knowledge Transfer Approaches

Abstract

Abstract (translated)

URL

PDF Copy

PDF