Abstract
This research proposes a novel approach to the Word Sense Disambiguation (WSD) task in the Georgian language, based on supervised fine-tuning of a pre-trained Large Language Model (LLM) on a dataset formed by filtering the Georgian Common Crawls corpus. The dataset is used to train a classifier for words with multiple senses. Additionally, we present experimental results of using LSTM for WSD. Accurately disambiguating homonyms is crucial in natural language processing. Georgian, an agglutinative language belonging to the Kartvelian language family, presents unique challenges in this context. The aim of this paper is to highlight the specific problems concerning homonym disambiguation in the Georgian language and to present our approach to solving them. The techniques discussed in the article achieve 95% accuracy for predicting lexical meanings of homonyms using a hand-classified dataset of over 7500 sentences.
Abstract (translated)
本文提出了一种基于监督微调预训练大型语言模型(LLM)在格鲁吉亚语中的新颖方法来解决Word Sense Disambiguation(WSD)任务。该方法基于通过过滤乔治亚公共爬行语料库来构建的数据集来训练具有多个意义的单词分类器。数据集用于训练具有多个意义的单词分类器。此外,我们还提供了使用LSTM进行WSD的实验结果。准确地解决同义词歧义在自然语言处理中至关重要。属于卡特维利语系(Kartvelian)的格鲁吉亚语在这一点上具有独特的挑战。本文的目的是强调格鲁吉亚语中同义词歧义问题及其解决方法。本文所讨论的技术达到使用超过7500个句子的手分类数据集预测同义词含义的95%准确度。
URL
https://arxiv.org/abs/2405.00710