Comparative Study of Pre-Trained BERT Models for Code-Mixed Hindi-English Data

Abstract
Abstract (translated)
URL
PDF

Abstract

The term "Code Mixed" refers to the use of more than one language in the same text. This phenomenon is predominantly observed on social media platforms, with an increasing amount of adaptation as time goes on. It is critical to detect foreign elements in a language and process them correctly, as a considerable number of individuals are using code-mixed languages that could not be comprehended by understanding one of those languages. In this work, we focus on low-resource Hindi-English code-mixed language and enhancing the performance of different code-mixed natural language processing tasks such as sentiment analysis, emotion recognition, and hate speech identification. We perform a comparative analysis of different Transformer-based language Models pre-trained using unsupervised approaches. We have included the code-mixed models like HingBERT, HingRoBERTa, HingRoBERTa-Mixed, mBERT, and non-code-mixed models like AlBERT, BERT, and RoBERTa for comparative analysis of code-mixed Hindi-English downstream tasks. We report state-of-the-art results on respective datasets using HingBERT-based models which are specifically pre-trained on real code-mixed text. Our HingBERT-based models provide significant improvements thus highlighting the poor performance of vanilla BERT models on code-mixed text.

Abstract (translated)

"代码混合"一词指的是在同一文本中使用多种语言的现象，这在社交媒体平台上尤为普遍，随着时间的流逝，适应度不断增加。重要的是要识别语言中的异国元素，并正确处理它们，因为相当多人使用代码混合语言，这些语言无法通过理解其中一种语言来理解。在本文中，我们重点关注资源有限的希伯来语-英语代码混合语言，并提高不同代码混合自然语言处理任务(如情感分析、情绪识别和恶言识别)的性能。我们使用无监督方法预先训练的不同Transformer-based语言模型进行了比较分析。我们包括代码混合模型，如HingBERT、HingRoBERTa、HingRoBERTa-混合、mBERT和非代码混合模型，如AlBERT、BERT和RoBERTa，以对代码混合希伯来语-英语下游任务进行代码混合语言比较分析。我们使用HingBERT-based模型分别报告了各自数据集的最佳结果，这些模型是在真实代码混合文本中进行预先训练的。我们的HingBERT-based模型提供了显著的改进，从而突出了代码混合文本中普通BERT模型表现不佳的情况。

URL

https://arxiv.org/abs/2305.15722

PDF

https://arxiv.org/pdf/2305.15722.pdf