Paper Reading AI Learner

Comparative Study of Pre-Trained BERT Models for Code-Mixed Hindi-English Data

2023-05-25 05:10:28
Aryan Patil, Varad Patwardhan, Abhishek Phaltankar, Gauri Takawane, Raviraj Joshi

Abstract

The term "Code Mixed" refers to the use of more than one language in the same text. This phenomenon is predominantly observed on social media platforms, with an increasing amount of adaptation as time goes on. It is critical to detect foreign elements in a language and process them correctly, as a considerable number of individuals are using code-mixed languages that could not be comprehended by understanding one of those languages. In this work, we focus on low-resource Hindi-English code-mixed language and enhancing the performance of different code-mixed natural language processing tasks such as sentiment analysis, emotion recognition, and hate speech identification. We perform a comparative analysis of different Transformer-based language Models pre-trained using unsupervised approaches. We have included the code-mixed models like HingBERT, HingRoBERTa, HingRoBERTa-Mixed, mBERT, and non-code-mixed models like AlBERT, BERT, and RoBERTa for comparative analysis of code-mixed Hindi-English downstream tasks. We report state-of-the-art results on respective datasets using HingBERT-based models which are specifically pre-trained on real code-mixed text. Our HingBERT-based models provide significant improvements thus highlighting the poor performance of vanilla BERT models on code-mixed text.

Abstract (translated)

"代码混合"一词指的是在同一文本中使用多种语言的现象,这在社交媒体平台上尤为普遍,随着时间的流逝,适应度不断增加。重要的是要识别语言中的异国元素,并正确处理它们,因为相当多人使用代码混合语言,这些语言无法通过理解其中一种语言来理解。在本文中,我们重点关注资源有限的希伯来语-英语代码混合语言,并提高不同代码混合自然语言处理任务(如情感分析、情绪识别和恶言识别)的性能。我们使用无监督方法预先训练的不同Transformer-based语言模型进行了比较分析。我们包括代码混合模型,如HingBERT、HingRoBERTa、HingRoBERTa-混合、mBERT和非代码混合模型,如AlBERT、BERT和RoBERTa,以对代码混合希伯来语-英语下游任务进行代码混合语言比较分析。我们使用HingBERT-based模型分别报告了各自数据集的最佳结果,这些模型是在真实代码混合文本中进行预先训练的。我们的HingBERT-based模型提供了显著的改进,从而突出了代码混合文本中普通BERT模型表现不佳的情况。

URL

https://arxiv.org/abs/2305.15722

PDF

https://arxiv.org/pdf/2305.15722.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot