Robustification of Multilingual Language Models to Real-world Noise with Robust Contrastive Pretraining

2022-10-10 15:40:43

Asa Cooper Stickland, Sailik Sengupta, Jason Krone, Saab Mansour, He He

arXiv_AI

Abstract
Abstract (translated)
URL
PDF

Abstract

Advances in neural modeling have achieved state-of-the-art (SOTA) results on public natural language processing (NLP) benchmarks, at times surpassing human performance. However, there is a gap between public benchmarks and real-world applications where noise such as typos or grammatical mistakes is abundant, resulting in degraded performance. Unfortunately, works that assess the robustness of neural models on noisy data and suggest improvements are limited to the English language. Upon analyzing noise in different languages, we observe that noise types vary across languages and thus require their own investigation. Thus, to benchmark the performance of pretrained multilingual models, we construct noisy datasets covering five languages and four NLP tasks. We see a gap in performance between clean and noisy data. After investigating ways to boost the zero-shot cross-lingual robustness of multilingual pretrained models, we propose Robust Contrastive Pretraining (RCP). RCP combines data augmentation with a contrastive loss term at the pretraining stage and achieves large improvements on noisy (& original test data) across two sentence-level classification (+3.2%) and two sequence-labeling (+10 F1-score) multilingual tasks.

Abstract (translated)

URL

https://arxiv.org/abs/2210.04782

PDF

https://arxiv.org/pdf/2210.04782.pdf