Paper Reading AI Learner

Misspellings in Natural Language Processing: A survey

2025-01-28 10:26:04
Gianluca Sperduti, Alejandro Moreo

Abstract

This survey provides an overview of the challenges of misspellings in natural language processing (NLP). While often unintentional, misspellings have become ubiquitous in digital communication, especially with the proliferation of Web 2.0, user-generated content, and informal text mediums such as social media, blogs, and forums. Even if humans can generally interpret misspelled text, NLP models frequently struggle to handle it: this causes a decline in performance in common tasks like text classification and machine translation. In this paper, we reconstruct a history of misspellings as a scientific problem. We then discuss the latest advancements to address the challenge of misspellings in NLP. Main strategies to mitigate the effect of misspellings include data augmentation, double step, character-order agnostic, and tuple-based methods, among others. This survey also examines dedicated data challenges and competitions to spur progress in the field. Critical safety and ethical concerns are also examined, for example, the voluntary use of misspellings to inject malicious messages and hate speech on social networks. Furthermore, the survey explores psycholinguistic perspectives on how humans process misspellings, potentially informing innovative computational techniques for text normalization and representation. Finally, the misspelling-related challenges and opportunities associated with modern large language models are also analyzed, including benchmarks, datasets, and performances of the most prominent language models against misspellings. This survey aims to be an exhaustive resource for researchers seeking to mitigate the impact of misspellings in the rapidly evolving landscape of NLP.

Abstract (translated)

这项调查概述了自然语言处理(NLP)中错拼词的挑战。尽管通常无意为之,但错拼词在数字通信中已经变得无处不在,尤其是在Web 2.0、用户生成内容以及社交媒体、博客和论坛等非正式文本媒介兴起的情况下更是如此。虽然人类一般可以理解错误拼写的文本,但NLP模型却经常难以处理:这导致了诸如文本分类和机器翻译等常见任务性能的下降。在本文中,我们重建了错拼词作为科学问题的历史,并讨论了最新进展以应对NLP中的错拼挑战。 主要缓解错拼影响的战略包括数据增强、双步法、字符顺序无关性和元组基于的方法等等。此调查还考察了专门的数据挑战和竞赛,旨在推动该领域的发展。同时探讨了一些关键的安全与伦理关注点,例如在社交网络上故意使用错拼词来注入恶意信息和仇恨言论的问题。 此外,这项调查还探索了从心理语言学角度如何处理错拼词的人类方法,这可能有助于创新的计算技术进行文本规范化和表示。最后,分析了现代大型语言模型相关的错拼挑战与机遇,包括基准测试、数据集以及最突出的语言模型在应对错拼词方面的表现。 本调查旨在为研究者提供一份详尽资源,在不断发展的NLP领域中减轻错拼的影响。

URL

https://arxiv.org/abs/2501.16836

PDF

https://arxiv.org/pdf/2501.16836.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot