Paper Reading AI Learner

From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets

2024-04-27 12:10:10
Manuel Tonneau, Diyi Liu, Samuel Fraiberger, Ralph Schroeder, Scott A. Hale, Paul Röttger

Abstract

Perceptions of hate can vary greatly across cultural contexts. Hate speech (HS) datasets, however, have traditionally been developed by language. This hides potential cultural biases, as one language may be spoken in different countries home to different cultures. In this work, we evaluate cultural bias in HS datasets by leveraging two interrelated cultural proxies: language and geography. We conduct a systematic survey of HS datasets in eight languages and confirm past findings on their English-language bias, but also show that this bias has been steadily decreasing in the past few years. For three geographically-widespread languages -- English, Arabic and Spanish -- we then leverage geographical metadata from tweets to approximate geo-cultural contexts by pairing language and country information. We find that HS datasets for these languages exhibit a strong geo-cultural bias, largely overrepresenting a handful of countries (e.g., US and UK for English) relative to their prominence in both the broader social media population and the general population speaking these languages. Based on these findings, we formulate recommendations for the creation of future HS datasets.

Abstract (translated)

不同文化背景下对仇恨的看法可能会有很大差异。仇恨言论(HS)数据集通常是由语言开发而来的,这可能掩盖了潜在的文化偏见,因为一种语言可能会在不同的国家使用,而这些国家可能具有不同的文化。在这项工作中,我们通过利用两个相互关联的文化指标:语言和地理位置,对HS数据集的文化偏见进行评估。我们在8种语言的HS数据集上进行了一项系统性的调查,证实了它们在英语语言偏见方面的已有发现,但还表明,这种偏见在过去几年里稳步减少。对于英语、阿拉伯语和西班牙语等三个地理上广泛传播的语言,我们利用推特中的地理元数据来近似地理文化背景,将语言和国家信息进行匹配。我们发现,这些语言的HS数据集表现出强烈的地理文化偏见,主要过度代表了它们在更广泛的社交媒体用户和这些语言的一般人口中的突出地位(例如,美国和英国为英语)。基于这些发现,我们为未来HS数据集的创建提出了建议。

URL

https://arxiv.org/abs/2404.17874

PDF

https://arxiv.org/pdf/2404.17874.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot