K-MHaS: A Multi-label Hate Speech Detection Dataset in Korean Online News Comment

2022-08-23 02:10:53

Jean Lee, Taejun Lim, Heejun Lee, Bogeun Jo, Yangsok Kim, Heegeun Yoon, Soyeon Caren Han

arXiv_AI

arXiv_AI Detection Classification Bert Pose Speech

Abstract
Abstract (translated)
URL
PDF

Abstract

Online Hate speech detection has become important with the growth of digital devices, but resources in languages other than English are extremely limited. We introduce K-MHaS, a new multi-label dataset for hate speech detection that effectively handles Korean language patterns. The dataset consists of 109k utterances from news comments and provides multi-label classification from 1 to 4 labels, and handling subjectivity and intersectionality. We evaluate strong baselines on K-MHaS. KR-BERT with sub-character tokenizer outperforms, recognising decomposed characters in each hate speech class.

Abstract (translated)

URL

https://arxiv.org/abs/2208.10684

PDF

https://arxiv.org/pdf/2208.10684.pdf