Abstract
Automatic hate speech detection using deep neural models is hampered by the scarcity of labeled datasets, leading to poor generalization. To mitigate this problem, generative AI has been utilized to generate large amounts of synthetic hate speech sequences from available labeled examples, leveraging the generated data in finetuning large pre-trained language models (LLMs). In this chapter, we provide a review of relevant methods, experimental setups and evaluation of this approach. In addition to general LLMs, such as BERT, RoBERTa and ALBERT, we apply and evaluate the impact of train set augmentation with generated data using LLMs that have been already adapted for hate detection, including RoBERTa-Toxicity, HateBERT, HateXplain, ToxDect, and ToxiGen. An empirical study corroborates our previous findings, showing that this approach improves hate speech generalization, boosting recall performance across data distributions. In addition, we explore and compare the performance of the finetuned LLMs with zero-shot hate detection using a GPT-3.5 model. Our results demonstrate that while better generalization is achieved using the GPT-3.5 model, it achieves mediocre recall and low precision on most datasets. It is an open question whether the sensitivity of models such as GPT-3.5, and onward, can be improved using similar techniques of text generation.
Abstract (translated)
使用深度神经网络模型进行自动仇恨言论检测存在标注数据集的稀缺性,导致泛化能力差。为解决这个问题,生成式人工智能(GAs)被用于从现有标注示例中生成大量合成仇恨言论序列,并利用生成的数据在微调大预训练语言模型(LLMs)时加强训练集 augment。在本章中,我们回顾了相关方法、实验设置以及这种方法的评估。除了通用的 LLMs,如 BERT、RoBERTa 和 ALBERT,我们还使用已经适应仇恨检测的 LLMs 进行了训练集增强和评估,包括 RoBERTa-Toxicity、HateBERT、HateXplain、ToxDect 和 ToxiGen。 一项实证研究证实了我们在之前的观察结果,表明这种方法提高了仇恨言论的泛化能力,提高了数据分布的召回率。此外,我们探讨并比较了使用 GPT-3.5 模型进行微调的 LLMs 与零散仇恨检测模型的性能。我们的结果表明,虽然使用 GPT-3.5 模型可以实现更好的泛化,但它在大多数数据集上的召回率和精度都较低。一个有趣的问题是对 GPT-3.5 模型等模型的敏感性是否可以通过类似的技术进行改进。
URL
https://arxiv.org/abs/2311.09993