Paper Reading AI Learner

HateMM: A Multi-Modal Dataset for Hate Video Classification

2023-05-06 03:39:00
Mithun Das, Rohit Raj, Punyajoy Saha, Binny Mathew, Manish Gupta, Animesh Mukherjee

Abstract

Hate speech has become one of the most significant issues in modern society, having implications in both the online and the offline world. Due to this, hate speech research has recently gained a lot of traction. However, most of the work has primarily focused on text media with relatively little work on images and even lesser on videos. Thus, early stage automated video moderation techniques are needed to handle the videos that are being uploaded to keep the platform safe and healthy. With a view to detect and remove hateful content from the video sharing platforms, our work focuses on hate video detection using multi-modalities. To this end, we curate ~43 hours of videos from BitChute and manually annotate them as hate or non-hate, along with the frame spans which could explain the labelling decision. To collect the relevant videos we harnessed search keywords from hate lexicons. We observe various cues in images and audio of hateful videos. Further, we build deep learning multi-modal models to classify the hate videos and observe that using all the modalities of the videos improves the overall hate speech detection performance (accuracy=0.798, macro F1-score=0.790) by ~5.7% compared to the best uni-modal model in terms of macro F1 score. In summary, our work takes the first step toward understanding and modeling hateful videos on video hosting platforms such as BitChute.

Abstract (translated)

恶言已经成为现代社会中最重要的问题之一,它在在线和离线世界中都具有重要意义。因此,恶言研究最近取得了很多进展。然而,大部分研究主要关注文本媒体,对于图像和视频的研究相对较少。因此,需要使用早期阶段的自动化视频编辑技术来处理正在上传的视频,以保持平台安全和健康。为了检测和删除视频分享平台上的仇恨内容,我们的研究重点是使用多模态方法检测恶言视频。为此,我们整理BitChute上的 ~43小时视频,并手动标注它们是否是恶言或非恶言,并考虑每个帧的跨度以解释标签的决定。为了收集相关视频,我们从仇恨词汇库中检索关键词。我们观察仇恨视频图像和音频中的各种线索。进一步,我们构建深度学习多模态模型来分类恶言视频,并观察到使用所有视频模态可以提高整体恶言检测表现(准确率=0.798,宏观F1得分=0.790)相比宏观F1得分最佳的单模态模型高出约5.7%。总之,我们的研究迈出了理解并建模像BitChute这样的视频托管平台中的仇恨视频的第一步。

URL

https://arxiv.org/abs/2305.03915

PDF

https://arxiv.org/pdf/2305.03915.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot