Abstract
Memes are a popular form of communicating trends and ideas in social media and on the internet in general, combining the modalities of images and text. They can express humor and sarcasm but can also have offensive content. Analyzing and classifying memes automatically is challenging since their interpretation relies on the understanding of visual elements, language, and background knowledge. Thus, it is important to meaningfully represent these sources and the interaction between them in order to classify a meme as a whole. In this work, we propose to use scene graphs, that express images in terms of objects and their visual relations, and knowledge graphs as structured representations for meme classification with a Transformer-based architecture. We compare our approach with ImgBERT, a multimodal model that uses only learned (instead of structured) representations of the meme, and observe consistent improvements. We further provide a dataset with human graph annotations that we compare to automatically generated graphs and entity linking. Analysis shows that automatic methods link more entities than human annotators and that automatically generated graphs are better suited for hatefulness classification in memes.
Abstract (translated)
弹幕(Memes)是社交媒体和互联网中一种流行的方式来传达趋势和想法,结合了图像和文本的特性。它们可以表达幽默和讽刺,但也具有攻击性的内容。自动分析和分类弹幕具有挑战性,因为它们的解释依赖于对视觉元素、语言和背景知识的理解。因此,为了更好地分类一个弹幕,必须有意义地代表这些来源和它们之间的互动。在这个研究中,我们建议使用场景图,以图像为单位,表达对象及其视觉关系,并使用知识图作为基于Transformer架构的弹幕分类的结构化表示。我们与ImgBERT(一个多模态模型,仅使用学习而不是结构化的弹幕表示)进行比较,并观察了一致性的提高。我们还提供了带有人类图标注的 dataset,并将其与自动生成的 Graph 和实体链接进行比较。分析表明,自动方法链接比人类标注员更多实体,而自动生成的 Graph 更适合于弹幕中仇恨分类。
URL
https://arxiv.org/abs/2305.18391