Abstract
Imbalanced data commonly exists in real world, espacially in sentiment-related corpus, making it difficult to train a classifier to distinguish latent sentiment in text data. We observe that humans often express transitional emotion between two adjacent discourses with discourse markers like "but", "though", "while", etc, and the head discourse and the tail discourse 3 usually indicate opposite emotional tendencies. Based on this observation, we propose a novel plug-and-play method, which first samples discourses according to transitional discourse markers and then validates sentimental polarities with the help of a pretrained attention-based model. Our method increases sample diversity in the first place, can serve as a upstream preprocessing part in data augmentation. We conduct experiments on three public sentiment datasets, with several frequently used algorithms. Results show that our method is found to be consistently effective, even in highly imbalanced scenario, and easily be integrated with oversampling method to boost the performance on imbalanced sentiment classification.
Abstract (translated)
不平衡数据通常存在于现实世界中,尤其是与情感相关的语料库中,因此很难训练分类器来区分文本数据中的潜在情感。我们观察到,人类经常用“但是”、“尽管”、“同时”等话语标记来表达相邻两个话语之间的过渡情感,而头语篇和尾语篇3通常表现出相反的情感倾向。基于这一观察,我们提出了一种新的即插即用的方法,首先根据过渡话语标记对话语进行样本分析,然后借助预先训练的基于注意力的模型验证情感的极端性。该方法首先提高了样本的多样性,可以作为数据增强的上游预处理部分。我们使用几种常用算法对三个公众情绪数据集进行了实验。结果表明,我们的方法具有一致的有效性,即使是在高度不平衡的情况下,也很容易与过度抽样方法相结合,以提高不平衡情绪分类的绩效。
URL
https://arxiv.org/abs/1903.11919