Abstract
Cold-start active learning (CSAL) selects valuable instances from an unlabeled dataset for manual annotation. It provides high-quality data at a low annotation cost for label-scarce text classification. However, existing CSAL methods overlook weak classes and hard representative examples, resulting in biased learning. To address these issues, this paper proposes a novel dual-diversity enhancing and uncertainty-aware (DEUCE) framework for CSAL. Specifically, DEUCE leverages a pretrained language model (PLM) to efficiently extract textual representations, class predictions, and predictive uncertainty. Then, it constructs a Dual-Neighbor Graph (DNG) to combine information on both textual diversity and class diversity, ensuring a balanced data distribution. It further propagates uncertainty information via density-based clustering to select hard representative instances. DEUCE performs well in selecting class-balanced and hard representative data by dual-diversity and informativeness. Experiments on six NLP datasets demonstrate the superiority and efficiency of DEUCE.
Abstract (translated)
冷启动主动学习(CSAL)从未标记的数据集中选择有价值的实例进行人工标注,为标签稀少的文本分类提供了高质量且成本低的数据。然而,现有的CSAL方法忽视了弱类别和难以代表的例子,导致了偏向性的学习问题。为了应对这些问题,本文提出了一种新颖的双多样性增强与不确定性感知(DEUCE)框架用于CSAL。具体来说,DEUCE利用预训练的语言模型(PLM)高效地提取文本表示、类别预测以及预测不确定性。接着,它构建了一个双邻域图(DNG),结合了文本多样性和类别多样性的信息,确保数据分布的均衡性。此外,通过基于密度的聚类传播不确定性信息,DEUCE选择出具有代表性的困难实例。 通过利用双重多样性与信息量,DEUCE在选择平衡且有代表性的样本方面表现出色。实验结果表明,在六个NLP数据集上,DEUCE展示了其优越性和效率。
URL
https://arxiv.org/abs/2502.00305