Abstract
Social scientists quickly adopted large language models due to their ability to annotate documents without supervised training, an ability known as zero-shot learning. However, due to their compute demands, cost, and often proprietary nature, these models are often at odds with replication and open science standards. This paper introduces the Political DEBATE (DeBERTa Algorithm for Textual Entailment) language models for zero-shot and few-shot classification of political documents. These models are not only as good, or better than, state-of-the art large language models at zero and few-shot classification, but are orders of magnitude more efficient and completely open source. By training the models on a simple random sample of 10-25 documents, they can outperform supervised classifiers trained on hundreds or thousands of documents and state-of-the-art generative models with complex, engineered prompts. Additionally, we release the PolNLI dataset used to train these models -- a corpus of over 200,000 political documents with highly accurate labels across over 800 classification tasks.
Abstract (translated)
社会科学家迅速采用了大型语言模型,因为它们可以在没有监督训练的情况下对文档进行注释,这种能力被称为零 shot 学习。然而,由于它们的计算需求、成本和通常的专有性质,这些模型通常与可重复性和开放科学标准相矛盾。本文介绍了用于零 shot 和 few-shot 分类的政治 DEBATE(DeBERTa 算法文本等价)语言模型。这些模型不仅与零和 few-shot 分类的现有大型语言模型一样好,或者更好,而且比它们更高效,完全开源。通过在 10-25 个简单的随机样本上训练这些模型,它们可以超越通过对成千上万份文档进行监督训练的分类器,并能够与具有复杂和工程化提示的现有生成模型相媲美。此外,我们还发布了用于训练这些模型的 PolNLI 数据集——一个包含超过 200,000 个政治文档,在超过 800 个分类任务中具有高度准确标签的数据集。
URL
https://arxiv.org/abs/2409.02078