Natcat: Weakly Supervised Text Classification with Naturally Annotated Datasets

Abstract
Abstract (translated)
URL
PDF

Abstract

We seek to improve text classification by leveraging naturally annotated data. In particular, we construct a general purpose text categorization dataset (NatCat) from three online resources: Wikipedia, Reddit, and Stack Exchange. These datasets consist of document-category pairs derived from manual curation that occurs naturally by their communities. We build general purpose text classifiers by training on NatCat and evaluate them on a suite of 11 text classification tasks (CatEval). We benchmark different modeling choices and dataset combinations, and show how each task benefits from different NatCat training resources.

Abstract (translated)

URL

https://arxiv.org/abs/2009.14335

PDF

https://arxiv.org/pdf/2009.14335.pdf