Abstract
The unsupervised text clustering is one of the major tasks in natural language processing (NLP) and remains a difficult and complex problem. Conventional \mbox{methods} generally treat this task using separated steps, including text representation learning and clustering the representations. As an improvement, neural methods have also been introduced for continuous representation learning to address the sparsity problem. However, the multi-step process still deviates from the unified optimization target. Especially the second step of cluster is generally performed with conventional methods such as k-Means. We propose a pure neural framework for text clustering in an end-to-end manner. It jointly learns the text representation and the clustering model. Our model works well when the context can be obtained, which is nearly always the case in the field of NLP. We have our method \mbox{evaluated} on two widely used benchmarks: IMDB movie reviews for sentiment classification and $20$-Newsgroup for topic categorization. Despite its simplicity, experiments show the model outperforms previous clustering methods by a large margin. Furthermore, the model is also verified on English wiki dataset as a large corpus.
Abstract (translated)
无监督文本聚类是自然语言处理(NLP)的主要任务之一,一直是一个复杂的难题。传统的mbox方法通常使用单独的步骤来处理此任务,包括文本表示学习和对表示进行聚类。作为一种改进,神经方法也被用于连续表示学习,以解决稀疏性问题。然而,多步过程仍然偏离了统一的优化目标。尤其是聚类的第二步一般采用K均值等常规方法进行。我们提出了一种端到端的文本聚类纯神经框架。它共同学习了文本表示和聚类模型。我们的模型在获取上下文时工作得很好,这在NLP领域几乎总是如此。我们在两个广泛使用的基准上评估了我们的方法:IMDB情绪分类电影评论和20美元的主题分类新闻组。尽管模型简单,但实验表明,该模型在很大程度上优于以前的聚类方法。此外,该模型也作为一个大型语料库在英语维基数据集上得到了验证。
URL
https://arxiv.org/abs/1903.09424