Paper Reading AI Learner

An end-to-end Neural Network Framework for Text Clustering

2019-03-22 09:54:36
Jie Zhou, Xingyi Cheng, Jinchao Zhang

Abstract

The unsupervised text clustering is one of the major tasks in natural language processing (NLP) and remains a difficult and complex problem. Conventional \mbox{methods} generally treat this task using separated steps, including text representation learning and clustering the representations. As an improvement, neural methods have also been introduced for continuous representation learning to address the sparsity problem. However, the multi-step process still deviates from the unified optimization target. Especially the second step of cluster is generally performed with conventional methods such as k-Means. We propose a pure neural framework for text clustering in an end-to-end manner. It jointly learns the text representation and the clustering model. Our model works well when the context can be obtained, which is nearly always the case in the field of NLP. We have our method \mbox{evaluated} on two widely used benchmarks: IMDB movie reviews for sentiment classification and $20$-Newsgroup for topic categorization. Despite its simplicity, experiments show the model outperforms previous clustering methods by a large margin. Furthermore, the model is also verified on English wiki dataset as a large corpus.

Abstract (translated)

无监督文本聚类是自然语言处理(NLP)的主要任务之一,一直是一个复杂的难题。传统的mbox方法通常使用单独的步骤来处理此任务,包括文本表示学习和对表示进行聚类。作为一种改进,神经方法也被用于连续表示学习,以解决稀疏性问题。然而,多步过程仍然偏离了统一的优化目标。尤其是聚类的第二步一般采用K均值等常规方法进行。我们提出了一种端到端的文本聚类纯神经框架。它共同学习了文本表示和聚类模型。我们的模型在获取上下文时工作得很好,这在NLP领域几乎总是如此。我们在两个广泛使用的基准上评估了我们的方法:IMDB情绪分类电影评论和20美元的主题分类新闻组。尽管模型简单,但实验表明,该模型在很大程度上优于以前的聚类方法。此外,该模型也作为一个大型语料库在英语维基数据集上得到了验证。

URL

https://arxiv.org/abs/1903.09424

PDF

https://arxiv.org/pdf/1903.09424.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot