Abstract
In text mining, topic models are a type of probabilistic generative models for inferring latent semantic topics from text corpus. One of the most popular inference approaches to topic models is perhaps collapsed Gibbs sampling (CGS), which typically samples one single topic label for each observed document-word pair. In this paper, we aim at improving the inference of CGS for topic models. We propose to leverage state augmentation technique by maximizing the number of topic samples to infinity, and then develop a new inference approach, called infinite latent state replication (ILR), to generate robust soft topic assignment for each given document-word pair. Experimental results on the publicly available datasets show that ILR outperforms CGS for inference of existing established topic models.
Abstract (translated)
在文本挖掘中,主题模型是一种概率生成模型,用于从文本库中推断隐藏的语义主题。主题模型中最流行的一种推断方法是可能是聚合条件概率抽样(CGS),该方法通常针对每个观察到的文档-单词对样本一个唯一的主题标签进行抽样。在本文中,我们旨在改进CGS对于主题模型的推断能力。我们提议利用状态增强技术,将主题样本的数量扩展到无限,然后开发一种新的方法,称为无限潜在状态复制(ILR),为每个给定的文档-单词对生成稳定的软主题 assignment。公开可用数据集的实验结果表明,ILR对于现有稳定主题模型的推断能力优于CGS。
URL
https://arxiv.org/abs/2301.12974