Abstract
Probabilistic topic models are a powerful tool for extracting latent themes from large text datasets. In many text datasets, we also observe per-document covariates (e.g., source, style, political affiliation) that act as environments that modulate a "global" (environment-agnostic) topic representation. Accurately learning these representations is important for prediction on new documents in unseen environments and for estimating the causal effect of topics on real-world outcomes. To this end, we introduce the Multi-environment Topic Model (MTM), an unsupervised probabilistic model that separates global and environment-specific terms. Through experimentation on various political content, from ads to tweets and speeches, we show that the MTM produces interpretable global topics with distinct environment-specific words. On multi-environment data, the MTM outperforms strong baselines in and out-of-distribution. It also enables the discovery of accurate causal effects.
Abstract (translated)
概率主题模型是从大量文本数据集中提取潜在主题的强大工具。在许多文本数据集中,我们还会观察到每篇文档的协变量(如来源、风格、政治倾向),这些协变量充当环境,调制一个“全局”(不依赖于环境)的主题表示。准确地学习这些表示对于预测新环境中未见文档的结果以及估计主题对现实世界结果的因果效应非常重要。为此,我们引入了多环境主题模型(MTM),这是一种无监督的概率模型,可以分离全局和特定于环境的术语。通过在各种政治内容上的实验,从广告到推文和演讲,我们展示了MTM能够生成具有不同特定环境词汇的可解释全局主题。在多环境数据上,MTM的表现优于强基线,在分布内和外都表现出色。它还使准确发现因果效应成为可能。
URL
https://arxiv.org/abs/2410.24126