Abstract
Data analysts have long sought to turn unstructured text data into meaningful concepts. Though common, topic modeling and clustering focus on lower-level keywords and require significant interpretative work. We introduce concept induction, a computational process that instead produces high-level concepts, defined by explicit inclusion criteria, from unstructured text. For a dataset of toxic online comments, where a state-of-the-art BERTopic model outputs "women, power, female," concept induction produces high-level concepts such as "Criticism of traditional gender roles" and "Dismissal of women's concerns." We present LLooM, a concept induction algorithm that leverages large language models to iteratively synthesize sampled text and propose human-interpretable concepts of increasing generality. We then instantiate LLooM in a mixed-initiative text analysis tool, enabling analysts to shift their attention from interpreting topics to engaging in theory-driven analysis. Through technical evaluations and four analysis scenarios ranging from literature review to content moderation, we find that LLooM's concepts improve upon the prior art of topic models in terms of quality and data coverage. In expert case studies, LLooM helped researchers to uncover new insights even from familiar datasets, for example by suggesting a previously unnoticed concept of attacks on out-party stances in a political social media dataset.
Abstract (translated)
数据分析师一直试图将无结构文本数据转化为有意义的概念。尽管常见,主题建模和聚类关注较低级别的关键词,需要进行大量解释性工作。我们引入了概念归纳,一种计算过程,它从无结构文本中产生高层次的概念,定义了明确的包括标准。对于一个包含有毒在线评论的 dataset,其中最先进的 BERTopic 模型输出“女性、权力、女性”,概念归纳产生了类似于“对传统性别角色批评”和“对女性关注的不屑”的高层次概念。我们介绍了 LLooM,一种利用大型语言模型迭代生成抽样文本并提出具有普遍性的人解释性概念的概念。然后将 LLooM 实例化到一个混合文本分析工具中,使分析员可以将注意力从解释主题转向进行理论驱动的分析。通过技术评估和四个分析场景(文献综述到内容审查),我们发现,LLooM 的概念在主题模型的先前艺术品质和数据覆盖方面有所提高。在专家案例研究中,LLooM 甚至帮助研究人员从熟悉的數據中发现新的见解,例如通过建议政治社交媒體數據中 previously unnoticed 的攻击姿态的概念。
URL
https://arxiv.org/abs/2404.12259