Context-Aware Clustering using Large Language Models

Abstract
Abstract (translated)
URL
PDF

Abstract

Despite the remarkable success of Large Language Models (LLMs) in text understanding and generation, their potential for text clustering tasks remains underexplored. We observed that powerful closed-source LLMs provide good quality clusterings of entity sets but are not scalable due to the massive compute power required and the associated costs. Thus, we propose CACTUS (Context-Aware ClusTering with aUgmented triplet losS), a systematic approach that leverages open-source LLMs for efficient and effective supervised clustering of entity subsets, particularly focusing on text-based entities. Existing text clustering methods fail to effectively capture the context provided by the entity subset. Moreover, though there are several language modeling based approaches for clustering, very few are designed for the task of supervised clustering. This paper introduces a novel approach towards clustering entity subsets using LLMs by capturing context via a scalable inter-entity attention mechanism. We propose a novel augmented triplet loss function tailored for supervised clustering, which addresses the inherent challenges of directly applying the triplet loss to this problem. Furthermore, we introduce a self-supervised clustering task based on text augmentation techniques to improve the generalization of our model. For evaluation, we collect ground truth clusterings from a closed-source LLM and transfer this knowledge to an open-source LLM under the supervised clustering framework, allowing a faster and cheaper open-source model to perform the same task. Experiments on various e-commerce query and product clustering datasets demonstrate that our proposed approach significantly outperforms existing unsupervised and supervised baselines under various external clustering evaluation metrics.

Abstract (translated)

尽管大型语言模型（LLMs）在文本理解和生成方面的表现非常出色，但它们在文本聚类的潜力仍然没有被充分利用。我们观察到，强大的闭源LLM在实体集合上提供了良好的聚类质量，但由于所需的大量计算能力和相关成本，它们不具有可扩展性。因此，我们提出了CACTUS（带有自适应分段丢失的上下文感知聚类 with aUgmented triplet loss），一种系统化方法，利用开源LLM进行有效地和有效的监督性实体子集聚类，特别是关注文本实体。现有的文本聚类方法未能有效捕捉实体子集提供的上下文。此外，尽管有许多基于语言模型的聚类方法，但几乎没有专门针对监督聚类设计的。本文提出了一种通过可扩展的跨实体关注机制对LLM进行聚类的新方法。我们提出了一个全新的监督聚类损失函数， tailored for supervised clustering，解决了直接将三元组损失应用于这个问题固有的挑战。此外，我们还引入了一个基于文本增强技术的自监督聚类任务，以提高模型的泛化能力。为了评估，我们在一个闭源LLM上收集了地面真实聚类，并将其知识转移到开源LLM上，在监督聚类框架下实现更快的免费模型执行相同任务。在各种电子商务查询和产品聚类数据集上进行实验证明，我们提出的方法在各种外部聚类评估指标上显著优于现有的无监督和监督基线。

URL

https://arxiv.org/abs/2405.00988

PDF

https://arxiv.org/pdf/2405.00988.pdf

Context-Aware Clustering using Large Language Models

Abstract

Abstract (translated)

URL

PDF Copy

PDF