Paper Reading AI Learner

Context-Aware Clustering using Large Language Models

2024-05-02 03:50:31
Sindhu Tipirneni, Ravinarayana Adkathimar, Nurendra Choudhary, Gaurush Hiranandani, Rana Ali Amjad, Vassilis N. Ioannidis, Changhe Yuan, Chandan K. Reddy

Abstract

Despite the remarkable success of Large Language Models (LLMs) in text understanding and generation, their potential for text clustering tasks remains underexplored. We observed that powerful closed-source LLMs provide good quality clusterings of entity sets but are not scalable due to the massive compute power required and the associated costs. Thus, we propose CACTUS (Context-Aware ClusTering with aUgmented triplet losS), a systematic approach that leverages open-source LLMs for efficient and effective supervised clustering of entity subsets, particularly focusing on text-based entities. Existing text clustering methods fail to effectively capture the context provided by the entity subset. Moreover, though there are several language modeling based approaches for clustering, very few are designed for the task of supervised clustering. This paper introduces a novel approach towards clustering entity subsets using LLMs by capturing context via a scalable inter-entity attention mechanism. We propose a novel augmented triplet loss function tailored for supervised clustering, which addresses the inherent challenges of directly applying the triplet loss to this problem. Furthermore, we introduce a self-supervised clustering task based on text augmentation techniques to improve the generalization of our model. For evaluation, we collect ground truth clusterings from a closed-source LLM and transfer this knowledge to an open-source LLM under the supervised clustering framework, allowing a faster and cheaper open-source model to perform the same task. Experiments on various e-commerce query and product clustering datasets demonstrate that our proposed approach significantly outperforms existing unsupervised and supervised baselines under various external clustering evaluation metrics.

Abstract (translated)

尽管大型语言模型(LLMs)在文本理解和生成方面的表现非常出色,但它们在文本聚类的潜力仍然没有被充分利用。我们观察到,强大的闭源LLM在实体集合上提供了良好的聚类质量,但由于所需的大量计算能力和相关成本,它们不具有可扩展性。因此,我们提出了CACTUS(带有自适应分段丢失的上下文感知聚类 with aUgmented triplet loss),一种系统化方法,利用开源LLM进行有效地和有效的监督性实体子集聚类,特别是关注文本实体。现有的文本聚类方法未能有效捕捉实体子集提供的上下文。此外,尽管有许多基于语言模型的聚类方法,但几乎没有专门针对监督聚类设计的。本文提出了一种通过可扩展的跨实体关注机制对LLM进行聚类的新方法。我们提出了一个全新的监督聚类损失函数, tailored for supervised clustering,解决了直接将三元组损失应用于这个问题固有的挑战。此外,我们还引入了一个基于文本增强技术的自监督聚类任务,以提高模型的泛化能力。为了评估,我们在一个闭源LLM上收集了地面真实聚类,并将其知识转移到开源LLM上,在监督聚类框架下实现更快的免费模型执行相同任务。在各种电子商务查询和产品聚类数据集上进行实验证明,我们提出的方法在各种外部聚类评估指标上显著优于现有的无监督和监督基线。

URL

https://arxiv.org/abs/2405.00988

PDF

https://arxiv.org/pdf/2405.00988.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot