Abstract
Embedding models are crucial for various natural language processing tasks but can be limited by factors such as limited vocabulary, lack of context, and grammatical errors. This paper proposes a novel approach to improve embedding performance by leveraging large language models (LLMs) to enrich and rewrite input text before the embedding process. By utilizing ChatGPT 3.5 to provide additional context, correct inaccuracies, and incorporate metadata, the proposed method aims to enhance the utility and accuracy of embedding models. The effectiveness of this approach is evaluated on three datasets: Banking77Classification, TwitterSemEval 2015, and Amazon Counter-factual Classification. Results demonstrate significant improvements over the baseline model on the TwitterSemEval 2015 dataset, with the best-performing prompt achieving a score of 85.34 compared to the previous best of 81.52 on the Massive Text Embedding Benchmark (MTEB) Leaderboard. However, performance on the other two datasets was less impressive, highlighting the importance of considering domain-specific characteristics. The findings suggest that LLM-based text enrichment has shown promising results to improve embedding performance, particularly in certain domains. Hence, numerous limitations in the process of embedding can be avoided.
Abstract (translated)
嵌入模型对各种自然语言处理任务至关重要,但它们可能受到词汇有限、缺乏上下文和语法错误等因素的限制。本文提出了一种通过利用大型语言模型(LLMs)在嵌入过程前对输入文本进行丰富和重新编写的全新方法,以提高嵌入性能。通过使用ChatGPT 3.5提供额外的上下文、修正不准确性和包含元数据,所提出的方法旨在增强嵌入模型的效用和准确性。本文在三个数据集上进行了评估:Banking77分类、TwitterSemEval 2015和Amazon Counter-factual分类。结果表明,与基线模型相比,在TwitterSemEval 2015数据集上取得了显著的提高,最佳表现提示得分比之前最佳成绩(MTEB Leaderboard)高出85.34分。然而,在其他两个数据集上的表现并不令人印象深刻,这表明在考虑领域特征时非常重要。这些发现表明,基于LLM的文本丰富已经显示出改善嵌入性能的前景,特别是在某些领域。因此,在嵌入过程中可以避免许多限制。
URL
https://arxiv.org/abs/2404.12283