Paper Reading AI Learner

Enhancing Embedding Performance through Large Language Model-based Text Enrichment and Rewriting

2024-04-18 15:58:56
Nicholas Harris, Anand Butani, Syed Hashmy

Abstract

Embedding models are crucial for various natural language processing tasks but can be limited by factors such as limited vocabulary, lack of context, and grammatical errors. This paper proposes a novel approach to improve embedding performance by leveraging large language models (LLMs) to enrich and rewrite input text before the embedding process. By utilizing ChatGPT 3.5 to provide additional context, correct inaccuracies, and incorporate metadata, the proposed method aims to enhance the utility and accuracy of embedding models. The effectiveness of this approach is evaluated on three datasets: Banking77Classification, TwitterSemEval 2015, and Amazon Counter-factual Classification. Results demonstrate significant improvements over the baseline model on the TwitterSemEval 2015 dataset, with the best-performing prompt achieving a score of 85.34 compared to the previous best of 81.52 on the Massive Text Embedding Benchmark (MTEB) Leaderboard. However, performance on the other two datasets was less impressive, highlighting the importance of considering domain-specific characteristics. The findings suggest that LLM-based text enrichment has shown promising results to improve embedding performance, particularly in certain domains. Hence, numerous limitations in the process of embedding can be avoided.

Abstract (translated)

嵌入模型对各种自然语言处理任务至关重要,但它们可能受到词汇有限、缺乏上下文和语法错误等因素的限制。本文提出了一种通过利用大型语言模型(LLMs)在嵌入过程前对输入文本进行丰富和重新编写的全新方法,以提高嵌入性能。通过使用ChatGPT 3.5提供额外的上下文、修正不准确性和包含元数据,所提出的方法旨在增强嵌入模型的效用和准确性。本文在三个数据集上进行了评估:Banking77分类、TwitterSemEval 2015和Amazon Counter-factual分类。结果表明,与基线模型相比,在TwitterSemEval 2015数据集上取得了显著的提高,最佳表现提示得分比之前最佳成绩(MTEB Leaderboard)高出85.34分。然而,在其他两个数据集上的表现并不令人印象深刻,这表明在考虑领域特征时非常重要。这些发现表明,基于LLM的文本丰富已经显示出改善嵌入性能的前景,特别是在某些领域。因此,在嵌入过程中可以避免许多限制。

URL

https://arxiv.org/abs/2404.12283

PDF

https://arxiv.org/pdf/2404.12283.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot