Paper Reading AI Learner

SIEVE: General Purpose Data Filtering System Matching GPT-4o Accuracy at 1% the Cost

2024-10-03 17:58:29
Jifan Zhang, Robert Nowak

Abstract

Creating specialized large language models requires vast amounts of clean, special purpose data for training and fine-tuning. With only a handful of existing large-scale, domain-specific datasets, creation of new datasets is required in most applications. This requires the development of new application-specific filtering of web-scale data. Filtering with a high-performance, general-purpose LLM such as GPT-4o can be highly effective, but this is extremely expensive at web-scale. This paper proposes SIEVE, a lightweight alternative that matches GPT-4o accuracy at a fraction of the cost. SIEVE can perform up to 500 filtering operations for the cost of one GPT-4o filtering call. The key to SIEVE is a seamless integration of GPT-4o and lightweight T5 models, using active learning to fine-tune T5 in the background with a small number of calls to GPT-4o. Once trained, it performs as well as GPT-4o at a tiny fraction of the cost. We experimentally validate SIEVE on the OpenWebText dataset, using five highly customized filter tasks targeting high quality and domain-specific content. Our results demonstrate the effectiveness and efficiency of our method in curating large, high-quality datasets for language model training at a substantially lower cost (1%) than existing techniques. To further validate SIEVE, experiments show that SIEVE and GPT-4o achieve similar accuracy, with human evaluators preferring SIEVE's filtering results to those of GPT-4o.

Abstract (translated)

创建专用的大型语言模型需要大量干净、专门用途的数据进行训练和微调。仅有一小部分现有的大型、领域特定的数据集,大多数应用程序都需要创建新的数据集。为此,需要在应用程序中开发新的针对网络规模数据的过滤。使用高性能、通用目的的LLM(如GPT-4o)进行过滤可能非常有效,但这对大规模网络非常昂贵。本文提出SIEVE,一种轻量级的替代方案,在GPT-4o的准确度分数为1/1000的成本下实现了与GPT-4o同样的效果。SIEVE的关键是实现GPT-4o与轻量T5模型的无缝集成,通过主动学习在后台对T5进行微调,以少量GPT-4o过滤调用为基础。一旦训练完成,它的表现与GPT-4o相当,而成本只有GPT-4o的1%(远低于现有技术)。我们使用OpenWebText数据集来实验验证SIEVE,该数据集针对高质量和领域特定的内容,有五个高度定制的过滤任务。我们的结果表明,我们的方法在较低的成本(1%)下 curated大型、高质量数据集用于语言模型训练方面的效果和效率。为了进一步验证SIEVE,实验表明SIEVE和GPT-4o达到相似的准确度,人类评估者更喜欢SIEVE的过滤结果,而不喜欢GPT-4o的过滤结果。

URL

https://arxiv.org/abs/2410.02755

PDF

https://arxiv.org/pdf/2410.02755.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot