Paper Reading AI Learner

Rho-1: Not All Tokens Are What You Need

2024-04-11 17:52:01
Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, Weizhu Chen

Abstract

Previous language model pre-training methods have uniformly applied a next-token prediction loss to all training tokens. Challenging this norm, we posit that "Not all tokens in a corpus are equally important for language model training". Our initial analysis delves into token-level training dynamics of language model, revealing distinct loss patterns for different tokens. Leveraging these insights, we introduce a new language model called Rho-1. Unlike traditional LMs that learn to predict every next token in a corpus, Rho-1 employs Selective Language Modeling (SLM), which selectively trains on useful tokens that aligned with the desired distribution. This approach involves scoring pretraining tokens using a reference model, and then training the language model with a focused loss on tokens with higher excess loss. When continual pretraining on 15B OpenWebMath corpus, Rho-1 yields an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks. After fine-tuning, Rho-1-1B and 7B achieved state-of-the-art results of 40.6% and 51.8% on MATH dataset, respectively - matching DeepSeekMath with only 3% of the pretraining tokens. Furthermore, when pretraining on 80B general tokens, Rho-1 achieves 6.8% average enhancement across 15 diverse tasks, increasing both efficiency and performance of the language model pre-training.

Abstract (translated)

在 previous language模型预训练方法中,所有训练 tokens 都均匀应用了下一个词预测损失。挑战这种规范,我们主张“并不是语料库中的所有标记都是对语言模型训练同样重要的”。我们对语言模型的初始分析深入研究了不同标记的词级训练动态,揭示了不同标记的损失模式。利用这些见解,我们引入了一个名为 Rho-1 的新语言模型。与传统语言模型不同,Rho-1 采用选择性语言模型(SLM)进行训练,选择性地在符合期望分布的有用标记上进行训练。这种方法涉及使用参考模型对预训练标记进行评分,然后对具有较高 excess loss 的标记进行关注性的损失训练语言模型。在继续在 15B OpenWebMath 语料库上进行预训练时,Rho-1 在几个简单的数学任务中的准确性提高了约 30%。在微调后,Rho-1-1B 和 7B 在 MATH 数据集上实现了 40.6% 和 51.8% 的最先进结果, respectively - 只使用预训练标记的 3% 就达到了与 DeepSeekMath 的成绩相同。此外,当在 80B 通用标记上进行预训练时,Rho-1 在 15 个不同的任务上实现了 6.8% 的平均增强,提高了语言模型预训练的效率和性能。

URL

https://arxiv.org/abs/2404.07965

PDF

https://arxiv.org/pdf/2404.07965.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot