Paper Reading AI Learner

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

2024-05-07 15:56:43
DeepSeek-AI

Abstract

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models. The model checkpoints are available at "this https URL.

Abstract (translated)

我们提出了DeepSeek-V2,一种强大的Mixture-of-Experts(MoE)语言模型,具有经济实惠的训练和高效的推理。它包括236B总参数,其中21B针对每个词条激活,并支持128K个词条的上下文长度。DeepSeek-V2采用了创新架构,包括多头潜在注意力(MLA)和DeepSeekMoE。MLA通过显著压缩键值(KV)缓存将其压缩为潜在向量,从而保证高效的推理,而DeepSeekMoE则通过稀疏计算以经济实惠的成本训练强大的模型。与DeepSeek 67B相比,DeepSeek-V2实现了显著的性能提升,同时降低了训练成本,将KV缓存减小了93.3%,并将最大生成吞吐量提高了5.76倍。我们对DeepSeek-V2进行了预训练,在由8.1T个词条组成的高质量多源语料库上,并进一步进行了监督微调(SFT)和强化学习(RL)以充分发掘其潜力。评估结果显示,即使只有21B个激活参数,DeepSeek-V2及其聊天版本仍然在开源模型中实现了顶级性能。模型检查点可在“此https URL上找到。

URL

https://arxiv.org/abs/2405.04434

PDF

https://arxiv.org/pdf/2405.04434.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot