Abstract
We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.
Abstract (translated)
我们提出了OLMoE,一个完全开放、最先进的语言模型,利用稀疏混合专家(MoE)技术。OLMoE-1B-7B具有70亿(B)参数,但每个输入标量仅使用10亿个参数。我们在5万亿个标量上预训练它,并进一步调整以创建OLMoE-1B-7B-Instruct。我们的模型在具有类似积极参数的可用模型中表现出色,甚至超过了像Llama2-13B-Chat和DeepSeekMoE-16B这样更大的模型。我们在MoE训练方面进行了各种实验,分析了我们模型的路由,展示了我们模型的专业性,并开源了我们工作的所有方面:模型权重、训练数据、代码和日志。
URL
https://arxiv.org/abs/2409.02060