Specializing Smaller Language Models towards Multi-Step Reasoning

Abstract
Abstract (translated)
URL
PDF

Abstract

The surprising ability of Large Language Models (LLMs) to perform well on complex reasoning with only few-shot chain-of-thought prompts is believed to emerge only in very large-scale models (100+ billion parameters). We show that such abilities can, in fact, be distilled down from GPT-3.5 ($\ge$ 175B) to T5 variants ($\le$ 11B). We propose model specialization, to specialize the model's ability towards a target task. The hypothesis is that large models (commonly viewed as larger than 100B) have strong modeling power, but are spread on a large spectrum of tasks. Small models (commonly viewed as smaller than 10B) have limited model capacity, but if we concentrate their capacity on a specific target task, the model can achieve a decent improved performance. We use multi-step math reasoning as our testbed because it is a very typical emergent ability. We show two important aspects of model abilities: (1). there exists a very complex balance/ tradeoff between language models' multi-dimensional abilities; (2). by paying the price of decreased generic ability, we can clearly lift up the scaling curve of models smaller than 10B towards a specialized multi-step math reasoning ability. We further give comprehensive discussions about important design choices for better generalization, including the tuning data format, the start model checkpoint, and a new model selection method. We hope our practice and discoveries can serve as an important attempt towards specialized smaller models in the new research paradigm set by LLMs.

Abstract (translated)

大型语言模型(LLM)在仅使用少量思考链prompt的情况下进行复杂的推理令人惊讶地表现出色,这种能力据说仅在非常大规模的模型中(超过100亿参数)出现。我们表明,这种能力实际上可以从GPT-3.5($geq 175B)到T5变体($leq 11B)进行蒸馏。我们提议模型专业化,将模型的能力专业化到特定任务上。假设大型模型(通常被认为是大于100B的模型)具有强大的建模能力,但分布在广泛的任务中。小型模型(通常被认为是小于10B的模型)具有有限的模型能力,但如果将它们的能力集中在特定的目标任务上,模型可以实现良好的性能改进。我们使用多步数学推理作为测试平台,因为它是一种非常典型的 emergent 能力。我们展示了模型能力的两个重要方面:(1)。语言模型的多维能力之间存在一个非常复杂的平衡/ trade-off;(2)。通过支付减少通用能力的代价,我们可以清楚地将小于10B的模型的 scaling 曲线升高到专业化的多步数学推理能力。我们还就更好的泛化设计选择进行了全面的讨论,包括调整数据格式、开始模型检查点以及新的模型选择方法。我们希望我们的实践和发现能够成为专业化小型模型在LLM所设定的新研究范式中的重要尝试。

URL

https://arxiv.org/abs/2301.12726

PDF

https://arxiv.org/pdf/2301.12726.pdf