Abstract
As the capabilities of Large Language Models (LLMs) in healthcare and medicine continue to advance, there is a growing need for competitive open-source models that can safeguard public interest. With the increasing availability of highly competitive open base models, the impact of continued pre-training is increasingly uncertain. In this work, we explore the role of instruct tuning, model merging, alignment, red teaming and advanced inference schemes, as means to improve current open models. To that end, we introduce the Aloe family, a set of open medical LLMs highly competitive within its scale range. Aloe models are trained on the current best base models (Mistral, LLaMA 3), using a new custom dataset which combines public data sources improved with synthetic Chain of Thought (CoT). Aloe models undergo an alignment phase, becoming one of the first few policy-aligned open healthcare LLM using Direct Preference Optimization, setting a new standard for ethical performance in healthcare LLMs. Model evaluation expands to include various bias and toxicity datasets, a dedicated red teaming effort, and a much-needed risk assessment for healthcare LLMs. Finally, to explore the limits of current LLMs in inference, we study several advanced prompt engineering strategies to boost performance across benchmarks, yielding state-of-the-art results for open healthcare 7B LLMs, unprecedented at this scale.
Abstract (translated)
随着大型语言模型(LLMs)在医疗和医学领域的功能不断扩展,保护公共利益的需求不断增加。随着高度竞争性的开源模型的可用性增加,继续进行预训练的影响越来越不确定。在这项工作中,我们探讨了指令调整、模型合并、对齐、红队和高级推理方案等方法,作为改进现有开源模型的手段。为此,我们引入了Aloe家族,一系列在规模范围内高度竞争的开放医疗LLM。Aloe模型在当前最佳基础模型(Mistral,LLLaM3)上进行训练,使用一种新的人工合成连续性思维(CoT)的数据集。Aloe模型经历了一个对齐阶段,成为第一个使用直接偏好优化策略实现政策对齐的开放医疗LLM,为医疗LLM的道德表现树立了新的标准。模型评估范围扩展到包括各种偏见和毒性数据集、专门的红军行动和医疗LLM所需的严重风险评估。最后,为了研究现有LLM在推理方面的局限性,我们研究了几种高级提示工程策略,以提高基准测试的性能,为开放医疗7B LLM产生最先进的结果,空前规模。
URL
https://arxiv.org/abs/2405.01886