Abstract
In recent years, the widespread adoption of Large Language Models (LLMs) has sparked interest in their potential for application within the military domain. However, the current generation of LLMs demonstrate sub-optimal performance on Army use cases, due to the prevalence of domain-specific vocabulary and jargon. In order to fully leverage LLMs in-domain, many organizations have turned to fine-tuning to circumvent the prohibitive costs involved in training new LLMs from scratch. In light of this trend, we explore the viability of adapting open-source LLMs for usage in the Army domain in order to address their existing lack of domain-specificity. Our investigations have resulted in the creation of three distinct generations of TRACLM, a family of LLMs fine-tuned by The Research and Analysis Center (TRAC), Army Futures Command (AFC). Through continuous refinement of our training pipeline, each successive iteration of TRACLM displayed improved capabilities when applied to Army tasks and use cases. Furthermore, throughout our fine-tuning experiments, we recognized the need for an evaluation framework that objectively quantifies the Army domain-specific knowledge of LLMs. To address this, we developed MilBench, an extensible software framework that efficiently evaluates the Army knowledge of a given LLM using tasks derived from doctrine and assessments. We share preliminary results, models, methods, and recommendations on the creation of TRACLM and MilBench. Our work significantly informs the development of LLM technology across the DoD and augments senior leader decisions with respect to artificial intelligence integration.
Abstract (translated)
近年来,大型语言模型(LLMs)的广泛应用引起了人们对其在军事领域应用潜力的兴趣。然而,当前一代的LLMs在军队使用案例中表现出次优性能,主要是由于存在特定领域的词汇和专业术语。为了充分利用LLMs,许多组织转向微调来规避从头开始训练新LLM的巨大成本。鉴于这一趋势,我们探讨了将开源LLMs适应于军队领域使用的可行性,以解决其现有的领域专属性不足问题。我们的研究结果导致了TRACLM的创建,这是由陆军未来司令部(AFC)的研究和分析中心(TRAC)微调的一系列LLM家族。通过持续优化训练管道,每个后续版本的TRACLM在应用于军队任务和使用案例时都显示出了改进的能力。此外,在我们的微调实验中,我们认识到需要一个客观量化LLMs领域专属性知识的评估框架。为此,我们开发了MilBench,这是一个可扩展的软件框架,可以高效地通过基于教义和评估的任务来评测给定LLM的军队知识。我们分享了关于TRACLM和MilBench创建的初步结果、模型、方法以及建议。我们的工作显著推动了整个国防部范围内LLM技术的发展,并增强高级领导人在人工智能整合方面的决策能力。
URL
https://arxiv.org/abs/2410.20297