Paper Reading AI Learner

Fine-Tuning and Evaluating Open-Source Large Language Models for the Army Domain

2024-10-27 00:39:24
Daniel C. Ruiz, John Sell

Abstract

In recent years, the widespread adoption of Large Language Models (LLMs) has sparked interest in their potential for application within the military domain. However, the current generation of LLMs demonstrate sub-optimal performance on Army use cases, due to the prevalence of domain-specific vocabulary and jargon. In order to fully leverage LLMs in-domain, many organizations have turned to fine-tuning to circumvent the prohibitive costs involved in training new LLMs from scratch. In light of this trend, we explore the viability of adapting open-source LLMs for usage in the Army domain in order to address their existing lack of domain-specificity. Our investigations have resulted in the creation of three distinct generations of TRACLM, a family of LLMs fine-tuned by The Research and Analysis Center (TRAC), Army Futures Command (AFC). Through continuous refinement of our training pipeline, each successive iteration of TRACLM displayed improved capabilities when applied to Army tasks and use cases. Furthermore, throughout our fine-tuning experiments, we recognized the need for an evaluation framework that objectively quantifies the Army domain-specific knowledge of LLMs. To address this, we developed MilBench, an extensible software framework that efficiently evaluates the Army knowledge of a given LLM using tasks derived from doctrine and assessments. We share preliminary results, models, methods, and recommendations on the creation of TRACLM and MilBench. Our work significantly informs the development of LLM technology across the DoD and augments senior leader decisions with respect to artificial intelligence integration.

Abstract (translated)

近年来,大型语言模型(LLMs)的广泛应用引起了人们对其在军事领域应用潜力的兴趣。然而,当前一代的LLMs在军队使用案例中表现出次优性能,主要是由于存在特定领域的词汇和专业术语。为了充分利用LLMs,许多组织转向微调来规避从头开始训练新LLM的巨大成本。鉴于这一趋势,我们探讨了将开源LLMs适应于军队领域使用的可行性,以解决其现有的领域专属性不足问题。我们的研究结果导致了TRACLM的创建,这是由陆军未来司令部(AFC)的研究和分析中心(TRAC)微调的一系列LLM家族。通过持续优化训练管道,每个后续版本的TRACLM在应用于军队任务和使用案例时都显示出了改进的能力。此外,在我们的微调实验中,我们认识到需要一个客观量化LLMs领域专属性知识的评估框架。为此,我们开发了MilBench,这是一个可扩展的软件框架,可以高效地通过基于教义和评估的任务来评测给定LLM的军队知识。我们分享了关于TRACLM和MilBench创建的初步结果、模型、方法以及建议。我们的工作显著推动了整个国防部范围内LLM技术的发展,并增强高级领导人在人工智能整合方面的决策能力。

URL

https://arxiv.org/abs/2410.20297

PDF

https://arxiv.org/pdf/2410.20297.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot