Pretraining and Updating Language- and Domain-specific Large Language Model: A Case Study in Japanese Business Domain

2024-04-12 06:21:48

Kosuke Takahashi, Takahiro Omi, Kosuke Arima, Tatsuya Ishigaki

arXiv_AI

arXiv_AI QA Knowledge Language_Model Pose LLM

Abstract
Abstract (translated)
URL
PDF

Abstract

Several previous studies have considered language- and domain-specific large language models (LLMs) as separate topics. This study explores the combination of a non-English language and a high-demand industry domain, focusing on a Japanese business-specific LLM. This type of a model requires expertise in the business domain, strong language skills, and regular updates of its knowledge. We trained a 13-billion-parameter LLM from scratch using a new dataset of business texts and patents, and continually pretrained it with the latest business documents. Further we propose a new benchmark for Japanese business domain question answering (QA) and evaluate our models on it. The results show that our pretrained model improves QA accuracy without losing general knowledge, and that continual pretraining enhances adaptation to new information. Our pretrained model and business domain benchmark are publicly available.

Abstract (translated)

之前的研究将语言和领域特定的 large language models (LLMs) 视为单独的主题。本研究探讨了将非英语语言和高需求行业领域相结合，重点关注日语商务特定 LLM 的组合。这种模型需要掌握业务领域专业知识、强大的语言技能和对知识的定期更新。我们从头训练了一个包含 130 亿参数的 LLM，并不断用最新的商务文件预热它。此外，我们还为日本商务领域问题回答 (QA) 提出了一个新的基准，并评估了我们的模型在它上的表现。研究结果表明，我们的预训练模型在没有失去一般知识的情况下提高了 QA 准确性，而持续预训练则增强了对新信息的适应。我们的预训练模型和商务领域基准都是公开可用的。

URL

https://arxiv.org/abs/2404.08262

PDF

https://arxiv.org/pdf/2404.08262.pdf

Pretraining and Updating Language- and Domain-specific Large Language Model: A Case Study in Japanese Business Domain

Abstract

Abstract (translated)

URL

PDF Copy

PDF