Abstract
We present a Japanese domain-specific language model for the pharmaceutical field, developed through continual pretraining on 2 billion Japanese pharmaceutical tokens and 8 billion English biomedical tokens. To enable rigorous evaluation, we introduce three new benchmarks: YakugakuQA, based on national pharmacist licensing exams; NayoseQA, which tests cross-lingual synonym and terminology normalization; and SogoCheck, a novel task designed to assess consistency reasoning between paired statements. We evaluate our model against both open-source medical LLMs and commercial models, including GPT-4o. Results show that our domain-specific model outperforms existing open models and achieves competitive performance with commercial ones, particularly on terminology-heavy and knowledge-based tasks. Interestingly, even GPT-4o performs poorly on SogoCheck, suggesting that cross-sentence consistency reasoning remains an open challenge. Our benchmark suite offers a broader diagnostic lens for pharmaceutical NLP, covering factual recall, lexical variation, and logical consistency. This work demonstrates the feasibility of building practical, secure, and cost-effective language models for Japanese domain-specific applications, and provides reusable evaluation resources for future research in pharmaceutical and healthcare NLP. Our model, codes, and datasets are released at this https URL.
Abstract (translated)
我们介绍了一种针对制药领域的日语特定领域语言模型,该模型通过在20亿个日语文本的医药标记和80亿个英语生物医学标记上进行持续预训练而开发。为了能够严格评估模型性能,我们引入了三个新的基准测试:YakugakuQA,基于国家药剂师执业资格考试;NayoseQA,用于跨语言同义词和术语规范化测试;以及SogoCheck,一个新颖的任务设计用于评估成对语句之间的一致性推理。我们在开源医学LLM(大型语言模型)和商业模型(包括GPT-4o)上对该模型进行了评估。结果显示,我们的特定领域模型在现有开放模型中表现更佳,并且在术语密集型和知识基础任务中与商用模型的性能相当甚至超越。有趣的是,即使是GPT-4o在SogoCheck上的表现也相对较差,这表明跨句子一致性推理仍然是一个待解决的技术难题。 我们的基准测试套件为医药NLP(自然语言处理)提供了一个更为全面的诊断视角,涵盖事实回忆、词汇变化和逻辑一致性。这项工作展示了构建实用、安全且成本效益高的日语文本领域应用的语言模型是可行的,并为未来在制药和医疗保健NLP领域的研究提供了可重复使用的评估资源。 我们的模型、代码和数据集已在此网址发布:[此URL]。
URL
https://arxiv.org/abs/2505.16661