Paper Reading AI Learner

A Japanese Language Model and Three New Evaluation Benchmarks for Pharmaceutical NLP

2025-05-22 13:27:37
Issey Sukeda, Takuro Fujii, Kosei Buma, Shunsuke Sasaki, Shinnosuke Ono

Abstract

We present a Japanese domain-specific language model for the pharmaceutical field, developed through continual pretraining on 2 billion Japanese pharmaceutical tokens and 8 billion English biomedical tokens. To enable rigorous evaluation, we introduce three new benchmarks: YakugakuQA, based on national pharmacist licensing exams; NayoseQA, which tests cross-lingual synonym and terminology normalization; and SogoCheck, a novel task designed to assess consistency reasoning between paired statements. We evaluate our model against both open-source medical LLMs and commercial models, including GPT-4o. Results show that our domain-specific model outperforms existing open models and achieves competitive performance with commercial ones, particularly on terminology-heavy and knowledge-based tasks. Interestingly, even GPT-4o performs poorly on SogoCheck, suggesting that cross-sentence consistency reasoning remains an open challenge. Our benchmark suite offers a broader diagnostic lens for pharmaceutical NLP, covering factual recall, lexical variation, and logical consistency. This work demonstrates the feasibility of building practical, secure, and cost-effective language models for Japanese domain-specific applications, and provides reusable evaluation resources for future research in pharmaceutical and healthcare NLP. Our model, codes, and datasets are released at this https URL.

Abstract (translated)

我们介绍了一种针对制药领域的日语特定领域语言模型,该模型通过在20亿个日语文本的医药标记和80亿个英语生物医学标记上进行持续预训练而开发。为了能够严格评估模型性能,我们引入了三个新的基准测试:YakugakuQA,基于国家药剂师执业资格考试;NayoseQA,用于跨语言同义词和术语规范化测试;以及SogoCheck,一个新颖的任务设计用于评估成对语句之间的一致性推理。我们在开源医学LLM(大型语言模型)和商业模型(包括GPT-4o)上对该模型进行了评估。结果显示,我们的特定领域模型在现有开放模型中表现更佳,并且在术语密集型和知识基础任务中与商用模型的性能相当甚至超越。有趣的是,即使是GPT-4o在SogoCheck上的表现也相对较差,这表明跨句子一致性推理仍然是一个待解决的技术难题。 我们的基准测试套件为医药NLP(自然语言处理)提供了一个更为全面的诊断视角,涵盖事实回忆、词汇变化和逻辑一致性。这项工作展示了构建实用、安全且成本效益高的日语文本领域应用的语言模型是可行的,并为未来在制药和医疗保健NLP领域的研究提供了可重复使用的评估资源。 我们的模型、代码和数据集已在此网址发布:[此URL]。

URL

https://arxiv.org/abs/2505.16661

PDF

https://arxiv.org/pdf/2505.16661.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot