Paper Reading AI Learner

Bioformer: an efficient transformer language model for biomedical text mining

2023-02-03 08:04:59
Li Fang, Qingyu Chen, Chih-Hsuan Wei, Zhiyong Lu, Kai Wang

Abstract

Pretrained language models such as Bidirectional Encoder Representations from Transformers (BERT) have achieved state-of-the-art performance in natural language processing (NLP) tasks. Recently, BERT has been adapted to the biomedical domain. Despite the effectiveness, these models have hundreds of millions of parameters and are computationally expensive when applied to large-scale NLP applications. We hypothesized that the number of parameters of the original BERT can be dramatically reduced with minor impact on performance. In this study, we present Bioformer, a compact BERT model for biomedical text mining. We pretrained two Bioformer models (named Bioformer8L and Bioformer16L) which reduced the model size by 60% compared to BERTBase. Bioformer uses a biomedical vocabulary and was pre-trained from scratch on PubMed abstracts and PubMed Central full-text articles. We thoroughly evaluated the performance of Bioformer as well as existing biomedical BERT models including BioBERT and PubMedBERT on 15 benchmark datasets of four different biomedical NLP tasks: named entity recognition, relation extraction, question answering and document classification. The results show that with 60% fewer parameters, Bioformer16L is only 0.1% less accurate than PubMedBERT while Bioformer8L is 0.9% less accurate than PubMedBERT. Both Bioformer16L and Bioformer8L outperformed BioBERTBase-v1.1. In addition, Bioformer16L and Bioformer8L are 2-3 fold as fast as PubMedBERT/BioBERTBase-v1.1. Bioformer has been successfully deployed to PubTator Central providing gene annotations over 35 million PubMed abstracts and 5 million PubMed Central full-text articles. We make Bioformer publicly available via this https URL, including pre-trained models, datasets, and instructions for downstream use.

Abstract (translated)

预训练的语言模型,例如双向编码器表示从Transformer(BERT)已经实现了在自然语言处理(NLP)任务中的最先进的性能。最近,BERT已经适应生物医学领域。尽管这些模型具有数百数百万参数,但在应用于大规模NLP应用时计算成本很高。我们假设,原始的BERT参数数量可以戏剧性地减少,而性能的影响较小。在本研究中,我们介绍了Bioformer,一个紧凑的BERT模型,用于生物医学文本挖掘。我们预先训练了两个Bioformer模型(称为Bioformer8L和Bioformer16L),比BERTBase大小缩小了60%。Bioformer使用生物医学词汇表,从 scratch 开始进行预训练,在 PubMed 摘要和 PubMed Central 全文 articles 上。我们彻底评估了Bioformer以及包括BioBERT和PubMedBERT在内的现有生物医学BERT模型,包括BioBERT和PubMedBERT在15个基准数据集上的四种不同生物医学NLP任务:命名实体识别、关系提取、问题回答和文档分类的性能。结果显示,与 PubMedBERT 相比,Bioformer16L只准确性下降了0.1%,而 Bioformer8L 下降了0.9%。两者都超越了 BERTBase-v1.1。此外,Bioformer16L 和 Bioformer8L 的速度是 PubMedBERT/BioBERTBase-v1.1 的2-3倍。Bioformer已经成功部署到PubTator Central,提供超过350万 PubMed 摘要和500万 PubMed Central 全文 article的基因注释。我们将Bioformer通过这个httpsURL公开提供,包括预训练模型、数据集和后续使用的指令。

URL

https://arxiv.org/abs/2302.01588

PDF

https://arxiv.org/pdf/2302.01588.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot