emrQA-msquad: A Medical Dataset Structured with the SQuAD V2.0 Framework, Enriched with emrQA Medical Information

Abstract
Abstract (translated)
URL
PDF

Abstract

Machine Reading Comprehension (MRC) holds a pivotal role in shaping Medical Question Answering Systems (QAS) and transforming the landscape of accessing and applying medical information. However, the inherent challenges in the medical field, such as complex terminology and question ambiguity, necessitate innovative solutions. One key solution involves integrating specialized medical datasets and creating dedicated datasets. This strategic approach enhances the accuracy of QAS, contributing to advancements in clinical decision-making and medical research. To address the intricacies of medical terminology, a specialized dataset was integrated, exemplified by a novel Span extraction dataset derived from emrQA but restructured into 163,695 questions and 4,136 manually obtained answers, this new dataset was called emrQA-msquad dataset. Additionally, for ambiguous questions, a dedicated medical dataset for the Span extraction task was introduced, reinforcing the system's robustness. The fine-tuning of models such as BERT, RoBERTa, and Tiny RoBERTa for medical contexts significantly improved response accuracy within the F1-score range of 0.75 to 1.00 from 10.1% to 37.4%, 18.7% to 44.7% and 16.0% to 46.8%, respectively. Finally, emrQA-msquad dataset is publicy available at this https URL.

Abstract (translated)

机器阅读理解（MRC）在塑造医疗问答系统（QAS）和访问和使用医疗信息的地形方面具有关键作用。然而，医疗领域的固有挑战，如复杂的术语和问题不明确，需要创新解决方案。一个关键解决方案涉及将专业医学数据集整合并创建专用数据集。这种策略提高了QAS的准确性，促进了临床决策和医学研究的进步。为解决医学用语的复杂性，专用的医学数据集被整合了，例如，由emrQA生成的新颖的跨度提取数据集，但重新结构为163,695个问题和发展4,136个手动获得的答案，这个新数据集被称为emrQA-msquad数据集。此外，为解决不明确的 questions，还引入了一个专门用于跨度提取任务的医学数据集，增强了系统的稳健性。对BERT、RoBERTa和Tiny RoBERTa等模型的微调，在医学上下文中的响应准确性从10.1%到37.4%，18.7%到44.7%和16.0%到46.8%分别改进。最后，emrQA-msquad数据集可以在此链接 https://url.cn/Zhangqi_emrQA_msquad_dataset 公开使用。

URL

https://arxiv.org/abs/2404.12050

PDF

https://arxiv.org/pdf/2404.12050.pdf

emrQA-msquad: A Medical Dataset Structured with the SQuAD V2.0 Framework, Enriched with emrQA Medical Information

Abstract

Abstract (translated)

URL

PDF Copy

PDF