Abstract
User-generated data sources have gained significance in uncovering Adverse Drug Reactions (ADRs), with an increasing number of discussions occurring in the digital world. However, the existing clinical corpora predominantly revolve around scientific articles in English. This work presents a multilingual corpus of texts concerning ADRs gathered from diverse sources, including patient fora, social media, and clinical reports in German, French, and Japanese. Our corpus contains annotations covering 12 entity types, four attribute types, and 13 relation types. It contributes to the development of real-world multilingual language models for healthcare. We provide statistics to highlight certain challenges associated with the corpus and conduct preliminary experiments resulting in strong baselines for extracting entities and relations between these entities, both within and across languages.
Abstract (translated)
用户生成的数据源在揭示不良反应(ADRs)方面具有重要意义,越来越多的讨论发生在数字世界中。然而,现有的临床数据集主要围绕英文科学文章展开。这项工作提供一个多语言文本库,涉及ADR的收集,包括患者论坛、社交媒体和德语、法语、日语的临床报告。我们的数据集包括12个实体类型、4个属性类型和13个关系类型的注释。它为开发用于医疗保健的真实世界多语言语言模型做出了贡献。我们提供了统计数据,以突出数据集和相关挑战。我们还进行了初步实验,在跨语言实体和关系之间实现了强大的基线。
URL
https://arxiv.org/abs/2403.18336