Abstract
Document-Level Biomedical Relation Extraction (Bio-RE) aims to identify relations between biomedical entities within extensive texts, serving as a crucial subfield of biomedical text mining. Existing Bio-RE methods struggle with cross-sentence inference, which is essential for capturing relations spanning multiple sentences. Moreover, previous methods often overlook the incompleteness of documents and lack the integration of external knowledge, limiting contextual richness. Besides, the scarcity of annotated data further hampers model training. Recent advancements in large language models (LLMs) have inspired us to explore all the above issues for document-level Bio-RE. Specifically, we propose a document-level Bio-RE framework via LLM Adaptive Document-Relation Cross-Mapping (ADRCM) Fine-Tuning and Concept Unique Identifier (CUI) Retrieval-Augmented Generation (RAG). First, we introduce the Iteration-of-REsummary (IoRs) prompt for solving the data scarcity issue. In this way, Bio-RE task-specific synthetic data can be generated by guiding ChatGPT to focus on entity relations and iteratively refining synthetic data. Next, we propose ADRCM fine-tuning, a novel fine-tuning recipe that establishes mappings across different documents and relations, enhancing the model's contextual understanding and cross-sentence inference capabilities. Finally, during the inference, a biomedical-specific RAG approach, named CUI RAG, is designed to leverage CUIs as indexes for entities, narrowing the retrieval scope and enriching the relevant document contexts. Experiments conducted on three Bio-RE datasets (GDA, CDR, and BioRED) demonstrate the state-of-the-art performance of our proposed method by comparing it with other related works.
Abstract (translated)
文档级生物医学关系提取(Bio-RE)旨在识别广泛文本中生物医学实体之间的关系,这是生物医学文本挖掘的一个重要子领域。现有的Bio-RE方法在跨句子推理方面存在困难,这对于捕捉跨越多句话的关系至关重要。此外,先前的方法往往忽略了文档的不完备性,并缺乏外部知识整合,从而限制了上下文的丰富度。而且,标注数据的稀缺进一步阻碍了模型训练。最近,在大型语言模型(LLMs)领域的进展激发了我们探索上述所有问题以解决文档级Bio-RE的需求。 具体来说,我们提出了一种通过LLM自适应文档关系跨映射(ADRCM)微调和概念唯一标识符(CUI)检索增强生成(RAG)的文档级Bio-RE框架。首先,我们引入了REsummary迭代(IoRs)提示来解决数据稀缺问题,在这种情况下,通过引导ChatGPT关注实体关系并迭代地精炼合成数据,可以生成特定于Bio-RE任务的合成数据。 其次,我们提出了ADRCM微调方法,这是一种新的微调配方,建立了不同文档和关系之间的映射,增强了模型的上下文理解能力和跨句子推理能力。最后,在进行推断时,设计了一种名为CUI RAG的生物医学特定RAG方法,利用CUI作为实体索引,缩小检索范围并丰富相关文档背景。 我们在三个Bio-RE数据集(GDA、CDR和BioRED)上进行了实验,并通过与其它相关工作对比验证了我们所提出的方法达到了最先进的性能。
URL
https://arxiv.org/abs/2501.05155