Document-level Relation Extraction (DocRE) involves identifying relations between entities across multiple sentences in a document. Evidence sentences, crucial for precise entity pair relationships identification, enhance focus on essential text segments, improving DocRE performance. However, existing evidence retrieval systems often overlook the collaborative nature among semantically similar entity pairs in the same document, hindering the effectiveness of the evidence retrieval task. To address this, we propose a novel evidence retrieval framework, namely CDER. CDER employs an attentional graph-based architecture to capture collaborative patterns and incorporates a dynamic sub-structure for additional robustness in evidence retrieval. Experimental results on the benchmark DocRE dataset show that CDER not only excels in the evidence retrieval task but also enhances overall performance of existing DocRE system.
文档级关系抽取(DocRE)涉及在文档的多句话中识别实体之间的关系。证据句子对于精确识别实体对的关系至关重要,它们有助于聚焦于关键文本片段,从而提高文档级关系抽取的效果。然而,现有的证据检索系统常常忽视了同一文档中语义相似的实体对之间相互协作的本质特征,这限制了证据检索任务的有效性。为此,我们提出了一种新的证据检索框架,即CDER(Collaborative Evidence Retrieval)。CDER采用注意力图网络架构来捕捉协作模式,并结合动态子结构以增强证据检索的鲁棒性。在DocRE基准数据集上的实验结果表明,CDER不仅在证据检索任务中表现出色,还提升了现有文档级关系抽取系统的整体性能。
https://arxiv.org/abs/2504.06529
Real-world data, such as news articles, social media posts, and chatbot conversations, is inherently dynamic and non-stationary, presenting significant challenges for constructing real-time structured representations through knowledge graphs (KGs). Relation Extraction (RE), a fundamental component of KG creation, often struggles to adapt to evolving data when traditional models rely on static, outdated datasets. Continual Relation Extraction (CRE) methods tackle this issue by incrementally learning new relations while preserving previously acquired knowledge. This study investigates the application of pre-trained language models (PLMs), specifically large language models (LLMs), to CRE, with a focus on leveraging memory replay to address catastrophic forgetting. We evaluate decoder-only models (eg, Mistral-7B and Llama2-7B) and encoder-decoder models (eg, Flan-T5 Base) on the TACRED and FewRel datasets. Task-incremental fine-tuning of LLMs demonstrates superior performance over earlier approaches using encoder-only models like BERT on TACRED, excelling in seen-task accuracy and overall performance (measured by whole and average accuracy), particularly with the Mistral and Flan-T5 models. Results on FewRel are similarly promising, achieving second place in whole and average accuracy metrics. This work underscores critical factors in knowledge transfer, language model architecture, and KG completeness, advancing CRE with LLMs and memory replay for dynamic, real-time relation extraction.
现实世界中的数据,如新闻文章、社交媒体帖子和聊天机器人对话,具有动态性和非平稳性,这给通过知识图谱(KG)构建实时结构化表示带来了重大挑战。关系抽取(RE),作为KG创建的一个基本组成部分,在处理不断变化的数据时往往会遇到困难,因为传统的模型依赖于静态且过时的数据集。持续关系提取(CRE)方法通过逐步学习新的关系并保留之前获取的知识来解决这一问题。 本研究探讨了预训练语言模型(PLM),特别是大型语言模型(LLMs),在CRE中的应用,并重点关注利用内存回放技术解决灾难性遗忘的问题。我们在TACRED和FewRel数据集上评估了解码器单独的模型(如Mistral-7B和Llama2-7B)以及编码器解码器的模型(如Flan-T5 Base)。任务增量微调LLMs在处理TACRED数据时表现出优于早期使用仅编码器模型(如BERT)的方法,尤其是在已见过的任务准确性和整体性能方面(通过总准确率和平均准确率度量),特别是Mistral和Flan-T5模型表现尤为出色。在FewRel上的结果也十分有前景,在总准确率和平均准确率指标上取得了第二名的好成绩。 这项研究强调了知识转移、语言模型架构以及KG完整性中的关键因素,并推进了使用LLMs和内存回放技术进行动态实时关系提取的CRE方法的发展。
https://arxiv.org/abs/2504.05214
Objective: Zero-shot methodology promises to cut down on costs of dataset annotation and domain expertise needed to make use of NLP. Generative large language models trained to align with human goals have achieved high zero-shot performance across a wide variety of tasks. As of yet, it is unclear how well these models perform on biomedical relation extraction (RE). To address this knowledge gap, we explore patterns in the performance of OpenAI LLMs across a diverse sampling of RE tasks. Methods: We use OpenAI GPT-4-turbo and their reasoning model o1 to conduct end-to-end RE experiments on seven datasets. We use the JSON generation capabilities of GPT models to generate structured output in two ways: (1) by defining an explicit schema describing the structure of relations, and (2) using a setting that infers the structure from the prompt language. Results: Our work is the first to study and compare the performance of the GPT-4 and o1 for the end-to-end zero-shot biomedical RE task across a broad array of datasets. We found the zero-shot performances to be proximal to that of fine-tuned methods. The limitations of this approach are that it performs poorly on instances containing many relations and errs on the boundaries of textual mentions. Conclusion: Recent large language models exhibit promising zero-shot capabilities in complex biomedical RE tasks, offering competitive performance with reduced dataset curation and NLP modeling needs at the cost of increased computing, potentially increasing medical community accessibility. Addressing the limitations we identify could further boost reliability. The code, data, and prompts for all our experiments are publicly available: this https URL
目标:零样本方法承诺减少数据集标注成本和使用自然语言处理(NLP)所需的专业知识。经过训练以与人类目标对齐的生成式大规模语言模型在各种任务中实现了高水平的零样本性能。然而,到目前为止,这些模型在生物医学关系抽取(RE)方面的表现尚不清楚。为了填补这一知识空白,我们探索了OpenAI LLMs在一系列多样化的RE任务中的性能模式。 方法:我们使用OpenAI GPT-4-turbo和推理模型o1对七种数据集进行端到端的零样本生物医学关系抽取实验。我们利用GPT模型的JSON生成能力以两种方式生成结构化输出:(1)通过定义描述关系结构的显式模式;(2)使用从提示语言中推断出结构的设置。 结果:我们的研究是首次对GPT-4和o1在广泛的生物医学RE数据集上进行端到端零样本任务性能的研究与比较。我们发现这些模型的零样本表现接近于经过微调的方法的表现。然而,该方法存在局限性,在包含许多关系的实例中表现较差,并且在文本提及边界的处理上有偏差。 结论:最近的大规模语言模型在复杂的生物医学RE任务中表现出令人鼓舞的零样本能力,减少了数据集整理和NLP建模的需求,但增加了计算成本,这可能提高医疗社区的可访问性。解决我们识别出的限制可能会进一步增强其可靠性。所有实验的代码、数据和提示均公开提供:[此URL](this https URL)
https://arxiv.org/abs/2504.04083
In Biomedical Natural Language Processing (BioNLP) tasks, such as Relation Extraction, Named Entity Recognition, and Text Classification, the scarcity of high-quality data remains a significant challenge. This limitation poisons large language models to correctly understand relationships between biological entities, such as molecules and diseases, or drug interactions, and further results in potential misinterpretation of biomedical documents. To address this issue, current approaches generally adopt the Synthetic Data Augmentation method which involves similarity computation followed by word replacement, but counterfactual data are usually generated. As a result, these methods disrupt meaningful word sets or produce sentences with meanings that deviate substantially from the original context, rendering them ineffective in improving model performance. To this end, this paper proposes a biomedical-dedicated rationale-based synthetic data augmentation method. Beyond the naive lexicon similarity, specific bio-relation similarity is measured to hold the augmented instance having a strong correlation with bio-relation instead of simply increasing the diversity of augmented data. Moreover, a multi-agents-involved reflection mechanism helps the model iteratively distinguish different usage of similar entities to escape falling into the mis-replace trap. We evaluate our method on the BLURB and BigBIO benchmark, which includes 9 common datasets spanning four major BioNLP tasks. Our experimental results demonstrate consistent performance improvements across all tasks, highlighting the effectiveness of our approach in addressing the challenges associated with data scarcity and enhancing the overall performance of biomedical NLP models.
在生物医学自然语言处理(BioNLP)任务中,如关系抽取、命名实体识别和文本分类等,高质量数据的稀缺仍然是一个重大挑战。这一限制使得大型语言模型难以准确理解诸如分子与疾病之间或药物相互作用之间的生物学实体间的关系,并可能导致对生物医学文献的潜在误解。为解决此问题,目前的方法通常采用合成数据增强方法,包括相似性计算后进行词汇替换,但往往生成的是反事实数据。结果是这些方法破坏了有意义的词组集,或者产生了与原始语境显著不同的句子含义,从而无法有效提升模型性能。 为此,本文提出了一种基于生物医学专门理据的合成数据增强方法。该方法超越了简单的词汇相似度,在衡量特定生物关系的相似性时,确保增强后的实例具有较强的生物关系相关性,而不是简单地增加增强数据的多样性。此外,一种涉及多代理参与的反思机制帮助模型迭代区分类似实体的不同用法,从而避免落入错误替换的陷阱。 我们在BLURB和BigBIO基准测试上评估了我们的方法,这些基准包括涵盖四大类BioNLP任务的九个常见数据集。实验结果表明,在所有任务中均实现了持续的性能改进,这凸显了我们方法在应对数据稀缺挑战方面的有效性,并显著提升了生物医学自然语言处理模型的整体表现。
https://arxiv.org/abs/2503.23673
Temporal information extraction from unstructured text is essential for contextualizing events and deriving actionable insights, particularly in the medical domain. We address the task of extracting clinical events and their temporal relations using the well-studied I2B2 2012 Temporal Relations Challenge corpus. This task is inherently challenging due to complex clinical language, long documents, and sparse annotations. We introduce GRAPHTREX, a novel method integrating span-based entity-relation extraction, clinical large pre-trained language models (LPLMs), and Heterogeneous Graph Transformers (HGT) to capture local and global dependencies. Our HGT component facilitates information propagation across the document through innovative global landmarks that bridge distant entities. Our method improves the state-of-the-art with 5.5% improvement in the tempeval $F_1$ score over the previous best and up to 8.9% improvement on long-range relations, which presents a formidable challenge. This work not only advances temporal information extraction but also lays the groundwork for improved diagnostic and prognostic models through enhanced temporal reasoning.
从非结构化文本中提取时间信息对于事件的语境化和获取可操作见解至关重要,特别是在医疗领域。我们通过研究广泛使用的I2B2 2012年时间关系挑战数据集来解决临床事件及其时间关系抽取的任务。由于复杂的医学语言、长文档以及稀疏标注的存在,这一任务本身具有相当大的挑战性。 为此,我们引入了一种新的方法——GRAPHTREX,该方法结合了基于跨度的实体-关系提取、大型预训练的语言模型(LPLMs)和异构图变换器(HGT),以捕捉局部与全局依赖关系。我们的HGT组件通过创新性的全局地标来促进文档中的信息传播,这些地标能够连接远距离的实体。 我们提出的方法显著提升了现有技术水平,在tempeval $F_1$分数上比之前的最佳方法提高了5.5%,在长程关系提取方面最多提高了8.9%。这种改进对于解决长程关系这一重大挑战尤其重要。 这项工作不仅推进了时间信息抽取技术的发展,还为通过增强的时间推理能力来改善诊断和预后模型奠定了基础。
https://arxiv.org/abs/2503.18085
Relation extraction (RE) is a standard information extraction task playing a major role in downstream applications such as knowledge discovery and question answering. Although decoder-only large language models are excelling in generative tasks, smaller encoder models are still the go to architecture for RE. In this paper, we revisit fine-tuning such smaller models using a novel dual-encoder architecture with a joint contrastive and cross-entropy loss. Unlike previous methods that employ a fixed linear layer for predicate representations, our approach uses a second encoder to compute instance-specific predicate representations by infusing them with real entity spans from corresponding input instances. We conducted experiments on two biomedical RE datasets and two general domain datasets. Our approach achieved F1 score improvements ranging from 1% to 2% over state-of-the-art methods with a simple but elegant formulation. Ablation studies justify the importance of various components built into the proposed architecture.
关系抽取(RE)是一项标准的信息提取任务,在知识发现和问答等下游应用中扮演着重要角色。尽管解码器-only的大语言模型在生成性任务上表现出色,但对于RE而言,较小的编码器模型仍然是首选架构。在这篇论文中,我们通过使用一种新型的双编码器架构并结合对比损失和交叉熵损失来重新审视对这些小型模型的微调过程。与之前的方法不同的是,我们的方法不再采用固定的线性层来进行谓词表示,而是引入了一个第二编码器,用于根据相应输入实例中的实际实体跨度生成特定于每个实例的谓词表示。 我们在两个生物医学RE数据集和两个通用领域数据集中进行了实验,结果表明我们提出的方法在F1分数上比现有最先进的方法提高了1%到2%,且具有简单而优雅的设计。消融研究也证明了所提出的架构中各个组件的重要性。
https://arxiv.org/abs/2503.17799
Assertion status detection is a critical yet often overlooked component of clinical NLP, essential for accurately attributing extracted medical facts. Past studies have narrowly focused on negation detection, leading to underperforming commercial solutions such as AWS Medical Comprehend, Azure AI Text Analytics, and GPT-4o due to their limited domain adaptation. To address this gap, we developed state-of-the-art assertion detection models, including fine-tuned LLMs, transformer-based classifiers, few-shot classifiers, and deep learning (DL) approaches. We evaluated these models against cloud-based commercial API solutions, the legacy rule-based NegEx approach, and GPT-4o. Our fine-tuned LLM achieves the highest overall accuracy (0.962), outperforming GPT-4o (0.901) and commercial APIs by a notable margin, particularly excelling in Present (+4.2%), Absent (+8.4%), and Hypothetical (+23.4%) assertions. Our DL-based models surpass commercial solutions in Conditional (+5.3%) and Associated-with-Someone-Else (+10.1%) categories, while the few-shot classifier offers a lightweight yet highly competitive alternative (0.929), making it ideal for resource-constrained environments. Integrated within Spark NLP, our models consistently outperform black-box commercial solutions while enabling scalable inference and seamless integration with medical NER, Relation Extraction, and Terminology Resolution. These results reinforce the importance of domain-adapted, transparent, and customizable clinical NLP solutions over general-purpose LLMs and proprietary APIs.
断言状态检测是临床自然语言处理(NLP)的一个关键但常被忽视的组成部分,对于准确地归因于提取的医学事实至关重要。过去的研究主要集中在否定检测上,导致像AWS Medical Comprehend、Azure AI Text Analytics和GPT-4o这样的商业解决方案表现不佳,主要是因为它们在特定领域的适应性有限。为了解决这一缺口,我们开发了最先进的断言检测模型,包括微调的大型语言模型(LLM)、基于变压器的分类器、少量样本分类器以及深度学习(DL)方法。我们在这些模型与云端商用API解决方案、传统的规则基础NegEx方法和GPT-4o之间进行了对比评估。 我们的微调LLM取得了最高的整体准确率(0.962),显著优于GPT-4o(0.901)以及商业API,尤其是在当前状态(+4.2%)、不存在(+8.4%)和假设(+23.4%)断言方面。我们的基于DL的模型在条件性(+5.3%)和与他人有关(+10.1%)类别中超越了商用解决方案,而少量样本分类器则提供了一个轻量级但极具竞争力的选择(0.929),非常适合资源受限环境。 当集成到Spark NLP时,我们的模型能够持续优于黑盒商业解决方案,并且支持大规模推理以及与医学命名实体识别、关系抽取和术语解析的无缝整合。这些结果强化了领域适应性、透明性和可定制化的临床NLP解决方案的重要性,相对于通用目的LLM和专有API而言更为重要。
https://arxiv.org/abs/2503.17425
Document-level relation extraction (DocRE) is the process of identifying and extracting relations between entities that span multiple sentences within a document. Due to its realistic settings, DocRE has garnered increasing research attention in recent years. Previous research has mostly focused on developing sophisticated encoding models to better capture the intricate patterns between entity pairs. While these advancements are undoubtedly crucial, an even more foundational challenge lies in the data itself. The complexity inherent in DocRE makes the labeling process prone to errors, compounded by the extreme sparsity of positive relation samples, which is driven by both the limited availability of positive instances and the broad diversity of positive relation types. These factors can lead to biased optimization processes, further complicating the task of accurate relation extraction. Recognizing these challenges, we have developed a robust framework called \textit{\textbf{COMM}} to better solve DocRE. \textit{\textbf{COMM}} operates by initially employing an instance-aware reasoning method to dynamically capture pertinent information of entity pairs within the document and extract relational features. Following this, \textit{\textbf{COMM}} takes into account the distribution of relations and the difficulty of samples to dynamically adjust the margins between prediction logits and the decision threshold, a process we call Concentrated Margin Maximization. In this way, \textit{\textbf{COMM}} not only enhances the extraction of relevant relational features but also boosts DocRE performance by addressing the specific challenges posed by the data. Extensive experiments and analysis demonstrate the versatility and effectiveness of \textit{\textbf{COMM}}, especially its robustness when trained on low-quality data (achieves \textgreater 10\% performance gains).
文档级关系抽取(DocRE)是指在文档中识别和提取跨越多句的关系实体之间的关系。由于其现实性的设定,近年来DocRE吸引了越来越多的研究关注。以往的研究主要集中在开发复杂编码模型以更好地捕捉实体对之间的微妙模式上。尽管这些进展无疑至关重要,但更根本的挑战在于数据本身。DocRE固有的复杂性使得标注过程容易出现错误,并且由于正样本实例数量有限和正关系类型多样性的广泛存在,正关系样本极其稀疏。这些因素会导致优化偏差进一步加剧关系抽取任务的难度。 认识到这些挑战,我们开发了一个名为**COMM(Concentrated Margin Maximization)**的强大框架以更好地解决DocRE问题。**COMM**通过首先采用一种实例感知推理方法来动态捕捉文档中实体对的相关信息并提取关系特征开始工作。随后,**COMM**考虑了关系分布和样本难度,并动态调整预测置信度与决策阈值之间的距离,这一过程称为集中边际最大化(Concentrated Margin Maximization)。通过这种方式,**COMM**不仅增强了相关关系特征的抽取能力,还通过解决数据所特有的挑战而提升了DocRE的表现。广泛的实验分析展示了**COMM**的多功能性和有效性,特别是在使用低质量数据训练时其表现尤为突出(性能提升超过10%)。
https://arxiv.org/abs/2503.13885
Clinical oncology generates vast, unstructured data that often contain inconsistencies, missing information, and ambiguities, making it difficult to extract reliable insights for data-driven decision-making. General-purpose large language models (LLMs) struggle with these challenges due to their lack of domain-specific reasoning, including specialized clinical terminology, context-dependent interpretations, and multi-modal data integration. We address these issues with an oncology-specialized, efficient, and adaptable NLP framework that combines instruction tuning, retrieval-augmented generation (RAG), and graph-based knowledge integration. Our lightweight models prove effective at oncology-specific tasks, such as named entity recognition (e.g., identifying cancer diagnoses), entity linking (e.g., linking entities to standardized ontologies), TNM staging, document classification (e.g., cancer subtype classification from pathology reports), and treatment response prediction. Our framework emphasizes adaptability and resource efficiency. We include minimal German instructions, collected at the University Hospital Zurich (USZ), to test whether small amounts of non-English language data can effectively transfer knowledge across languages. This approach mirrors our motivation for lightweight models, which balance strong performance with reduced computational costs, making them suitable for resource-limited healthcare settings. We validated our models on oncology datasets, demonstrating strong results in named entity recognition, relation extraction, and document classification.
临床肿瘤学产生了大量的非结构化数据,这些数据常常包含不一致、缺失信息和模糊性,使得提取可靠见解以支持基于数据的决策变得困难。通用的大规模语言模型(LLMs)由于缺乏特定领域的推理能力,包括专门的临床术语、上下文依赖解释以及多模态数据整合,难以应对这些问题。我们采用了一种针对肿瘤学的专业化、高效且可适应的自然语言处理(NLP)框架来解决这些问题,该框架结合了指令微调、检索增强生成(RAG)和基于图的知识集成。 我们的轻量级模型在特定于肿瘤学的任务上证明是有效的,例如命名实体识别(如识别癌症诊断)、实体链接(如将实体连接到标准化的本体论)、TNM分期、文档分类(如从病理报告中进行癌症亚型分类)以及治疗反应预测。我们的框架强调适应性和资源效率,并且我们包括了少量在苏黎世大学医院收集的德语文本指令,以测试小量非英语语言数据是否能够有效地跨语言传递知识。 这种方法反映了我们对轻量级模型动机的理解,这些模型能够在保持强大性能的同时减少计算成本,使其适合于资源有限的医疗环境。我们在肿瘤学数据集上验证了我们的模型,在命名实体识别、关系提取和文档分类方面取得了强大的结果。
https://arxiv.org/abs/2503.08323
We conduct an empirical analysis of neural network architectures and data transfer strategies for causal relation extraction. By conducting experiments with various contextual embedding layers and architectural components, we show that a relatively straightforward BioBERT-BiGRU relation extraction model generalizes better than other architectures across varying web-based sources and annotation strategies. Furthermore, we introduce a metric for evaluating transfer performance, $F1_{phrase}$ that emphasizes noun phrase localization rather than directly matching target tags. Using this metric, we can conduct data transfer experiments, ultimately revealing that augmentation with data with varying domains and annotation styles can improve performance. Data augmentation is especially beneficial when an adequate proportion of implicitly and explicitly causal sentences are included.
我们对神经网络架构和数据传输策略在因果关系抽取中的应用进行了实证分析。通过使用不同的上下文嵌入层和架构组件进行实验,我们表明相对简单的BioBERT-BiGRU关系抽取模型比其他架构具有更好的泛化能力,适用于不同基于网页的数据源和标注策略。此外,我们引入了一个用于评估迁移性能的指标$F1_{phrase}$,该指标强调名词短语定位而非直接匹配目标标签。利用这一指标,我们可以进行数据传输实验,最终发现使用来自不同领域和标注风格的数据增强可以提高模型性能。特别地,当包含足够的隐性和显性因果句时,数据增强尤为有益。
https://arxiv.org/abs/2503.06076
Previous work on clinical relation extraction from free-text sentences leveraged information about semantic types from clinical knowledge bases as a part of entity representations. In this paper, we exploit additional evidence by also making use of domain-specific semantic type dependencies. We encode the relation between a span of tokens matching a Unified Medical Language System (UMLS) concept and other tokens in the sentence. We implement our method and compare against different named entity recognition (NER) architectures (i.e., BiLSTM-CRF and BiLSTM-GCN-CRF) using different pre-trained clinical embeddings (i.e., BERT, BioBERT, UMLSBert). Our experimental results on clinical datasets show that in some cases NER effectiveness can be significantly improved by making use of domain-specific semantic type dependencies. Our work is also the first study generating a matrix encoding to make use of more than three dependencies in one pass for the NER task.
之前关于从自由文本句子中提取临床关系的工作利用了来自临床知识库的语义类型信息作为实体表示的一部分。在本文中,我们通过使用特定领域的语义类型依赖性来利用额外的信息证据。我们将匹配统一医学语言系统(UMLS)概念的一段标记与该句中的其他标记之间的关系进行编码。 我们在不同的命名实体识别(NER)架构(即BiLSTM-CRF和BiLSTM-GCN-CRF)和不同预训练的临床嵌入(例如BERT、BioBERT、UMLSBert)上实现了我们的方法,并进行了对比实验。在临床数据集上的实验证明,在某些情况下,通过利用特定领域的语义类型依赖性可以显著提高NER的有效性。 此外,本研究首次生成矩阵编码以在同一遍中使用超过三种依赖关系来完成NER任务。
https://arxiv.org/abs/2503.05373
Dense retrieval models are commonly used in Information Retrieval (IR) applications, such as Retrieval-Augmented Generation (RAG). Since they often serve as the first step in these systems, their robustness is critical to avoid failures. In this work, by repurposing a relation extraction dataset (e.g. Re-DocRED), we design controlled experiments to quantify the impact of heuristic biases, such as favoring shorter documents, in retrievers like Dragon+ and Contriever. Our findings reveal significant vulnerabilities: retrievers often rely on superficial patterns like over-prioritizing document beginnings, shorter documents, repeated entities, and literal matches. Additionally, they tend to overlook whether the document contains the query's answer, lacking deep semantic understanding. Notably, when multiple biases combine, models exhibit catastrophic performance degradation, selecting the answer-containing document in less than 3% of cases over a biased document without the answer. Furthermore, we show that these biases have direct consequences for downstream applications like RAG, where retrieval-preferred documents can mislead LLMs, resulting in a 34% performance drop than not providing any documents at all.
密集检索模型常用于信息检索(IR)应用中,如增强型生成(RAG)。由于它们通常作为这些系统的第一步,因此其鲁棒性对于避免故障至关重要。在本工作中,我们通过重新利用一个关系抽取数据集(例如Re-DocRED),设计了受控实验来量化诸如偏好较短文档等启发式偏见对Dragon+和Contriever等检索器的影响。我们的研究发现揭示了显著的脆弱性:检索器经常依赖于表浅模式,如过度优先考虑文档开头、更短的文档、重复实体以及直接匹配。此外,它们往往忽视文档是否包含查询的答案,缺乏深度语义理解能力。值得注意的是,当多个偏见结合时,模型会出现灾难性的性能下降,在存在答案但偏向性更强的文档中选择正确答案的概率低于3%。此外,我们还展示了这些偏见对下游应用(如RAG)产生的直接影响:检索器偏好文档可能会误导大型语言模型(LLMs),导致性能比不提供任何文档的情况下降低34%。
https://arxiv.org/abs/2503.05037
Foundation models, including language models, e.g., GPT, and vision models, e.g., CLIP, have significantly advanced numerous biomedical tasks. Despite these advancements, the high inference latency and the "overthinking" issues in model inference impair the efficiency and effectiveness of foundation models, thus limiting their application in real-time clinical settings. To address these challenges, we proposed EPEE (Entropy- and Patience-based Early Exiting), a novel hybrid strategy designed to improve the inference efficiency of foundation models. The core idea was to leverage the strengths of entropy-based and patience-based early exiting methods to overcome their respective weaknesses. To evaluate EPEE, we conducted experiments on three core biomedical tasks-classification, relation extraction, and event extraction-using four foundation models (BERT, ALBERT, GPT-2, and ViT) across twelve datasets, including clinical notes and medical images. The results showed that EPEE significantly reduced inference time while maintaining or improving accuracy, demonstrating its adaptability to diverse datasets and tasks. EPEE addressed critical barriers to deploying foundation models in healthcare by balancing efficiency and effectiveness. It potentially provided a practical solution for real-time clinical decision-making with foundation models, supporting reliable and efficient workflows.
基础模型(包括语言模型,如GPT,和视觉模型,如CLIP)在众多生物医学任务中取得了显著进展。尽管这些进步已经实现,但在模型推断过程中存在的高延迟问题以及“过度思考”现象仍然阻碍了基础模型的效率与效果,从而限制其在临床实时应用场景中的应用。为了应对这些挑战,我们提出了EPEE(基于熵和耐心的早期退出策略),这是一种旨在提高基础模型推理效率的新颖混合策略。该方法的核心思想是利用熵为基础和以耐心为基础的早期退出方法的优点来克服各自的弱点。 为了评估EPEE的效果,我们在三个核心生物医学任务——分类、关系抽取和事件抽取上进行了实验,并使用了四种基础模型(BERT、ALBERT、GPT-2 和 ViT)在十二个数据集(包括临床记录和医学影像)上进行测试。结果表明,EPEE显著缩短了推理时间,同时保持甚至提升了准确性,展示了其适应多样数据集与任务的能力。 通过平衡效率与效果,EPEE解决了部署基础模型到医疗保健中的关键障碍,并为实时临床决策中应用基础模型提供了潜在的实际解决方案,支持可靠的高效工作流程。
https://arxiv.org/abs/2503.02053
Extracting causal relationships from a medical case report is essential for comprehending the case, particularly its diagnostic process. Since the diagnostic process is regarded as a bottom-up inference, causal relationships in cases naturally form a multi-layered tree structure. The existing tasks, such as medical relation extraction, are insufficient for capturing the causal relationships of an entire case, as they treat all relations equally without considering the hierarchical structure inherent in the diagnostic process. Thus, we propose a novel task, Causal Tree Extraction (CTE), which receives a case report and generates a causal tree with the primary disease as the root, providing an intuitive understanding of a case's diagnostic process. Subsequently, we construct a Japanese case report CTE dataset, J-Casemap, propose a generation-based CTE method that outperforms the baseline by 20.2 points in the human evaluation, and introduce evaluation metrics that reflect clinician preferences. Further experiments also show that J-Casemap enhances the performance of solving other medical tasks, such as question answering.
从医疗案例报告中提取因果关系对于理解病例,尤其是其诊断过程至关重要。由于诊断过程被视为自下而上的推理方法,因此案例中的因果关系自然形成了多层次的树状结构。现有的任务,如医学关系抽取,不足以捕捉整个案例的因果关系,因为它们将所有关系同等看待,忽略了诊断过程中固有的层次结构。为此,我们提出了一项新的任务——因果树提取(CTE),该任务接收一个病例报告并生成以主要疾病为根节点的因果树,从而直观地展示出病例诊断过程。随后,我们构建了一个日本案例报告CTE数据集J-Casemap,并提出了基于生成的CTE方法,在人工评估中比基线高出20.2分,还引入了反映临床医生偏好的评价指标。进一步的实验也表明,J-Casemap能够增强解决其他医疗任务(如问答)的能力。
https://arxiv.org/abs/2503.01302
The advancement of Large Language Models (LLMs) has significantly impacted biomedical Natural Language Processing (NLP), enhancing tasks such as named entity recognition, relation extraction, event extraction, and text classification. In this context, the DeepSeek series of models have shown promising potential in general NLP tasks, yet their capabilities in the biomedical domain remain underexplored. This study evaluates multiple DeepSeek models (Distilled-DeepSeek-R1 series and Deepseek-LLMs) across four key biomedical NLP tasks using 12 datasets, benchmarking them against state-of-the-art alternatives (Llama3-8B, Qwen2.5-7B, Mistral-7B, Phi-4-14B, Gemma-2-9B). Our results reveal that while DeepSeek models perform competitively in named entity recognition and text classification, challenges persist in event and relation extraction due to precision-recall trade-offs. We provide task-specific model recommendations and highlight future research directions. This evaluation underscores the strengths and limitations of DeepSeek models in biomedical NLP, guiding their future deployment and optimization.
大型语言模型(LLMs)的发展在生物医学自然语言处理(NLP)领域产生了显著影响,提升了诸如命名实体识别、关系抽取、事件抽取和文本分类等任务的效果。在此背景下,DeepSeek系列模型在通用NLP任务中展示了巨大的潜力,但在生物医学领域的应用能力仍有待探索。本研究评估了多个DeepSeek模型(Distilled-DeepSeek-R1系列和Deepseek-LLMs)在四个关键的生物医学NLP任务中的表现,并使用12个数据集将它们与最先进的替代模型(如Llama3-8B、Qwen2.5-7B、Mistral-7B、Phi-4-14B、Gemma-2-9B)进行基准测试。我们的研究结果表明,尽管DeepSeek模型在命名实体识别和文本分类任务上表现出色,但在事件抽取和关系提取方面仍面临精度与召回率之间的权衡问题。我们提供了针对特定任务的模型推荐,并指出了未来的研究方向。这一评估强调了DeepSeek模型在生物医学NLP领域的优势和局限性,为它们未来的部署和优化提供指导。
https://arxiv.org/abs/2503.00624
Few-shot Continual Relation Extraction is a crucial challenge for enabling AI systems to identify and adapt to evolving relationships in dynamic real-world domains. Traditional memory-based approaches often overfit to limited samples, failing to reinforce old knowledge, with the scarcity of data in few-shot scenarios further exacerbating these issues by hindering effective data augmentation in the latent space. In this paper, we propose a novel retrieval-based solution, starting with a large language model to generate descriptions for each relation. From these descriptions, we introduce a bi-encoder retrieval training paradigm to enrich both sample and class representation learning. Leveraging these enhanced representations, we design a retrieval-based prediction method where each sample "retrieves" the best fitting relation via a reciprocal rank fusion score that integrates both relation description vectors and class prototypes. Extensive experiments on multiple datasets demonstrate that our method significantly advances the state-of-the-art by maintaining robust performance across sequential tasks, effectively addressing catastrophic forgetting.
少量样本下的持续关系抽取(Few-shot Continual Relation Extraction)是使AI系统能够识别并适应动态现实世界领域中不断变化的关系的关键挑战。传统的基于记忆的方法通常会过度拟合于有限的样本,无法强化旧知识,在少量样本场景中数据稀疏性进一步加剧了这些问题,阻碍了潜在空间中的有效数据增强。 在本文中,我们提出了一种新颖的检索解决方案,首先使用大型语言模型生成每个关系的描述。从这些描述出发,我们引入了一个双向编码器检索训练范式,以丰富样例和类表示学习。利用这些增强后的表示形式,我们设计了一种基于检索的预测方法,其中每个样本通过整合关系描述向量和类原型的互反秩融合分数来“检索”最合适的关联。 在多个数据集上的广泛实验表明,我们的方法显著提升了现有技术,在连续任务中保持了强大的性能,并有效解决了灾难性遗忘问题。
https://arxiv.org/abs/2502.20596
Typically, Few-shot Continual Relation Extraction (FCRE) models must balance retaining prior knowledge while adapting to new tasks with extremely limited data. However, real-world scenarios may also involve unseen or undetermined relations that existing methods still struggle to handle. To address these challenges, we propose a novel approach that leverages the Open Information Extraction concept of Knowledge Graph Construction (KGC). Our method not only exposes models to all possible pairs of relations, including determined and undetermined labels not available in the training set, but also enriches model knowledge with diverse relation descriptions, thereby enhancing knowledge retention and adaptability in this challenging scenario. In the perspective of KGC, this is the first work explored in the setting of Continual Learning, allowing efficient expansion of the graph as the data evolves. Experimental results demonstrate our superior performance compared to other state-of-the-art FCRE baselines, as well as the efficiency in handling dynamic graph construction in this setting.
通常,少量样本连续关系抽取(FCRE)模型必须在保留先前知识的同时适应仅具有极少量数据的新任务。然而,在现实世界中可能会遇到未见过或不确定的关系类型,现有方法仍然难以处理这种情况。为了解决这些挑战,我们提出了一种新的方法,该方法利用了开放信息抽取中的知识图谱构建(KGC)概念。我们的方法不仅使模型接触到所有可能的实体对关系组合,包括训练集中不可用的已知和未知标签,而且还通过多样化的关系描述来丰富模型的知识库,从而在这一挑战性场景中增强了知识保留能力和适应能力。从KGC的角度来看,这是首次探索连续学习环境下的工作,允许随着数据的发展高效地扩展图谱。实验结果表明,在处理动态图构建方面,与其它最先进的FCRE基准相比,我们的方法表现出色且更有效率。
https://arxiv.org/abs/2502.16648
Objective: To optimize in-context learning in biomedical natural language processing by improving example selection. Methods: We introduce a novel multi-mode retrieval-augmented generation (MMRAG) framework, which integrates four retrieval strategies: (1) Random Mode, selecting examples arbitrarily; (2) Top Mode, retrieving the most relevant examples based on similarity; (3) Diversity Mode, ensuring variation in selected examples; and (4) Class Mode, selecting category-representative examples. This study evaluates MMRAG on three core biomedical NLP tasks: Named Entity Recognition (NER), Relation Extraction (RE), and Text Classification (TC). The datasets used include BC2GM for gene and protein mention recognition (NER), DDI for drug-drug interaction extraction (RE), GIT for general biomedical information extraction (RE), and HealthAdvice for health-related text classification (TC). The framework is tested with two large language models (Llama2-7B, Llama3-8B) and three retrievers (Contriever, MedCPT, BGE-Large) to assess performance across different retrieval strategies. Results: The results from the Random mode indicate that providing more examples in the prompt improves the model's generation performance. Meanwhile, Top mode and Diversity mode significantly outperform Random mode on the RE (DDI) task, achieving an F1 score of 0.9669, a 26.4% improvement. Among the three retrievers tested, Contriever outperformed the other two in a greater number of experiments. Additionally, Llama 2 and Llama 3 demonstrated varying capabilities across different tasks, with Llama 3 showing a clear advantage in handling NER tasks. Conclusion: MMRAG effectively enhances biomedical in-context learning by refining example selection, mitigating data scarcity issues, and demonstrating superior adaptability for NLP-driven healthcare applications.
**目标:** 在生物医学自然语言处理中优化上下文学习,通过改进示例选择来提高性能。 **方法:** 我们引入了一种新颖的多模式检索增强生成(MMRAG)框架。该框架集成了四种检索策略:(1) 随机模式,随机选取例子;(2) 顶级模式,基于相似性提取最相关的例子;(3) 多样性模式,确保所选示例的变化性;以及 (4) 分类模式,选择具有代表性的类别实例。这项研究在三个核心生物医学NLP任务上评估了MMRAG:命名实体识别(NER)、关系抽取(RE)和文本分类(TC)。使用的数据集包括BC2GM用于基因和蛋白质提及识别(NER),DDI用于药物-药物相互作用提取(RE),GIT用于一般生物医学信息提取(RE),以及HealthAdvice用于与健康相关的文本分类(TC)。框架在两个大型语言模型(Llama2-7B,Llama3-8B)和三个检索器(Contriever,MedCPT,BGE-Large)上进行了测试,以评估不同检索策略下的性能。 **结果:** 随机模式的结果表明,在提示中提供更多的例子可以提高模型的生成表现。同时,顶级模式和多样性模式在关系抽取任务(DDI)中显著优于随机模式,取得了F1得分为0.9669的成绩,比随机模式提高了26.4%。在这三个检索器中,Contriever 在更多实验中表现更好于其他两个。此外,在不同的任务上,Llama 2和Llama 3表现出不同程度的能力,其中Llama 3在处理NER任务方面明显占优势。 **结论:** MMRAG通过优化示例选择来有效提高生物医学上下文学习的性能,减轻数据稀缺问题,并展示了对NLP驱动医疗应用的高度适应性。
https://arxiv.org/abs/2502.15954
In this article, we present the BTransformer18 model, a deep learning architecture designed for multi-label relation extraction in French texts. Our approach combines the contextual representation capabilities of pre-trained language models from the BERT family - such as BERT, RoBERTa, and their French counterparts CamemBERT and FlauBERT - with the power of Transformer encoders to capture long-term dependencies between tokens. Experiments conducted on the dataset from the TextMine'25 challenge show that our model achieves superior performance, particularly when using CamemBERT-Large, with a macro F1 score of 0.654, surpassing the results obtained with FlauBERT-Large. These results demonstrate the effectiveness of our approach for the automatic extraction of complex relations in intelligence reports.
在这篇文章中,我们介绍了BTransformer18模型,这是一种专为从法语文本中提取多标签关系而设计的深度学习架构。我们的方法结合了来自BERT家族(如BERT、RoBERTa及其法国版本CamemBERT和FlauBERT)的预训练语言模型的上下文表示能力,以及变压器编码器捕捉令牌之间长期依赖性的能力。在TextMine'25挑战赛的数据集上进行的实验表明,使用CamemBERT-Large时,我们的模型取得了优异的成绩,宏F1得分为0.654,超过了FlauBERT-Large的结果。这些结果证明了我们方法在智能报告中自动提取复杂关系方面的有效性。
https://arxiv.org/abs/2502.15619
Universal Information Extraction (UIE) has garnered significant attention due to its ability to address model explosion problems effectively. Extractive UIE can achieve strong performance using a relatively small model, making it widely adopted. Extractive UIEs generally rely on task instructions for different tasks, including single-target instructions and multiple-target instructions. Single-target instruction UIE enables the extraction of only one type of relation at a time, limiting its ability to model correlations between relations and thus restricting its capability to extract complex relations. While multiple-target instruction UIE allows for the extraction of multiple relations simultaneously, the inclusion of irrelevant relations introduces decision complexity and impacts extraction accuracy. Therefore, for multi-relation extraction, we propose LDNet, which incorporates multi-aspect relation modeling and a label drop mechanism. By assigning different relations to different levels for understanding and decision-making, we reduce decision confusion. Additionally, the label drop mechanism effectively mitigates the impact of irrelevant relations. Experiments show that LDNet outperforms or achieves competitive performance with state-of-the-art systems on 9 tasks, 33 datasets, in both single-modal and multi-modal, few-shot and zero-shot settings.\footnote{this https URL}
通用信息抽取(UIE)因其能够有效解决模型爆炸问题而备受关注。提取型UIE能够在使用相对较小的模型的情况下取得强劲性能,因此被广泛采用。提取型UIEs通常依赖于不同的任务指令,包括单目标和多目标指令。单一目标指令UIE一次只能提取一种关系类型,这限制了它建模关系之间关联的能力,并且限制了其抽取复杂关系的能力。虽然多目标指令UIE可以同时提取多种关系,但包含不相关的关系会增加决策的复杂性并影响提取精度。因此,对于多关系抽取,我们提出了LDNet(Label Drop Network),该模型集成了多方面关系建模和标签丢弃机制。通过将不同关系分配给不同的层级进行理解和决策,我们可以减少决策混乱。此外,标签丢弃机制有效地缓解了不相关关系的影响。实验表明,在单模式、多模式、少样本以及零样本设置下,LDNet在9个任务、33个数据集上表现优于或与最先进的系统相当。\footnote{此链接指向相关信息源}
https://arxiv.org/abs/2502.12614