Document-Level Relation Extraction (DocRE) presents significant challenges due to its reliance on cross-sentence context and the long-tail distribution of relation types, where many relations have scarce training examples. In this work, we introduce DOcument-level Relation Extraction optiMizing the long taIl (DOREMI), an iterative framework that enhances underrepresented relations through minimal yet targeted manual annotations. Unlike previous approaches that rely on large-scale noisy data or heuristic denoising, DOREMI actively selects the most informative examples to improve training efficiency and robustness. DOREMI can be applied to any existing DocRE model and is effective at mitigating long-tail biases, offering a scalable solution to improve generalization on rare relations.
文档级关系抽取(DocRE)面临重大挑战,主要是由于其依赖跨句子的上下文以及关系类型的长尾分布,其中许多关系只有很少的训练样本。在本项工作中,我们引入了DOcument-level Relation Extraction optiMizing the long taIl (DOREMI),这是一个迭代框架,通过最少但有针对性的手动注释来增强代表性不足的关系。与依赖大规模嘈杂数据或启发式去噪方法的先前方法不同,DOREMI主动选择最有信息量的例子以提高训练效率和鲁棒性。DOREMI可以应用于任何现有的DocRE模型,并且在减轻长尾偏差方面非常有效,提供了一种可扩展的方法来改善罕见关系上的泛化能力。
https://arxiv.org/abs/2601.11190
The scarcity of annotated datasets for clinical information extraction in non-English languages hinders the evaluation of large language model (LLM)-based methods developed primarily in English. In this study, we present the first comprehensive bilingual evaluation of LLMs for the clinical Relation Extraction (RE) task in both English and Turkish. To facilitate this evaluation, we introduce the first English-Turkish parallel clinical RE dataset, derived and carefully curated from the 2010 i2b2/VA relation classification corpus. We systematically assess a diverse set of prompting strategies, including multiple in-context learning (ICL) and Chain-of-Thought (CoT) approaches, and compare their performance to fine-tuned baselines such as PURE. Furthermore, we propose Relation-Aware Retrieval (RAR), a novel in-context example selection method based on contrastive learning, that is specifically designed to capture both sentence-level and relation-level semantics. Our results show that prompting-based LLM approaches consistently outperform traditional fine-tuned models. Moreover, evaluations for English performed better than their Turkish counterparts across all evaluated LLMs and prompting techniques. Among ICL methods, RAR achieves the highest performance, with Gemini 1.5 Flash reaching a micro-F1 score of 0.906 in English and 0.888 in Turkish. Performance further improves to 0.918 F1 in English when RAR is combined with a structured reasoning prompt using the DeepSeek-V3 model. These findings highlight the importance of high-quality demonstration retrieval and underscore the potential of advanced retrieval and prompting techniques to bridge resource gaps in clinical natural language processing.
在非英语语言中,用于临床信息提取的注释数据集的稀缺性阻碍了主要用英语开发的大规模语言模型(LLM)方法的评估。在这项研究中,我们首次提出了针对英语和土耳其语的临床关系抽取(RE)任务的全面双语评价。为了支持这一评估,我们引入了第一个英土双语文本平行的临床RE数据集,该数据集源自2010年i2b2/VA关系分类语料库,并经过仔细整理。我们系统地评估了一系列不同的提示策略,包括多种上下文学习(ICL)和链式思维(CoT)方法,并将其性能与微调基准模型(如PURE)进行了比较。此外,我们提出了基于对比学习的关系感知检索(RAR),这是一种新型的在上下文示例选择方法,特别设计用于捕捉句子级别和关系级别的语义。我们的结果表明,基于提示的LLM方法始终优于传统的微调模型。而且,在所有评估的LLM和提示技术中,英语的性能都优于土耳其语。在ICL方法中,RAR取得了最高的表现,其中Gemini 1.5 Flash在英语中的微平均F1得分为0.906,在土耳其语中为0.888。当RAR与使用DeepSeek-V3模型的结构化推理提示结合时,性能进一步提高至英语中的0.918 F1得分。这些发现突显了高质量演示检索的重要性,并强调了先进的检索和提示技术在临床自然语言处理资源缺口方面具有巨大的潜力。
https://arxiv.org/abs/2601.09367
Document-Level Zero-Shot Relation Extraction (DocZSRE) aims to predict unseen relation labels in text documents without prior training on specific relations. Existing approaches rely on Large Language Models (LLMs) to generate synthetic data for unseen labels, which poses challenges for low-resource languages like Malaysian English. These challenges include the incorporation of local linguistic nuances and the risk of factual inaccuracies in LLM-generated data. This paper introduces Document-Level Zero-Shot Relation Extraction with Entity Side Information (DocZSRE-SI) to address limitations in the existing DocZSRE approach. The DocZSRE-SI framework leverages Entity Side Information, such as Entity Mention Descriptions and Entity Mention Hypernyms, to perform ZSRE without depending on LLM-generated synthetic data. The proposed low-complexity model achieves an average improvement of 11.6% in the macro F1-Score compared to baseline models and existing benchmarks. By utilizing Entity Side Information, DocZSRE-SI offers a robust and efficient alternative to error-prone, LLM-based methods, demonstrating significant advancements in handling low-resource languages and linguistic diversity in relation extraction tasks. This research provides a scalable and reliable solution for ZSRE, particularly in contexts like Malaysian English news articles, where traditional LLM-based approaches fall short.
文档级零样本关系抽取(DocZSRE)的目标是在没有针对特定关系进行前期训练的情况下,预测文本文件中未见过的关系标签。现有的方法依赖于大型语言模型(LLMs)来生成用于未见标签的合成数据,这对于马来西亚英语等资源匮乏的语言来说存在挑战,这些挑战包括难以融入当地的语言细微差别以及由LLM生成的数据中的事实错误风险。 本文提出了一种名为文档级零样本关系抽取带实体侧面信息(DocZSRE-SI)的方法,以解决现有DocZSRE方法的限制。DocZSRE-SI框架利用了诸如实体提及描述和实体提及超类之类的实体侧面信息来进行零样本关系抽取,而无需依赖于LLM生成的合成数据。所提出的低复杂度模型相比基线模型和现有的基准,在宏平均F1分数上提高了11.6%。 通过使用实体侧面信息,DocZSRE-SI为基于LLMs的方法提供了一个稳健且高效的替代方案,表明在处理资源匮乏的语言以及关系抽取任务中的语言多样性方面取得了重大进展。这项研究提供了用于零样本关系抽取的可扩展和可靠解决方案,在马来西亚英语新闻文章等情境下尤其有效,其中传统的基于LLM的方法表现不佳。
https://arxiv.org/abs/2601.07271
Digital twins -- virtual replicas of physical entities -- are gaining traction in healthcare for personalized monitoring, predictive modeling, and clinical decision support. However, generating interoperable patient digital twins from unstructured electronic health records (EHRs) remains challenging due to variability in clinical documentation and lack of standardized mappings. This paper presents a semantic NLP-driven pipeline that transforms free-text EHR notes into FHIR-compliant digital twin representations. The pipeline leverages named entity recognition (NER) to extract clinical concepts, concept normalization to map entities to SNOMED-CT or ICD-10, and relation extraction to capture structured associations between conditions, medications, and observations. Evaluation on MIMIC-IV Clinical Database Demo with validation against MIMIC-IV-on-FHIR reference mappings demonstrates high F1-scores for entity and relation extraction, with improved schema completeness and interoperability compared to baseline methods.
数字孪生——物理实体的虚拟副本——在医疗保健领域用于个性化监测、预测建模和临床决策支持方面正逐渐受到重视。然而,从非结构化的电子健康记录(EHRs)生成可互操作的患者数字孪生仍然面临挑战,主要是由于临床文档的变异性以及缺乏标准化映射所致。本文提出了一种基于语义自然语言处理(NLP)的管道,该管道将自由文本形式的EHR笔记转换为符合FHIR标准的数字孪生表示。此管道利用命名实体识别(NER)来提取临床概念、概念规范化以将实体映射到SNOMED-CT或ICD-10,并通过关系抽取捕捉条件、药物和观察结果之间的结构化关联。 使用MIMIC-IV临床数据库Demo进行评估,对照组为MIMIC-IV-on-FHIR参考映射验证表明,在实体和关系提取方面达到了高F1分数。与基线方法相比,此管道提高了模式完整性和互操作性。
https://arxiv.org/abs/2601.05847
Large language models (LLMs) offer new opportunities for constructing knowledge graphs (KGs) from unstructured clinical narratives. However, existing approaches often rely on structured inputs and lack robust validation of factual accuracy and semantic consistency, limitations that are especially problematic in oncology. We introduce an end-to-end framework for clinical KG construction and evaluation directly from free text using multi-agent prompting and a schema-constrained Retrieval-Augmented Generation (KG-RAG) strategy. Our pipeline integrates (1) prompt-driven entity, attribute, and relation extraction; (2) entropy-based uncertainty scoring; (3) ontology-aligned RDF/OWL schema generation; and (4) multi-LLM consensus validation for hallucination detection and semantic refinement. Beyond static graph construction, the framework supports continuous refinement and self-supervised evaluation, enabling iterative improvement of graph quality. Applied to two oncology cohorts (PDAC and BRCA), our method produces interpretable, SPARQL-compatible, and clinically grounded knowledge graphs without relying on gold-standard annotations. Experimental results demonstrate consistent gains in precision, relevance, and ontology compliance over baseline methods.
大型语言模型(LLM)为从非结构化临床叙述中构建知识图谱(KG)提供了新的机会。然而,现有的方法通常依赖于结构化的输入,并且缺乏对事实准确性和语义一致性进行稳健验证的能力,在肿瘤学领域尤其成问题。我们引入了一种端到端框架,该框架可以直接从自由文本使用多代理提示和约束模式的检索增强生成(KG-RAG)策略构建并评估临床知识图谱。我们的流水线整合了以下四个关键步骤:(1) 提示驱动下的实体、属性和关系抽取;(2) 基于熵的不确定性评分;(3) 与本体一致的RDF/OWL模式生成;以及 (4) 多LLM共识验证以检测幻觉并进行语义细化。除了静态图谱构建外,该框架还支持持续改进和自我监督评估,从而能够不断优化图的质量。在两个肿瘤学队列(胰腺导管腺癌PDAC 和 BRCA)的应用中,我们的方法无需依赖金标准注释,即可生成可解释、SPARQL兼容且具有临床依据的知识图谱。实验结果表明,在精确度、相关性和符合本体规范方面,相较于基线方法,该方法始终表现出显著的优势。
https://arxiv.org/abs/2601.01844
In fact-checking applications, a common reason to reject a claim is to detect the presence of erroneous cause-effect relationships between the events at play. However, current automated fact-checking methods lack dedicated causal-based reasoning, potentially missing a valuable opportunity for semantically rich explainability. To address this gap, we propose a methodology that combines event relation extraction, semantic similarity computation, and rule-based reasoning to detect logical inconsistencies between chains of events mentioned in a claim and in an evidence. Evaluated on two fact-checking datasets, this method establishes the first baseline for integrating fine-grained causal event relationships into fact-checking and enhance explainability of verdict prediction.
在事实核查应用中,拒绝一个声明的常见原因之一是检测事件之间是否存在错误的因果关系。然而,当前的自动化事实核查方法缺乏专门基于因果关系的推理能力,这可能错失了提高语义丰富解释性的宝贵机会。为了填补这一空白,我们提出了一种结合事件关系提取、语义相似性计算和规则基础推理的方法,以检测声明中提及的一系列事件与证据中提到的事件之间的逻辑不一致。在两个事实核查数据集上进行评估后,这种方法建立了将细粒度因果事件关系集成到事实核查中的第一个基准,并提高了判决预测的解释性。
https://arxiv.org/abs/2512.13286
Legal relations form a highly consequential analytical framework of civil law system, serving as a crucial foundation for resolving disputes and realizing values of the rule of law in judicial practice. However, legal relations in Chinese civil cases remain underexplored in the field of legal artificial intelligence (legal AI), largely due to the absence of comprehensive schemas. In this work, we firstly introduce a comprehensive schema, which contains a hierarchical taxonomy and definitions of arguments, for AI systems to capture legal relations in Chinese civil cases. Based on this schema, we then formulate legal relation extraction task and present LexRel, an expert-annotated benchmark for legal relation extraction in Chinese civil law. We use LexRel to evaluate state-of-the-art large language models (LLMs) on legal relation extractions, showing that current LLMs exhibit significant limitations in accurately identifying civil legal relations. Furthermore, we demonstrate that incorporating legal relations information leads to consistent performance gains on other downstream legal AI tasks.
法律关系构成了民法体系中一个高度重要的分析框架,是解决纠纷和实现法治价值的重要基础。然而,在中国民事案件中,由于缺乏全面的架构,法律关系在法律人工智能(legal AI)领域仍被较少研究。在这项工作中,我们首先介绍了一个全面的架构,包括法律关系的分层分类法及其定义,使AI系统能够捕捉到中国民事案件中的法律关系。基于这一架构,我们将法律关系抽取任务具体化,并提出了LexRel,一个专家标注的数据集,作为中文民法领域中法律关系提取的基准。我们使用LexRel来评估当前最先进的大型语言模型(LLMs)在法律关系抽取上的表现,显示出现有LLMs在准确识别民事法律关系方面存在显著局限性。此外,我们还证明了将法律关系信息融入其中可以持续提升其他下游法律人工智能任务的表现。
https://arxiv.org/abs/2512.12643
Although Large language Model (LLM)-powered information extraction (IE) systems have shown impressive capabilities, current fine-tuning paradigms face two major limitations: high training costs and difficulties in aligning with LLM preferences. To address these issues, we propose a novel universal IE paradigm, the Self-Correcting Iterative Refinement (SCIR) framework, along with a Multi-task Bilingual (Chinese-English) Self-Correcting (MBSC) dataset containing over 100,000 entries. The SCIR framework achieves plug-and-play compatibility with existing LLMs and IE systems through its Dual-Path Self-Correcting module and feedback-driven optimization, thereby significantly reducing training costs. Concurrently, the MBSC dataset tackles the challenge of preference alignment by indirectly distilling GPT-4's capabilities into IE result detection models. Experimental results demonstrate that SCIR outperforms state-of-the-art IE methods across three key tasks: named entity recognition, relation extraction, and event extraction, achieving a 5.27 percent average improvement in span-based Micro-F1 while reducing training costs by 87 percent compared to baseline approaches. These advancements not only enhance the flexibility and accuracy of IE systems but also pave the way for lightweight and efficient IE paradigms.
尽管大型语言模型(LLM)驱动的信息抽取(IE)系统展现了令人印象深刻的能力,但当前的微调范式面临着两个主要限制:高昂的训练成本和与LLM偏好对齐的困难。为了解决这些问题,我们提出了一种新型的通用信息抽取框架——自我校正迭代精炼(Self-Correcting Iterative Refinement, SCIR)以及一个多任务双语(中英文)自我校正数据集MBSC,包含超过10万条记录。通过其双路径自我校正模块和反馈驱动优化,SCIR框架实现了与现有LLM和IE系统的即插即用兼容性,从而大幅降低了训练成本。同时,MBSC数据集通过间接提炼GPT-4的能力到信息抽取结果检测模型中,解决了偏好对齐的挑战。 实验结果显示,在命名实体识别、关系提取和事件抽取三项关键任务上,SCIR方法超越了现有的最先进的IE方法,平均提高了基于跨度的Micro-F1分数5.27%,并且相比基准方法将训练成本降低了87%。这些进展不仅提升了信息抽取系统的灵活性和准确性,还为轻量级且高效的IE范式铺平了道路。
https://arxiv.org/abs/2512.12337
We present PharmaShip, a real-world Chinese dataset of scanned pharmaceutical shipping documents designed to stress-test pre-trained text-layout models under noisy OCR and heterogeneous templates. PharmaShip covers three complementary tasks-sequence entity recognition (SER), relation extraction (RE), and reading order prediction (ROP)-and adopts an entity-centric evaluation protocol to minimize confounds across architectures. We benchmark five representative baselines spanning pixel-aware and geometry-aware families (LiLT, LayoutLMv3-base, GeoLayoutLM and their available RORE-enhanced variants), and standardize preprocessing, splits, and optimization. Experiments show that pixels and explicit geometry provide complementary inductive biases, yet neither alone is sufficient: injecting reading-order-oriented regularization consistently improves SER and EL and yields the most robust configuration, while longer positional coverage stabilizes late-page predictions and reduces truncation artifacts. ROP is accurate at the word level but challenging at the segment level, reflecting boundary ambiguity and long-range crossings. PharmaShip thus establishes a controlled, reproducible benchmark for safety-critical document understanding in the pharmaceutical domain and highlights sequence-aware constraints as a transferable bias for structure modeling. We release the dataset at this https URL.
我们介绍了PharmaShip,这是一个包含扫描的中国药品运输文件的真实世界数据集,旨在测试预训练文本布局模型在噪声OCR和异构模板下的性能。PharmaShip涵盖了三个互补的任务:序列实体识别(SER)、关系抽取(RE)以及阅读顺序预测(ROP),并采用以实体为中心的评估协议来最小化不同架构之间的混淆因素。我们对五种具有代表性的基线方法进行了基准测试,这些方法跨越像素感知和几何感知两大类(包括LiLT、LayoutLMv3-base、GeoLayoutLM及其可用的RORE增强版本),并且统一了预处理、数据拆分以及优化标准。 实验结果表明,像素信息与显式几何结构提供了互补性的归纳偏置,但两者单独使用都不足以达到最佳效果:注入以阅读顺序为导向的正则化能够持续改进SER和EL性能,并提供最稳健的配置;而增加位置覆盖范围可以稳定页面后段的预测并减少截断产生的伪影。尽管ROP在单词级别上的准确性较高,但在分段级别上却具有挑战性,这反映了边界模糊以及长距离跨越的问题。 因此,PharmaShip为药品领域中关键安全性的文件理解建立了可控制且可重复的基准测试,并强调了序列感知约束作为结构建模中的可转移偏置。数据集可在以下链接获取:[此链接](this https URL)。
https://arxiv.org/abs/2512.23714
In this research, we combine Transformer-based relation extraction with matching of knowledge graphs (KGs) and apply them to answering multiple-choice questions (MCQs) while maintaining the traceability of the output process. KGs are structured representations of factual knowledge consisting of entities and relations. Due to the high construction cost, they had been regarded as static databases with validated links. However, the recent development of Transformer-based relation extraction (RE) methods has enabled us to generate KGs dynamically by giving them natural language texts, and thereby opened the possibility for representing the meaning of the input sentences with the created KGs. Using this effect, we propose a method that answers MCQs in the "fill-in-the-blank" format, taking care of the point that RE methods generate KGs that represent false information if provided with factually incorrect texts. We measure the truthfulness of each question sentence by (i) converting the sentence into a relational graph using an RE method and (ii) verifying it against factually correct KGs under the closed-world assumption. The experimental results demonstrate that our method correctly answers up to around 70% of the questions, while providing traceability of the procedure. We also highlight that the question category has a vast influence on the accuracy.
https://arxiv.org/abs/2511.14144
We introduce BeDiscovER (Benchmark of Discourse Understanding in the Era of Reasoning Language Models), an up-to-date, comprehensive suite for evaluating the discourse-level knowledge of modern LLMs. BeDiscovER compiles 5 publicly available discourse tasks across discourse lexicon, (multi-)sentential, and documental levels, with in total 52 individual datasets. It covers both extensively studied tasks such as discourse parsing and temporal relation extraction, as well as some novel challenges such as discourse particle disambiguation (e.g., ``just''), and also aggregates a shared task on Discourse Relation Parsing and Treebanking for multilingual and multi-framework discourse relation classification. We evaluate open-source LLMs: Qwen3 series, DeepSeek-R1, and frontier model such as GPT-5-mini on BeDiscovER, and find that state-of-the-art models exhibit strong performance in arithmetic aspect of temporal reasoning, but they struggle with full document reasoning and some subtle semantic and discourse phenomena, such as rhetorical relation recognition.
https://arxiv.org/abs/2511.13095
Multi-turn instruction following is essential for building intelligent conversational systems that can consistently adhere to instructions across dialogue turns. However, existing approaches to enhancing multi-turn instruction following primarily rely on collecting or generating large-scale multi-turn dialogue datasets to fine-tune large language models (LLMs), which treat each response generation as an isolated task and fail to explicitly incorporate multi-turn instruction following into the optimization objectives. As a result, instruction-tuned LLMs often struggle with complex long-distance constraints. In multi-turn dialogues, relational constraints across turns can be naturally modeled as labeled directed edges, making graph structures particularly suitable for modeling multi-turn instruction following. Despite this potential, leveraging graph structures to enhance the multi-turn instruction following capabilities of LLMs remains unexplored. To bridge this gap, we propose GraphIF, a plug-and-play framework that models multi-turn dialogues as directed relation graphs and leverages graph prompts to enhance the instruction following capabilities of LLMs. GraphIF comprises three key components: (1) an agent-based relation extraction module that captures inter-turn semantic relations via action-triggered mechanisms to construct structured graphs; (2) a relation graph prompt generation module that converts structured graph information into natural language prompts; and (3) a response rewriting module that refines initial LLM outputs using the generated graph prompts. Extensive experiments on two long multi-turn dialogue datasets demonstrate that GraphIF can be seamlessly integrated into instruction-tuned LLMs and leads to significant improvements across all four multi-turn instruction-following evaluation metrics.
https://arxiv.org/abs/2511.10051
Research in Machine Learning (ML) and AI evolves rapidly. Information Extraction (IE) from scientific publications enables to identify information about research concepts and resources on a large scale and therefore is a pathway to improve understanding and reproducibility of ML-related research. To extract and connect fine-grained information in ML-related research, e.g. method training and data usage, we introduce GSAP-ERE. It is a manually curated fine-grained dataset with 10 entity types and 18 semantically categorized relation types, containing mentions of 63K entities and 35K relations from the full text of 100 ML publications. We show that our dataset enables fine-tuned models to automatically extract information relevant for downstream tasks ranging from knowledge graph (KG) construction, to monitoring the computational reproducibility of AI research at scale. Additionally, we use our dataset as a test suite to explore prompting strategies for IE using Large Language Models (LLM). We observe that the performance of state-of-the-art LLM prompting methods is largely outperformed by our best fine-tuned baseline model (NER: 80.6%, RE: 54.0% for the fine-tuned model vs. NER: 44.4%, RE: 10.1% for the LLM). This disparity of performance between supervised models and unsupervised usage of LLMs suggests datasets like GSAP-ERE are needed to advance research in the domain of scholarly information extraction.
https://arxiv.org/abs/2511.09411
Large Language Models (LLMs) have demonstrated their remarkable capabilities in document understanding. However, recent research reveals that LLMs still exhibit performance gaps in Document-level Relation Extraction (DocRE) as requiring fine-grained comprehension. The commonly adopted "extract entities then predict relations" paradigm in LLM-based methods leads to these gaps due to two main reasons: (1) Numerous unrelated entity pairs introduce noise and interfere with the relation prediction for truly related entity pairs. (2) Although LLMs have identified semantic associations between entities, relation labels beyond the predefined set are still treated as prediction errors. To address these challenges, we propose a novel Relation as a Prior (RelPrior) paradigm for LLM-based DocRE. For challenge (1), RelPrior utilizes binary relation as a prior to extract and determine whether two entities are correlated, thereby filtering out irrelevant entity pairs and reducing prediction noise. For challenge (2), RelPrior utilizes predefined relation as a prior to match entities for triples extraction instead of directly predicting relation. Thus, it avoids misjudgment caused by strict predefined relation labeling. Extensive experiments on two benchmarks demonstrate that RelPrior achieves state-of-the-art performance, surpassing existing LLM-based methods.
https://arxiv.org/abs/2511.08143
This article presents a systematic review of relation extraction (RE) research since the advent of Transformer-based models. Using an automated framework to collect and annotate publications, we analyze 34 surveys, 64 datasets, and 104 models published between 2019 and 2024. The review highlights methodological advances, benchmark resources, and the integration of semantic web technologies. By consolidating results across multiple dimensions, the study identifies current trends, limitations, and open challenges, offering researchers and practitioners a comprehensive reference for understanding the evolution and future directions of RE.
https://arxiv.org/abs/2511.03610
RDF pattern-based extraction is a compelling approach for fine-tuning small language models (SLMs) by focusing a relation extraction task on a specified SHACL shape. This technique enables the development of efficient models trained on limited text and RDF data. In this article, we introduce Kastor, a framework that advances this approach to meet the demands for completing and refining knowledge bases in specialized domains. Kastor reformulates the traditional validation task, shifting from single SHACL shape validation to evaluating all possible combinations of properties derived from the shape. By selecting the optimal combination for each training example, the framework significantly enhances model generalization and performance. Additionally, Kastor employs an iterative learning process to refine noisy knowledge bases, enabling the creation of robust models capable of uncovering new, relevant facts
https://arxiv.org/abs/2511.03466
Small language models (SLMs) have shown promises for relation extraction (RE) when extracting RDF triples guided by SHACL shapes focused on common datatype properties. This paper investigates how SLMs handle both datatype and object properties for a complete RDF graph extraction. We show that the key bottleneck is related to long-tail distribution of rare properties. To solve this issue, we evaluate several strategies: stratified sampling, weighted loss, dataset scaling, and template-based synthetic data augmentation. We show that the best strategy to perform equally well over unbalanced target properties is to build a training set where the number of occurrences of each property exceeds a given threshold. To enable reproducibility, we publicly released our datasets, experimental results and code. Our findings offer practical guidance for training shape-aware SLMs and highlight promising directions for future work in semantic RE.
https://arxiv.org/abs/2511.03407
Relation extraction between drugs plays a crucial role in identifying drug drug interactions and predicting side effects. The advancement of machine learning methods in relation extraction, along with the development of large medical text databases, has enabled the low cost extraction of such relations compared to other approaches that typically require expert knowledge. However, to the best of our knowledge, there are limited datasets specifically designed for drug drug relation extraction currently available. Therefore, employing transfer learning becomes necessary to apply machine learning methods in this domain. In this study, we propose DREAM, a method that first employs a trained relation extraction model to discover relations between entities and then applies this model to a corpus of medical texts to construct an ontology of drug relationships. The extracted relations are subsequently validated using a large language model. Quantitative results indicate that the LLM agreed with 71 of the relations extracted from a subset of PubMed abstracts. Furthermore, our qualitative analysis indicates that this approach can uncover ambiguities in the medical domain, highlighting the challenges inherent in relation extraction in this field.
https://arxiv.org/abs/2510.23189
Distantly Supervised Relation Extraction (DSRE) remains a long-standing challenge in NLP, where models must learn from noisy bag-level annotations while making sentence-level predictions. While existing state-of-the-art (SoTA) DSRE models rely on task-specific training, their integration with in-context learning (ICL) using large language models (LLMs) remains underexplored. A key challenge is that the LLM may not learn relation semantics correctly, due to noisy annotation. In response, we propose HYDRE -- HYbrid Distantly Supervised Relation Extraction framework. It first uses a trained DSRE model to identify the top-k candidate relations for a given test sentence, then uses a novel dynamic exemplar retrieval strategy that extracts reliable, sentence-level exemplars from training data, which are then provided in LLM prompt for outputting the final relation(s). We further extend HYDRE to cross-lingual settings for RE in low-resource languages. Using available English DSRE training data, we evaluate all methods on English as well as a newly curated benchmark covering four diverse low-resource Indic languages -- Oriya, Santali, Manipuri, and Tulu. HYDRE achieves up to 20 F1 point gains in English and, on average, 17 F1 points on Indic languages over prior SoTA DSRE models. Detailed ablations exhibit HYDRE's efficacy compared to other prompting strategies.
远监督关系抽取(DSRE)在自然语言处理领域一直是一个长期存在的挑战,模型需要从带有噪声的句子级别标注中学习并进行句级别的预测。尽管现有的最先进的(SoTA)DSRE模型依赖于特定任务的训练,但它们与大规模语言模型(LLMs)中的上下文学习(ICL)集成的研究却相对较少。一个关键的问题是由于标注数据的噪声性,LLM可能无法正确地学习关系语义。为了解决这个问题,我们提出了HYDRE——混合远监督关系抽取框架。该框架首先使用经过训练的DSRE模型来识别给定测试句子中的前k个候选关系,然后采用一种新颖的动态示例检索策略从训练数据中提取可靠、句级别的示例,并将这些示例如实提供在LLM提示中以输出最终的关系。我们进一步扩展了HYDRE框架的应用范围,使其适用于低资源语言的关系抽取跨语言场景。 使用现有的英文DSRE训练数据,我们在英语以及一个新的基准测试上评估了所有方法——该测试涵盖了四种多样的低资源印地语系语言:奥里亚语、桑塔利语、梅泰语和图鲁语。HYDRE在英语上的表现比之前的SoTA DSRE模型高出最多20个F1分数点,并且平均而言,在印地语系语言上也提高了17个F1分数点。详细的消融实验展示了HYDRE相较于其他提示策略的优越性。
https://arxiv.org/abs/2510.18344
Knowledge graphs (KGs) are vital for knowledge-intensive tasks and have shown promise in reducing hallucinations in large language models (LLMs). However, constructing high-quality KGs remains difficult, requiring accurate information extraction and structured representations that support interpretability and downstream utility. Existing LLM-based approaches often focus narrowly on entity and relation extraction, limiting coverage to sentence-level contexts or relying on predefined schemas. We propose a hierarchical extraction framework that organizes information at multiple levels, enabling the creation of semantically rich and well-structured KGs. Using state-of-the-art LLMs, we extract and construct knowledge graphs and evaluate them comprehensively from both structural and semantic perspectives. Our results highlight the strengths and shortcomings of current LLMs in KG construction and identify key challenges for future work. To advance research in this area, we also release a curated dataset of LLM-generated KGs derived from research papers on children's mental well-being. This resource aims to foster more transparent, reliable, and impactful applications in high-stakes domains such as healthcare.
知识图谱(KG)对于知识密集型任务至关重要,并且在减少大型语言模型(LLM)中的幻觉方面展现出潜力。然而,构建高质量的知识图谱仍然具有挑战性,需要准确的信息提取和能够支持可解释性和下游应用的结构化表示。现有的基于LLM的方法通常仅专注于实体和关系抽取,覆盖范围局限于句子级别的上下文或依赖于预定义的模式。我们提出了一种分层抽取框架,该框架在多个层次上组织信息,以创建语义丰富且结构良好的知识图谱。利用最先进的LLM,我们提取并构建了知识图谱,并从结构和语义两个方面对其进行了全面评估。我们的研究结果突出了当前LLM在知识图谱构建中的优势与不足,并确定了未来工作的关键挑战。 为了推动该领域的研究,我们还发布了一个精心策划的数据集,其中包含来自儿童心理健康研究论文的LLM生成的知识图谱。这一资源旨在促进医疗等高风险领域中更透明、可靠和有影响力的应用。
https://arxiv.org/abs/2510.11297