Relation extraction represents a fundamental component in the process of creating knowledge graphs, among other applications. Large language models (LLMs) have been adopted as a promising tool for relation extraction, both in supervised and in-context learning settings. However, in this work we show that their performance still lags behind much smaller architectures when the linguistic graph underlying a text has great complexity. To demonstrate this, we evaluate four LLMs against a graph-based parser on six relation extraction datasets with sentence graphs of varying sizes and complexities. Our results show that the graph-based parser increasingly outperforms the LLMs, as the number of relations in the input documents increases. This makes the much lighter graph-based parser a superior choice in the presence of complex linguistic graphs.
https://arxiv.org/abs/2604.08752
Cross-document relation extraction (RE) aims to identify relations between the head and tail entities located in different documents. Existing approaches typically adopt the paradigm of ``\textit{Small Language Model (SLM) + Classifier}''. However, the limited language understanding ability of SLMs hinders further improvement of their performance. In this paper, we conduct a preliminary study to explore the performance of Large Language Models (LLMs) in cross-document RE. Despite their extensive parameters, our findings indicate that LLMs do not consistently surpass existing SLMs. Further analysis suggests that the underperformance is largely attributed to the challenges posed by the numerous predefined relations. To overcome this issue, we propose an LLM-based \underline{H}ierarchical \underline{C}lassification model for cross-document \underline{RE} (HCRE), which consists of two core components: 1) an LLM for relation prediction and 2) a \textit{hierarchical relation tree} derived from the predefined relation set. This tree enables the LLM to perform hierarchical classification, where the target relation is inferred level by level. Since the number of child nodes is much smaller than the size of the entire predefined relation set, the hierarchical relation tree significantly reduces the number of relation options that LLM needs to consider during inference. However, hierarchical classification introduces the risk of error propagation across levels. To mitigate this, we propose a \textit{prediction-then-verification} inference strategy that improves prediction reliability through multi-view verification at each level. Extensive experiments show that HCRE outperforms existing baselines, validating its effectiveness.
https://arxiv.org/abs/2604.07937
Existing prompt-based fine-tuning methods typically learn task-specific prompts independently, imposing significant computing and storage overhead at scale when deploying multiple clinical natural language processing (NLP) systems. We present a multitask prompt distillation and decomposition framework that learns a single shared metaprompt from 21 diverse clinical source tasks and adapts it to unseen target tasks with fewer than 0.05% trainable parameters. Evaluated across five clinical NLP task types (named entity recognition, relation extraction, question answering, natural language inference, and summarization) on 10 held-out target datasets using three backbone models (LLaMA 3.1 8B, Meditron3 8B, gpt-oss 20B), our framework consistently outperforms LoRA by 1.5~1.7% despite using orders of magnitude fewer parameters, and exceeds single-task prompt tuning by 6.1~6.6%. The gpt-oss 20B model achieves the highest overall performance, particularly on clinical reasoning tasks. The strong zero- and few-shot performance demonstrates better transferability of the shared prompt representation.
https://arxiv.org/abs/2604.06650
Knowledge Graph construction from natural language requires extracting structured triplets from complex, information-dense sentences. In this paper, we investigate if the decomposition of text into atomic propositions (minimal, semantically autonomous units of information) can improve the triplet extraction. We introduce MPropositionneur-V2, a small multilingual model covering six European languages trained by knowledge distillation from Qwen3-32B into a Qwen3-0.6B architecture, and we evaluate its integration into two extraction paradigms: entity-centric (GLiREL) and generative (Qwen3). Experiments on SMiLER, FewRel, DocRED and CaRB show that atomic propositions benefit weaker extractors (GLiREL, CoreNLP, 0.6B models), improving relation recall and, in the multilingual setting, overall accuracy. For stronger LLMs, a fallback combination strategy recovers entity recall losses while preserving the gains in relation extraction. These results show that atomic propositions are an interpretable intermediate data structure that complements extractors without replacing them.
从自然语言构建知识图谱需要从复杂、信息密集的句子中提取结构化三元组。本文研究了将文本分解为原子命题(最小、语义自足的信息单元)是否能改进三元组抽取。我们引入了MPropositionneur-V2,一个覆盖六种欧洲语言的小型多语言模型,通过将Qwen3-32B的知识蒸馏至Qwen3-0.6B架构中进行训练,并评估了其整合到两种抽取范式中的效果:以实体为中心(GLiREL)和生成式(Qwen3)。在SMiLER、FewRel、DocRED和CaRB上的实验表明,原子命题有益于较弱的抽取器(GLiREL、CoreNLP、0.6B模型),提高了关系召回率,并在多语言设置中提升了整体准确率。对于更强大的大语言模型,一种回退组合策略在保持关系抽取增益的同时,恢复了实体召回率的损失。这些结果表明,原子命题是一种可解释的中间数据结构,能够补充而非替代抽取器。
https://arxiv.org/abs/2604.02866
Temporal Relation Extraction (TRE) requires identifying how two events or temporal expressions are related in time. Existing attention-based models often highlight globally salient tokens but overlook the pair-specific cues that actually determine the temporal relation. We propose WISTERIA (Weak Implicit Signal-based Temporal Relation Extraction with Attention), a framework that examines whether the top-K attention components conditioned on each event pair truly encode interpretable evidence for temporal classification. Unlike prior works assuming explicit markers such as before, after, or when, WISTERIA considers signals as any lexical, syntactic, or morphological element implicitly expressing temporal order. By combining multi-head attention with pair-conditioned top-K pooling, the model isolates the most informative contextual tokens for each pair. We conduct extensive experiments on TimeBank-Dense, MATRES, TDDMan, and TDDAuto, including linguistic analyses of top-K tokens. Results show that WISTERIA achieves competitive accuracy and reveals pair-level rationales aligned with temporal linguistic cues, offering a localized and interpretable view of temporal reasoning.
时间关系提取(Temporal Relation Extraction, TRE)需要识别两个事件或时间表达式在时间上的关联方式。现有的基于注意力的模型常聚焦于全局显著标记,却忽略了实际决定时间关系的事件对特定线索。我们提出WISTERIA(基于弱隐式信号与注意力机制的时间关系提取)框架,该框架检验了在 conditioning on 每个事件对条件下的top-K注意力成分是否真正编码了可解释的时间分类证据。与先前假设"之前"、"之后"或"当...时"等显性标记的研究不同,WISTERIA将信号定义为任何隐式表达时间顺序的词汇、句法或形态学元素。通过将多头注意力与事件对条件化的top-K池化相结合,模型能为每个事件对分离出最具信息量的上下文标记。我们在TimeBank-Dense、MATRES、TDDMan和TDDAuto数据集上进行了广泛实验,包括对top-K标记的语言学分析。结果表明,WISTERIA实现了具有竞争力的准确率,并揭示了与时间语言学线索对齐的事件级推理依据,为时间推理提供了局部化且可解释的视角。
https://arxiv.org/abs/2603.23319
Automated knowledge graph (KG) construction is essential for navigating the rapidly expanding body of scientific literature. However, existing approaches struggle to recognize long multi-word entities, often fail to generalize across domains, and typically overlook the hierarchical nature of scientific knowledge. While general-purpose large language models (LLMs) offer adaptability, they are computationally expensive and yield inconsistent accuracy on specialized tasks. As a result, current KGs are shallow and inconsistent, limiting their utility for exploration and synthesis. We propose a two-stage framework for scalable, zero-shot scientific KG construction. The first stage, Z-NERD, introduces (i) Orthogonal Semantic Decomposition (OSD), which promotes domain-agnostic entity recognition by isolating semantic "turns" in text, and (ii) a Multi-Scale TCQK attention mechanism that captures coherent multi-word entities through n-gram-aware attention heads. The second stage, HGNet, performs relation extraction with hierarchy-aware message passing, explicitly modeling parent, child, and peer relations. To enforce global consistency, we introduce two complementary objectives: a Differentiable Hierarchy Loss to discourage cycles and shortcut edges, and a Continuum Abstraction Field (CAF) Loss that embeds abstraction levels along a learnable axis in Euclidean space. This is the first approach to formalize hierarchical abstraction as a continuous property within standard Euclidean embeddings, offering a simpler alternative to hyperbolic methods. We release SPHERE (this https URL), a multi-domain benchmark for hierarchical relation extraction. Our framework establishes a new state of the art on SciERC, SciER, and SPHERE, improving NER by 8.08% and RE by 5.99% on out-of-distribution tests. In zero-shot settings, gains reach 10.76% for NER and 26.2% for RE.
自动化知识图谱(KG)构建对于快速增长的科研文献的探索至关重要。然而,现有方法难以识别长多词实体,通常无法跨领域泛化,且往往忽视科学知识的层次性。虽然通用大语言模型(LLMs)具有适应性,但它们在专业任务上计算成本高昂且准确率不稳定。因此,当前KG普遍浅层且不一致,限制了其用于探索和综合的效用。我们提出一个可扩展的零样本科学KG构建两阶段框架。第一阶段Z-NERD引入了(i)正交语义分解(OSD),通过隔离文本中的语义“转向”来促进领域无关实体识别;(ii)多尺度TCQK注意力机制,通过n-gram感知注意力头捕获连贯的多词实体。第二阶段HGNet通过层次感知消息传递执行关系抽取,显式建模父级、子级和同级关系。为强化全局一致性,我们引入两个互补目标:可微层次损失(抑制循环和捷径边)以及连续抽象场(CAF)损失(在欧氏空间的可学习轴上嵌入抽象层级)。这是首个将层次抽象形式化为标准欧氏嵌入中连续属性的方法,为双曲方法提供了更简单的替代方案。我们发布了SPHERE(此https链接),一个用于层次关系抽取的多领域基准。我们的框架在SciERC、SciER和SPHERE上建立了新的最先进水平,在分布外测试中将NER提升8.08%、RE提升5.99%。在零样本设置中,NER和RE的增益分别达到10.76%和26.2%。
https://arxiv.org/abs/2603.23136
We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains $15,000$ examples built by translating $750$ curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed-effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages. Code is available at: this https URL
https://arxiv.org/abs/2603.19017
Radiology report annotation is essential for clinical NLP, yet manual labeling is slow and costly. We present RadAnnotate, an LLM-based framework that studies retrieval-augmented synthetic reports and confidence-based selective automation to reduce expert effort for labeling in RadGraph. We study RadGraph-style entity labeling (graph nodes) and leave relation extraction (edges) to future work. First, we train entity-specific classifiers on gold-standard reports and characterize their strengths and failure modes across anatomy and observation categories, with uncertain observations hardest to learn. Second, we generate RAG-guided synthetic reports and show that synthetic-only models remain within 1-2 F1 points of gold-trained models, and that synthetic augmentation is especially helpful for uncertain observations in a low-resource setting, improving F1 from 0.61 to 0.70. Finally, by learning entity-specific confidence thresholds, RadAnnotate can automatically annotate 55-90% of reports at 0.86-0.92 entity match score while routing low-confidence cases for expert review.
https://arxiv.org/abs/2603.16002
Automated Drug Combination Extraction (DCE) from large-scale biomedical literature is crucial for advancing precision medicine and pharmacological research. However, existing relation extraction methods primarily focus on binary interactions and struggle to model variable-length n-ary drug combinations, where complex compatibility logic and distributed evidence need to be considered. To address these limitations, we propose RexDrug, an end-to-end reasoning-enhanced relation extraction framework for n-ary drug combination extraction based on large language models. RexDrug adopts a two-stage training strategy. First, a multi-agent collaborative mechanism is utilized to automatically generate high-quality expert-like reasoning traces for supervised fine-tuning. Second, reinforcement learning with a multi-dimensional reward function specifically tailored for DCE is applied to further refine reasoning quality and extraction accuracy. Extensive experiments on the DrugComb dataset show that RexDrug consistently outperforms state-of-the-art baselines for n-ary extraction. Additional evaluation on the DDI13 corpus confirms its generalizability to binary drugdrug interaction tasks. Human expert assessment and automatic reasoning metrics further indicates that RexDrug produces coherent medical reasoning while accurately identifying complex therapeutic regimens. These results establish RexDrug as a scalable and reliable solution for complex biomedical relation extraction from unstructured text. The source code and data are available at this https URL
从大规模生物医学文献中自动提取药物组合(DCE)对于推进精准医疗和药理学研究至关重要。然而,现有的关系抽取方法主要关注二元相互作用,并且难以建模长度可变的n-元药物组合,这些组合需要考虑复杂的兼容逻辑及分布证据。为了克服这些问题,我们提出了RexDrug——一个基于大规模语言模型、用于n-元药物组合提取的端到端推理增强的关系抽取框架。RexDrug采用了一种两阶段训练策略:首先利用多代理协同机制自动生成高质量的专业级推理痕迹,以进行监督微调;其次应用具有多维奖励函数的强化学习来进一步优化推理质量和抽取准确性,该奖励函数特别针对DCE设计。在DrugComb数据集上的广泛实验表明,RexDrug在n-元提取方面始终优于最新的基准方法。对DDI13语料库进行的额外评估证实了其在二元药物相互作用任务中的泛化能力。人类专家评价及自动推理指标进一步表明,RexDrug能够生成连贯的医学推理,并准确识别复杂的治疗方案。这些结果使RexDrug成为从非结构化文本中提取复杂生物医学关系的一种可扩展且可靠的解决方案。源代码和数据可在[提供的URL]获取。
https://arxiv.org/abs/2603.08166
Clinical information extraction (e.g., 2010 i2b2/VA challenge) usually presents tasks of concept recognition, assertion classification, and relation extraction. Jointly modeling the multi-stage tasks in the clinical domain is an underexplored topic. The existing independent task setting (reference inputs given in each stage) makes the joint models not directly comparable to the existing pipeline work. To address these issues, we define a joint task setting and propose a novel end-to-end system to jointly optimize three-stage tasks. We empirically investigate the joint evaluation of our proposal and the pipeline baseline with various embedding techniques: word, contextual, and in-domain contextual embeddings. The proposed joint system substantially outperforms the pipeline baseline by +0.3, +1.4, +3.1 for the concept, assertion, and relation F1. This work bridges joint approaches and clinical information extraction. The proposed approach could serve as a strong joint baseline for future research. The code is publicly available.
临床信息提取(例如,2010年i2b2/VA挑战赛)通常包括概念识别、断言分类和关系抽取等任务。在临床领域中联合建模多阶段任务是一个较少被探索的课题。现有的独立任务设置(每个阶段都有给定的参考输入),使得联合模型无法直接与现有流水线工作进行比较。为了解决这些问题,我们定义了一个联合任务设置,并提出了一种新颖的端到端系统以共同优化三个阶段的任务。我们通过使用各种嵌入技术:词嵌入、上下文嵌入和领域内上下文嵌入来进行实验性评估,比较了我们的提案与流水线基线的联合评价结果。提出的联合系统在概念、断言和关系F1分数上分别比流水线基线高出0.3、1.4和3.1分。这项工作将联合方法与临床信息提取联系起来,所提出的方法可以作为未来研究的强大联合基准。代码已公开发布。
https://arxiv.org/abs/2603.07487
Mathematical text understanding is a challenging task due to the presence of specialized entities and complex relationships between them. This study formulates mathematical problem interpretation as a Mathematical Entity Relation Extraction (MERE) task, where operands are treated as entities and operators as their relationships. Transformer-based models are applied to automatically extract these relations from mathematical text, with Bidirectional Encoder Representations from Transformers (BERT) achieving the best performance, reaching an accuracy of 99.39%. To enhance transparency and trust in the model's predictions, Explainable Artificial Intelligence (XAI) is incorporated using Shapley Additive Explanations (SHAP). The explainability analysis reveals how specific textual and mathematical features influence relation prediction, providing insights into feature importance and model behavior. By combining transformer-based learning, a task-specific dataset, and explainable modeling, this work offers an effective and interpretable framework for MERE, supporting future applications in automated problem solving, knowledge graph construction, and intelligent educational systems.
数学文本理解是一个具有挑战性的任务,因为其中包含专业的实体以及这些实体之间复杂的关系。这项研究将数学问题解释定义为一个数学实体关系抽取(MERE)的任务,在这个过程中,操作数被视为实体,而运算符则作为它们之间的关系。基于Transformer的模型被应用于自动从数学文本中提取这些关系,并且双向编码器表示法(BERT)表现最佳,达到了99.39%的准确率。 为了提高模型预测的透明度和可信度,研究采用了可解释的人工智能(XAI),具体使用了Shapley Additive Explanations(SHAP)。通过这种可解释性分析,可以揭示特定文本特征和数学特性如何影响关系预测,并提供有关特征重要性和模型行为的见解。 这项工作结合了基于Transformer的学习、任务特异性数据集以及可解释建模方法,为MERE提供了有效且可解释的框架。这将支持未来在自动化问题解决、知识图谱构建以及智能教育系统中的应用。
https://arxiv.org/abs/2603.06348
Zero-shot relation extraction aims to identify relations between entity mentions using textual descriptions of novel types (i.e., previously unseen) instead of labeled training examples. Previous works often rely on unrealistic assumptions: (1) pairs of mentions are often encoded directly in the input, which prevents offline pre-computation for large scale document database querying; (2) no rejection mechanism is introduced, biasing the evaluation when using these models in a retrieval scenario where some (and often most) inputs are irrelevant and must be ignored. In this work, we study the robustness of existing zero-shot relation extraction models when adapting them to a realistic extraction scenario. To this end, we introduce a typology of existing models, and propose several strategies to build single pass models and models with a rejection mechanism. We adapt several state-of-the-art tools, and compare them in this challenging setting, showing that no existing work is really robust to realistic assumptions, but overall AlignRE (Li et al., 2024) performs best along all criteria.
零样本关系抽取的目标是利用文本描述的新类型(即之前未见过的)来识别实体提及之间的关系,而不是依赖于有标签的训练示例。以往的工作常常基于一些不切实际的假设:(1) 实体对通常直接编码在输入中,这阻碍了大规模文档数据库查询时的离线预计算;(2) 没有引入拒绝机制,在检索场景下使用这些模型时会偏向于评估,因为在某些情况下(通常是大多数情况),输入是无关的且必须被忽略。在这项工作中,我们研究了现有零样本关系抽取模型在适应现实提取场景时的稳健性。为此,我们介绍了现有模型的分类,并提出了几种策略来构建单次通过模型和具备拒绝机制的模型。我们将几款最先进的工具进行了调整,在这种具有挑战性的设置下进行比较,结果显示目前没有工作能真正应对现实假设的挑战,但总体而言,AlignRE(Li等,2024)在所有标准上表现最佳。
https://arxiv.org/abs/2603.01266
This paper introduces FRAME (Fine-grained Recognition of Art-historical Metadata and Entities), a manually annotated dataset of art-historical image descriptions for Named Entity Recognition (NER) and Relation Extraction (RE). Descriptions were collected from museum catalogs, auction listings, open-access platforms, and scholarly databases, then filtered to ensure that each text focuses on a single artwork and contains explicit statements about its material, composition, or iconography. FRAME provides stand-off annotations in three layers: a metadata layer for object-level properties, a content layer for depicted subjects and motifs, and a co-reference layer linking repeated mentions. Across layers, entity spans are labeled with 37 types and connected by typed RE links between mentions. Entity types are aligned with Wikidata to support Named Entity Linking (NEL) and downstream knowledge-graph construction. The dataset is released as UIMA XMI Common Analysis Structure (CAS) files with accompanying images and bibliographic metadata, and can be used to benchmark and fine-tune NER and RE systems, including zero- and few-shot setups with Large Language Models (LLMs).
本文介绍了FRAME(艺术史元数据和实体的细粒度识别),这是一个由人工注释的艺术史图像描述数据集,用于命名实体识别(NER)和关系抽取(RE)。描述是从博物馆目录、拍卖清单、开放获取平台和学术数据库中收集的,经过筛选以确保每个文本聚焦于单一艺术品,并包含对其材质、构图或象征意义的具体陈述。FRAME提供了三层离散注释:元数据层用于对象级属性;内容层用于描绘的主题和动机;以及共指层连接重复提及的内容。在各层次之间,实体跨度被标记为37种类型并通过提及之间的类型化RE链接进行关联。实体类型与Wikidata对齐以支持命名实体链接(NEL)及下游知识图的构建。该数据集作为UIMA XMI通用分析结构(CAS)文件发布,并附带相关图像和文献元数据,可用于基准测试并微调NER和RE系统,包括大型语言模型(LLMs)的零样本和少样本设置。
https://arxiv.org/abs/2602.19133
HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts. Building on the HIPE-2020 and HIPE-2022 campaigns, it extends the series toward semantic relation extraction by targeting the task of identifying person--place associations in multiple languages and time periods. Systems are asked to classify relations of two types - $at$ ("Has the person ever been at this place?") and $isAt$ ("Is the person located at this place around publication time?") - requiring reasoning over temporal and geographical cues. The lab introduces a three-fold evaluation profile that jointly assesses accuracy, computational efficiency, and domain generalization. By linking relation extraction to large-scale historical data processing, HIPE-2026 aims to support downstream applications in knowledge-graph construction, historical biography reconstruction, and spatial analysis in digital humanities.
https://arxiv.org/abs/2602.17663
Large Language Models (LLMs) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings. In this work, we investigate whether "small" LLMs (around one billion parameters) can effectively perform medical tasks while maintaining competitive accuracy. We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks among Named Entity Recognition, Relation Extraction, Case Report Form Filling, Question Answering, and Argument Mining. We systematically compare a range of adaptation strategies, both at inference time (few-shot prompting, constraint decoding) and at training time (supervised fine-tuning, continual pretraining). Fine-tuning emerges as the most effective approach, while the combination of few-shot prompting and constraint decoding offers strong lower-resource alternatives. Our results show that small LLMs can match or even surpass larger baselines, with our best configuration based on Qwen3-1.7B achieving an average score +9.2 points higher than Qwen3-32B. We release a comprehensive collection of all the publicly available Italian medical datasets for NLP tasks, together with our top-performing models. Furthermore, we release an Italian dataset of 126M words from the Emergency Department of an Italian Hospital, and 175M words from various sources that we used for continual pre-training.
https://arxiv.org/abs/2602.17475
We introduce CogRE, a novel framework for relation extraction (RE), enhancing RE from both accuracy and explainability. The framework has two key components: (i) a reasoning mechanism that formulates relation extraction as a series of text-processing steps inspired by cognitive science, and (ii) an optimization process driven by a novel reinforcement learning (RL) reward function. Our framework introduces relation keywords and rewards generating such keywords using an automatically constructed keywords dictionary. This design addresses the lack of language-based explanations in traditional RE and provides supervision for explanation during RL training. Our experiments show that CogRE improves explanation quality by addressing two common failure patterns in one-shot RE: poor attention focus and limited one-shot learning capability. For example, our cognitive-structured reasoning with Qwen2.5-15B-Instruct on One-shot NYT29 achieves 24.65% F1, surpassing prior reasoning-based designs. Optimizing this approach with RL using our reward further improves performance by +23.46% (absolute). Further, models trained on NYT29 with our reward achieve a +16.9% F1 gain on out-of-distribution WIKIDATA. Finally, human evaluation shows that our best model generates relational keywords closely aligned with gold labels, increasing human explanation quality ratings by 54% (relative).
https://arxiv.org/abs/2510.06198
Multimedia Event Extraction (MEE) aims to identify events and their arguments from documents that contain both text and images. It requires grounding event semantics across different modalities. Progress in MEE is limited by the lack of annotated training data. M2E2 is the only established benchmark, but it provides annotations only for evaluation. This makes direct supervised training impractical. Existing methods mainly rely on cross-modal alignment or inference-time prompting with Vision--Language Models (VLMs). These approaches do not explicitly learn structured event representations and often produce weak argument grounding in multimodal settings. To address these limitations, we propose RMPL, a Relation-aware Multi-task Progressive Learning framework for MEE under low-resource conditions. RMPL incorporates heterogeneous supervision from unimodal event extraction and multimedia relation extraction with stage-wise training. The model is first trained with a unified schema to learn shared event-centric representations across modalities. It is then fine-tuned for event mention identification and argument role extraction using mixed textual and visual data. Experiments on the M2E2 benchmark with multiple VLMs show consistent improvements across different modality settings.
https://arxiv.org/abs/2602.13748
Enterprise documents, such as forms and reports, embed critical information for downstream applications like data archiving, automated workflows, and analytics. Although generalist Vision Language Models (VLMs) perform well on established document understanding benchmarks, their ability to conduct holistic, fine-grained structured extraction across diverse document types and flexible schemas is not well studied. Existing Key Entity Extraction (KEE), Relation Extraction (RE), and Visual Question Answering (VQA) datasets are limited by narrow entity ontologies, simple queries, or homogeneous document types, often overlooking the need for adaptable and structured extraction. To address these gaps, we introduce ExStrucTiny, a new benchmark dataset for structured Information Extraction (IE) from document images, unifying aspects of KEE, RE, and VQA. Built through a novel pipeline combining manual and synthetic human-validated samples, ExStrucTiny covers more varied document types and extraction scenarios. We analyze open and closed VLMs on this benchmark, highlighting challenges such as schema adaptation, query under-specification, and answer localization. We hope our work provides a bedrock for improving generalist models for structured IE in documents.
企业文档(如表格和报告)包含下游应用(例如数据归档、自动化工作流和数据分析)所需的至关重要的信息。尽管通用的视觉语言模型(VLMs)在已建立的文档理解基准测试中表现良好,但它们在整个多样化文档类型和灵活模式中的整体细粒度结构化提取能力尚未得到充分研究。现有的关键实体抽取(KEE)、关系抽取(RE)和视觉问答(VQA)数据集受限于狭窄的实体本体、简单的问题或单一类型的文件,往往忽略了对适应性强且具有结构性提取的需求。为了解决这些问题,我们引入了ExStrucTiny,这是一个新的基准数据集,用于从文档图像中进行结构化信息抽取,将KEE、RE和VQA的方面统一起来。通过结合手动和合成的人类验证样本的新颖管道构建而成,ExStrucTiny涵盖了更多种类的文件类型和提取场景。我们在该基准测试上分析了开放和封闭的VLMs,并指出了模式适应性差、问题描述不明确以及答案定位等挑战。我们希望我们的工作为提高文档中的结构化信息抽取的通用模型提供基础。
https://arxiv.org/abs/2602.12203
Scientific knowledge bases accelerate discovery by curating findings from primary literature into structured, queryable formats for both human researchers and emerging AI systems. Maintaining these resources requires expert curators to search relevant papers, reconcile evidence across documents, and produce ontology-grounded annotations - a workflow that existing benchmarks, focused on isolated subtasks like named entity recognition or relation extraction, do not capture. We present FlyBench to evaluate AI agents on end-to-end agentic ontology curation from scientific literature. Given only a gene symbol, agents must search and read from a corpus of 16,898 full-text papers to produce structured annotations: Gene Ontology terms describing function, expression patterns, and historical synonyms linking decades of nomenclature. The benchmark includes 7,397 expert-curated annotations across 100 genes drawn from FlyBase, the Drosophila (fruit fly) knowledge base. We evaluate four baseline agent architectures: memorization, fixed pipeline, single-agent, and multi-agent. We find that architectural choices significantly impact performance, with multi-agent designs outperforming simpler alternatives, yet scaling backbone models yields diminishing returns. All baselines leave substantial room for improvement. Our analysis surfaces several findings to guide future development; for example, agents primarily use retrieval to confirm parametric knowledge rather than discover new information. We hope FlyBench will drive progress on retrieval-augmented scientific reasoning, a capability with broad applications across scientific domains.
科学知识库通过将原始文献中的发现整理成结构化、可查询的格式,加速了科研领域的发现过程。这些资源不仅供人类研究人员使用,还为新兴的人工智能系统提供支持。然而,维护这类资源需要专家编目人员来搜索相关论文,跨文档比对证据,并生成基于本体论的注释——这是一项现有基准测试无法完全捕捉的工作流程,因为现有的基准测试主要集中在孤立的任务上,如命名实体识别或关系抽取等。 我们提出了FlyBench这一评估AI代理在从科学文献中进行端到端代理式本体编目的方法。给定一个基因符号后,代理必须从包含16,898篇全文论文的语料库中搜索并阅读,以生成结构化的注释:描述功能、表达模式以及连接几十年命名惯例的历史同义词的Gene Ontology术语。该基准测试包含了来自FlyBase(果蝇知识库)的100个基因中的7397项专家编目的注释。 我们评估了四种基础代理架构:记忆型、固定管道型、单一代理型和多代理型。研究发现,架构选择对性能有显著影响,其中多代理设计优于简单的替代方案,但扩展模型主体的规模带来的收益却逐渐减少。所有基准测试的表现都有改进的空间。 我们的分析揭示了一些指导未来发展的关键点;例如,代理主要依靠检索来确认参数知识而不是发现新的信息。我们希望FlyBench能够推动基于检索的科学推理的进步,这一能力在各个科研领域内都有着广泛的应用前景。
https://arxiv.org/abs/2602.09163
With the continuous progress of digitization in Chinese judicial institutions, a substantial amount of electronic legal document information has been accumulated. To unlock its potential value, entity and relation extraction for legal documents has emerged as a crucial task. However, existing methods often lack domain-specific knowledge and fail to account for the unique characteristics of the judicial domain. In this paper, we propose an entity and relation extraction algorithm based on hypergraph neural network (Legal-KAHRE) for drug-related judgment documents. Firstly, we design a candidate span generator based on neighbor-oriented packing strategy and biaffine mechanism, which identifies spans likely to contain entities. Secondly, we construct a legal dictionary with judicial domain knowledge and integrate it into text encoding representation using multi-head attention. Additionally, we incorporate domain-specific cases like joint crimes and combined punishment for multiple crimes into the hypergraph structure design. Finally, we employ a hypergraph neural network for higher-order inference via message passing. Experimental results on the CAIL2022 information extraction dataset demonstrate that our method significantly outperforms existing baseline models.
随着中国司法机构数字化的不断推进,积累了大量的电子法律文件信息。为了挖掘这些数据的巨大潜力,从法律文档中提取实体和关系成为了一个至关重要的任务。然而,现有的方法通常缺乏特定领域的专业知识,并且未能充分考虑司法领域独有的特点。为此,在本文中我们提出了一种基于超图神经网络(Legal-KAHRE)的药物相关判决文书的实体和关系抽取算法。 首先,我们设计了基于邻居导向打包策略与双线性机制的候选片段生成器,该生成器能够识别可能包含实体的文本片段。其次,我们构建了一个包含司法领域知识的法律词典,并通过多头注意力机制将其集成到文本编码表示中。此外,我们将特定领域的案例如共同犯罪和多项罪行合并处罚纳入超图结构设计之中。 最后,我们利用超图神经网络进行高层次推理,通过消息传递来实现更深层次的理解。在CAIL2022信息抽取数据集上的实验结果显示,我们的方法显著优于现有的基准模型。
https://arxiv.org/abs/2602.08289