Paper Reading AI Learner

Towards Scalable and Cross-Lingual Specialist Language Models for Oncology

2025-03-11 11:34:57
Morteza Rohanian, Tarun Mehra, Nicola Miglino, Farhad Nooralahzadeh, Michael Krauthammer, Andreas Wicki

Abstract

Clinical oncology generates vast, unstructured data that often contain inconsistencies, missing information, and ambiguities, making it difficult to extract reliable insights for data-driven decision-making. General-purpose large language models (LLMs) struggle with these challenges due to their lack of domain-specific reasoning, including specialized clinical terminology, context-dependent interpretations, and multi-modal data integration. We address these issues with an oncology-specialized, efficient, and adaptable NLP framework that combines instruction tuning, retrieval-augmented generation (RAG), and graph-based knowledge integration. Our lightweight models prove effective at oncology-specific tasks, such as named entity recognition (e.g., identifying cancer diagnoses), entity linking (e.g., linking entities to standardized ontologies), TNM staging, document classification (e.g., cancer subtype classification from pathology reports), and treatment response prediction. Our framework emphasizes adaptability and resource efficiency. We include minimal German instructions, collected at the University Hospital Zurich (USZ), to test whether small amounts of non-English language data can effectively transfer knowledge across languages. This approach mirrors our motivation for lightweight models, which balance strong performance with reduced computational costs, making them suitable for resource-limited healthcare settings. We validated our models on oncology datasets, demonstrating strong results in named entity recognition, relation extraction, and document classification.

Abstract (translated)

临床肿瘤学产生了大量的非结构化数据,这些数据常常包含不一致、缺失信息和模糊性,使得提取可靠见解以支持基于数据的决策变得困难。通用的大规模语言模型(LLMs)由于缺乏特定领域的推理能力,包括专门的临床术语、上下文依赖解释以及多模态数据整合,难以应对这些问题。我们采用了一种针对肿瘤学的专业化、高效且可适应的自然语言处理(NLP)框架来解决这些问题,该框架结合了指令微调、检索增强生成(RAG)和基于图的知识集成。 我们的轻量级模型在特定于肿瘤学的任务上证明是有效的,例如命名实体识别(如识别癌症诊断)、实体链接(如将实体连接到标准化的本体论)、TNM分期、文档分类(如从病理报告中进行癌症亚型分类)以及治疗反应预测。我们的框架强调适应性和资源效率,并且我们包括了少量在苏黎世大学医院收集的德语文本指令,以测试小量非英语语言数据是否能够有效地跨语言传递知识。 这种方法反映了我们对轻量级模型动机的理解,这些模型能够在保持强大性能的同时减少计算成本,使其适合于资源有限的医疗环境。我们在肿瘤学数据集上验证了我们的模型,在命名实体识别、关系提取和文档分类方面取得了强大的结果。

URL

https://arxiv.org/abs/2503.08323

PDF

https://arxiv.org/pdf/2503.08323.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot