Paper Reading AI Learner

Advancing Chronic Tuberculosis Diagnostics Using Vision-Language Models: A Multi modal Framework for Precision Analysis

2025-03-17 13:49:29
Praveen Shastry, Sowmya Chowdary Muthulur, Naveen Kumarasami, Anandakumar D, Mounigasri M, Keerthana R, Kishore Prasath Venkatesh, Bargava Subramanian, Kalyan Sivasailam, Revathi Ezhumalai, Abitha Marimuthu

Abstract

Background This study proposes a Vision-Language Model (VLM) leveraging the SIGLIP encoder and Gemma-3b transformer decoder to enhance automated chronic tuberculosis (TB) screening. By integrating chest X-ray images with clinical data, the model addresses the challenges of manual interpretation, improving diagnostic consistency and accessibility, particularly in resource-constrained settings. Methods The VLM architecture combines a Vision Transformer (ViT) for visual encoding and a transformer-based text encoder to process clinical context, such as patient histories and treatment records. Cross-modal attention mechanisms align radiographic features with textual information, while the Gemma-3b decoder generates comprehensive diagnostic reports. The model was pre-trained on 5 million paired medical images and texts and fine-tuned using 100,000 chronic TB-specific chest X-rays. Results The model demonstrated high precision (94 percent) and recall (94 percent) for detecting key chronic TB pathologies, including fibrosis, calcified granulomas, and bronchiectasis. Area Under the Curve (AUC) scores exceeded 0.93, and Intersection over Union (IoU) values were above 0.91, validating its effectiveness in detecting and localizing TB-related abnormalities. Conclusion The VLM offers a robust and scalable solution for automated chronic TB diagnosis, integrating radiographic and clinical data to deliver actionable and context-aware insights. Future work will address subtle pathologies and dataset biases to enhance the model's generalizability, ensuring equitable performance across diverse populations and healthcare settings.

Abstract (translated)

背景:这项研究提出了一种利用SIGLIP编码器和Gemma-3b解码器的视觉语言模型(VLM),以增强自动慢性结核病(TB)筛查。通过整合胸部X光影像与临床数据,该模型旨在解决手动解读的挑战,提高诊断的一致性和可访问性,特别是在资源有限的情况下。 方法:VLM架构结合了用于视觉编码的Vision Transformer(ViT)和基于变压器的文本编码器以处理如患者病史和治疗记录等临床背景信息。跨模态注意机制将放射学特征与文本信息对齐,而Gemma-3b解码器则生成全面的诊断报告。该模型在500万张配对的医学影像和文本数据上进行了预训练,并通过10万张特定于慢性TB的胸部X光片进行了微调。 结果:该模型对于检测慢性结核病的关键病理特征,如纤维化、钙化的肉芽肿和支气管扩张等表现出了高精度(94%)和召回率(94%)。在曲线下面积(AUC)评分超过0.93,交并比(IoU)值高于0.91,验证了其在检测和定位与结核病相关的异常方面具有有效性。 结论:VLM为自动慢性结核诊断提供了一种稳健且可扩展的解决方案,通过整合放射学和临床数据来交付实际可行且情境感知的洞察。未来的工作将解决细微病理特征和数据集偏差的问题,以增强模型在多样人口和医疗环境中的泛化能力,确保其性能的公平性。

URL

https://arxiv.org/abs/2503.14536

PDF

https://arxiv.org/pdf/2503.14536.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot