Paper Reading AI Learner

Edge-Optimized Vision-Language Models for Underground Infrastructure Assessment

2026-02-03 17:03:46
Johny J. Lopez, Md Meftahul Ferdaus, Mahdi Abdelguerfi

Abstract

Autonomous inspection of underground infrastructure, such as sewer and culvert systems, is critical to public safety and urban sustainability. Although robotic platforms equipped with visual sensors can efficiently detect structural deficiencies, the automated generation of human-readable summaries from these detections remains a significant challenge, especially on resource-constrained edge devices. This paper presents a novel two-stage pipeline for end-to-end summarization of underground deficiencies, combining our lightweight RAPID-SCAN segmentation model with a fine-tuned Vision-Language Model (VLM) deployed on an edge computing platform. The first stage employs RAPID-SCAN (Resource-Aware Pipeline Inspection and Defect Segmentation using Compact Adaptive Network), achieving 0.834 F1-score with only 0.64M parameters for efficient defect segmentation. The second stage utilizes a fine-tuned Phi-3.5 VLM that generates concise, domain-specific summaries in natural language from the segmentation outputs. We introduce a curated dataset of inspection images with manually verified descriptions for VLM fine-tuning and evaluation. To enable real-time performance, we employ post-training quantization with hardware-specific optimization, achieving significant reductions in model size and inference latency without compromising summarization quality. We deploy and evaluate our complete pipeline on a mobile robotic platform, demonstrating its effectiveness in real-world inspection scenarios. Our results show the potential of edge-deployable integrated AI systems to bridge the gap between automated defect detection and actionable insights for infrastructure maintenance, paving the way for more scalable and autonomous inspection solutions.

Abstract (translated)

地下基础设施(如下水道和涵洞系统)的自主检查对于公共安全和城市可持续性至关重要。尽管配备视觉传感器的机器人平台可以高效地检测结构缺陷,但从这些检测结果自动生成易于人类理解的摘要仍然是一个重大挑战,尤其是在资源受限的边缘设备上实现这一点尤为困难。本文提出了一种新颖的两阶段流水线方法,用于地下缺陷的端到端总结,该方法结合了我们的轻量级RAPID-SCAN分割模型和在边缘计算平台上部署的微调视觉语言模型(VLM)。 第一阶段采用了RAPID-SCAN(资源感知管道检查与缺陷分割使用的紧凑自适应网络),它使用仅0.64M参数实现了0.834的F1分数,从而实现高效的缺陷分割。第二阶段利用了经过微调的Phi-3.5 VLM,该模型能够从分割输出中生成简洁、特定领域的自然语言总结。 我们还引入了一套由人工验证描述支持的检查图像数据集,用于VLM的微调和评估。为了实现实时性能,我们在训练后采用硬件特异性的优化进行量化处理,在不牺牲摘要质量的前提下显著减少了模型大小和推断延迟。 我们的完整流水线在移动机器人平台上部署并进行了评估,证明了其在实际检查场景中的有效性。研究表明,边缘可部署的集成AI系统具有将自动化缺陷检测与基础设施维护方面的实际行动之间的差距弥合的可能性,为更可扩展和自主的检查解决方案铺平了道路。

URL

https://arxiv.org/abs/2602.03742

PDF

https://arxiv.org/pdf/2602.03742.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot