Paper Reading AI Learner

Unsupervised semantic segmentation of high-resolution UAV imagery for road scene parsing


Abstract

Two challenges are presented when parsing road scenes in UAV images. First, the high resolution of UAV images makes processing difficult. Second, supervised deep learning methods require a large amount of manual annotations to train robust and accurate models. In this paper, an unsupervised road parsing framework that leverages recent advances in vision language models and fundamental computer vision model is introduced.Initially, a vision language model is employed to efficiently process ultra-large resolution UAV images to quickly detect road regions of interest in the images. Subsequently, the vision foundation model SAM is utilized to generate masks for the road regions without category information. Following that, a self-supervised representation learning network extracts feature representations from all masked regions. Finally, an unsupervised clustering algorithm is applied to cluster these feature representations and assign IDs to each cluster. The masked regions are combined with the corresponding IDs to generate initial pseudo-labels, which initiate an iterative self-training process for regular semantic segmentation. The proposed method achieves an impressive 89.96% mIoU on the development dataset without relying on any manual annotation. Particularly noteworthy is the extraordinary flexibility of the proposed method, which even goes beyond the limitations of human-defined categories and is able to acquire knowledge of new categories from the dataset itself.

Abstract (translated)

在对UAV图像进行道路场景解析时,有两个挑战需要面对。首先,UAV图像的高分辨率使得处理过程变得困难。其次,需要大量手动注释才能训练出 robust 和 accurate 的模型,这是监督式深度学习方法的一个缺点。在本文中,介绍了一种利用最近在视觉语言模型和基本计算机视觉模型方面的进展的无需手动注释的无监督道路解析框架。首先,采用一个视觉语言模型来高效地处理超大型分辨率UAV图像,以快速检测图像中的感兴趣道路区域。接着,采用视觉基础模型(SAM)来生成没有类别信息的道路区域的掩码。然后,利用自监督表示学习网络从所有掩码区域提取特征表示。最后,采用无监督聚类算法对特征表示进行聚类,并为每个聚类分配ID。掩码区域与相应的ID结合,生成初始伪标签,从而启动自训练的语义分割过程。与任何手动注释相比,所提出的方法在开发数据集上实现了令人印象深刻的89.96% mIoU。尤其值得注意的是,所提出的方法具有非凡的灵活性,甚至超越了人类定义的范畴,并且能够从数据集中获取新的类别知识。

URL

https://arxiv.org/abs/2402.02985

PDF

https://arxiv.org/pdf/2402.02985.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot