Paper Reading AI Learner

MVGD-Net: A Novel Motion-aware Video Glass Surface Detection Network

2026-01-20 08:19:17
Yiwei Lu, Hao Huang, Tao Yan

Abstract

Glass surface ubiquitous in both daily life and professional environments presents a potential threat to vision-based systems, such as robot and drone navigation. To solve this challenge, most recent studies have shown significant interest in Video Glass Surface Detection (VGSD). We observe that objects in the reflection (or transmission) layer appear farther from the glass surfaces. Consequently, in video motion scenarios, the notable reflected (or transmitted) objects on the glass surface move slower than objects in non-glass regions within the same spatial plane, and this motion inconsistency can effectively reveal the presence of glass surfaces. Based on this observation, we propose a novel network, named MVGD-Net, for detecting glass surfaces in videos by leveraging motion inconsistency cues. Our MVGD-Net features three novel modules: the Cross-scale Multimodal Fusion Module (CMFM) that integrates extracted spatial features and estimated optical flow maps, the History Guided Attention Module (HGAM) and Temporal Cross Attention Module (TCAM), both of which further enhances temporal features. A Temporal-Spatial Decoder (TSD) is also introduced to fuse the spatial and temporal features for generating the glass region mask. Furthermore, for learning our network, we also propose a large-scale dataset, which comprises 312 diverse glass scenarios with a total of 19,268 frames. Extensive experiments demonstrate that our MVGD-Net outperforms relevant state-of-the-art methods.

Abstract (translated)

玻璃表面在日常生活中和专业环境中普遍存在,这给基于视觉的系统(如机器人和无人机导航)带来了潜在威胁。为了解决这一挑战,最近的研究对视频玻璃面检测(VGSD)表现出了浓厚的兴趣。我们观察到,在反射层或透射层中的物体似乎距离玻璃更远。因此,在视频运动场景中,相较于同一平面内的非玻璃区域里的对象,玻璃表面上的显著反射(或透射)物体移动得较慢,这种运动不一致性可以有效地揭示玻璃表面的存在。 基于这一观察,我们提出了一种名为MVGD-Net的新网络,用于通过利用运动不一致线索来检测视频中的玻璃面。我们的MVGD-Net具有三个新颖模块:跨尺度多模态融合模块(CMFM),该模块整合了提取的空间特征和估计的光流图;历史引导注意模块(HGAM)以及时间交叉注意模块(TCAM),这两个模块进一步增强了时序特征。此外,还引入了一个时空解码器(TSD),用于融合空间和时间特征以生成玻璃区域掩模。 为了训练我们的网络,我们还提出了一套大规模的数据集,其中包括312种多样的玻璃场景,总计有19,268帧。广泛的实验表明,与相关最先进的方法相比,我们的MVGD-Net在性能上取得了优越的结果。

URL

https://arxiv.org/abs/2601.13715

PDF

https://arxiv.org/pdf/2601.13715.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot