Paper Reading AI Learner

In Defense of Clip-based Video Relation Detection

2023-07-18 05:42:01
Meng Wei, Long Chen, Wei Ji, Xiaoyu Yue, Roger Zimmermann

Abstract

Video Visual Relation Detection (VidVRD) aims to detect visual relationship triplets in videos using spatial bounding boxes and temporal boundaries. Existing VidVRD methods can be broadly categorized into bottom-up and top-down paradigms, depending on their approach to classifying relations. Bottom-up methods follow a clip-based approach where they classify relations of short clip tubelet pairs and then merge them into long video relations. On the other hand, top-down methods directly classify long video tubelet pairs. While recent video-based methods utilizing video tubelets have shown promising results, we argue that the effective modeling of spatial and temporal context plays a more significant role than the choice between clip tubelets and video tubelets. This motivates us to revisit the clip-based paradigm and explore the key success factors in VidVRD. In this paper, we propose a Hierarchical Context Model (HCM) that enriches the object-based spatial context and relation-based temporal context based on clips. We demonstrate that using clip tubelets can achieve superior performance compared to most video-based methods. Additionally, using clip tubelets offers more flexibility in model designs and helps alleviate the limitations associated with video tubelets, such as the challenging long-term object tracking problem and the loss of temporal information in long-term tubelet feature compression. Extensive experiments conducted on two challenging VidVRD benchmarks validate that our HCM achieves a new state-of-the-art performance, highlighting the effectiveness of incorporating advanced spatial and temporal context modeling within the clip-based paradigm.

Abstract (translated)

Video Visual Relation Detection (VidVRD) 旨在使用空间边界和时间边界检测视频中的视觉关系 triplets。现有的 VidVRD 方法可以根据不同的分类方法将其分为bottom-up和top-down paradigm,取决于其分类方法。bottom-up方法基于片段的方法,将短片 tubelet 对的关系分类并将它们合并成较长的视频关系。top-down方法直接分类较长的视频 tubelet 对。虽然使用视频 tubelets 的video-based 方法已经取得了令人瞩目的结果,但我们认为有效的空间和时间建模比选择片段 tubelets 和视频 tubelets 更为重要。这激励我们重新考虑基于片段的分类 paradigm 并探索 VidVRD 的关键成功因素。在本文中,我们提出了一种Hierarchical Context Model (HCM),该模型基于片段来丰富基于对象的空间和基于关系的时间的上下文。我们证明,使用片段 tubelet 可以比大多数基于视频的方法获得更好的性能。此外,使用片段 tubelet 可以在模型设计中提供更多的灵活性,并减轻与视频 tubelets 相关的限制,例如挑战性的长期对象跟踪问题和长期 tubelet 特征压缩中的时间信息丢失问题。在两个挑战性的 VidVRD 基准测试中进行了广泛的实验验证,我们的 HCM 实现了新的先进技术性能,强调了在基于片段的分类 paradigm 内引入高级的空间和时间上下文建模的有效性。

URL

https://arxiv.org/abs/2307.08984

PDF

https://arxiv.org/pdf/2307.08984.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot