Paper Reading AI Learner

Unbiased Scene Graph Generation in Videos

2023-04-03 06:10:06
Sayak Nag, Kyle Min, Subarna Tripathi, Amit K. Roy Chowdhury

Abstract

The task of dynamic scene graph generation (SGG) from videos is complicated and challenging due to the inherent dynamics of a scene, temporal fluctuation of model predictions, and the long-tailed distribution of the visual relationships in addition to the already existing challenges in image-based SGG. Existing methods for dynamic SGG have primarily focused on capturing spatio-temporal context using complex architectures without addressing the challenges mentioned above, especially the long-tailed distribution of relationships. This often leads to the generation of biased scene graphs. To address these challenges, we introduce a new framework called TEMPURA: TEmporal consistency and Memory Prototype guided UnceRtainty Attenuation for unbiased dynamic SGG. TEMPURA employs object-level temporal consistencies via transformer-based sequence modeling, learns to synthesize unbiased relationship representations using memory-guided training, and attenuates the predictive uncertainty of visual relations using a Gaussian Mixture Model (GMM). Extensive experiments demonstrate that our method achieves significant (up to 10% in some cases) performance gain over existing methods highlighting its superiority in generating more unbiased scene graphs.

Abstract (translated)

从视频动态场景Graph(SGG)的生成任务变得复杂且具有挑战性,因为场景本身具有动态特性、模型预测时间的随机波动以及视觉关系长长尾分布的特性,而基于图像的SGG已经面临了上述挑战。现有的动态SGG方法主要关注使用复杂的架构捕捉时空上下文,而没有解决上述挑战,特别是关系长长尾分布的问题。这可能导致生成偏差的场景Graph。为了应对这些挑战,我们提出了名为TEMPURA的新框架,它利用集体一致性和记忆原型引导的无偏差动态SGG性能衰减。 Tempura使用对象级别的时间一致性通过Transformer序列建模实现,学习使用记忆引导训练合成无偏差的关系表示,并通过高斯混合模型(GMM)衰减视觉关系预测的不确定性。广泛的实验表明,我们的方法比现有方法实现了显著的性能提升(在某些情况下高达10%),突出了它在生成更多无偏差场景Graph方面的优越性。

URL

https://arxiv.org/abs/2304.00733

PDF

https://arxiv.org/pdf/2304.00733.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot