Paper Reading AI Learner

Correlation Debiasing for Unbiased Scene Graph Generation in Videos

2023-10-24 14:59:51
Anant Khandelwal

Abstract

Dynamic scene graph generation (SGG) from videos requires not only comprehensive understanding of objects across the scenes that are prone to temporal fluctuations but also a model the temporal motions and interactions with different objects. Moreover, the long-tailed distribution of visual relationships is the crucial bottleneck of most dynamic SGG methods, since most of them focus on capturing spatio-temporal context using complex architectures, which leads to the generation of biased scene graphs. To address these challenges, we propose FloCoDe: Flow-aware temporal consistency and Correlation Debiasing with uncertainty attenuation for unbiased dynamic scene graphs. FloCoDe employs feature warping using flow to detect temporally consistent objects across the frames. In addition, it uses correlation debiasing to learn the unbiased relation representation for long-tailed classes. Moreover, to attenuate the predictive uncertainties, it uses a mixture of sigmoidal cross-entropy loss and contrastive loss to incorporate label correlations to identify the commonly co-occurring relations and help debias the long-tailed ones. Extensive experimental evaluation shows a performance gain as high as 4.1% showing the superiority of generating more unbiased scene graphs.

Abstract (translated)

动态场景图生成(SGG)从视频中需要不仅全面了解易受时间波动的场景中对象的全面情况,还需要模型对不同对象的时空运动和相互作用进行建模。此外,长尾分布的视觉关系是大多数动态 SGG 方法的瓶颈,因为它们主要关注使用复杂架构捕捉空间-时间上下文,导致生成有偏的场景图。为解决这些挑战,我们提出了 FloCoDe: 基于流的时空一致性和不确定性减轻的关联消解 for unbiased dynamic scene graphs。FloCoDe 使用流特征扭曲来检测帧间一致的物体。此外,它使用关联消解来学习无偏的类之间的关系表示。此外,为了消弱预测的不确定性,它使用正则化索索交叉熵损失和对比损失来结合标签关联以确定共现的关系,并帮助消除有偏的类。大量的实验评估显示,生成更无偏的场景图的性能提高可达 4.1%,证明了生成更有利的场景图的优势。

URL

https://arxiv.org/abs/2310.16073

PDF

https://arxiv.org/pdf/2310.16073.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot