Paper Reading AI Learner

Video Relationship Detection Using Mixture of Experts

2024-03-06 19:08:34
Ala Shaabana, Zahra Gharaee, Paul Fieguth

Abstract

Machine comprehension of visual information from images and videos by neural networks faces two primary challenges. Firstly, there exists a computational and inference gap in connecting vision and language, making it difficult to accurately determine which object a given agent acts on and represent it through language. Secondly, classifiers trained by a single, monolithic neural network often lack stability and generalization. To overcome these challenges, we introduce MoE-VRD, a novel approach to visual relationship detection utilizing a mixture of experts. MoE-VRD identifies language triplets in the form of < subject, predicate, object> tuples to extract relationships from visual processing. Leveraging recent advancements in visual relationship detection, MoE-VRD addresses the requirement for action recognition in establishing relationships between subjects (acting) and objects (being acted upon). In contrast to single monolithic networks, MoE-VRD employs multiple small models as experts, whose outputs are aggregated. Each expert in MoE-VRD specializes in visual relationship learning and object tagging. By utilizing a sparsely-gated mixture of experts, MoE-VRD enables conditional computation and significantly enhances neural network capacity without increasing computational complexity. Our experimental results demonstrate that the conditional computation capabilities and scalability of the mixture-of-experts approach lead to superior performance in visual relationship detection compared to state-of-the-art methods.

Abstract (translated)

机器从图像和视频中理解视觉信息的主要挑战有两个。首先,在连接视觉和语言之间存在计算和推理差距,这使得准确确定给定代理对哪个对象进行操作并将其通过语言表示为困难。其次,由单个单体神经网络训练的分类器通常缺乏稳定性和泛化能力。为了克服这些挑战,我们引入了MoE-VRD,一种利用专家混合的新视觉关系检测方法。MoE-VRD以<主体,谓词,对象>元组的形式识别视觉处理中的语言三元组以提取关系。利用最近在视觉关系检测方面的进展,MoE-VRD在建立主体(进行操作)与物体(被操作)之间的关系方面解决了动作识别的要求。与单体网络相比,MoE-VRD采用多个小模型作为专家,其输出进行聚合。每个专家在MoE-VRD专门研究视觉关系学习和对象标记。通过使用稀疏门控的专家混合,MoE-VRD实现了条件计算,显著增强了神经网络能力,而不会增加计算复杂度。我们的实验结果表明,条件计算能力和可扩展性是专家混合方法的优越性能在视觉关系检测方面比最先进的方法更显著。

URL

https://arxiv.org/abs/2403.03994

PDF

https://arxiv.org/pdf/2403.03994.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot