Paper Reading AI Learner

Deep Learning Based Multimodal with Two-phase Training Strategy for Daily Life Video Classification

2023-04-30 19:12:34
Lam Pham, Trang Le, Cam Le, Dat Ngo, Weissenfeld Axel, Alexander Schindler

Abstract

In this paper, we present a deep learning based multimodal system for classifying daily life videos. To train the system, we propose a two-phase training strategy. In the first training phase (Phase I), we extract the audio and visual (image) data from the original video. We then train the audio data and the visual data with independent deep learning based models. After the training processes, we obtain audio embeddings and visual embeddings by extracting feature maps from the pre-trained deep learning models. In the second training phase (Phase II), we train a fusion layer to combine the audio/visual embeddings and a dense layer to classify the combined embedding into target daily scenes. Our extensive experiments, which were conducted on the benchmark dataset of DCASE (IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events) 2021 Task 1B Development, achieved the best classification accuracy of 80.5%, 91.8%, and 95.3% with only audio data, with only visual data, both audio and visual data, respectively. The highest classification accuracy of 95.3% presents an improvement of 17.9% compared with DCASE baseline and shows very competitive to the state-of-the-art systems.

Abstract (translated)

在本文中,我们介绍了一种基于深度学习的多项式系统,用于分类日常生活视频。为了训练系统,我们提出了一种两阶段的培训策略。在第一个训练阶段(阶段一),我们从原始视频中提取了音频和视觉(图像)数据。然后,我们使用独立的深度学习模型训练音频数据和视觉数据。训练完成后,我们提取了预训练深度学习模型的特征映射,获得音频嵌入和视觉嵌入。在第二个训练阶段(阶段二),我们训练了一个融合层,将音频和视觉嵌入相结合,并训练了一个密集层,将结合的嵌入分类为目标日常生活场景。我们广泛的实验,在DCASE基准数据集(IEEE AASP挑战:语音识别场景和事件检测和分类2021任务1B开发)上进行了测试,仅使用音频数据、仅使用视觉数据、同时使用音频和视觉数据分别取得了最佳的分类准确率80.5%、91.8%、95.3%。最高水平的95.3%的分类准确率与DCASE基准相比提高了17.9%,表明非常与最先进的系统竞争。

URL

https://arxiv.org/abs/2305.01476

PDF

https://arxiv.org/pdf/2305.01476.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot