Abstract
In this paper, we present a deep learning based multimodal system for classifying daily life videos. To train the system, we propose a two-phase training strategy. In the first training phase (Phase I), we extract the audio and visual (image) data from the original video. We then train the audio data and the visual data with independent deep learning based models. After the training processes, we obtain audio embeddings and visual embeddings by extracting feature maps from the pre-trained deep learning models. In the second training phase (Phase II), we train a fusion layer to combine the audio/visual embeddings and a dense layer to classify the combined embedding into target daily scenes. Our extensive experiments, which were conducted on the benchmark dataset of DCASE (IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events) 2021 Task 1B Development, achieved the best classification accuracy of 80.5%, 91.8%, and 95.3% with only audio data, with only visual data, both audio and visual data, respectively. The highest classification accuracy of 95.3% presents an improvement of 17.9% compared with DCASE baseline and shows very competitive to the state-of-the-art systems.
Abstract (translated)
在本文中,我们介绍了一种基于深度学习的多项式系统,用于分类日常生活视频。为了训练系统,我们提出了一种两阶段的培训策略。在第一个训练阶段(阶段一),我们从原始视频中提取了音频和视觉(图像)数据。然后,我们使用独立的深度学习模型训练音频数据和视觉数据。训练完成后,我们提取了预训练深度学习模型的特征映射,获得音频嵌入和视觉嵌入。在第二个训练阶段(阶段二),我们训练了一个融合层,将音频和视觉嵌入相结合,并训练了一个密集层,将结合的嵌入分类为目标日常生活场景。我们广泛的实验,在DCASE基准数据集(IEEE AASP挑战:语音识别场景和事件检测和分类2021任务1B开发)上进行了测试,仅使用音频数据、仅使用视觉数据、同时使用音频和视觉数据分别取得了最佳的分类准确率80.5%、91.8%、95.3%。最高水平的95.3%的分类准确率与DCASE基准相比提高了17.9%,表明非常与最先进的系统竞争。
URL
https://arxiv.org/abs/2305.01476