Paper Reading AI Learner

Multimodal feature fusion for CNN-based gait recognition: an empirical comparison

2018-06-19 11:36:22
Francisco Manuel Castro, Manuel Jesús Marín-Jiménez, Nicolás Guil, Nicolás Pérez de la Blanca

Abstract

People identification in video based on the way they walk (i.e. gait) is a relevant task in computer vision using a non-invasive approach. Standard and current approaches typically derive gait signatures from sequences of binary energy maps of subjects extracted from images, but this process introduces a large amount of non-stationary noise, thus, conditioning their efficacy. In contrast, in this paper we focus on the raw pixels, or simple functions derived from them, letting advanced learning techniques to extract relevant features. Therefore, we present a comparative study of different Convolutional Neural Network (CNN) architectures on three low-level features (i.e. gray pixels, optical flow channels and depth maps) on two widely-adopted and challenging datasets: TUM-GAID and CASIA-B. In addition, we perform a comparative study between different early and late fusion methods used to combine the information obtained from each kind of low-level features. Our experimental results suggest that (i) the use of hand-crafted energy maps (e.g. GEI) is not necessary, since equal or better results can be achieved from the raw pixels; (ii) the combination of multiple modalities (i.e. gray pixels, optical flow and depth maps) from different CNNs allows to obtain state-of-the-art results on the gait task with an image resolution several times smaller than the previously reported results; and, (iii) the selection of the architecture is a critical point that can make the difference between state-of-the-art results or poor results.

Abstract (translated)

基于他们行走方式(即步态)的视频中的人物识别是使用非侵入式方法的计算机视觉中的相关任务。标准和当前的方法通常从图像中提取的受试者的二进制能量图的序列中导出步态签名,但是该过程引入了大量的非平稳噪声,因此调节了它们的功效。相比之下,在本文中,我们将重点放在原始像素或从它们派生的简单函数上,让高级学习技术提取相关的特征。因此,我们在两个广泛采用和具有挑战性的数据集上对三种低级特征(即灰色像素,光学流通道和深度图)进行了不同卷积神经网络(CNN)体系结构的比较研究:TUM-GAID和CASIA-B 。另外,我们对不同的早期和晚期融合方法进行了比较研究,这些方法用于合并从各种低级特征获得的信息。我们的实验结果表明:(i)使用手工制作的能量图(例如GEI)是不必要的,因为可以从原始像素获得相同或更好的结果; (ii)来自不同CNN的多种模态(即,灰色像素,光流和深度图)的组合允许获得步态任务上的最新结果,其图像分辨率比先前报告的结果小几倍; (iii)架构的选择是一个关键点,可以区分最先进的结果或较差的结果。

URL

https://arxiv.org/abs/1806.07753

PDF

https://arxiv.org/pdf/1806.07753.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot