Multimodal feature fusion for CNN-based gait recognition: an empirical comparison

Abstract
Abstract (translated)
URL
PDF

Abstract

People identification in video based on the way they walk (i.e. gait) is a relevant task in computer vision using a non-invasive approach. Standard and current approaches typically derive gait signatures from sequences of binary energy maps of subjects extracted from images, but this process introduces a large amount of non-stationary noise, thus, conditioning their efficacy. In contrast, in this paper we focus on the raw pixels, or simple functions derived from them, letting advanced learning techniques to extract relevant features. Therefore, we present a comparative study of different Convolutional Neural Network (CNN) architectures on three low-level features (i.e. gray pixels, optical flow channels and depth maps) on two widely-adopted and challenging datasets: TUM-GAID and CASIA-B. In addition, we perform a comparative study between different early and late fusion methods used to combine the information obtained from each kind of low-level features. Our experimental results suggest that (i) the use of hand-crafted energy maps (e.g. GEI) is not necessary, since equal or better results can be achieved from the raw pixels; (ii) the combination of multiple modalities (i.e. gray pixels, optical flow and depth maps) from different CNNs allows to obtain state-of-the-art results on the gait task with an image resolution several times smaller than the previously reported results; and, (iii) the selection of the architecture is a critical point that can make the difference between state-of-the-art results or poor results.

Abstract (translated)

基于他们行走方式（即步态）的视频中的人物识别是使用非侵入式方法的计算机视觉中的相关任务。标准和当前的方法通常从图像中提取的受试者的二进制能量图的序列中导出步态签名，但是该过程引入了大量的非平稳噪声，因此调节了它们的功效。相比之下，在本文中，我们将重点放在原始像素或从它们派生的简单函数上，让高级学习技术提取相关的特征。因此，我们在两个广泛采用和具有挑战性的数据集上对三种低级特征（即灰色像素，光学流通道和深度图）进行了不同卷积神经网络（CNN）体系结构的比较研究：TUM-GAID和CASIA-B 。另外，我们对不同的早期和晚期融合方法进行了比较研究，这些方法用于合并从各种低级特征获得的信息。我们的实验结果表明：（i）使用手工制作的能量图（例如GEI）是不必要的，因为可以从原始像素获得相同或更好的结果; （ii）来自不同CNN的多种模态（即，灰色像素，光流和深度图）的组合允许获得步态任务上的最新结果，其图像分辨率比先前报告的结果小几倍; （iii）架构的选择是一个关键点，可以区分最先进的结果或较差的结果。

URL

https://arxiv.org/abs/1806.07753

PDF

https://arxiv.org/pdf/1806.07753.pdf