Joint Learning of Discriminative Low-dimensional Image Representations Based on Dictionary Learning and Two-layer Orthogonal Projections

Abstract
Abstract (translated)
URL
PDF

Abstract

This work investigates the problem of efficiently learning discriminative low-dimensional representations of multiclass large-scale image objects. We propose a generic deep learning approach by taking advantages of Convolutional Neural Networks (CNN), sparse dictionary learning, and orthogonal projections. CNN is not only powerful on feature extraction, but also robust to spatial variance and changes. Sparse dictionary learning is well known for disentangling nonlinear underlying discriminative factors in data. The orthogonal projection is a notable efficient tool to project multi-class data onto low dimensional discriminative subspace. The proposed procedure can be summarized as follows. At first, a CNN is employed to extract high-dimensional (HD) preliminary convolutional features. Secondly, to avoid the high computational cost of direct sparse coding on HD CNN features, we present to learn sparse representation (SR) in an orthogonal projected space over a taskdriven sparsifying dictionary. We then exploit the discriminative projection on SR. The whole learning process is treated as a joint optimization problem of trace quotient maximization, which involves the CNN parameters, the orthogonal projection on CNN features, the dictionary and the discriminative projection on sparse codes. The related cost function is well defined on product manifold of the Stiefel manifold, the Oblique manifold, and the Grassmann manifold. It is optimized via a geometrical stochastic gradient descent algorithm. Finally, the quality of dictionary and projections learned by the proposed approach is further investigated in terms of image classification. The experimental results show that the approach can achieve a strongly competitive performance with state-of-the-art image classification methods.

Abstract (translated)

本文研究了如何有效地学习多类大尺度图像对象的低维识别表示问题。利用卷积神经网络（CNN）、稀疏字典学习和正交投影等优点，提出了一种通用的深度学习方法。CNN不仅具有强大的特征提取能力，而且对空间变异和变化具有鲁棒性。稀疏字典学习以分离数据中的非线性潜在识别因素而闻名。正交投影是将多类数据投影到低维识别子空间的一种有效工具。建议的程序总结如下。首先，使用CNN提取高维（HD）初步卷积特征。其次，为了避免直接稀疏编码在高清CNN特征上的高计算成本，我们提出在任务驱动的稀疏字典上学习正交投影空间中的稀疏表示（SR）。然后利用SR上的判别投影，将整个学习过程看作是一个跟踪商最大化的联合优化问题，涉及CNN参数、CNN特征的正交投影、字典和稀疏码的判别投影。相关的成本函数在斯蒂夫流形、斜流形和格拉斯曼流形的产品流形上有很好的定义。采用几何随机梯度下降算法进行优化。最后，从图像分类的角度进一步研究了该方法所学习到的词典和投影的质量。实验结果表明，该方法与目前最先进的图像分类方法相比，具有较强的竞争力。

URL

https://arxiv.org/abs/1903.09977

PDF

https://arxiv.org/pdf/1903.09977.pdf