Abstract
Learning from different modalities is a challenging task. In this paper, we look at the challenging problem of cross modal face verification and recognition between caricature and visual image modalities. Caricature have exaggerations of facial features of a person. Due to the significant variations in the caricatures, building vision models for recognizing and verifying data from this modality is an extremely challenging task. Visual images with significantly lesser amount of distortions can act as a bridge for the analysis of caricature modality. We introduce a publicly available large Caricature-VIsual dataset [CaVI] with images from both the modalities that captures the rich variations in the caricature of an identity. This paper presents the first cross modal architecture that handles extreme distortions of caricatures using a deep learning network that learns similar representations across the modalities. We use two convolutional networks along with transformations that are subjected to orthogonality constraints to capture the shared and modality specific representations. In contrast to prior research, our approach neither depends on manually extracted facial landmarks for learning the representations, nor on the identities of the person for performing verification. The learned shared representation achieves 91% accuracy for verifying unseen images and 75% accuracy on unseen identities. Further, recognizing the identity in the image by knowledge transfer using a combination of shared and modality specific representations, resulted in an unprecedented performance of 85% rank-1 accuracy for caricatures and 95% rank-1 accuracy for visual images.
Abstract (translated)
从不同的方式中学习是一项具有挑战性的任务。在本文中,我们将研究交叉模态面部验证和漫画与视觉图像模态之间识别的挑战性问题。漫画夸张了一个人的面部特征。由于漫画的显着变化,用于识别和验证来自该模态的数据的建立视觉模型是极其具有挑战性的任务。具有显着较少量扭曲的视觉图像可以充当分析漫画形态的桥梁。我们介绍了一个公开的大型漫画 - VIsual数据集[CaVI],其中包含来自两种模式的图像,这些图像捕获了身份漫画中丰富的变化。本文介绍了第一种交叉模态体系结构,该体系结构使用深度学习网络处理极端扭曲的漫画,该网络学习了各种模态中的类似表示。我们使用两个卷积网络以及受正交性约束的变换来捕获共享和模态特定表示。与先前的研究相反,我们的方法既不依赖于手动提取的面部地标来学习表示,也不依赖于用于执行验证的人的身份。学习的共享表示在验证看不见的图像方面达到91%的准确率,在看不见的身份上达到75%的准确度。此外,通过使用共享和模态特定表示的组合的知识转移来识别图像中的身份,导致对于漫画的85%秩-1准确度和针对视觉图像的95%秩-1准确度的前所未有的性能。
URL
https://arxiv.org/abs/1807.11688