Paper Reading AI Learner

Interpretable and Generalizable Deep Image Matching with Adaptive Convolutions


Abstract

For image matching tasks, like face recognition and person re-identification, existing deep networks often focus on representation learning. However, without domain adaptation or transfer learning, the learned model is fixed as is, which is not adaptable to handle various unseen scenarios. In this paper, beyond representation learning, we consider how to formulate image matching directly in deep feature maps. We treat image matching as finding local correspondences in feature maps, and construct adaptive convolution kernels on the fly to achieve local matching. In this way, the matching process and result is interpretable, and this explicit matching is more generalizable than representation features to unseen scenarios, such as unknown misalignments, pose or viewpoint changes. To facilitate end-to-end training of such an image matching architecture, we further build a class memory module to cache feature maps of the most recent samples of each class, so as to compute image matching losses for metric learning. The proposed method is preliminarily validated on the person re-identification task. Through direct cross-dataset evaluation without further transfer learning, it achieves better results than many transfer learning methods. Besides, a model-free temporal cooccurrence based score weighting method is proposed, which improves the performance to a further extent, resulting in state-of-the-art results in cross-dataset evaluation.

Abstract (translated)

对于人脸识别和人的再识别等图像匹配任务,现有的深层网络往往侧重于表示学习。然而,在没有域自适应或转移学习的情况下,学习模型是固定不变的,不适合处理各种看不见的场景。除了表示学习外,本文还考虑了如何在深度特征图中直接进行图像匹配。我们将图像匹配视为在特征映射中寻找局部对应,并实时构造自适应卷积核来实现局部匹配。通过这种方式,匹配过程和结果是可解释的,并且这种显式匹配比表示特征更易于概括为未知的场景,例如未知的错位、姿势或视点变化。为了方便这种图像匹配体系结构的端到端培训,我们进一步构建了一个类内存模块来缓存每个类的最新样本的特征映射,从而计算出用于度量学习的图像匹配损失。初步验证了该方法在人员重新识别任务中的有效性。通过直接的跨数据集评估,不需要进一步的转移学习,它比许多转移学习方法获得了更好的结果。此外,提出了一种基于无模型时间共现的分数加权方法,进一步提高了该方法的性能,使跨数据集的评价达到了最先进的水平。

URL

https://arxiv.org/abs/1904.10424

PDF

https://arxiv.org/pdf/1904.10424.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot