Paper Reading AI Learner

Learning cross space mapping via DNN using large scale click-through logs

2023-02-26 09:00:35
Wei Yu, Kuiyuan Yang, Yalong Bai, Hongxun Yao, Yong Rui

Abstract

The gap between low-level visual signals and high-level semantics has been progressively bridged by continuous development of deep neural network (DNN). With recent progress of DNN, almost all image classification tasks have achieved new records of accuracy. To extend the ability of DNN to image retrieval tasks, we proposed a unified DNN model for image-query similarity calculation by simultaneously modeling image and query in one network. The unified DNN is named the cross space mapping (CSM) model, which contains two parts, a convolutional part and a query-embedding part. The image and query are mapped to a common vector space via these two parts respectively, and image-query similarity is naturally defined as an inner product of their mappings in the space. To ensure good generalization ability of the DNN, we learn weights of the DNN from a large number of click-through logs which consists of 23 million clicked image-query pairs between 1 million images and 11.7 million queries. Both the qualitative results and quantitative results on an image retrieval evaluation task with 1000 queries demonstrate the superiority of the proposed method.

Abstract (translated)

低层次的视觉信号和高级别的语义逐渐通过深度神经网络(DNN)的发展而逐步被填补。随着DNN的进展,几乎所有图像分类任务都达到了准确性的新记录。为了将DNN的能力扩展到图像检索任务,我们提出了一种统一的图像查询相似度计算的DNN模型,通过同时建模图像和查询在一个网络中进行。这个统一的图像查询相似度计算的DNN模型被称为交叉空间映射(CSM)模型,它由两个部分组成,一个是卷积部分,另一个是查询嵌入部分。图像和查询通过这两个部分分别映射到一个共同的向量空间中,图像查询相似性自然定义为它们在空间中的内积。为了确保DNN的良好泛化能力,我们从大量的点击日志中学习DNN的权重,这些日志包括1000万对图像和查询点击的配对,其中配对的数量在1百万图像和11.7百万查询之间。在1000个查询的图像检索评估任务中,定性结果和定量结果都证明了我们提出的方法的优越性。

URL

https://arxiv.org/abs/2302.13275

PDF

https://arxiv.org/pdf/2302.13275.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot