Paper Reading AI Learner

Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching

2024-04-28 08:44:28
Haiwen Diao, Ying Zhang, Shang Gao, Xiang Ruan, Huchuan Lu

Abstract

Image-text matching remains a challenging task due to heterogeneous semantic diversity across modalities and insufficient distance separability within triplets. Different from previous approaches focusing on enhancing multi-modal representations or exploiting cross-modal correspondence for more accurate retrieval, in this paper we aim to leverage the knowledge transfer between peer branches in a boosting manner to seek a more powerful matching model. Specifically, we propose a brand-new Deep Boosting Learning (DBL) algorithm, where an anchor branch is first trained to provide insights into the data properties, with a target branch gaining more advanced knowledge to develop optimal features and distance metrics. Concretely, an anchor branch initially learns the absolute or relative distance between positive and negative pairs, providing a foundational understanding of the particular network and data distribution. Building upon this knowledge, a target branch is concurrently tasked with more adaptive margin constraints to further enlarge the relative distance between matched and unmatched samples. Extensive experiments validate that our DBL can achieve impressive and consistent improvements based on various recent state-of-the-art models in the image-text matching field, and outperform related popular cooperative strategies, e.g., Conventional Distillation, Mutual Learning, and Contrastive Learning. Beyond the above, we confirm that DBL can be seamlessly integrated into their training scenarios and achieve superior performance under the same computational costs, demonstrating the flexibility and broad applicability of our proposed method. Our code is publicly available at: this https URL.

Abstract (translated)

图像文本匹配仍然是一个具有挑战性的任务,因为不同模态之间存在异质语义多样性,三元组内的距离分离度不足。与之前关注增强多模态表示或利用跨模态对应进行更准确检索的前沿方法不同,本文旨在以一种更有力的匹配模型的方式利用同侪分支之间的知识传递,寻找一个更强大的匹配模型。具体来说,我们提出了一个全新的Deep Boosting Learning(DBL)算法,其中锚支首先通过训练获得对数据属性的见解,目标支则获得更丰富的知识以开发最优的特征和距离度量。具体,锚支最初学习正负对之间的绝对或相对距离,为特定网络和数据分布提供基础理解。在此基础上,目标支同时负担具有自适应边界的扩展,进一步扩大匹配样本之间的相对距离。大量实验证实,我们的DBL可以在各种图像文本匹配领域的最新模型上实现令人印象深刻的改进,并优于相关的流行合作策略,例如传统扩散、相互学习和对比学习。此外,我们还证实,DBL可以轻松地集成到他们的训练场景中,在相同的计算成本下实现卓越的性能,证明了我们所提出方法的可行性和广泛的适用性。我们的代码可公开访问于:https:// this URL。

URL

https://arxiv.org/abs/2404.18114

PDF

https://arxiv.org/pdf/2404.18114.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot