Paper Reading AI Learner

Multi-Modal Proxy Learning Towards Personalized Visual Multiple Clustering

2024-04-24 05:20:42
Jiawei Yao, Qi Qian, Juhua Hu

Abstract

Multiple clustering has gained significant attention in recent years due to its potential to reveal multiple hidden structures of data from different perspectives. The advent of deep multiple clustering techniques has notably advanced the performance by uncovering complex patterns and relationships within large datasets. However, a major challenge arises as users often do not need all the clusterings that algorithms generate, and figuring out the one needed requires a substantial understanding of each clustering result. Traditionally, aligning a user's brief keyword of interest with the corresponding vision components was challenging, but the emergence of multi-modal and large language models (LLMs) has begun to bridge this gap. In response, given unlabeled target visual data, we propose Multi-MaP, a novel method employing a multi-modal proxy learning process. It leverages CLIP encoders to extract coherent text and image embeddings, with GPT-4 integrating users' interests to formulate effective textual contexts. Moreover, reference word constraint and concept-level constraint are designed to learn the optimal text proxy according to the user's interest. Multi-MaP not only adeptly captures a user's interest via a keyword but also facilitates identifying relevant clusterings. Our extensive experiments show that Multi-MaP consistently outperforms state-of-the-art methods in all benchmark multi-clustering vision tasks. Our code is available at this https URL.

Abstract (translated)

近年来,由于从不同角度揭示数据中多个潜在的结构具有可能性,多聚类技术引起了人们的广泛关注。深度多聚类技术的出现显著提高了大型数据集的性能,通过揭示复杂模式和关系。然而,用户通常不需要算法生成的所有聚类,而且确定所需聚类需要对每个聚类结果进行深入的理解。传统上,将用户的感兴趣关键词与相应的视觉组件对齐是具有挑战性的,但多模态和大型语言模型(LLMs)的出现已经开始弥合这一差距。 为了应对无标签目标视觉数据,我们提出了Multi-MaP,一种采用多模态代理学习过程的新型方法。它依赖于CLIP编码器提取连贯的文本和图像嵌入,GPT-4将用户的兴趣组合成有效的文本上下文。此外,参考词约束和概念级别约束旨在根据用户的兴趣学习最优的文本代理。Multi-MaP不仅通过关键词捕获用户的兴趣,而且有助于发现相关的聚类。 我们进行了广泛的实验,结果表明,Multi-MaP在所有基准多聚类视觉任务中均显著优于最先进的方法。我们的代码可在此处访问:https://thisurl.com/。

URL

https://arxiv.org/abs/2404.15655

PDF

https://arxiv.org/pdf/2404.15655.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot