Paper Reading AI Learner

Learning User Preferences for Image Generation Model

2025-08-11 17:39:42
Wenyi Mo, Ying Ba, Tianyu Zhang, Yalong Bai, Biye Li

Abstract

User preference prediction requires a comprehensive and accurate understanding of individual tastes. This includes both surface-level attributes, such as color and style, and deeper content-related aspects, such as themes and composition. However, existing methods typically rely on general human preferences or assume static user profiles, often neglecting individual variability and the dynamic, multifaceted nature of personal taste. To address these limitations, we propose an approach built upon Multimodal Large Language Models, introducing contrastive preference loss and preference tokens to learn personalized user preferences from historical interactions. The contrastive preference loss is designed to effectively distinguish between user ''likes'' and ''dislikes'', while the learnable preference tokens capture shared interest representations among existing users, enabling the model to activate group-specific preferences and enhance consistency across similar users. Extensive experiments demonstrate our model outperforms other methods in preference prediction accuracy, effectively identifying users with similar aesthetic inclinations and providing more precise guidance for generating images that align with individual tastes. The project page is \texttt{this https URL}.

Abstract (translated)

用户偏好预测需要对个人品味进行全面而准确的理解,这包括表层属性(如颜色和风格)以及更深层次的内容相关方面(如主题和构图)。然而,现有的方法通常依赖于一般的人类偏好或假设静态的用户档案,往往忽视了个体差异和个人喜好的动态、多面性。为了解决这些限制,我们提出了一种基于多模态大型语言模型的方法,并引入对比偏好损失和偏好标记来从历史交互中学习个性化的用户偏好。对比偏好损失旨在有效地区分用户的“喜欢”和“不喜欢”,而可学习的偏好标记则捕获现有用户之间的共享兴趣表示,使模型能够激活特定群体的偏好并增强类似用户的一致性。广泛的实验表明,我们的模型在偏好预测准确性方面优于其他方法,可以有效地识别具有相似审美倾向的用户,并为生成与个人品味相符的图像提供更精确的指导。 项目页面位于:\[此URL\]

URL

https://arxiv.org/abs/2508.08220

PDF

https://arxiv.org/pdf/2508.08220.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot