Paper Reading AI Learner

Personalized Image Generation with Large Multimodal Models

2025-02-02 06:35:42
Yiyan Xu, Wenjie Wang, Yang Zhang, Biao Tang, Peng Yan, Fuli Feng, Xiangnan He

Abstract

Personalized content filtering, such as recommender systems, has become a critical infrastructure to alleviate information overload. However, these systems merely filter existing content and are constrained by its limited diversity, making it difficult to meet users' varied content needs. To address this limitation, personalized content generation has emerged as a promising direction with broad applications. Nevertheless, most existing research focuses on personalized text generation, with relatively little attention given to personalized image generation. The limited work in personalized image generation faces challenges in accurately capturing users' visual preferences and needs from noisy user-interacted images and complex multimodal instructions. Worse still, there is a lack of supervised data for training personalized image generation models. To overcome the challenges, we propose a Personalized Image Generation Framework named Pigeon, which adopts exceptional large multimodal models with three dedicated modules to capture users' visual preferences and needs from noisy user history and multimodal instructions. To alleviate the data scarcity, we introduce a two-stage preference alignment scheme, comprising masked preference reconstruction and pairwise preference alignment, to align Pigeon with the personalized image generation task. We apply Pigeon to personalized sticker and movie poster generation, where extensive quantitative results and human evaluation highlight its superiority over various generative baselines.

Abstract (translated)

个性化内容过滤,例如推荐系统,已成为缓解信息过载的关键基础设施。然而,这些系统仅能筛选现有内容,并受限于其有限的多样性,难以满足用户多样化的内容需求。为解决这一局限性,个性化内容生成作为一种具有广泛应用前景的方向应运而生。尽管如此,现有的大多数研究主要集中在个性化文本生成上,对个性化图像生成的关注相对较少。在个性化图像生成领域的有限工作面临着从嘈杂的用户互动图像和复杂的多模态指令中准确捕捉用户视觉偏好与需求的巨大挑战。更糟糕的是,训练个性化图像生成模型缺乏监督数据。为了克服这些挑战,我们提出了一种名为Pigeon的个性化图像生成框架,该框架采用了卓越的大规模多模态模型,并配备了三个专门模块,从嘈杂的用户历史记录和复杂多模态指令中捕捉用户的视觉偏好与需求。为缓解数据稀缺问题,我们引入了一个两阶段的偏好对齐方案,包括掩码偏好重构和成对偏好对齐,以使Pigeon更好地适应个性化图像生成任务的需求。我们将Pigeon应用于个性化贴纸及电影海报生成,并通过广泛的定量结果和人类评估证明了其在各种生成基线模型中的优越性。

URL

https://arxiv.org/abs/2410.14170

PDF

https://arxiv.org/pdf/2410.14170.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot