Paper Reading AI Learner

PCQA: A Strong Baseline for AIGC Quality Assessment Based on Prompt Condition

2024-04-20 07:05:45
Xi Fang, Weigang Wang, Xiaoxin Lv, Jun Yan

Abstract

The development of Large Language Models (LLM) and Diffusion Models brings the boom of Artificial Intelligence Generated Content (AIGC). It is essential to build an effective quality assessment framework to provide a quantifiable evaluation of different images or videos based on the AIGC technologies. The content generated by AIGC methods is driven by the crafted prompts. Therefore, it is intuitive that the prompts can also serve as the foundation of the AIGC quality assessment. This study proposes an effective AIGC quality assessment (QA) framework. First, we propose a hybrid prompt encoding method based on a dual-source CLIP (Contrastive Language-Image Pre-Training) text encoder to understand and respond to the prompt conditions. Second, we propose an ensemble-based feature mixer module to effectively blend the adapted prompt and vision features. The empirical study practices in two datasets: AIGIQA-20K (AI-Generated Image Quality Assessment database) and T2VQA-DB (Text-to-Video Quality Assessment DataBase), which validates the effectiveness of our proposed method: Prompt Condition Quality Assessment (PCQA). Our proposed simple and feasible framework may promote research development in the multimodal generation field.

Abstract (translated)

大规模语言模型(LLM)和扩散模型的开发带来了人工智能生成内容(AIGC)的繁荣。建立一个有效的质量评估框架以根据AIGC技术对不同图像或视频进行定量评估非常重要。由AIGC方法生成的内容是由创建的提示驱动的。因此,提示也可以作为AIGC质量评估的基石。 本研究提出了一个有效的AIGC质量评估(QA)框架。首先,我们提出了一种基于双重源CLIP(对比性语言-图像预训练)文本编码器的中置提示方法,以理解和响应提示条件。其次,我们提出了一种基于集成特征混合器的 ensemble-based 方法,有效地融合了自适应提示和视觉特征。以下是两个数据集的实验研究实践:AIGIQA-20K(AI-Generated Image Quality Assessment database)和T2VQA-DB(文本-视频质量评估数据库),验证了我们提出方法的有效性:提示条件质量评估(PCQA)。我们提出的研究简单而可行,可能会促进多模态生成领域的研究发展。

URL

https://arxiv.org/abs/2404.13299

PDF

https://arxiv.org/pdf/2404.13299.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot