Abstract
The development of Large Language Models (LLM) and Diffusion Models brings the boom of Artificial Intelligence Generated Content (AIGC). It is essential to build an effective quality assessment framework to provide a quantifiable evaluation of different images or videos based on the AIGC technologies. The content generated by AIGC methods is driven by the crafted prompts. Therefore, it is intuitive that the prompts can also serve as the foundation of the AIGC quality assessment. This study proposes an effective AIGC quality assessment (QA) framework. First, we propose a hybrid prompt encoding method based on a dual-source CLIP (Contrastive Language-Image Pre-Training) text encoder to understand and respond to the prompt conditions. Second, we propose an ensemble-based feature mixer module to effectively blend the adapted prompt and vision features. The empirical study practices in two datasets: AIGIQA-20K (AI-Generated Image Quality Assessment database) and T2VQA-DB (Text-to-Video Quality Assessment DataBase), which validates the effectiveness of our proposed method: Prompt Condition Quality Assessment (PCQA). Our proposed simple and feasible framework may promote research development in the multimodal generation field.
Abstract (translated)
大规模语言模型(LLM)和扩散模型的开发带来了人工智能生成内容(AIGC)的繁荣。建立一个有效的质量评估框架以根据AIGC技术对不同图像或视频进行定量评估非常重要。由AIGC方法生成的内容是由创建的提示驱动的。因此,提示也可以作为AIGC质量评估的基石。 本研究提出了一个有效的AIGC质量评估(QA)框架。首先,我们提出了一种基于双重源CLIP(对比性语言-图像预训练)文本编码器的中置提示方法,以理解和响应提示条件。其次,我们提出了一种基于集成特征混合器的 ensemble-based 方法,有效地融合了自适应提示和视觉特征。以下是两个数据集的实验研究实践:AIGIQA-20K(AI-Generated Image Quality Assessment database)和T2VQA-DB(文本-视频质量评估数据库),验证了我们提出方法的有效性:提示条件质量评估(PCQA)。我们提出的研究简单而可行,可能会促进多模态生成领域的研究发展。
URL
https://arxiv.org/abs/2404.13299