Paper Reading AI Learner

Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

2024-04-30 17:55:27
Yunhao Ge, Xiaohui Zeng, Jacob Samuel Huffman, Tsung-Yi Lin, Ming-Yu Liu, Yin Cui

Abstract

Existing automatic captioning methods for visual content face challenges such as lack of detail, content hallucination, and poor instruction following. In this work, we propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions; 3) captioning, where an LLM generates the final caption by summarizing caption proposals and the fact check verification results. In this step, VFC can flexibly generate captions in various styles following complex instructions. We conduct comprehensive captioning evaluations using four metrics: 1) CLIP-Score for image-text similarity; 2) CLIP-Image-Score for measuring the image-image similarity between the original and the reconstructed image generated by a text-to-image model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V for fine-grained evaluation. Evaluation results show that VFC outperforms state-of-the-art open-sourced captioning methods for 2D images on the COCO dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by combining open-source models into a pipeline, we can attain captioning capability comparable to proprietary models such as GPT-4V, despite being over 10x smaller in model size.

Abstract (translated)

目前,为视觉内容设计的自动摘要方法面临着缺乏细节、内容偏差和差劲的指令等挑战。在这项工作中,我们提出了VisualFactChecker(VFC),一种灵活的训练免费管道,为2D图像和3D对象生成高保真度和详细摘要。VFC包括三个步骤:1)提议,其中图像到文本摘要模型提出多个初始摘要;2)验证,其中大型语言模型(LLM)利用诸如物体检测和VQA模型等工具对提议的摘要进行验证;3)摘要,其中LLM通过总结摘要建议和验证结果生成最终的摘要。在这一步骤,VFC可以根据复杂指令灵活生成各种风格的摘要。我们使用四个指标对全面摘要评估:1)CLIP-Score,衡量图像与文本相似度;2)CLIP-Image-Score,衡量原图像和由文本到图像模型生成的图像之间的图像图像相似度;3)在Amazon Mechanical Turk上的人类研究;4)GPT-4V进行微细化评估。评估结果显示,VFC在COCO数据集上的2D图像上的表现优于最先进的开源摘要方法,而在Objaverse数据集上的3D资产上的表现也优于最先进的开放式源代码方法。我们的研究证明了通过将开源模型集成到管道中,我们可以实现与 proprietary 模型如GPT-4V相当的摘要能力,尽管模型的规模是开源模型的10倍以上。

URL

https://arxiv.org/abs/2404.19752

PDF

https://arxiv.org/pdf/2404.19752.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot