Paper Reading AI Learner

Image2Struct: Benchmarking Structure Extraction for Vision-Language Models

2024-10-29 18:44:59
Josselin Somerville Roberts, Tony Lee, Chi Heem Wong, Michihiro Yasunaga, Yifan Mai, Percy Liang

Abstract

We introduce Image2Struct, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images. Our benchmark 1) captures real-world use cases, 2) is fully automatic and does not require human judgment, and 3) is based on a renewable stream of fresh data. In Image2Struct, VLMs are prompted to generate the underlying structure (e.g., LaTeX code or HTML) from an input image (e.g., webpage screenshot). The structure is then rendered to produce an output image (e.g., rendered webpage), which is compared against the input image to produce a similarity score. This round-trip evaluation allows us to quantitatively evaluate VLMs on tasks with multiple valid structures. We create a pipeline that downloads fresh data from active online communities upon execution and evaluates the VLMs without human intervention. We introduce three domains (Webpages, LaTeX, and Musical Scores) and use five image metrics (pixel similarity, cosine similarity between the Inception vectors, learned perceptual image patch similarity, structural similarity index measure, and earth mover similarity) that allow efficient and automatic comparison between pairs of images. We evaluate Image2Struct on 14 prominent VLMs and find that scores vary widely, indicating that Image2Struct can differentiate between the performances of different VLMs. Additionally, the best score varies considerably across domains (e.g., 0.402 on sheet music vs. 0.830 on LaTeX equations), indicating that Image2Struct contains tasks of varying difficulty. For transparency, we release the full results at this https URL.

Abstract (translated)

我们推出了Image2Struct,这是一个用于评估视觉语言模型(VLMs)从图像中提取结构能力的基准测试。我们的基准测试具备以下特点:1) 涵盖了现实世界的使用场景;2) 完全自动化,无需人工判断;3) 基于持续更新的数据流。在Image2Struct中,通过给定输入图像(如网页截图),VLMs被提示生成其底层结构(例如LaTeX代码或HTML)。接着将该结构渲染成输出图像(如渲染后的网页),并将其与输入图像进行比较以得出相似度分数。这种往返评估使我们能够对多结构有效的任务中的VLMs进行定量评价。我们创建了一个管道,在执行时从活跃的在线社区下载最新数据,并在没有人工干预的情况下评估VLMs的表现。我们介绍了三个领域(网页、LaTeX和乐谱),并使用了五种图像度量标准(像素相似性、Inception向量之间的余弦相似性、学习到的感知图像块相似性、结构相似性指数测量以及地面移动相似性)来实现图像对之间高效自动比较。我们在14个著名的VLMs上进行了Image2Struct评估,发现分数差异很大,表明Image2Struct能够区分不同VLMs的表现水平。此外,在各个领域中最佳得分相差甚大(例如在乐谱上的得分为0.402,而在LaTeX方程式的得分则为0.830),这表明Image2Struct包含了难度不同的任务。为了提高透明度,我们在此链接https://...上发布了完整结果。

URL

https://arxiv.org/abs/2410.22456

PDF

https://arxiv.org/pdf/2410.22456.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot