Abstract
We introduce Image2Struct, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images. Our benchmark 1) captures real-world use cases, 2) is fully automatic and does not require human judgment, and 3) is based on a renewable stream of fresh data. In Image2Struct, VLMs are prompted to generate the underlying structure (e.g., LaTeX code or HTML) from an input image (e.g., webpage screenshot). The structure is then rendered to produce an output image (e.g., rendered webpage), which is compared against the input image to produce a similarity score. This round-trip evaluation allows us to quantitatively evaluate VLMs on tasks with multiple valid structures. We create a pipeline that downloads fresh data from active online communities upon execution and evaluates the VLMs without human intervention. We introduce three domains (Webpages, LaTeX, and Musical Scores) and use five image metrics (pixel similarity, cosine similarity between the Inception vectors, learned perceptual image patch similarity, structural similarity index measure, and earth mover similarity) that allow efficient and automatic comparison between pairs of images. We evaluate Image2Struct on 14 prominent VLMs and find that scores vary widely, indicating that Image2Struct can differentiate between the performances of different VLMs. Additionally, the best score varies considerably across domains (e.g., 0.402 on sheet music vs. 0.830 on LaTeX equations), indicating that Image2Struct contains tasks of varying difficulty. For transparency, we release the full results at this https URL.
Abstract (translated)
我们推出了Image2Struct,这是一个用于评估视觉语言模型(VLMs)从图像中提取结构能力的基准测试。我们的基准测试具备以下特点:1) 涵盖了现实世界的使用场景;2) 完全自动化,无需人工判断;3) 基于持续更新的数据流。在Image2Struct中,通过给定输入图像(如网页截图),VLMs被提示生成其底层结构(例如LaTeX代码或HTML)。接着将该结构渲染成输出图像(如渲染后的网页),并将其与输入图像进行比较以得出相似度分数。这种往返评估使我们能够对多结构有效的任务中的VLMs进行定量评价。我们创建了一个管道,在执行时从活跃的在线社区下载最新数据,并在没有人工干预的情况下评估VLMs的表现。我们介绍了三个领域(网页、LaTeX和乐谱),并使用了五种图像度量标准(像素相似性、Inception向量之间的余弦相似性、学习到的感知图像块相似性、结构相似性指数测量以及地面移动相似性)来实现图像对之间高效自动比较。我们在14个著名的VLMs上进行了Image2Struct评估,发现分数差异很大,表明Image2Struct能够区分不同VLMs的表现水平。此外,在各个领域中最佳得分相差甚大(例如在乐谱上的得分为0.402,而在LaTeX方程式的得分则为0.830),这表明Image2Struct包含了难度不同的任务。为了提高透明度,我们在此链接https://...上发布了完整结果。
URL
https://arxiv.org/abs/2410.22456