Abstract
A well-executed graphic design typically achieves harmony in two levels, from the fine-grained design elements (color, font and layout) to the overall design. This complexity makes the comprehension of graphic design challenging, for it needs the capability to both recognize the design elements and understand the design. With the rapid development of Multimodal Large Language Models (MLLMs), we establish the DesignProbe, a benchmark to investigate the capability of MLLMs in design. Our benchmark includes eight tasks in total, across both the fine-grained element level and the overall design level. At design element level, we consider both the attribute recognition and semantic understanding tasks. At overall design level, we include style and metaphor. 9 MLLMs are tested and we apply GPT-4 as evaluator. Besides, further experiments indicates that refining prompts can enhance the performance of MLLMs. We first rewrite the prompts by different LLMs and found increased performances appear in those who self-refined by their own LLMs. We then add extra task knowledge in two different ways (text descriptions and image examples), finding that adding images boost much more performance over texts.
Abstract (translated)
优秀的平面设计通常在两个层面上实现和谐,从微观的设计元素(颜色、字体和布局)到整体设计。这种复杂性使得理解平面设计具有挑战性,因为它需要具备同时识别设计元素和理解设计的能力。随着多模态大型语言模型(MLLMs)的快速发展,我们建立了DesignProbe,作为研究MLLMs在设计能力方面的基准。我们的基准包括8个任务,跨越微观设计元素和整体设计级别。在设计元素级别,我们考虑了属性的识别和语义理解任务。在整体设计级别,我们包括了风格和隐喻。我们对9个MLLM进行了测试,并使用了GPT-4作为评估器。此外,进一步的实验表明,优化提示可以提高MLLMs的性能。我们首先通过不同的LLM重新撰写了提示,发现那些通过自己LLMs进行自定义的性能明显提高。然后我们以两种方式添加额外任务知识(文本描述和图像示例),发现添加图像极大地提高了文本的性能。
URL
https://arxiv.org/abs/2404.14801