Abstract
App developers use the Graphical User Interface (GUI) of other apps as an important source of inspiration to design and improve their own apps. In recent years, research suggested various approaches to retrieve GUI designs that fit a certain text query from screenshot datasets acquired through automated GUI exploration. However, such text-to-GUI retrieval approaches only leverage the textual information of the GUI elements in the screenshots, neglecting visual information such as icons or background images. In addition, the retrieved screenshots are not steered by app developers and often lack important app features, e.g. whose UI pages require user authentication. To overcome these limitations, this paper proposes GUing, a GUI search engine based on a vision-language model called UIClip, which we trained specifically for the app GUI domain. For this, we first collected app introduction images from Google Play, which usually display the most representative screenshots selected and often captioned (i.e. labeled) by app vendors. Then, we developed an automated pipeline to classify, crop, and extract the captions from these images. This finally results in a large dataset which we share with this paper: including 303k app screenshots, out of which 135k have captions. We used this dataset to train a novel vision-language model, which is, to the best of our knowledge, the first of its kind in GUI retrieval. We evaluated our approach on various datasets from related work and in manual experiment. The results demonstrate that our model outperforms previous approaches in text-to-GUI retrieval achieving a Recall@10 of up to 0.69 and a HIT@10 of 0.91. We also explored the performance of UIClip for other GUI tasks including GUI classification and Sketch-to-GUI retrieval with encouraging results.
Abstract (translated)
翻译: 应用程序开发者会从其他应用程序的图形用户界面(GUI)中获得灵感来设计和改进他们的应用程序。近年来,研究建议从自动抓取通过 GUI 探索获得的屏幕截图数据集中检索 GUI 设计的各种方法。然而,这样的文本到 GUI 检索方法仅利用了屏幕截图中 GUI 元素的文本信息,而忽视了视觉信息,如图标或背景图像。此外,检索到的屏幕截图通常不是由应用程序开发者引导的,并且通常缺乏重要的应用程序功能,例如需要用户身份验证的 UI 页面。为了克服这些限制,本文提出了基于 UIClip 视觉语言模型的 GUI 搜索引擎,该模型专门为应用程序 GUI 领域进行训练。为此,我们首先从 Google Play 收集了应用程序介绍图片,这些图片通常显示了由应用开发商选择的最具有代表性的屏幕截图并附有标签(即标注)。然后,我们开发了一个自动化的管道来对这些图像进行分类、裁剪和提取标签。最终,我们得到了一个大型数据集,我们将其与本文分享:包括 303k 个应用程序屏幕截图,其中 135k 个带有标签。我们使用这个数据集来训练了一种新颖的视觉语言模型,据我们所知,这是 GUI 检索领域第一个这样的模型。我们在相关工作和手动实验的各种数据集上评估了我们的方法,结果表明,我们的模型在文本到 GUI 检索方面优于先前的方法,达到召回率@10 最高可达 0.69 和精确率@10 最高可达 0.91。我们还研究了 UIClip 在其他 GUI 任务上的性能,包括 GUI 分类和 Sketch-to-GUI 检索,具有鼓舞人心的结果。
URL
https://arxiv.org/abs/2405.00145