Paper Reading AI Learner

GUing: A Mobile GUI Search Engine using a Vision-Language Model

2024-04-30 18:42:18
Jialiang Wei, Anne-Lise Courbis, Thomas Lambolais, Binbin Xu, Pierre Louis Bernard, Gérard Dray, Walid Maalej

Abstract

App developers use the Graphical User Interface (GUI) of other apps as an important source of inspiration to design and improve their own apps. In recent years, research suggested various approaches to retrieve GUI designs that fit a certain text query from screenshot datasets acquired through automated GUI exploration. However, such text-to-GUI retrieval approaches only leverage the textual information of the GUI elements in the screenshots, neglecting visual information such as icons or background images. In addition, the retrieved screenshots are not steered by app developers and often lack important app features, e.g. whose UI pages require user authentication. To overcome these limitations, this paper proposes GUing, a GUI search engine based on a vision-language model called UIClip, which we trained specifically for the app GUI domain. For this, we first collected app introduction images from Google Play, which usually display the most representative screenshots selected and often captioned (i.e. labeled) by app vendors. Then, we developed an automated pipeline to classify, crop, and extract the captions from these images. This finally results in a large dataset which we share with this paper: including 303k app screenshots, out of which 135k have captions. We used this dataset to train a novel vision-language model, which is, to the best of our knowledge, the first of its kind in GUI retrieval. We evaluated our approach on various datasets from related work and in manual experiment. The results demonstrate that our model outperforms previous approaches in text-to-GUI retrieval achieving a Recall@10 of up to 0.69 and a HIT@10 of 0.91. We also explored the performance of UIClip for other GUI tasks including GUI classification and Sketch-to-GUI retrieval with encouraging results.

Abstract (translated)

翻译: 应用程序开发者会从其他应用程序的图形用户界面(GUI)中获得灵感来设计和改进他们的应用程序。近年来,研究建议从自动抓取通过 GUI 探索获得的屏幕截图数据集中检索 GUI 设计的各种方法。然而,这样的文本到 GUI 检索方法仅利用了屏幕截图中 GUI 元素的文本信息,而忽视了视觉信息,如图标或背景图像。此外,检索到的屏幕截图通常不是由应用程序开发者引导的,并且通常缺乏重要的应用程序功能,例如需要用户身份验证的 UI 页面。为了克服这些限制,本文提出了基于 UIClip 视觉语言模型的 GUI 搜索引擎,该模型专门为应用程序 GUI 领域进行训练。为此,我们首先从 Google Play 收集了应用程序介绍图片,这些图片通常显示了由应用开发商选择的最具有代表性的屏幕截图并附有标签(即标注)。然后,我们开发了一个自动化的管道来对这些图像进行分类、裁剪和提取标签。最终,我们得到了一个大型数据集,我们将其与本文分享:包括 303k 个应用程序屏幕截图,其中 135k 个带有标签。我们使用这个数据集来训练了一种新颖的视觉语言模型,据我们所知,这是 GUI 检索领域第一个这样的模型。我们在相关工作和手动实验的各种数据集上评估了我们的方法,结果表明,我们的模型在文本到 GUI 检索方面优于先前的方法,达到召回率@10 最高可达 0.69 和精确率@10 最高可达 0.91。我们还研究了 UIClip 在其他 GUI 任务上的性能,包括 GUI 分类和 Sketch-to-GUI 检索,具有鼓舞人心的结果。

URL

https://arxiv.org/abs/2405.00145

PDF

https://arxiv.org/pdf/2405.00145.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot