We are witnessing a revolution in conditional image synthesis with the recent success of large scale text-to-image generation methods. This success also opens up new opportunities in controlling the generation and editing process using multi-modal input. While spatial control using cues such as depth, sketch, and other images has attracted a lot of research, we argue that another equally effective modality is audio since sound and sight are two main components of human perception. Hence, we propose a method to enable audio-conditioning in large scale image diffusion models. Our method first maps features obtained from audio clips to tokens that can be injected into the diffusion model in a fashion similar to text tokens. We introduce additional audio-image cross attention layers which we finetune while freezing the weights of the original layers of the diffusion model. In addition to audio conditioned image generation, our method can also be utilized in conjuction with diffusion based editing methods to enable audio conditioned image editing. We demonstrate our method on a wide range of audio and image datasets. We perform extensive comparisons with recent methods and show favorable performance.
我们正在见证随着大型文本到图像生成方法的最近成功,条件图像合成领域也发生了一场革命。这一成功也为我们使用多模态输入控制生成和编辑过程提供了新的机会。虽然使用深度、轮廓和其他图像进行空间控制已经吸引了许多研究,但我们认为另一个同样有效的模式是音频,因为声音和视觉是人类感知的主要组成部分。因此,我们提出了一种在大规模图像扩散模型中实现音频条件的方法。我们的方法首先将音频剪辑获得的特征映射到可以注入扩散模型中的标记。我们引入了额外的音频-图像跨注意层,在冻结扩散模型原始层的重量的同时,对其进行微调。除了音频条件图像生成外,我们的方法还可以与基于扩散的编辑方法相结合,以实现音频条件图像编辑。我们在广泛的音频和图像数据集上进行了实验。我们与最近的方法进行了广泛的比较,并展示了良好的性能。
https://arxiv.org/abs/2405.00878
App developers use the Graphical User Interface (GUI) of other apps as an important source of inspiration to design and improve their own apps. In recent years, research suggested various approaches to retrieve GUI designs that fit a certain text query from screenshot datasets acquired through automated GUI exploration. However, such text-to-GUI retrieval approaches only leverage the textual information of the GUI elements in the screenshots, neglecting visual information such as icons or background images. In addition, the retrieved screenshots are not steered by app developers and often lack important app features, e.g. whose UI pages require user authentication. To overcome these limitations, this paper proposes GUing, a GUI search engine based on a vision-language model called UIClip, which we trained specifically for the app GUI domain. For this, we first collected app introduction images from Google Play, which usually display the most representative screenshots selected and often captioned (i.e. labeled) by app vendors. Then, we developed an automated pipeline to classify, crop, and extract the captions from these images. This finally results in a large dataset which we share with this paper: including 303k app screenshots, out of which 135k have captions. We used this dataset to train a novel vision-language model, which is, to the best of our knowledge, the first of its kind in GUI retrieval. We evaluated our approach on various datasets from related work and in manual experiment. The results demonstrate that our model outperforms previous approaches in text-to-GUI retrieval achieving a Recall@10 of up to 0.69 and a HIT@10 of 0.91. We also explored the performance of UIClip for other GUI tasks including GUI classification and Sketch-to-GUI retrieval with encouraging results.
翻译: 应用程序开发者会从其他应用程序的图形用户界面(GUI)中获得灵感来设计和改进他们的应用程序。近年来,研究建议从自动抓取通过 GUI 探索获得的屏幕截图数据集中检索 GUI 设计的各种方法。然而,这样的文本到 GUI 检索方法仅利用了屏幕截图中 GUI 元素的文本信息,而忽视了视觉信息,如图标或背景图像。此外,检索到的屏幕截图通常不是由应用程序开发者引导的,并且通常缺乏重要的应用程序功能,例如需要用户身份验证的 UI 页面。为了克服这些限制,本文提出了基于 UIClip 视觉语言模型的 GUI 搜索引擎,该模型专门为应用程序 GUI 领域进行训练。为此,我们首先从 Google Play 收集了应用程序介绍图片,这些图片通常显示了由应用开发商选择的最具有代表性的屏幕截图并附有标签(即标注)。然后,我们开发了一个自动化的管道来对这些图像进行分类、裁剪和提取标签。最终,我们得到了一个大型数据集,我们将其与本文分享:包括 303k 个应用程序屏幕截图,其中 135k 个带有标签。我们使用这个数据集来训练了一种新颖的视觉语言模型,据我们所知,这是 GUI 检索领域第一个这样的模型。我们在相关工作和手动实验的各种数据集上评估了我们的方法,结果表明,我们的模型在文本到 GUI 检索方面优于先前的方法,达到召回率@10 最高可达 0.69 和精确率@10 最高可达 0.91。我们还研究了 UIClip 在其他 GUI 任务上的性能,包括 GUI 分类和 Sketch-to-GUI 检索,具有鼓舞人心的结果。
https://arxiv.org/abs/2405.00145
Imbuing machines with the ability to talk has been a longtime pursuit of artificial intelligence (AI) research. From the very beginning, the community has not only aimed to synthesise high-fidelity speech that accurately conveys the semantic meaning of an utterance, but also to colour it with inflections that cover the same range of affective expressions that humans are capable of. After many years of research, it appears that we are on the cusp of achieving this when it comes to single, isolated utterances. This unveils an abundance of potential avenues to explore when it comes to combining these single utterances with the aim of synthesising more complex, longer-term behaviours. In the present chapter, we outline the methodological advances that brought us so far and sketch out the ongoing efforts to reach that coveted next level of artificial expressivity. We also discuss the societal implications coupled with rapidly advancing expressive speech synthesis (ESS) technology and highlight ways to mitigate those risks and ensure the alignment of ESS capabilities with ethical norms.
使机器具有交流的能力已经一直是人工智能(AI)研究的长期目标。从研究开始就旨在合成准确传达一句话的丰富高保真度的语音,并且用同样的情感表达范围为人类所能够覆盖的调谐来染色。经过多年的研究,在单句和孤立的话语上,我们似乎正处在实现这一目标即将到来的临界点。这揭示了我们在这些单句上合成更复杂、更长期行为的可能性。在本章中,我们概述了使我们达到目前阶段的Methodological advances,并勾勒出正致力于实现这一目标的努力。我们还讨论了与快速发展的表达性语音合成(ESS)技术相关的社会影响,并指出了解决这些风险并确保ESS能力与伦理规范保持一致的方法。
https://arxiv.org/abs/2404.19363
Sketch-based image retrieval (SBIR) associates hand-drawn sketches with their corresponding realistic images. In this study, we aim to tackle two major challenges of this task simultaneously: i) zero-shot, dealing with unseen categories, and ii) fine-grained, referring to intra-category instance-level retrieval. Our key innovation lies in the realization that solely addressing this cross-category and fine-grained recognition task from the generalization perspective may be inadequate since the knowledge accumulated from limited seen categories might not be fully valuable or transferable to unseen target categories. Inspired by this, in this work, we propose a dual-modal prompting CLIP (DP-CLIP) network, in which an adaptive prompting strategy is designed. Specifically, to facilitate the adaptation of our DP-CLIP toward unpredictable target categories, we employ a set of images within the target category and the textual category label to respectively construct a set of category-adaptive prompt tokens and channel scales. By integrating the generated guidance, DP-CLIP could gain valuable category-centric insights, efficiently adapting to novel categories and capturing unique discriminative clues for effective retrieval within each target category. With these designs, our DP-CLIP outperforms the state-of-the-art fine-grained zero-shot SBIR method by 7.3% in Acc.@1 on the Sketchy dataset. Meanwhile, in the other two category-level zero-shot SBIR benchmarks, our method also achieves promising performance.
基于手绘图的图像检索(SBIR)将手绘图与相应的现实图像相关联。在本文中,我们试图同时解决这个任务的两个主要挑战:i)零散 shot,处理未见类别;ii)精细程度,涉及类内实例级别的检索。我们创新的关键在于意识到,仅从一般化角度解决这个问题可能是不充分的,因为有限见的类别所积累的知识可能不完全是宝贵的,或者不能完全转移给未见的目标类别。 受到这个启发,本文我们提出了一种双模提示的CLIP(DP-CLIP)网络,其中设计了一种自适应提示策略。具体来说,为了促进我们的DP-CLIP适应不可预测的目标类别,我们在目标类别和文本类别标签内分别构建了一系列类适应的提示词和通道比例。通过整合生成的指导,DP-CLIP可以在每个目标类别中获得宝贵的类中心见解,高效适应新类别,并捕捉到每个目标类别的独特区分线索,从而实现有效的检索。 通过这些设计,我们的DP-CLIP在Sketchy数据集上的Acc.@1值比最先进的细粒度零散 shot SBIR方法提高了7.3%。同时,在另外两个类级零散 shot SBIR基准中,我们的方法也取得了很好的表现。
https://arxiv.org/abs/2404.18695
Generating face image with specific gaze information has attracted considerable attention. Existing approaches typically input gaze values directly for face generation, which is unnatural and requires annotated gaze datasets for training, thereby limiting its application. In this paper, we present a novel gaze-controllable face generation task. Our approach inputs textual descriptions that describe human gaze and head behavior and generates corresponding face images. Our work first introduces a text-of-gaze dataset containing over 90k text descriptions spanning a dense distribution of gaze and head poses. We further propose a gaze-controllable text-to-face method. Our method contains a sketch-conditioned face diffusion module and a model-based sketch diffusion module. We define a face sketch based on facial landmarks and eye segmentation map. The face diffusion module generates face images from the face sketch, and the sketch diffusion module employs a 3D face model to generate face sketch from text description. Experiments on the FFHQ dataset show the effectiveness of our method. We will release our dataset and code for future research.
生成具有特定目光信息的面部图像引起了相当的关注。现有的方法通常直接输入目光值来进行面部生成,这并不自然,需要大量的注释目光数据进行训练,从而限制了其应用。在本文中,我们提出了一个新颖的具有控制目光的面部生成任务。我们的方法输入文本描述,描述了人类目光和头部的行为,并生成相应的面部图像。我们的工作首先引入了一个包含超过90k个文本描述的文本-目光数据集,涵盖了目光分布的密集范围和头部姿势的多样化。我们进一步提出了一个目光可控制文本-面部方法。我们的方法包含一个基于特征的人脸扩散模块和一个基于模型的特征扩散模块。我们定义了一个基于面部特征和眼部分割图的人脸特征扩散模块。人脸扩散模块从人脸特征中生成人脸图像,而基于模型的特征扩散模块采用3D人脸模型从文本描述中生成人脸特征扩散。在FFHQ数据集上的实验表明,我们的方法的有效性。我们将发布我们的数据集和代码供未来研究使用。
https://arxiv.org/abs/2404.17486
The rapid evolution of the fashion industry increasingly intersects with technological advancements, particularly through the integration of generative AI. This study introduces a novel generative pipeline designed to transform the fashion design process by employing latent diffusion models. Utilizing ControlNet and LoRA fine-tuning, our approach generates high-quality images from multimodal inputs such as text and sketches. We leverage and enhance state-of-the-art virtual try-on datasets, including Multimodal Dress Code and VITON-HD, by integrating sketch data. Our evaluation, utilizing metrics like FID, CLIP Score, and KID, demonstrates that our model significantly outperforms traditional stable diffusion models. The results not only highlight the effectiveness of our model in generating fashion-appropriate outputs but also underscore the potential of diffusion models in revolutionizing fashion design workflows. This research paves the way for more interactive, personalized, and technologically enriched methodologies in fashion design and representation, bridging the gap between creative vision and practical application.
时尚行业的快速演变越来越与技术进步特别是通过采用生成式AI(Generative AI)整合密切相关。本研究介绍了一种新型的生成管道,通过利用潜在扩散模型对时尚设计过程进行建模,以实现时尚设计过程的数字化转型。利用ControlNet和LoRA微调,我们的方法从多模态输入(如文本和草图)生成高质量图像。通过整合草图数据,我们充分利用并提升了最先进的虚拟试穿数据集,包括多模态着装规范(Multimodal Dress Code)和VITON-HD,从而显著优于传统的稳定扩散模型。我们的评估,利用指标如FID、CLIP分数和KID,证明了我们的模型在生成时尚合适输出方面显著优于传统稳定扩散模型。结果不仅突出了我们在生成时尚相关输出方面的有效性,还强调了扩散模型在革新时装设计工作流程中的巨大潜力。这项研究为时尚设计及其表示方法带来了更交互、个性化和技术丰富的方法奠定了基础,将创意视觉和实际应用之间的差距缩小。
https://arxiv.org/abs/2404.18591
Geometry- and appearance-controlled full-body human image generation is an interesting but challenging task. Existing solutions are either unconditional or dependent on coarse conditions (e.g., pose, text), thus lacking explicit geometry and appearance control of body and garment. Sketching offers such editing ability and has been adopted in various sketch-based face generation and editing solutions. However, directly adapting sketch-based face generation to full-body generation often fails to produce high-fidelity and diverse results due to the high complexity and diversity in the pose, body shape, and garment shape and texture. Recent geometrically controllable diffusion-based methods mainly rely on prompts to generate appearance and it is hard to balance the realism and the faithfulness of their results to the sketch when the input is coarse. This work presents Sketch2Human, the first system for controllable full-body human image generation guided by a semantic sketch (for geometry control) and a reference image (for appearance control). Our solution is based on the latent space of StyleGAN-Human with inverted geometry and appearance latent codes as input. Specifically, we present a sketch encoder trained with a large synthetic dataset sampled from StyleGAN-Human's latent space and directly supervised by sketches rather than real images. Considering the entangled information of partial geometry and texture in StyleGAN-Human and the absence of disentangled datasets, we design a novel training scheme that creates geometry-preserved and appearance-transferred training data to tune a generator to achieve disentangled geometry and appearance control. Although our method is trained with synthetic data, it can handle hand-drawn sketches as well. Qualitative and quantitative evaluations demonstrate the superior performance of our method to state-of-the-art methods.
几何和外观控全身体像生成是一个有趣但具有挑战性的任务。现有的解决方案或依赖于粗略的条件(例如姿态,文本),从而缺乏对身体的几何和外观控制。绘画提供了这种编辑能力,并在各种基于草图的脸部生成和编辑解决方案中得到了应用。然而,直接将基于草图的脸部生成应用到全身生成通常由于姿态、身体形状和服装形状和纹理的高复杂性和多样性而无法产生高质量和多样性的结果。最近,基于几何可控制扩散的解决方案主要依赖于提示生成外观,而在输入粗糙时很难平衡现实感和结果的准确性。 这项工作提出了Sketch2Human,第一个基于语义草图的全身体人像生成系统(用于几何控制)和参考图像(用于外观控制)。我们的解决方案基于StyleGAN-Human的潜在空间,具有倒置的形状和外观潜在代码作为输入。具体来说,我们提出了一种用从StyleGAN-Human的潜在空间中采样的大规模合成数据训练的草图编码器,并直接从草图中监督生成图像。考虑到StyleGAN-Human中部分几何和纹理的纠缠信息以及缺乏解耦的数据集,我们设计了一种新的训练计划,以创建几何保留和外观传递的训练数据来调整生成器实现解耦几何和外观控制。尽管我们的方法是基于合成数据训练的,但它还可以处理手绘草图。定性和定量评估证明了我们的方法在现有方法中的卓越性能。
https://arxiv.org/abs/2404.15889
In this work, we study the task of sketch-guided image inpainting. Unlike the well-explored natural language-guided image inpainting, which excels in capturing semantic details, the relatively less-studied sketch-guided inpainting offers greater user control in specifying the object's shape and pose to be inpainted. As one of the early solutions to this task, we introduce a novel partial discrete diffusion process (PDDP). The forward pass of the PDDP corrupts the masked regions of the image and the backward pass reconstructs these masked regions conditioned on hand-drawn sketches using our proposed sketch-guided bi-directional transformer. The proposed novel transformer module accepts two inputs -- the image containing the masked region to be inpainted and the query sketch to model the reverse diffusion process. This strategy effectively addresses the domain gap between sketches and natural images, thereby, enhancing the quality of inpainting results. In the absence of a large-scale dataset specific to this task, we synthesize a dataset from the MS-COCO to train and extensively evaluate our proposed framework against various competent approaches in the literature. The qualitative and quantitative results and user studies establish that the proposed method inpaints realistic objects that fit the context in terms of the visual appearance of the provided sketch. To aid further research, we have made our code publicly available at this https URL .
在这项工作中,我们研究了基于草图指导的图像修复任务。与在自然语言引导下图像修复任务中已经取得很好进展的情况不同,相对较少研究的基于草图指导的修复任务为用户指定修复物体形状和姿势提供了更大的用户控制。作为解决这个问题的一种早期解决方案,我们引入了一个新的部分离散扩散过程(PDDP)。PDDP的前向传播会破坏图像上的遮罩区域,而反向传播根据我们提出的草图指导双向变换器重构这些遮罩区域。所提出的新的变换器模块接受两个输入——包含要修复的遮罩区域的图像和查询草图,以建模反向扩散过程。这种策略有效地解决了草图和自然图像之间的领域差距,从而提高了修复结果的质量。在缺乏针对这个任务的较大规模数据集的情况下,我们通过将MS-COCO中的数据集合成一个数据集来训练,并详细评估我们提出的框架与各种有效方法之间的差异。定性和定量的结果以及用户研究证实了所提出的修复方法在给定草图的视觉外观下修复了真实的物体。为了进一步研究,我们将代码公开发布在https:// 这个网址上。
https://arxiv.org/abs/2404.11949
A positive margin may result in an increased risk of local recurrences after breast retention surgery for any malignant tumour. In order to reduce the number of positive margins would offer surgeon real-time intra-operative information on the presence of positive resection margins. This study aims to design an intra-operative tumour margin evaluation scheme by using specimen mammography in breast-conserving surgery. Total of 30 cases were evaluated and compared with the manually determined contours by experienced physicians and pathology report. The proposed method utilizes image thresholding to extract regions of interest and then performs a deep learning model, i.e. SegNet, to segment tumour tissue. The margin width of normal tissues surrounding it is evaluated as the result. The desired size of margin around the tumor was set for 10 mm. The smallest average difference to manual sketched margin (6.53 mm +- 5.84). In the all case, the SegNet architecture was utilized to obtain tissue specimen boundary and tumor contour, respectively. The simulation results indicated that this technology is helpful in discriminating positive from negative margins in the intra-operative setting. The aim of proposed scheme was a potential procedure in the intra-operative measurement system. The experimental results reveal that deep learning techniques can draw results that are consistent with pathology reports.
积极的 margin 可能导致任何恶性肿瘤手术后局部复发风险增加。为了减少阳性切缘的数量,从而让外科医生在手术过程中实时了解阳性切除切缘的存在,这项研究旨在通过乳腺保乳手术使用组织乳腺X光检查设计一种术中肿瘤边缘评估方案。共评价了30例病例,并将其与有经验的医生手动确定的轮廓进行比较。所提出的方法利用图像阈值提取感兴趣区域,然后进行深度学习模型,即SegNet,对肿瘤组织进行分割。评估正常组织周围的边缘宽度作为结果。肿瘤周围希望的切缘大小设置为10毫米。所有情况下,SegNet架构都被用于获取组织样本边界和肿瘤轮廓。仿真结果表明,这种技术在术中可以帮助鉴别阳性切缘和阴性切缘。所提出方案的潜在程序性在于术中测量系统中。实验结果表明,深度学习技术可以获得与病理报告结果一致的结果。
https://arxiv.org/abs/2404.10600
Our goal is to build embodied agents that can learn inductively generalizable spatial concepts in a continual manner, e.g, constructing a tower of a given height. Existing work suffers from certain limitations (a) (Liang et al., 2023) and their multi-modal extensions, rely heavily on prior knowledge and are not grounded in the demonstrations (b) (Liu et al., 2023) lack the ability to generalize due to their purely neural approach. A key challenge is to achieve a fine balance between symbolic representations which have the capability to generalize, and neural representations that are physically grounded. In response, we propose a neuro-symbolic approach by expressing inductive concepts as symbolic compositions over grounded neural concepts. Our key insight is to decompose the concept learning problem into the following steps 1) Sketch: Getting a programmatic representation for the given instruction 2) Plan: Perform Model-Based RL over the sequence of grounded neural action concepts to learn a grounded plan 3) Generalize: Abstract out a generic (lifted) Python program to facilitate generalizability. Continual learning is achieved by interspersing learning of grounded neural concepts with higher level symbolic constructs. Our experiments demonstrate that our approach significantly outperforms existing baselines in terms of its ability to learn novel concepts and generalize inductively.
我们的目标是构建能够以连续的方式学习归纳通用空间概念的 embodied 代理,例如,构建一个指定高度的塔。现有工作存在某些局限性:(a)(Liang 等人,2023) 和他们的多模态扩展,严重依赖先验知识,并且缺乏演示(b)(Liu 等人,2023) 的物理基础,无法进行泛化。一个关键挑战是实现符号表示与具有泛化能力的神经表示之间的良好平衡。为了应对这一挑战,我们提出了一个基于神经符号的方法,通过将归纳性概念表达为 grounded 神经概念的符号组合来表示。我们的关键见解是将概念学习问题分解为以下步骤:1) 草图:为给定指令获得程序化表示;2)规划:在 grounded 神经动作概念的序列上执行基于模型的强化学习,以学习一个 grounded 计划;3)泛化:抽象出通用的(lifted)Python 程序,以促进泛化。持续学习是通过在 grounded 神经概念的学习中插入更高层次的符号构建来实现的。我们的实验证明,与其他现有基线相比,我们的方法在学习和泛化方面的表现显著出色。
https://arxiv.org/abs/2404.07774
Human visual imagination usually begins with analogies or rough sketches. For example, given an image with a girl playing guitar before a building, one may analogously imagine how it seems like if Iron Man playing guitar before Pyramid in Egypt. Nonetheless, visual condition may not be precisely aligned with the imaginary result indicated by text prompt, and existing layout-controllable text-to-image (T2I) generation models is prone to producing degraded generated results with obvious artifacts. To address this issue, we present a novel T2I generation method dubbed SmartControl, which is designed to modify the rough visual conditions for adapting to text prompt. The key idea of our SmartControl is to relax the visual condition on the areas that are conflicted with text prompts. In specific, a Control Scale Predictor (CSP) is designed to identify the conflict regions and predict the local control scales, while a dataset with text prompts and rough visual conditions is constructed for training CSP. It is worth noting that, even with a limited number (e.g., 1,000~2,000) of training samples, our SmartControl can generalize well to unseen objects. Extensive experiments on four typical visual condition types clearly show the efficacy of our SmartControl against state-of-the-arts. Source code, pre-trained models, and datasets are available at this https URL.
人类视觉想象通常从类比或粗略草图开始。例如,对于一个女孩在建筑物前弹吉他的图像,一个人可以类比地想象如果钢铁侠在埃及金字塔前弹吉他。然而,视觉状态可能不会完全与文本提示的想象结果对齐,并且现有的基于文本提示的图像到图像(T2I)生成模型容易产生明显的伪影。为解决这个问题,我们提出了一个名为SmartControl的新T2I生成方法,它旨在根据文本提示修改粗略视觉状态。我们SmartControl的关键思想是在与文本提示发生冲突的区域上放松视觉状态。具体来说,我们设计了一个控制比例预测器(CSP),用于识别冲突区域并预测局部控制比例,同时为训练CSP构建了一个带有文本提示和粗略视觉条件的数据集。值得注意的是,即使只有有限的训练样本(例如1,000~2,000个),我们的SmartControl也可以很好地适应未见过的物体。在四种典型视觉条件类型的广泛实验中,我们的SmartControl明显优于最先进的水平。源代码、预训练模型和数据集可在此链接处访问。
https://arxiv.org/abs/2404.06451
Automatically generating UI code from webpage design visions can significantly alleviate the burden of developers, enabling beginner developers or designers to directly generate Web pages from design diagrams. Currently, prior research has accomplished the objective of generating UI code from rudimentary design visions or sketches through designing deep neural networks. Inspired by the groundbreaking advancements achieved by Multimodal Large Language Models (MLLMs), the automatic generation of UI code from high-fidelity design images is now emerging as a viable possibility. Nevertheless, our investigation reveals that existing MLLMs are hampered by the scarcity of authentic, high-quality, and large-scale datasets, leading to unsatisfactory performance in automated UI code generation. To mitigate this gap, we present a novel dataset, termed VISION2UI, extracted from real-world scenarios, augmented with comprehensive layout information, tailored specifically for finetuning MLLMs in UI code generation. Specifically, this dataset is derived through a series of operations, encompassing collecting, cleaning, and filtering of the open-source Common Crawl dataset. In order to uphold its quality, a neural scorer trained on labeled samples is utilized to refine the data, retaining higher-quality instances. Ultimately, this process yields a dataset comprising 2,000 (Much more is coming soon) parallel samples encompassing design visions and UI code. The dataset is available at this https URL.
自动从网页设计愿景中生成UI代码可以显著减轻开发人员的负担,使新手开发者或设计师能够直接从设计图生成Web页面。目前,先前的研究已经通过设计深度神经网络从基本设计愿景或草图生成UI代码。受到Multimodal Large Language Models(MLLMs)通过深度学习取得突破性进展的启发,现在从高保真度设计图像中自动生成UI代码的可能性正在成为一个可行的方式。然而,我们的调查发现,现有的MLLM受到真实世界场景中高质量、真实数据集的不足所限制,导致自动UI代码生成不满意。为了减轻这一差距,我们提出了一个名为VISION2UI的新数据集,它来源于真实世界场景,并补充了全面的布局信息,特别针对优化MLLMs在UI代码生成中的微调进行优化。具体来说,这个数据集通过一系列操作来生成,包括收集、清洗和过滤开源的Common Crawl数据集。为了保持其质量,用标签样本训练的神经评分器来优化数据,保留高质量实例。最终,这个过程产生了一个包含2,000个并行样本(更多即将到来)的数据集,涵盖设计愿景和UI代码。数据集可在https://这个链接中获取。
https://arxiv.org/abs/2404.06369
We aim at finetuning a vision-language model without hurting its out-of-distribution (OOD) generalization. We address two types of OOD generalization, i.e., i) domain shift such as natural to sketch images, and ii) zero-shot capability to recognize the category that was not contained in the finetune data. Arguably, the diminished OOD generalization after finetuning stems from the excessively simplified finetuning target, which only provides the class information, such as ``a photo of a [CLASS]''. This is distinct from the process in that CLIP was pretrained, where there is abundant text supervision with rich semantic information. Therefore, we propose to compensate for the finetune process using auxiliary supervision with rich semantic information, which acts as anchors to preserve the OOD generalization. Specifically, two types of anchors are elaborated in our method, including i) text-compensated anchor which uses the images from the finetune set but enriches the text supervision from a pretrained captioner, ii) image-text-pair anchor which is retrieved from the dataset similar to pretraining data of CLIP according to the downstream task, associating with the original CLIP text with rich semantics. Those anchors are utilized as auxiliary semantic information to maintain the original feature space of CLIP, thereby preserving the OOD generalization capabilities. Comprehensive experiments demonstrate that our method achieves in-distribution performance akin to conventional finetuning while attaining new state-of-the-art results on domain shift and zero-shot learning benchmarks.
我们旨在通过调整视觉语言模型的参数,使其不会对分布外(OOD)泛化造成损害。我们解决了两种类型的OOD泛化问题,即i)领域转移,例如自然绘画图像,ii)在微调数据中无法包含的类别的识别。在微调后,降低的OOD泛化可能是由于过度简化的微调目标,它只提供了类信息,如“一张[CLASS]的照片”。这不同于在CLIP预训练时的过程,在那里有大量的文本监督,具有丰富的语义信息。因此,我们提出了一种通过具有丰富语义信息的辅助监督来补偿微调过程的方法,该辅助监督作为锚点来保留OOD泛化能力。具体来说,在我们的方法中包括两种锚点,即i)文本补偿锚点,它使用微调集中的图像,但通过预训练的句号提取了更多的文本监督,ii)图像-文本对锚点,它与CLIP的下游任务的预训练数据类似,与原始CLIP文本具有丰富的语义信息。这些锚点用作辅助语义信息来维护CLIP的原始特征空间,从而保留OOD泛化能力。 全面的实验证明,我们的方法在类内性能与传统微调类似的同时,在领域转移和零样本学习基准上实现了最先进的成果。
https://arxiv.org/abs/2404.06244
As rapid advances in Artificial Intelligence and the rise of some of history's most potent corporations meet the diminished neoliberal state, people are increasingly subject to power exercised by means of automated systems. Machine learning and related computational technologies now underpin vital government services. They connect consumers and producers in new algorithmic markets. They determine how we find out about everything from how to vote to where to get vaccinated, and whose speech is amplified, reduced, or restricted. And a new wave of products based on Large Language Models (LLMs) will further transform our economic and political lives. Automatic Authorities are automated computational systems used to exercise power over us by determining what we may know, what we may have, and what our options will be. In response to their rise, scholars working on the societal impacts of AI and related technologies have advocated shifting attention from how to make AI systems beneficial or fair towards a critical analysis of these new power relations. But power is everywhere, and is not necessarily bad. On what basis should we object to new or intensified power relations, and what can be done to justify them? This paper introduces the philosophical materials with which to formulate these questions, and offers preliminary answers. It starts by pinning down the concept of power, focusing on the ability that some agents have to shape others' lives. It then explores how AI enables and intensifies the exercise of power so understood, and sketches three problems with power and three ways to solve those problems. It emphasises, in particular, that justifying power requires more than satisfying substantive justificatory criteria; standards of proper authority and procedural legitimacy must also be met. We need to know not only what power may be used for, but how it may be used, and by whom.
随着人工智能的快速发展以及一些历史最强大的公司的崛起,人们越来越容易受到自动系统行使权力的影响。机器学习和相关的计算技术现在为政府服务提供了关键支持。它们将消费者和生产者联系到了新的算法市场。它们决定了我们如何了解 everything,从如何投票到去哪里接种疫苗,以及谁的言论得到了放大、减少或限制。并且基于大型语言模型的全新产品将进一步改变我们的生活。自动机构是用于决定我们对 AI 系统和相关技术行使权力的自动化计算系统。为了应对其崛起,社会影响 AI 和相关技术的学者呼吁将注意力从如何使 AI 系统有益或公平转向对这些新权力关系的批判性分析。但权力无处不在,并不一定是坏事。我们应以何种基础反对新的或加剧的权力关系,以及如何为它们辩护呢?本文提供了提出这些问题的哲学材料和初步答案。它首先明确了权力的概念,并关注了一些代理者有能力塑造他人生活的能力。然后探讨了 AI 如何实现和加剧了这种理解,并提出了三个与权力相关的问题以及三种解决方案。它特别强调,仅满足实质性正义标准并不足以证明权力;标准和程序正当性也需要满足。我们需要知道不仅可以用于什么目的,而且如何使用,以及由谁使用。
https://arxiv.org/abs/2404.05990
We present a large language model (LLM) based system to empower quadrupedal robots with problem-solving abilities for long-horizon tasks beyond short-term motions. Long-horizon tasks for quadrupeds are challenging since they require both a high-level understanding of the semantics of the problem for task planning and a broad range of locomotion and manipulation skills to interact with the environment. Our system builds a high-level reasoning layer with large language models, which generates hybrid discrete-continuous plans as robot code from task descriptions. It comprises multiple LLM agents: a semantic planner for sketching a plan, a parameter calculator for predicting arguments in the plan, and a code generator to convert the plan into executable robot code. At the low level, we adopt reinforcement learning to train a set of motion planning and control skills to unleash the flexibility of quadrupeds for rich environment interactions. Our system is tested on long-horizon tasks that are infeasible to complete with one single skill. Simulation and real-world experiments show that it successfully figures out multi-step strategies and demonstrates non-trivial behaviors, including building tools or notifying a human for help.
我们提出了一个基于大型语言模型(LLM)的系统,以赋予四足机器人解决长视野任务的能力,这些任务超出了短期的动作。对于四足机器人的长视野任务来说,这是一个具有挑战性的问题,因为它们需要对问题语义的高级理解以及广泛的动作和操作技能与环境交互。我们的系统通过构建具有大型语言模型的多层次推理层,从任务描述中生成机器人代码的混合离散-连续计划。它包括多个LLM代理:用于绘制计划的语义规划器、预测计划中论据的参数计算器以及将计划转换为可执行机器人代码的代码生成器。在低级别上,我们采用强化学习来训练一系列动作规划和控制技能,以释放四足机器人与丰富环境交互的灵活性。我们的系统在无法通过单一技能完成的长期视野任务上进行了测试。模拟和现实世界实验证明,它成功地找到了多步策略,并展示了非平凡的行为,包括构建工具或通知人类寻求帮助。
https://arxiv.org/abs/2404.05291
Deep quantization methods have shown high efficiency on large-scale image retrieval. However, current models heavily rely on ground-truth information, hindering the application of quantization in label-hungry scenarios. A more realistic demand is to learn from inexhaustible uploaded images that are associated with informal tags provided by amateur users. Though such sketchy tags do not obviously reveal the labels, they actually contain useful semantic information for supervising deep quantization. To this end, we propose Weakly-Supervised Deep Hyperspherical Quantization (WSDHQ), which is the first work to learn deep quantization from weakly tagged images. Specifically, 1) we use word embeddings to represent the tags and enhance their semantic information based on a tag correlation graph. 2) To better preserve semantic information in quantization codes and reduce quantization error, we jointly learn semantics-preserving embeddings and supervised quantizer on hypersphere by employing a well-designed fusion layer and tailor-made loss functions. Extensive experiments show that WSDHQ can achieve state-of-art performance on weakly-supervised compact coding. Code is available at this https URL.
深度量化方法在大型图像检索任务上表现出了高效性。然而,当前的模型在很大程度上依赖于真实数据,这阻碍了在有标签的场景中应用量化。一个更现实的需求是学习来自非正式标签的不可用上传图像,这些图像与业余用户提供的标签相关。尽管这些标签并不明显地揭示了标签,但它们实际上包含有关深度量化的有用语义信息。为此,我们提出了弱监督深度超球量化(WSDHQ),这是第一个从弱标签图像中学习深度量化的工作。具体来说,1)我们使用词向量来表示标签,并根据标签相关图增强其语义信息。2)为了更好地保留语义信息在量化代码中,并减少量化误差,我们通过采用设计巧妙的融合层和定制损失函数,在超球上共同学习和语义保持嵌入。大量实验证明,WSDHQ可以在弱监督的紧凑编码上实现最先进的性能。代码可在此处下载:https://url.cn/xyz6hU6
https://arxiv.org/abs/2404.04998
Robots are being designed to help people in an increasing variety of settings--but seemingly little attention has been given so far to the specific needs of women, who represent roughly half of the world's population but are highly underrepresented in robotics. Here we used a speculative prototyping approach to explore this expansive design space: First, we identified some potential challenges of interest, including crimes and illnesses that disproportionately affect women, as well as potential opportunities for designers, which were visualized in five sketches. Then, one of the sketched scenarios was further explored by developing a prototype, of a robotic helper drone equipped with computer vision to detect hidden cameras that could be used to spy on women. While object detection introduced some errors, hidden cameras were identified with a reasonable accuracy of 80\% (Intersection over Union (IoU) score: 0.40). Our aim is that the identified challenges and opportunities could help spark discussion and inspire designers, toward realizing a safer, more inclusive future through responsible use of technology.
机器人在各种场景中帮助人的设计越来越普遍,但似乎迄今为止对女性具体需求的研究还很少。在这里,我们使用了一种speculative prototyping方法来探索这个广阔的设计空间:首先,我们识别出一些感兴趣的潜在挑战,包括对女性影响最大的犯罪和疾病,以及设计师可以关注到的潜在机会,这些机会在五个草图中被呈现出来。然后,针对其中一个草图场景,通过开发一个配备了计算机视觉的机器人助手无人机原型,进一步研究了如何利用摄像机进行窥探的问题。虽然物体检测引入了一些误差,但隐蔽摄像头的识别准确率相当高(交集 over 联合(IoU)得分:0.40)。我们的目标是,识别出的挑战和机会可以激发讨论,激发设计师们,从而通过负责任地使用科技,实现一个更安全、更包容的未来。
https://arxiv.org/abs/2404.04123
Recently, image-to-3D approaches have achieved significant results with a natural image as input. However, it is not always possible to access these enriched color input samples in practical applications, where only sketches are available. Existing sketch-to-3D researches suffer from limitations in broad applications due to the challenges of lacking color information and multi-view content. To overcome them, this paper proposes a novel generation paradigm Sketch3D to generate realistic 3D assets with shape aligned with the input sketch and color matching the textual description. Concretely, Sketch3D first instantiates the given sketch in the reference image through the shape-preserving generation process. Second, the reference image is leveraged to deduce a coarse 3D Gaussian prior, and multi-view style-consistent guidance images are generated based on the renderings of the 3D Gaussians. Finally, three strategies are designed to optimize 3D Gaussians, i.e., structural optimization via a distribution transfer mechanism, color optimization with a straightforward MSE loss and sketch similarity optimization with a CLIP-based geometric similarity loss. Extensive visual comparisons and quantitative analysis illustrate the advantage of our Sketch3D in generating realistic 3D assets while preserving consistency with the input.
近年来,以自然图像作为输入的图像-to-3D方法取得了显著的成果。然而,在实际应用中,通常无法访问这些富含色彩信息的样本。现有的sketch-to-3D研究在广泛的应用中因缺乏色彩信息和多视角内容而受到限制。为了克服这些挑战,本文提出了一个新颖的生成范式Sketch3D,用于生成与输入sketch形状对齐的现实生活中3D资产,并实现与文本描述的颜色匹配。具体来说,Sketch3D通过形状保留生成过程在参考图像中初始化给定的sketch。然后,参考图像被用于推断一个粗略的3D高斯先验,基于3D高斯渲染的图像被生成。最后,设计了三种策略来优化3D高斯,即通过分布传输机制进行结构优化,通过简单的MSE损失进行色彩优化,以及使用CLIP基于的几何相似度损失进行sketch相似度优化。大量的视觉比较和定量分析证明了我们在生成真实3D资产的同时保持与输入的一致性的优势。
https://arxiv.org/abs/2404.01843
We present FashionEngine, an interactive 3D human generation and editing system that allows us to design 3D digital humans in a way that aligns with how humans interact with the world, such as natural languages, visual perceptions, and hand-drawing. FashionEngine automates the 3D human production with three key components: 1) A pre-trained 3D human diffusion model that learns to model 3D humans in a semantic UV latent space from 2D image training data, which provides strong priors for diverse generation and editing tasks. 2) Multimodality-UV Space encoding the texture appearance, shape topology, and textual semantics of human clothing in a canonical UV-aligned space, which faithfully aligns the user multimodal inputs with the implicit UV latent space for controllable 3D human editing. The multimodality-UV space is shared across different user inputs, such as texts, images, and sketches, which enables various joint multimodal editing tasks. 3) Multimodality-UV Aligned Sampler learns to sample high-quality and diverse 3D humans from the diffusion prior for multimodal user inputs. Extensive experiments validate FashionEngine's state-of-the-art performance for conditional generation/editing tasks. In addition, we present an interactive user interface for our FashionEngine that enables both conditional and unconditional generation tasks, and editing tasks including pose/view/shape control, text-, image-, and sketch-driven 3D human editing and 3D virtual try-on, in a unified framework. Our project page is at: this https URL.
我们向您介绍FashionEngine,这是一个交互式的3D人生成和编辑系统,让您能够以与人类互动的方式设计3D数字人,如自然语言、视觉感知和手绘。FashionEngine通过三个关键组件来自动化3D人生产:1)一个预训练的3D人体扩散模型,可以从2D图像训练数据中学习建模3D人类的语义UV坐标,为多样性的生成和编辑任务提供强度的先验。2)多模态-UV空间编码人类服装的纹理显现、形状拓扑和文本语义在规范的UV对齐空间中,这确实将用户的多种模态输入与隐含的UV潜在空间相平滑对齐,以便于可控制的3D人体编辑。多模态-UV空间在不同的用户输入中共享,如文本、图像和草图,这使得各种联合多模态编辑任务成为可能。3)多模态-UV对齐采样器可以从扩散先验中学习高质量的多样化3D人体样本。广泛的实验证实了FashionEngine在条件生成/编辑任务方面的卓越性能。此外,我们还为FashionEngine创建了一个交互式的用户界面,使其具有包括条件和非条件生成任务以及编辑任务(包括姿势/视图/形状控制、文本、图像和草图驱动的3D人体编辑和3D虚拟试穿)的统一框架。我们的项目页面在这个网址:https://this URL。
https://arxiv.org/abs/2404.01655
The integration of knowledge extracted from diverse models, whether described by domain experts or generated by machine learning algorithms, has historically been challenged by the absence of a suitable framework for specifying and integrating structures, learning processes, data transformations, and data models or rules. In this work, we extend algebraic specification methods to address these challenges within such a framework. In our work, we tackle the challenging task of developing a comprehensive framework for defining and analyzing deep learning architectures. We believe that previous efforts have fallen short by failing to establish a clear connection between the constraints a model must adhere to and its actual implementation. Our methodology employs graphical structures that resemble Ehresmann's sketches, interpreted within a universe of fuzzy sets. This approach offers a unified theory that elegantly encompasses both deterministic and non-deterministic neural network designs. Furthermore, we highlight how this theory naturally incorporates fundamental concepts from computer science and automata theory. Our extended algebraic specification framework, grounded in graphical structures akin to Ehresmann's sketches, offers a promising solution for integrating knowledge across disparate models and domains. By bridging the gap between domain-specific expertise and machine-generated insights, we pave the way for more comprehensive, collaborative, and effective approaches to knowledge integration and modeling.
知识从各种模型中提取的集成,无论是由领域专家还是由机器学习算法生成的,历史上一直受到缺乏指定和整合结构的合适框架的挑战。在这项工作中,我们将代数规范方法扩展到解决这些挑战的问题,并使用这样的框架。在我们的工作中,我们解决了开发全面框架以定义和分析深度学习架构的具有挑战性的任务。我们认为,之前的努力未能建立模型必须遵守的约束和实际实现之间的明确联系。我们的方法采用类似于Ehresmann sketches的图形结构,解释在模糊集的宇宙中。这种方法提供了一个统一理论,优雅地涵盖了确定性和非确定性神经网络设计。此外,我们强调了这种理论如何自然地整合计算机科学和自动机理论的基本概念。我们扩展的代数规范框架,基于类似于Ehresmann sketches的图形结构,为跨异质模型和领域的知识整合提供了一个有前景的解决方案。通过弥合领域专业知识和机器生成的见解之间的差距,我们为更全面、协作和有效的知识整合和建模方法铺平了道路。
https://arxiv.org/abs/2404.01526