In this paper, we leverage CLIP for zero-shot sketch based image retrieval (ZS-SBIR). We are largely inspired by recent advances on foundation models and the unparalleled generalisation ability they seem to offer, but for the first time tailor it to benefit the sketch community. We put forward novel designs on how best to achieve this synergy, for both the category setting and the fine-grained setting ("all"). At the very core of our solution is a prompt learning setup. First we show just via factoring in sketch-specific prompts, we already have a category-level ZS-SBIR system that overshoots all prior arts, by a large margin (24.8%) - a great testimony on studying the CLIP and ZS-SBIR synergy. Moving onto the fine-grained setup is however trickier, and requires a deeper dive into this synergy. For that, we come up with two specific designs to tackle the fine-grained matching nature of the problem: (i) an additional regularisation loss to ensure the relative separation between sketches and photos is uniform across categories, which is not the case for the gold standard standalone triplet loss, and (ii) a clever patch shuffling technique to help establishing instance-level structural correspondences between sketch-photo pairs. With these designs, we again observe significant performance gains in the region of 26.9% over previous state-of-the-art. The take-home message, if any, is the proposed CLIP and prompt learning paradigm carries great promise in tackling other sketch-related tasks (not limited to ZS-SBIR) where data scarcity remains a great challenge. Code and models will be made available.
在本文中,我们利用CLIP实现零次请求的 Sketch 图像检索(ZS-SBIR),我们主要受到最近在基础模型方面取得的进步以及它们似乎提供的无与伦比的泛化能力启发,但首次为 Sketch 社区服务。我们提出了新的设计,以最大程度地实现这一协同作用,无论是按类别设置还是精细设置(“所有”)。我们的核心解决方案是prompt learning setup。我们首先通过考虑 Sketch 特定的提示因子,已经有了一个按类别设置的 ZS-SBIR 系统,比所有先前作品都超出了很大的比例(24.8%),这是研究 CLIP 和 ZS-SBIR协同作用的巨大证明。然而,切换到精细设置变得更加困难,需要更深入地探索这一协同作用。为此,我们提出了两个特定的设计,以解决精细匹配问题:(i)额外的正则化损失,以确保 Sketch 和照片之间的相对分离在所有类别上是均匀的,而不像标准单例差分损失那样,(ii)聪明的 patch shuffle 技术,以帮助建立 Sketch 和照片之间的实例级结构对应关系。通过这些设计,我们再次观察到在先前技术水平的26.9%范围内显著的性能提升。总之,任何消息都是关于 proposed CLIP 和 prompt learning 范式在处理其他 Sketch 相关任务(不仅仅限于 ZS-SBIR)时具有巨大的潜力,数据稀缺仍然是一个巨大挑战。代码和模型将公开提供。
https://arxiv.org/abs/2303.13440
For an AI solution to evolve from a trained machine learning model into a production-ready AI system, many more things need to be considered than just the performance of the machine learning model. A production-ready AI system needs to be trustworthy, i.e. of high quality. But how to determine this in practice? For traditional software, ISO25000 and its predecessors have since long time been used to define and measure quality characteristics. Recently, quality models for AI systems, based on ISO25000, have been introduced. This paper applies one such quality model to a real-life case study: a deep learning platform for monitoring wildflowers. The paper presents three realistic scenarios sketching what it means to respectively use, extend and incrementally improve the deep learning platform for wildflower identification and counting. Next, it is shown how the quality model can be used as a structured dictionary to define quality requirements for data, model and software. Future work remains to extend the quality model with metrics, tools and best practices to aid AI engineering practitioners in implementing trustworthy AI systems.
要将人工智能解决方案从训练机器学习模型进化为生产级别的人工智能系统,需要考虑许多比机器学习模型表现更为重要的事情。一个生产级别的人工智能系统需要可靠性,也就是高质量的。但是,在实践中如何确定这一点呢?对于传统的软件,ISO25000及其前身已经很长时间被用来定义和衡量质量特征。最近,基于ISO25000的质量模型已经被引入到人工智能系统中。本文将一个基于ISO25000的质量模型应用于一个实际案例研究:用于监测野花的深度学习平台。本文提出了三个实际情景,描述了如何使用、扩展和逐步改进深度学习平台来进行野花识别和计数的意义。接下来,它展示了如何使用质量模型作为结构化词典来定义数据、模型和软件的质量要求。未来的工作将继续扩展质量模型,使用指标、工具和最佳实践,帮助人工智能工程师实现可靠的人工智能系统。
https://arxiv.org/abs/2303.13151
Design mockups are essential instruments for visualizing and testing design ideas. However, the process of generating mockups can be time-consuming and challenging for designers. In this article, we present and evaluate two different modalities for generating mockup ideas to support designers in their work: (1) a sketch-based approach to generate mockups based on hand-drawn sketches, and (2) a semantic-based approach to generate interfaces based on a set of predefined design elements. To evaluate the effectiveness of these two approaches, we conducted a series of experiments with 13 participants in which we asked them to generate mockups using each modality. Our results show that sketch-based generation was more intuitive and expressive, while semantic-based generative AI obtained better results in terms of quality and fidelity. Both methods can be valuable tools for UI designers looking to increase their creativity and efficiency.
设计原型是可视化和测试设计想法的必要工具。然而,生成原型的过程对设计师来说可能会是一项耗时且具有挑战性的活动。在本文中,我们介绍了和评估了两种生成设计原型的不同方式,以支持设计师在工作中:(1)基于手绘草图的生成方式,(2)基于一组预先定义的设计元素的语义生成方式。为了评估这两种方法的有效性,我们与13名参与者进行了一系列实验,让他们使用每种方式生成原型。我们的结果表明,基于手绘草图的生成方式更具直觉和表现力,而基于语义生成方式的人工智能在质量和清晰度方面取得更好的结果。这两种方法对于想要提高UI设计师的创造力和效率的设计师来说都是非常有价值的工具。
https://arxiv.org/abs/2303.12709
Neural networks have a number of shortcomings. Amongst the severest ones is the sensitivity to distribution shifts which allows models to be easily fooled into wrong predictions by small perturbations to inputs that are often imperceivable to humans and do not have to carry semantic meaning. Adversarial training poses a partial solution to address this issue by training models on worst-case perturbations. Yet, recent work has also pointed out that the reasoning in neural networks is different from humans. Humans identify objects by shape, while neural nets mainly employ texture cues. Exemplarily, a model trained on photographs will likely fail to generalize to datasets containing sketches. Interestingly, it was also shown that adversarial training seems to favorably increase the shift toward shape bias. In this work, we revisit this observation and provide an extensive analysis of this effect on various architectures, the common $\ell_2$- and $\ell_\infty$-training, and Transformer-based models. Further, we provide a possible explanation for this phenomenon from a frequency perspective.
神经网络存在一些缺陷。其中最严重的缺点是对分布 Shift 的敏感性,这使得模型容易被 small perturbations 欺骗,将其错误预测误认为正确的预测。对抗性训练是一种通过训练模型对最坏情况的 perturbations 来解决此问题的 partial 解决方案。然而,最近的研究也指出,神经网络的推理与人类不同。人类通过形状识别物体,而神经网络主要使用纹理线索。举例来说,一个基于照片训练的模型可能无法将带有 Sketch 数据的dataset 泛化到。有趣的是,也研究表明对抗性训练似乎有利于向形状偏见的方向移动。在本文中,我们回顾了这一观察,并对不同架构、常见的 $\ell_2$ 和 $\ell_\infty$-训练以及基于Transformer 模型的影响进行了广泛的分析。此外,我们还提供了从频率的角度来看 this 现象的可能解释。
https://arxiv.org/abs/2303.12669
Assisting people in efficiently producing visually plausible 3D characters has always been a fundamental research topic in computer vision and computer graphics. Recent learning-based approaches have achieved unprecedented accuracy and efficiency in the area of 3D real human digitization. However, none of the prior works focus on modeling 3D biped cartoon characters, which are also in great demand in gaming and filming. In this paper, we introduce 3DBiCar, the first large-scale dataset of 3D biped cartoon characters, and RaBit, the corresponding parametric model. Our dataset contains 1,500 topologically consistent high-quality 3D textured models which are manually crafted by professional artists. Built upon the data, RaBit is thus designed with a SMPL-like linear blend shape model and a StyleGAN-based neural UV-texture generator, simultaneously expressing the shape, pose, and texture. To demonstrate the practicality of 3DBiCar and RaBit, various applications are conducted, including single-view reconstruction, sketch-based modeling, and 3D cartoon animation. For the single-view reconstruction setting, we find a straightforward global mapping from input images to the output UV-based texture maps tends to lose detailed appearances of some local parts (e.g., nose, ears). Thus, a part-sensitive texture reasoner is adopted to make all important local areas perceived. Experiments further demonstrate the effectiveness of our method both qualitatively and quantitatively. 3DBiCar and RaBit are available at this http URL.
协助人们高效制造视觉效果合理的三维角色一直是计算机视觉和计算机图形领域的基本研究话题。近年来,基于学习的方法在3D真实人类数字化领域实现了前所未有的准确性和效率。然而,以前的工作并没有专注于建模三维双足卡通角色,这些角色在游戏和电影拍摄中也非常受欢迎。在本文中,我们介绍了3DBiCar,这是第一个大规模3D双足卡通角色数据集,以及RaBit,它的参数化模型。我们的数据集包含了由专业艺术家手动制作的具有正确拓扑结构的高质量3D纹理模型1500个,基于这些数据,RaBit采用类似于SMPL的线性混合形状模型和基于风格GAN的 UV纹理生成神经网络,同时表达形状、姿态和纹理。为了展示3DBiCar和RaBit的实际可行性,我们进行了多种应用测试,包括单视角重建、基于 Sketch 的建模和3D卡通动画。对于单视角重建设置,我们发现从输入图像到输出 UV 纹理映射的直观全球映射倾向于失去一些局部部件(如鼻子和耳朵)的详细外观。因此,采用了局部纹理敏感的原因分析器,以便所有重要局部区域都被感知到。实验还证明了我们方法的定性和定量效果。3DBiCar和RaBit可在本网站的http://URL上获取。
https://arxiv.org/abs/2303.12564
The vision of AI collaborators has long been a staple of science fiction, where artificial agents understand nuances of collaboration and human communication. They bring advantages to their human collaborators and teams by contributing their special talents. Government advisory groups and leaders in AI have advocated for years that AIs should be human compatible and be capable of effective collaboration. Nonetheless, robust AIs that can collaborate like talented people remain out of reach. This position paper draws on a cognitive analysis of what effective and robust collaboration requires of human and artificial agents. It sketches a history of public and AI visions for artificial collaborators, starting with early visions of intelligence augmentation (IA) and artificial intelligence (AI). It is intended as motivation and context for a second position paper on collaborative AI (Stefik & Price, 2023). The second paper reviews the multi-disciplinary state-of-the-art and proposes a roadmap for bootstrapping collaborative AIs.
人工智能协作者 vision 一直是科幻小说中的常绿树,其中人工代理能够理解协作和人类沟通微妙之处。通过贡献其特殊才能,他们为人类协作和团队带来了优势。政府咨询团体和人工智能领域的领袖多年来一直主张,人工智能应该与人类兼容,并且有能力有效地协作。然而,能够像人才一样协作的强人工智能仍然难以实现。这篇立场文件基于认知分析,探讨了人类和人工智能协作所需具备的有效和强大的协作能力。它概述了公众和人工智能对于人工协作的愿景,从早期人工智能增强(IA)和人工智能(AI)的愿景开始。这作为第二个关于协作人工智能的立场文件(Stefik & Price,2023)的激励和背景。第二个文件综述了多学科的最新研究成果,并提出了构建bootstrapping协作人工智能的路线图。
https://arxiv.org/abs/2303.12040
Human sketch has already proved its worth in various visual understanding tasks (e.g., retrieval, segmentation, image-captioning, etc). In this paper, we reveal a new trait of sketches - that they are also salient. This is intuitive as sketching is a natural attentive process at its core. More specifically, we aim to study how sketches can be used as a weak label to detect salient objects present in an image. To this end, we propose a novel method that emphasises on how "salient object" could be explained by hand-drawn sketches. To accomplish this, we introduce a photo-to-sketch generation model that aims to generate sequential sketch coordinates corresponding to a given visual photo through a 2D attention mechanism. Attention maps accumulated across the time steps give rise to salient regions in the process. Extensive quantitative and qualitative experiments prove our hypothesis and delineate how our sketch-based saliency detection model gives a competitive performance compared to the state-of-the-art.
人类 Sketch 已经在各种视觉理解任务中证明了其价值(例如检索、分割、图像标题等)。在本文中,我们揭示了 Sketch 的新特征——它们也是引人注目的。这是因为 Sketch 本质上是一个自然的注意力过程的核心。更具体地说,我们旨在研究如何通过 Sketch 用作弱标签来检测图像中的引人注目物体。为此,我们提出了一种新方法,强调了如何通过手绘 Sketch 解释“引人注目的物体”。为了实现这一点,我们引入了一个照片到 Sketch 生成模型,旨在通过 2D 注意力机制生成与给定视觉照片相应的Sequential Sketch 坐标。注意力图在时间步上累积,从而产生引人注目的区域。广泛的定量和定性实验证明了我们的假设,并概述了我们 Sketch based 引人注目检测模型相对于最先进的技术提供的竞争性能。
https://arxiv.org/abs/2303.11502
The recent explosion of high-quality image-to-image methods has prompted interest in applying image-to-image methods towards artistic and design tasks. Of interest for architects is to use these methods to generate design proposals from conceptual sketches, usually hand-drawn sketches that are quickly developed and can embody a design intent. More specifically, instantiating a sketch into a visual that can be used to elicit client feedback is typically a time consuming task, and being able to speed up this iteration time is important. While the body of work in generative methods has been impressive, there has been a mismatch between the quality measures used to evaluate the outputs of these systems and the actual expectations of architects. In particular, most recent image-based works place an emphasis on realism of generated images. While important, this is one of several criteria architects look for. In this work, we describe the expectations architects have for design proposals from conceptual sketches, and identify corresponding automated metrics from the literature. We then evaluate several image-to-image generative methods that may address these criteria and examine their performance across these metrics. From these results, we identify certain challenges with hand-drawn conceptual sketches and describe possible future avenues of investigation to address them.
最近的高质量图像对图像方法的爆发引起了对将图像对图像方法应用于艺术和设计任务的兴趣。对建筑师来说,最感兴趣的方法是使用这些方法从概念草图生成设计提案,通常是指手绘的草图,它们可以快速发展并体现设计意图。更具体地说,将草图实例化成为可用于获取客户反馈的视觉通常是一项耗时的任务,并且能够加快迭代时间很重要。虽然生成方法领域的工作令人印象深刻,但用于评估这些系统输出的质量指标与建筑师的实际期望之间存在不匹配。特别是,最近基于图像的作品强调了生成图像的真实感。虽然重要,但这是建筑师寻找的其他标准之一。在本研究中,我们描述了建筑师从概念草图生成设计提案的期望,并从文献中识别相应的自动化指标。然后我们评估了几种可能解决这些标准的图像对图像生成方法,并检查它们在这些指标上的性能。从这些结果中,我们识别了手绘概念草图的一些挑战,并描述了可能用于解决这些问题的未来的研究方向。
https://arxiv.org/abs/2303.11483
Given an abstract, deformed, ordinary sketch from untrained amateurs like you and me, this paper turns it into a photorealistic image - just like those shown in Fig. 1(a), all non-cherry-picked. We differ significantly from prior art in that we do not dictate an edgemap-like sketch to start with, but aim to work with abstract free-hand human sketches. In doing so, we essentially democratise the sketch-to-photo pipeline, "picturing" a sketch regardless of how good you sketch. Our contribution at the outset is a decoupled encoder-decoder training paradigm, where the decoder is a StyleGAN trained on photos only. This importantly ensures that generated results are always photorealistic. The rest is then all centred around how best to deal with the abstraction gap between sketch and photo. For that, we propose an autoregressive sketch mapper trained on sketch-photo pairs that maps a sketch to the StyleGAN latent space. We further introduce specific designs to tackle the abstract nature of human sketches, including a fine-grained discriminative loss on the back of a trained sketch-photo retrieval model, and a partial-aware sketch augmentation strategy. Finally, we showcase a few downstream tasks our generation model enables, amongst them is showing how fine-grained sketch-based image retrieval, a well-studied problem in the sketch community, can be reduced to an image (generated) to image retrieval task, surpassing state-of-the-arts. We put forward generated results in the supplementary for everyone to scrutinise.
给定像我一样未经培训的业余爱好者生成的抽象、变形的普通 Sketch,本文将其转换为一张逼真的图像 - 与 Fig. 1(a)所示的图像一样,全部非 cherry-pick 的。我们与已有的文献有很大的不同,我们没有先指定一个边缘地图式的 Sketch,而是旨在与抽象手绘 human Sketch 合作。这样做,我们实质上将 Sketch 到照片的管道民主化,“想象”出 Sketch 不论 Sketch 质量如何。我们的贡献从一开始就是分离编码-解码训练范式,其中编码器仅训练基于照片的 StyleGAN。这重要的是确保生成的结果是 always 逼真的。剩下的都围绕着如何最好地处理 Sketch 和照片的抽象差距。为此,我们提出一种自回归 Sketch 映射器,训练基于 Sketch 和照片配对的 StyleGAN 隐藏空间映射。我们还介绍了一些设计来解决人类 Sketch 抽象性质,包括一个精细的辨别损失的训练 Sketch 照片检索模型的背后,以及一个部分自我意识的 Sketch 增强策略。最后,我们展示了一些我们的生成模型能够执行的下游任务,其中之一是展示如何通过精细的 Sketch 检索,一个在 Sketch 社区中广受研究的问题解决精细的 Sketch 图像检索问题可以归结为图像(生成)的任务,超越了现有水平。我们呈现了生成的结果,以便每个人审查。
https://arxiv.org/abs/2303.11162
Text-to-image diffusion models are gradually introduced into computer graphics, recently enabling the development of Text-to-3D pipelines in an open domain. However, for interactive editing purposes, local manipulations of content through a simplistic textual interface can be arduous. Incorporating user guided sketches with Text-to-image pipelines offers users more intuitive control. Still, as state-of-the-art Text-to-3D pipelines rely on optimizing Neural Radiance Fields (NeRF) through gradients from arbitrary rendering views, conditioning on sketches is not straightforward. In this paper, we present SKED, a technique for editing 3D shapes represented by NeRFs. Our technique utilizes as few as two guiding sketches from different views to alter an existing neural field. The edited region respects the prompt semantics through a pre-trained diffusion model. To ensure the generated output adheres to the provided sketches, we propose novel loss functions to generate the desired edits while preserving the density and radiance of the base instance. We demonstrate the effectiveness of our proposed method through several qualitative and quantitative experiments.
文本到图像扩散模型逐渐引入到计算机图形中,最近在开放域中实现了文本到三维管道的开发。但对于交互编辑目的,通过简单的文本界面对内容进行 local 的操作可能较为困难。将用户引导的 Sketch 与文本到图像管道相结合可以提供用户更加直观的控制。然而,由于先进的文本到三维管道依赖于通过任意渲染视图的梯度优化 Neural Radiance Fields (NeRF),对 Sketch 进行条件化并不是那么简单。在本文中,我们提出了 SKED,一种用于编辑 NeRF 表示的三维形状的方法。我们利用不同视图中仅两个引导 Sketch 来改变现有的神经网络场。编辑区域通过预先训练的扩散模型尊重了 prompt 语义。为了确保生成的输出符合提供 Sketch 的内容,我们提出了一种新的损失函数,以生成所需的编辑,同时保留基实例的密度和亮度。我们通过几个定性和定量实验展示了我们提出方法的有效性。
https://arxiv.org/abs/2303.10735
Recent studies on transfer learning have shown that selectively fine-tuning a subset of layers or customizing different learning rates for each layer can greatly improve robustness to out-of-distribution (OOD) data and retain generalization capability in the pre-trained models. However, most of these methods employ manually crafted heuristics or expensive hyper-parameter searches, which prevent them from scaling up to large datasets and neural networks. To solve this problem, we propose Trainable Projected Gradient Method (TPGM) to automatically learn the constraint imposed for each layer for a fine-grained fine-tuning regularization. This is motivated by formulating fine-tuning as a bi-level constrained optimization problem. Specifically, TPGM maintains a set of projection radii, i.e., distance constraints between the fine-tuned model and the pre-trained model, for each layer, and enforces them through weight projections. To learn the constraints, we propose a bi-level optimization to automatically learn the best set of projection radii in an end-to-end manner. Theoretically, we show that the bi-level optimization formulation is the key to learning different constraints for each layer. Empirically, with little hyper-parameter search cost, TPGM outperforms existing fine-tuning methods in OOD performance while matching the best in-distribution (ID) performance. For example, when fine-tuned on DomainNet-Real and ImageNet, compared to vanilla fine-tuning, TPGM shows $22\%$ and $10\%$ relative OOD improvement respectively on their sketch counterparts. Code is available at \url{this https URL}.
最近的 Transfer Learning 研究已经表明,选择性地优化某些层或为每个层定制不同的学习速率可以极大地改善对非分布数据的可靠性和对训练模型的泛化能力。然而,这些方法大多数都采用了手工编写的启发式或昂贵的超参数搜索,这导致它们无法扩展到大型数据集和神经网络。为了解决这个问题,我们提出了可训练投影梯度方法(TPGM),以自动学习每个层强加的限制条件。这通过将 fine-tuning 解释为两个水平的约束优化问题而 motivated。具体来说,TPGM 维护每个层的投影半径集合,即 fine-tuning 模型和训练模型之间的距离限制,并通过权重投影来强制它们。为了学习限制条件,我们提出了两个水平的优化,以自动学习最佳投影半径集合。从理论上讲,我们表明,两个水平的优化 formulation 是学习每个层不同限制的关键。实证研究表明,在 OOD 性能方面,与基本 fine-tuning 相比,TPGM 在非分布性能方面表现更好,同时与分布最佳性能相媲美。例如,当在DomainNet-Real 和 ImageNet 上进行 fine-tuning 时,与基本 fine-tuning 相比,TPGM 在 Sketch 对应的 OOD 性能上表现出 $22\%$ 和 $10\%$ 的相对改善。代码可在 \url{this https URL} 上获取。
https://arxiv.org/abs/2303.10720
Reverse engineering CAD models from raw geometry is a classic but strenuous research problem. Previous learning-based methods rely heavily on labels due to the supervised design patterns or reconstruct CAD shapes that are not easily editable. In this work, we introduce SECAD-Net, an end-to-end neural network aimed at reconstructing compact and easy-to-edit CAD models in a self-supervised manner. Drawing inspiration from the modeling language that is most commonly used in modern CAD software, we propose to learn 2D sketches and 3D extrusion parameters from raw shapes, from which a set of extrusion cylinders can be generated by extruding each sketch from a 2D plane into a 3D body. By incorporating the Boolean operation (i.e., union), these cylinders can be combined to closely approximate the target geometry. We advocate the use of implicit fields for sketch representation, which allows for creating CAD variations by interpolating latent codes in the sketch latent space. Extensive experiments on both ABC and Fusion 360 datasets demonstrate the effectiveness of our method, and show superiority over state-of-the-art alternatives including the closely related method for supervised CAD reconstruction. We further apply our approach to CAD editing and single-view CAD reconstruction. The code is released at this https URL.
从 raw 几何逆向工程 CAD 模型是经典的但艰苦的研究问题。由于监督设计模式或重构不易修改的 CAD 形状,过去的基于学习的方法在很大程度上依赖于标签。在本研究中,我们介绍了SECAD-Net,一个端到端的神经网络,旨在以自我监督的方式重构紧凑且易于编辑的 CAD 模型。从现代 CAD 软件中最常用的建模语言中汲取灵感,我们提议从 raw 形状中学习 2D Sketch 和 3D 挤出参数,从这些参数中可以生成一组挤出圆柱形,这些圆柱形可以通过将每个 Sketch 从 2D 平面挤出到 3D 身体生成。通过加入布尔运算(即合并),这些圆柱形可以组合起来,几乎接近目标几何形状。我们倡导使用隐含 fields 用于 Sketch 表示,这可以通过在 Sketch 隐示空间中插值预填充代码来创建 CAD 变异。在 ABC 和 Fusion 360 数据集上进行广泛的实验证明了我们方法的有效性,并显示了比当前先进技术包括监督 CAD 重构方法的优越性。我们还将其方法应用于 CAD 编辑和单视图 CAD 重构。代码在此 https URL 上发布。
https://arxiv.org/abs/2303.10613
Garment pattern design aims to convert a 3D garment to the corresponding 2D panels and their sewing structure. Existing methods rely either on template fitting with heuristics and prior assumptions, or on model learning with complicated shape parameterization. Importantly, both approaches do not allow for personalization of the output garment, which today has increasing demands. To fill this demand, we introduce PersonalTailor: a personalized 2D pattern design method, where the user can input specific constraints or demands (in language or sketch) for personal 2D panel fabrication from 3D point clouds. PersonalTailor first learns a multi-modal panel embeddings based on unsupervised cross-modal association and attentive fusion. It then predicts a binary panel masks individually using a transformer encoder-decoder framework. Extensive experiments show that our PersonalTailor excels on both personalized and standard pattern fabrication tasks.
服装 patterns design 的目标是将三维服装转化为对应的二维Panel及其缝合结构。现有的方法要么依赖于模板的启发式fitting和先前假设,要么依赖于复杂的形状参数化模型学习。重要的是,这两种方法都不允许对输出服装进行个性化,而这种需求今天越来越受欢迎。为了满足这种需求,我们介绍了 PersonalTailor:一种个性化的二维pattern设计方法,用户可以从3D点云中输入具体的约束或要求(用语言或草图表示)进行个性化二维Panel制造。 PersonalTailor 首先基于无监督的跨模态结合和注意融合学习多模态Panel嵌入。然后使用transformer编码器-解码框架单独预测二进制Panel口罩。广泛的实验表明,我们的 PersonalTailor 在个性化和标准pattern制造任务中表现优异。
https://arxiv.org/abs/2303.09695
In this work, we investigate the problem of sketch-based object localization on natural images, where given a crude hand-drawn sketch of an object, the goal is to localize all the instances of the same object on the target image. This problem proves difficult due to the abstract nature of hand-drawn sketches, variations in the style and quality of sketches, and the large domain gap existing between the sketches and the natural images. To mitigate these challenges, existing works proposed attention-based frameworks to incorporate query information into the image features. However, in these works, the query features are incorporated after the image features have already been independently learned, leading to inadequate alignment. In contrast, we propose a sketch-guided vision transformer encoder that uses cross-attention after each block of the transformer-based image encoder to learn query-conditioned image features leading to stronger alignment with the query sketch. Further, at the output of the decoder, the object and the sketch features are refined to bring the representation of relevant objects closer to the sketch query and thereby improve the localization. The proposed model also generalizes to the object categories not seen during training, as the target image features learned by our method are query-aware. Our localization framework can also utilize multiple sketch queries via a trainable novel sketch fusion strategy. The model is evaluated on the images from the public object detection benchmark, namely MS-COCO, using the sketch queries from QuickDraw! and Sketchy datasets. Compared with existing localization methods, the proposed approach gives a $6.6\%$ and $8.0\%$ improvement in mAP for seen objects using sketch queries from QuickDraw! and Sketchy datasets, respectively, and a $12.2\%$ improvement in AP@50 for large objects that are `unseen' during training.
在本研究中,我们研究了基于 Sketch 的对象定位在自然图像上的问题,给定一个粗略手绘的 Sketch 轮廓,目标是在目标图像上定位相同的对象实例。这个问题证明很难,因为手绘 Sketch 的抽象性质、 Sketch 的风格和质量的变异以及 Sketch 和自然图像之间的巨大领域差距。为了缓解这些挑战,已有的研究提出了基于注意力的框架,将查询信息融入图像特征。然而,在这些研究中,查询特征是在图像特征已经独立学习之后再加入的,导致不够准确的对齐。相反,我们提出了一种 Sketch 引导的视觉Transformer编码器,使用transformer-based图像编码器中的交叉注意力在每个块之后学习查询条件的图像特征,导致与查询 Sketch 更强的对齐。此外,在解码器的输出中,对象和 Sketch 特征进行优化,使相关的物体 representation 更接近 Sketch 查询,从而提高定位。我们提出的模型还适用于训练期间未观察到的对象类别,因为我们的方法学习的目标图像特征具有查询意识。我们的定位框架还可以通过可训练的 Sketch 融合策略使用多个 Sketch 查询。模型在公开对象检测基准 MS-COCO 的图像上使用从 QuickDraw! 和 Sketchy 数据集的 Sketch 查询进行测试。与现有的定位方法相比,我们提出的方法在可见对象使用 Sketch 查询时的 mAP 有 $6.6\%$ 和 $8.0\%$ 的提高,而对于在训练期间未被“看到”的大对象,AP@50有 $12.2\%$ 的提高。
https://arxiv.org/abs/2303.08784
Rising concerns about privacy and anonymity preservation of deep learning models have facilitated research in data-free learning (DFL). For the first time, we identify that for data-scarce tasks like Sketch-Based Image Retrieval (SBIR), where the difficulty in acquiring paired photos and hand-drawn sketches limits data-dependent cross-modal learning algorithms, DFL can prove to be a much more practical paradigm. We thus propose Data-Free (DF)-SBIR, where, unlike existing DFL problems, pre-trained, single-modality classification models have to be leveraged to learn a cross-modal metric-space for retrieval without access to any training data. The widespread availability of pre-trained classification models, along with the difficulty in acquiring paired photo-sketch datasets for SBIR justify the practicality of this setting. We present a methodology for DF-SBIR, which can leverage knowledge from models independently trained to perform classification on photos and sketches. We evaluate our model on the Sketchy, TU-Berlin, and QuickDraw benchmarks, designing a variety of baselines based on state-of-the-art DFL literature, and observe that our method surpasses all of them by significant margins. Our method also achieves mAPs competitive with data-dependent approaches, all the while requiring no training data. Implementation is available at \url{this https URL}.
对深度学习模型隐私和匿名保护的日益关注使得数据自由学习(DFL)的研究变得更加容易。我们首次认识到,对于像 Sketch-Based Image Retrieval (SBIR)这样的数据匮乏任务,由于在获取一对照片和手绘 Sketch 的数据上所面临的困难,限制了基于数据依赖的跨模态学习算法,因此 DFL 可能是一个非常实用的范式。因此,我们提出了数据自由 (DF) -SBIR,与现有的 DFL 问题不同,我们要求使用预先训练的单一模态分类模型,以学习一个跨模态度量空间,而无需访问任何训练数据。预先训练的分类模型的广泛可用性和获取SBIR数据集的困难证明了这个设置的实际可行性。我们提出了一种方法,用于DF-SBIR,该方法可以利用独立训练用于对照片和 Sketch 进行分类的知识。我们使用 Sketchy、柏林工业大学和QuickDraw基准测试来评估我们的模型,并根据最新的 DFL 文献设计了多种基线,并观察到我们的方法在许多方面都超越了它们。我们的方法还实现了与基于数据依赖方法的 mAPs 相当的精度,而无需任何训练数据。实现代码已放在 url{this https URL} 上。
https://arxiv.org/abs/2303.07775
Sensing technologies deployed in the workplace can collect detailed data about individual activities and group interactions that are otherwise difficult to capture. A hopeful application of these technologies is that they can help businesses and workers optimize productivity and wellbeing. However, given the inherent and structural power dynamics in the workplace, the prevalent approach of accepting tacit compliance to monitor work activities rather than seeking workers' meaningful consent raises privacy and ethical concerns. This paper unpacks a range of challenges that workers face when consenting to workplace wellbeing technologies. Using a hypothetical case to prompt reflection among six multi-stakeholder focus groups involving 15 participants, we explored participants' expectations and capacity to consent to workplace sensing technologies. We sketched possible interventions that could better support more meaningful consent to workplace wellbeing technologies by drawing on critical computing and feminist scholarship -- which reframes consent from a purely individual choice to a structural condition experienced at the individual level that needs to be freely given, reversible, informed, enthusiastic, and specific (FRIES). The focus groups revealed that workers are vulnerable to meaningless consent -- dynamics that undo the value of data gathered in the name of "wellbeing," as well as an erosion of autonomy in the workplace. To meaningfully consent, participants wanted changes to how the technology works and is being used, as well as to the policies and practices surrounding the technology. Our mapping of what prevents workers from meaningfully consenting to workplace wellbeing technologies (challenges) and what they require to do so (interventions) underscores that the lack of meaningful consent is a structural problem requiring socio-technical solutions.
在工作场所部署的感知技术可以收集个体活动和群体互动的详细数据,而这些数据在其他情况下很难收集。希望应用这些技术的原因是它们可以帮助企业和工人优化生产力和福利。然而,鉴于工作场所固有的和结构性的权力动态,通常接受默认 compliance 来监测工作活动而不是寻求工人有意义的同意,引发了隐私和伦理问题。本文讨论了工人在同意工作场所福利技术时面临的各种挑战。通过虚构一个例子,激励六个涉及15名参与者的多利益相关方焦点小组进行反思,我们探索了参与者的期望和同意工作场所感知技术的能力。我们草拟了可能的支持更有意义的同意工作场所福利技术的措施,利用了批判计算和女权主义研究,将同意从纯粹的个人选择转变为个人在个体层面上经历的结构性条件,需要自由给予、可逆转、知情、热情和具体的(FRIES)。焦点小组表明,工人容易受到无意义同意的危害,这些动态削弱了收集以“福利”命名的数据的价值,并削弱了工作场所的自主性。为了有意义地同意,参与者希望改变技术的运作和使用情况,以及围绕技术的政策和做法。我们绘制了阻止工人有意义同意工作场所福利技术的因素(挑战)和他们需要这样做(干预)的要素,强调了缺乏有意义的同意是一个结构性问题,需要社会和技术解决方案。
https://arxiv.org/abs/2303.07242
Recent advances in face manipulation using StyleGAN have produced impressive results. However, StyleGAN is inherently limited to cropped aligned faces at a fixed image resolution it is pre-trained on. In this paper, we propose a simple and effective solution to this limitation by using dilated convolutions to rescale the receptive fields of shallow layers in StyleGAN, without altering any model parameters. This allows fixed-size small features at shallow layers to be extended into larger ones that can accommodate variable resolutions, making them more robust in characterizing unaligned faces. To enable real face inversion and manipulation, we introduce a corresponding encoder that provides the first-layer feature of the extended StyleGAN in addition to the latent style code. We validate the effectiveness of our method using unaligned face inputs of various resolutions in a diverse set of face manipulation tasks, including facial attribute editing, super-resolution, sketch/mask-to-face translation, and face toonification.
使用StyleGAN进行面部操纵的最新进展已经取得了令人印象深刻的结果。然而,StyleGAN本身的局限性在于其预先训练时的固定图像分辨率下的对齐面部裁剪。在本文中,我们提出了一种简单而有效的解决方案,通过使用扩展的卷积来重新缩放StyleGAN浅层响应面,而无需改变任何模型参数。这使固定大小的小型特征在浅层层中被扩展为能够适应变量分辨率的更大的特征,从而使它们更加鲁棒地描述非对齐面部。为了实现真正的面部翻转和操纵,我们引入了一个相应的编码器,它提供扩展StyleGAN的第一层特征,同时提供隐式的样式代码。我们使用不同分辨率的非对齐面部输入在多个面部操纵任务中,包括面部属性编辑、超分辨率、 Sketch/Mask-to-Face翻译和面部魔变中验证了我们方法的有效性。
https://arxiv.org/abs/2303.06146
This paper studies a model learning and online planning approach towards building flexible and general robots. Specifically, we investigate how to exploit the locality and sparsity structures in the underlying environmental transition model to improve model generalization, data-efficiency, and runtime-efficiency. We present a new domain definition language, named PDSketch. It allows users to flexibly define high-level structures in the transition models, such as object and feature dependencies, in a way similar to how programmers use TensorFlow or PyTorch to specify kernel sizes and hidden dimensions of a convolutional neural network. The details of the transition model will be filled in by trainable neural networks. Based on the defined structures and learned parameters, PDSketch automatically generates domain-independent planning heuristics without additional training. The derived heuristics accelerate the performance-time planning for novel goals.
本论文研究了构建灵活且通用的机器人的一种模型学习和在线规划方法。具体来说,我们研究如何利用底层环境转换模型中局部性和稀疏性结构,以提高模型泛化能力、数据效率和运行时效率。我们提出了一种新领域定义语言,称为PDSketch。它允许用户 flexibly 定义转换模型中的高级别结构,例如对象和特征依赖关系,类似于程序员使用TensorFlow或PyTorch来指定卷积神经网络的内核大小和隐藏维度。转换模型的细节将由可训练的神经网络填充。基于定义的结构和学习参数,PDSketch自动生成领域无关的计划启发式,而无需额外的训练。这些推导启发式加速了针对新目标的性能时间规划。
https://arxiv.org/abs/2303.05501
Computational simulations are a popular method for testing hypotheses about the emergence of communication. This kind of research is performed in a variety of traditions including language evolution, developmental psychology, cognitive science, machine learning, robotics, etc. The motivations for the models are different, but the operationalizations and methods used are often similar. We identify the assumptions and explanatory targets of several most representative models and summarise the known results. We claim that some of the assumptions -- such as portraying meaning in terms of mapping, focusing on the descriptive function of communication, modelling signals with amodal tokens -- may hinder the success of modelling. Relaxing these assumptions and foregrounding the interactions of embodied and situated agents allows one to systematise the multiplicity of pressures under which symbolic systems evolve. In line with this perspective, we sketch the road towards modelling the emergence of meaningful symbolic communication, where symbols are simultaneously grounded in action and perception and form an abstract system.
计算模拟是一种流行的方法,用于测试关于通信涌现的假设。这种研究可以在各种传统上开展,包括语言演化、发展心理学、认知科学、机器学习、机器人学等。模型的动机不同,但所使用的操作和方法通常类似。我们识别了几只最有代表性的模型的假设和解释目标,并总结了已知结果。我们声称,一些假设——例如以映射方式描述意义、专注于通信的描述功能、用modal token建模信号——可能会妨碍模型的成功。放松这些假设并重视身体和位置 agents 之间的交互,可以使你系统地总结符号系统进化中所承受的复杂性压力。按照这一观点,我们描绘了通往建模有意义符号通信涌现的道路,其中符号同时基于行动和感知,并形成抽象系统。
https://arxiv.org/abs/2303.04544
We propose InCA, a lightweight method for transfer learning that cross-attends to any activation layer of a pre-trained model. During training, InCA uses a single forward pass to extract multiple activations, which are passed to external cross-attention adapters, trained anew and combined or selected for downstream tasks. We show that, even when selecting a single top-scoring adapter, InCA achieves performance comparable to full fine-tuning, at a cost comparable to fine-tuning just the last layer. For example, with a cross-attention probe 1.3% the size of a pre-trained ViT-L/16 model, we achieve performance within 0.2% of the full fine-tuning paragon at 51% training cost of the baseline, on average across 11 downstream classification tasks. Unlike other forms of efficient adaptation, InCA does not require backpropagating through the pre-trained model, thus leaving its execution unaltered at both training and inference. The versatility of InCA is best illustrated in fine-grained tasks, which may require accessing information absent in the last layer but accessible in intermediate layer activations. Since the backbone is fixed, InCA allows parallel ensembling as well as parallel execution of multiple tasks. InCA achieves state-of-the-art performance in the ImageNet-to-Sketch multi-task benchmark.
我们提出了InCA,一种轻量级的方法,用于迁移学习,它可以跨 attend 训练模型中的任意激活层。在训练中,InCA使用一个 forward 通道提取多个激活,并将其传递给外部跨注意力Adapter,重新训练并组合或为后续任务选择。我们证明,即使选择单个高分Adapter,InCA也能实现与全微调相当的性能,成本与微调最后一个层相当。例如,使用一个跨注意力探测器的大小为训练模型ViT-L/16模型的1.3%,我们在基准线的51%训练成本上,平均情况下对11个后续分类任务达到性能与全微调范式相当。与高效的适应方式其他形式不同,InCA不需要通过训练模型进行反向传播,因此训练和推理时其执行未发生变化。InCA的灵活度最好体现在精细任务上,这些任务可能需要访问最后一个层但没有在中间层可用的信息。由于基线是固定的,InCA允许并行组合和并行执行多个任务。InCA在ImageNet到Sketch多任务基准测试中实现了最先进的性能。
https://arxiv.org/abs/2303.04105