In this paper, we explore the unique modality of sketch for explainability, emphasising the profound impact of human strokes compared to conventional pixel-oriented studies. Beyond explanations of network behavior, we discern the genuine implications of explainability across diverse downstream sketch-related tasks. We propose a lightweight and portable explainability solution -- a seamless plugin that integrates effortlessly with any pre-trained model, eliminating the need for re-training. Demonstrating its adaptability, we present four applications: highly studied retrieval and generation, and completely novel assisted drawing and sketch adversarial attacks. The centrepiece to our solution is a stroke-level attribution map that takes different forms when linked with downstream tasks. By addressing the inherent non-differentiability of rasterisation, we enable explanations at both coarse stroke level (SLA) and partial stroke level (P-SLA), each with its advantages for specific downstream tasks.
在本文中,我们探讨了可解释性绘图的独特维度,强调人类笔触与传统像素导向研究的深刻影响。除了网络行为的解释外,我们分辨出可解释性在各种下游绘图相关任务中的真正含义。我们提出了一个轻量级且便携的 explainability 解决方案--无缝插件,可轻松地集成到任何预训练模型中,无需重新训练。展示其适应性,我们提出了四个应用:高度研究过的检索和生成,以及完全新颖的辅助绘图和绘图对抗攻击。我们解决方案的核心是一个在连接到下游任务时具有不同形式的笔触级别归因图。通过解决平滑映射固有的不可分性,我们使得解释在粗笔级别(SLA)和部分笔级别(P-SLA)上都具有优势, each with its advantages for specific downstream tasks.
https://arxiv.org/abs/2403.09480
We propose SketchINR, to advance the representation of vector sketches with implicit neural models. A variable length vector sketch is compressed into a latent space of fixed dimension that implicitly encodes the underlying shape as a function of time and strokes. The learned function predicts the $xy$ point coordinates in a sketch at each time and stroke. Despite its simplicity, SketchINR outperforms existing representations at multiple tasks: (i) Encoding an entire sketch dataset into a fixed size latent vector, SketchINR gives $60\times$ and $10\times$ data compression over raster and vector sketches, respectively. (ii) SketchINR's auto-decoder provides a much higher-fidelity representation than other learned vector sketch representations, and is uniquely able to scale to complex vector sketches such as FS-COCO. (iii) SketchINR supports parallelisation that can decode/render $\sim$$100\times$ faster than other learned vector representations such as SketchRNN. (iv) SketchINR, for the first time, emulates the human ability to reproduce a sketch with varying abstraction in terms of number and complexity of strokes. As a first look at implicit sketches, SketchINR's compact high-fidelity representation will support future work in modelling long and complex sketches.
我们提出了SketchINR,以通过隐式神经模型提高向量草图的表示。一个长度可变的向量草图被压缩成一个固定维度的潜在空间,其中隐含地编码了 underlying 形状作为时间和水流的函数。学习到的函数在每个时间和线条上预测草图中的 $xy$ 点坐标。尽管它的简单性,SketchINR在多个任务上优于现有的表示: (i) 将整个草图数据集压缩到固定大小的潜在向量中,SketchINR在遥感草图和向量草图上分别给出了60倍和10倍的数据压缩。 (ii) SketchINR的自动编码器提供了比其他学习到的向量草图表示更高的保真度,并且具有独特的能力将其扩展到复杂的向量草图(如FS-COCO)。 (iii) SketchINR支持并行化,这使得它能够比其他学习到的向量表示更快地解码/渲染大约100倍的草图。 (iv) SketchINR是第一个模拟人类能力在抽象程度和复杂性方面复制草图的。作为对隐性草图的第一印象,SketchINR的紧凑高保真度表示将支持未来在建模长和复杂草图方面的研究。
https://arxiv.org/abs/2403.09344
Using vision-language models (VLMs) in web development presents a promising strategy to increase efficiency and unblock no-code solutions: by providing a screenshot or a sketch of a UI, a VLM could generate the code to reproduce it, for instance in a language like HTML. Despite the advancements in VLMs for various tasks, the specific challenge of converting a screenshot into a corresponding HTML has been minimally explored. We posit that this is mainly due to the absence of a suitable, high-quality dataset. This work introduces WebSight, a synthetic dataset consisting of 2 million pairs of HTML codes and their corresponding screenshots. We fine-tune a foundational VLM on our dataset and show proficiency in converting webpage screenshots to functional HTML code. To accelerate the research in this area, we open-source WebSight.
在Web开发中使用视觉语言模型(VLMs)是一种提高效率并打破无代码解决方案的有前途的方法:通过提供UI截图或草图,VLM可以生成复制该截图的代码,例如在HTML语言中。尽管VLMs在各种任务上取得了进步,但将截图转换为相应的HTML的具体挑战却被大大缩小了。我们认为是由于缺乏一个合适、高质量的數據集。这项工作引入了WebSight,一个包含200万对HTML代码及其截图的合成数据集。我们在我们的数据集上微调了一个基本VLM,并展示了将网页截图转换为功能HTML代码的熟练程度。为了加速这一领域的研究,我们开源了WebSight。
https://arxiv.org/abs/2403.09029
Drawing is an art that enables people to express their imagination and emotions. However, individuals usually face challenges in drawing, especially when translating conceptual ideas into visually coherent representations and bridging the gap between mental visualization and practical execution. In response, we propose ARtVista - a novel system integrating AR and generative AI technologies. ARtVista not only recommends reference images aligned with users' abstract ideas and generates sketches for users to draw but also goes beyond, crafting vibrant paintings in various painting styles. ARtVista also offers users an alternative approach to create striking paintings by simulating the paint-by-number concept on reference images, empowering users to create visually stunning artwork devoid of the necessity for advanced drawing skills. We perform a pilot study and reveal positive feedback on its usability, emphasizing its effectiveness in visualizing user ideas and aiding the painting process to achieve stunning pictures without requiring advanced drawing skills. The source code will be available at this https URL.
绘画是一种艺术形式,让人们对想象力和情感进行表达。然而,在绘画过程中,个人通常会面临挑战,尤其是在将概念性想法转化为视觉上连贯的图像,以及将头脑中的想象和实际操作之间建立联系时。为此,我们提出了ArtVista - 一款集成了AR和生成式人工智能技术的全新系统。ArtVista不仅推荐与用户抽象想法相符的参考图像,还为用户生成绘画草图,但还超越了这一点,通过各种绘画风格创作出鲜艳的画作。ArtVista还通过模拟“画数法”概念在参考图像上,让用户在不需要高级绘画技能的情况下,创造出令人惊叹的画作。我们对ArtVista的可用性进行了试点研究,并得到了积极的反馈,强调其在可视化用户想法和帮助绘画过程方面的高效性。源代码将在此处链接提供。
https://arxiv.org/abs/2403.08876
In the realm of fashion design, sketches serve as the canvas for expressing an artist's distinctive drawing style and creative vision, capturing intricate details like stroke variations and texture nuances. The advent of sketch-to-image cross-modal translation technology has notably aided designers. However, existing methods often compromise these sketch details during image generation, resulting in images that deviate from the designer's intended concept. This limitation hampers the ability to offer designers a precise preview of the final output. To overcome this challenge, we introduce HAIFIT, a novel approach that transforms sketches into high-fidelity, lifelike clothing images by integrating multi-scale features and capturing extensive feature map dependencies from diverse perspectives. Through extensive qualitative and quantitative evaluations conducted on our self-collected dataset, our method demonstrates superior performance compared to existing methods in generating photorealistic clothing images. Our method excels in preserving the distinctive style and intricate details essential for fashion design applications.
在时尚设计领域,草图作为表现艺术家独特绘画风格和创意视图的画布,捕捉到微妙的细节,如笔触变化和质感细微差别。草图到图像跨模态翻译技术的出现,显著地帮助了设计师。然而,现有的方法通常在图像生成过程中妥协这些草图细节,导致设计师的预期概念与生成的图像存在偏差。这一限制阻碍了设计师能够得到精确的最终输出预览。为了克服这一挑战,我们引入了HAIFIT,一种通过整合多尺度特征并从不同角度捕捉广泛特征映射的新颖方法。通过对我们的自收集数据进行广泛的定性和定量评估,我们的方法证明了在生成逼真的服装图像方面,其性能优于现有方法。我们的方法擅长保留时尚设计应用程序所需的独特风格和复杂细节。
https://arxiv.org/abs/2403.08651
While manga is a popular entertainment form, creating manga is tedious, especially adding screentones to the created sketch, namely manga screening. Unfortunately, there is no existing method that tailors for automatic manga screening, probably due to the difficulty of generating high-quality shaded high-frequency screentones. The classic manga screening approaches generally require user input to provide screentone exemplars or a reference manga image. The recent deep learning models enables the automatic generation by learning from a large-scale dataset. However, the state-of-the-art models still fail to generate high-quality shaded screentones due to the lack of a tailored model and high-quality manga training data. In this paper, we propose a novel sketch-to-manga framework that first generates a color illustration from the sketch and then generates a screentoned manga based on the intensity guidance. Our method significantly outperforms existing methods in generating high-quality manga with shaded high-frequency screentones.
虽然漫画是一种流行的娱乐形式,但创作漫画费力,尤其是为创作的插图添加阴影高频率屏幕,例如漫画扫描。不幸的是,目前尚无针对自动漫画扫描的方法,这可能是因为生成高质量阴影高频率屏幕的难度较大。经典的漫画扫描方法通常需要用户输入提供屏幕示例或参考漫画图像。近年来,深度学习模型通过从大型数据集中学习自动生成,取得了显著的成果。然而,由于缺乏定制化的模型和高质量漫画训练数据,最先进的模型仍无法生成高质量的阴影屏幕。在本文中,我们提出了一种新颖的插图到漫画框架,首先根据插图生成彩色插图,然后根据强度指导生成漫画扫描。我们的方法在生成高质量带有阴影的高频屏幕漫画方面显著优于现有方法。
https://arxiv.org/abs/2403.08266
This paper unravels the potential of sketches for diffusion models, addressing the deceptive promise of direct sketch control in generative AI. We importantly democratise the process, enabling amateur sketches to generate precise images, living up to the commitment of "what you sketch is what you get". A pilot study underscores the necessity, revealing that deformities in existing models stem from spatial-conditioning. To rectify this, we propose an abstraction-aware framework, utilising a sketch adapter, adaptive time-step sampling, and discriminative guidance from a pre-trained fine-grained sketch-based image retrieval model, working synergistically to reinforce fine-grained sketch-photo association. Our approach operates seamlessly during inference without the need for textual prompts; a simple, rough sketch akin to what you and I can create suffices! We welcome everyone to examine results presented in the paper and its supplementary. Contributions include democratising sketch control, introducing an abstraction-aware framework, and leveraging discriminative guidance, validated through extensive experiments.
本文探讨了对于扩散模型的 sketches 的潜在功能,并解决了生成式 AI 中直接绘制控制所带来的误导性承诺。我们重要的是使过程民主化,使业余 sketches 能够生成精确的图像,达到“你画什么,你就得到什么”的承诺。一个试点研究证实了必要性,揭示了现有模型的畸形源于空间约束。为了纠正这个问题,我们提出了一个抽象感知框架,利用了插图适配器、自适应时间步采样和预训练的精细颗粒插图基于图像检索模型的歧视性指导,协同工作以强化细粒度插图与照片的关联。在推理过程中,我们的方法无需文本提示操作顺畅;类似于我们和您可以创建的简单而粗糙的插图足够了!我们欢迎所有人研究论文及其补充。贡献包括使插图控制民主化、引入了抽象感知框架以及利用了歧视性指导,并通过广泛的实验验证了其有效性。
https://arxiv.org/abs/2403.07234
Two primary input modalities prevail in image retrieval: sketch and text. While text is widely used for inter-category retrieval tasks, sketches have been established as the sole preferred modality for fine-grained image retrieval due to their ability to capture intricate visual details. In this paper, we question the reliance on sketches alone for fine-grained image retrieval by simultaneously exploring the fine-grained representation capabilities of both sketch and text, orchestrating a duet between the two. The end result enables precise retrievals previously unattainable, allowing users to pose ever-finer queries and incorporate attributes like colour and contextual cues from text. For this purpose, we introduce a novel compositionality framework, effectively combining sketches and text using pre-trained CLIP models, while eliminating the need for extensive fine-grained textual descriptions. Last but not least, our system extends to novel applications in composite image retrieval, domain attribute transfer, and fine-grained generation, providing solutions for various real-world scenarios.
在图像检索中,有两种主要的输入模式:手绘和文本。虽然文本在跨类别检索任务中应用广泛,但是手绘因其能够捕捉到细微的视觉细节而被确立为细粒度图像检索的单独首选模式。在本文中,我们质疑仅依赖手绘进行细粒度图像检索的可行性,通过同时探索手绘和文本的细粒度表示能力,安排两者之间的二重奏。最终成果实现了以前无法达到的精确检索,使用户能够提出越来越精细的查询,并从文本中提取颜色和上下文提示等属性。为此,我们引入了一个新颖的组合性框架,通过预训练的CLIP模型有效地将手绘和文本组合在一起,而无需进行大量细粒度的文本描述。最后,我们的系统扩展到了复合图像检索、领域属性转移和细粒度生成的全新应用中,为各种现实场景提供了解决方案。
https://arxiv.org/abs/2403.07222
This paper, for the first time, explores text-to-image diffusion models for Zero-Shot Sketch-based Image Retrieval (ZS-SBIR). We highlight a pivotal discovery: the capacity of text-to-image diffusion models to seamlessly bridge the gap between sketches and photos. This proficiency is underpinned by their robust cross-modal capabilities and shape bias, findings that are substantiated through our pilot studies. In order to harness pre-trained diffusion models effectively, we introduce a straightforward yet powerful strategy focused on two key aspects: selecting optimal feature layers and utilising visual and textual prompts. For the former, we identify which layers are most enriched with information and are best suited for the specific retrieval requirements (category-level or fine-grained). Then we employ visual and textual prompts to guide the model's feature extraction process, enabling it to generate more discriminative and contextually relevant cross-modal representations. Extensive experiments on several benchmark datasets validate significant performance improvements.
本文是首次探索基于零 shots素描的图像检索(ZS-SBIR)中的文本到图像扩散模型。我们突出一个关键发现:文本到图像扩散模型如何平滑地连接起素描和照片之间的巟巟差距。这一能力是由它们的稳健跨模态能力和形状偏差所支撑的,这些发现通过我们的初步研究得到了证实。为了有效地利用预训练的扩散模型,我们引入了一种简单而强大的策略,集中于两个关键方面:选择最优的特征层和利用视觉和文本提示。对于前者,我们确定哪些层富含信息并且最适合于特定的检索要求(分类级别或细粒度)。然后我们使用视觉和文本提示引导模型的特征提取过程,使其能够生成更具有区分度和相关性的跨模态表示。在多个基准数据集上进行的广泛实验证实了显著的性能提升。
https://arxiv.org/abs/2403.07214
In this paper, we propose a novel abstraction-aware sketch-based image retrieval framework capable of handling sketch abstraction at varied levels. Prior works had mainly focused on tackling sub-factors such as drawing style and order, we instead attempt to model abstraction as a whole, and propose feature-level and retrieval granularity-level designs so that the system builds into its DNA the necessary means to interpret abstraction. On learning abstraction-aware features, we for the first-time harness the rich semantic embedding of pre-trained StyleGAN model, together with a novel abstraction-level mapper that deciphers the level of abstraction and dynamically selects appropriate dimensions in the feature matrix correspondingly, to construct a feature matrix embedding that can be freely traversed to accommodate different levels of abstraction. For granularity-level abstraction understanding, we dictate that the retrieval model should not treat all abstraction-levels equally and introduce a differentiable surrogate Acc.@q loss to inject that understanding into the system. Different to the gold-standard triplet loss, our Acc.@q loss uniquely allows a sketch to narrow/broaden its focus in terms of how stringent the evaluation should be - the more abstract a sketch, the less stringent (higher $q$). Extensive experiments depict our method to outperform existing state-of-the-arts in standard SBIR tasks along with challenging scenarios like early retrieval, forensic sketch-photo matching, and style-invariant retrieval.
在本文中,我们提出了一个新颖的基于抽象意识到的图素-基于图像检索框架,能够处理不同抽象层次的抽象。 prior works主要集中在解决子因素,如绘制风格和顺序,我们而是将抽象建模为一个整体,并提出了基于特征级别和检索级别的设计,以便系统能够构建解读抽象所需的方法。在学习抽象意识到的特征时,我们首次利用预训练的StyleGAN模型的丰富语义嵌入,并引入了一个新颖的抽象级别映射器,解码抽象层次并提供相应的特征矩阵维度,以构建一个可以自由遍历的不同抽象层次的矩阵嵌入。对于粒度级别抽象理解,我们规定检索模型不应将所有抽象层次视为平等,并引入了不同的可导替代损失函数来注入这种理解。与黄金标准三元组损失不同,我们的Acc.@q损失独特地允许一个草图在评估应该有多严格方面缩小/放宽其关注点 - 越抽象的草图,评估应该越宽松(更高 $q$)。大量实验证明,我们的方法在标准SBIR任务以及具有挑战性的场景(如早期检索,法医画像匹配和风格无关检索)中均能优于现有技术水平。
https://arxiv.org/abs/2403.07203
Research on generative models to produce human-aligned / human-preferred outputs has seen significant recent contributions. Between text and image-generative models, we narrowed our focus to text-based generative models, particularly to produce captions for images that align with human preferences. In this research, we explored a potential method to amplify the performance of the Deep Neural Network Model to generate captions that are preferred by humans. This was achieved by integrating Supervised Learning and Reinforcement Learning with Human Feedback (RLHF) using the Flickr8k dataset. Also, a novel loss function that is capable of optimizing the model based on human feedback is introduced. In this paper, we provide a concise sketch of our approach and results, hoping to contribute to the ongoing advances in the field of human-aligned generative AI models.
研究人类友好/人类偏爱的生成模型的工作取得了显著的最近贡献。在文本和图像生成模型之间,我们缩小了我们的重点,特别关注生成符合人类偏好的图像的摘要。在这项研究中,我们探讨了一种可能的方法,以增强Deep Neural Network模型生成被人类喜欢的摘要的能力。通过使用Flickr8k数据集,将监督学习和强化学习与人类反馈(RLHF)集成,实现了这一目标。此外,我们还引入了一种新的损失函数,该函数基于人类反馈优化模型。在本文中,我们简要描述了我们的方法以及结果,希望为该领域的人类友好生成人工智能模型的持续进展做出贡献。
https://arxiv.org/abs/2403.06735
A topical challenge for algorithms in general and for automatic image categorization and generation in particular is presented in the form of a drawing for AI to understand. In a second vein, AI is challenged to produce something similar from verbal description. The aim of the paper is to highlight strengths and deficiencies of current Artificial Intelligence approaches while coarsely sketching a way forward. A general lack of encompassing symbol-embedding and (not only) -grounding in some bodily basis is made responsible for current deficiencies. A concomitant dearth of hierarchical organization of concepts follows suite. As a remedy for these shortcomings, it is proposed to take a wide step back and to newly incorporate aspects of cybernetics and analog control processes. It is claimed that a promising overarching perspective is provided by the Ouroboros Model with a valid and versatile algorithmic backbone for general cognition at all accessible levels of abstraction and capabilities. Reality, rules, truth, and Free Will are all useful abstractions according to the Ouroboros Model. Logic deduction as well as intuitive guesses are claimed as produced on the basis of one compartmentalized memory for schemata and a pattern-matching, i.e., monitoring process termed consumption analysis. The latter directs attention on short (attention proper) and also on long times scales (emotional biases). In this cybernetic approach, discrepancies between expectations and actual activations (e.g., sensory precepts) drive the general process of cognition and at the same time steer the storage of new and adapted memory entries. Dedicated structures in the human brain work in concert according to this scheme.
一种普遍的挑战是,在形式上,一幅AI可以理解的图景向人们展示了算法在图像分类和生成方面的应用。在第二点,AI被挑战根据语言描述产生类似东西。本文旨在通过粗略地勾勒出前进的道路,突出当前人工智能方法的优缺点。普遍缺乏包括符号嵌入和(不仅)-基于某些生理基础的全面性,是当前缺陷的原因。概念的层次结构不足 Follow suit。为解决这些缺陷,建议采取背离的态度,并重新纳入一些控制过程的方面。据称,由Ouroboros模型及其在所有可访问的抽象度和能力级别具有有效且多才多艺的算法骨架提供的前景,是一个有前景的全面视角。根据Ouroboros模型,现实、规则、真理和自由意志都是有用的抽象。逻辑演绎和直觉猜测被认为是由一个分解记忆的图案和一种模式匹配、即消费分析过程产生的。后者将注意力集中在短(注意 proper)和长时间尺度(情感偏见)上。在這種神經網絡方法中,期望与实际激活之间的差异(例如,感官印象)推动了认知过程,同时引导存储新和适应性记忆条目。根据这一模式,人类大脑中的专用结构相互协同工作。
https://arxiv.org/abs/2403.04292
Layout-aware text-to-image generation is a task to generate multi-object images that reflect layout conditions in addition to text conditions. The current layout-aware text-to-image diffusion models still have several issues, including mismatches between the text and layout conditions and quality degradation of generated images. This paper proposes a novel layout-aware text-to-image diffusion model called NoiseCollage to tackle these issues. During the denoising process, NoiseCollage independently estimates noises for individual objects and then crops and merges them into a single noise. This operation helps avoid condition mismatches; in other words, it can put the right objects in the right places. Qualitative and quantitative evaluations show that NoiseCollage outperforms several state-of-the-art models. These successful results indicate that the crop-and-merge operation of noises is a reasonable strategy to control image generation. We also show that NoiseCollage can be integrated with ControlNet to use edges, sketches, and pose skeletons as additional conditions. Experimental results show that this integration boosts the layout accuracy of ControlNet. The code is available at this https URL.
布局感知文本到图像生成是一个生成多对象图像,同时考虑布局条件和文本条件的问题。当前的布局感知文本到图像扩散模型仍然存在一些问题,包括文本和布局条件之间的不匹配和生成的图像质量下降。本文提出了一种名为NoiseCollage的新布局感知文本到图像扩散模型,以解决这些问题。在去噪过程中,NoiseCollage独立估计每个对象的噪声,然后裁剪和合并它们成为单个噪声。这种操作有助于避免条件不匹配;换句话说,它可以将正确的物体放在正确的地方。定性和定量评估表明,NoiseCollage超越了最先进的模型。这些成功的结果表明,噪声的裁剪和合并操作是一个合理的策略,用于控制图像生成。我们还证明了NoiseCollage可以与ControlNet集成,使用边缘、草图和姿态骨架作为附加条件。实验结果表明,这种集成确实提高了ControlNet的布局准确性。代码可在此处访问:https://www.kaggle.com/noisecollage/noise-collage
https://arxiv.org/abs/2403.03485
Chinese landscape painting has a unique and artistic style, and its drawing technique is highly abstract in both the use of color and the realistic representation of objects. Previous methods focus on transferring from modern photos to ancient ink paintings. However, little attention has been paid to translating landscape paintings into modern photos. To solve such problems, in this paper, we (1) propose DLP-GAN (\textbf{D}raw Modern Chinese \textbf{L}andscape \textbf{P}hotos with \textbf{G}enerative \textbf{A}dversarial \textbf{N}etwork), an unsupervised cross-domain image translation framework with a novel asymmetric cycle mapping, and (2) introduce a generator based on a dense-fusion module to match different translation directions. Moreover, a dual-consistency loss is proposed to balance the realism and abstraction of model painting. In this way, our model can draw landscape photos and sketches in the modern sense. Finally, based on our collection of modern landscape and sketch datasets, we compare the images generated by our model with other benchmarks. Extensive experiments including user studies show that our model outperforms state-of-the-art methods.
Chinese landscape painting has a unique and artistic style, and its drawing technique is highly abstract in both the use of color and the realistic representation of objects. Previous methods have focused on transferring modern photos to ancient ink paintings. However, little attention has been paid to translating landscape paintings into modern photos. To solve these problems, in this paper, we propose DLP-GAN, an unsupervised cross-domain image translation framework with a novel asymmetric cycle mapping, and introduce a generator based on a dense-fusion module to match different translation directions. Moreover, a dual-consistency loss is proposed to balance the realism and abstraction of model painting. In this way, our model can draw landscape photos and sketches in the modern sense. Finally, based on our collection of modern landscape and sketch datasets, we compare the images generated by our model with other benchmarks. Extensive experiments including user studies show that our model outperforms state-of-the-art methods.
https://arxiv.org/abs/2403.03456
With the advancement of language models (LMs), their exposure to private data is increasingly inevitable, and their deployment (especially for smaller ones) on personal devices, such as PCs and smartphones, has become a prevailing trend. In contexts laden with user information, enabling models to both safeguard user privacy and execute commands efficiently emerges as an essential research imperative. In this paper, we propose CoGenesis, a collaborative generation framework integrating large (hosted on cloud infrastructure) and small models (deployed on local devices) to address privacy concerns logically. Initially, we design a pipeline to create personalized writing instruction datasets enriched with extensive context details as the testbed of this research issue. Subsequently, we introduce two variants of CoGenesis based on sketch and logits respectively. Our experimental findings, based on our synthesized dataset and two additional open-source datasets, indicate that: 1) Large-scale models perform well when provided with user context but struggle in the absence of such context. 2) While specialized smaller models fine-tuned on the synthetic dataset show promise, they still lag behind their larger counterparts. 3) Our CoGenesis framework, utilizing mixed-scale models, showcases competitive performance, providing a feasible solution to privacy issues.
随着语言模型(LMs)的发展,它们越来越容易接触到个人数据,尤其是在较小设备上的部署(特别是笔记本电脑和智能手机)已成为一种普遍趋势。在充满用户信息的环境中,让模型既保护用户隐私又高效执行指令变得至关重要。在本文中,我们提出了CoGenesis,一种集成大型(托管在云基础设施上)和小型模型(部署在本地设备上)的研究框架,以解决隐私问题。最初,我们设计了一个管道,以创建具有丰富上下文细节的个性化写作指令数据集,作为研究问题的实验床。随后,我们引入了两个CoGenesis变体,基于草图和逻辑。我们对我们生成的数据集和两个开源数据集的实验结果表明:1)在提供用户上下文时,大型模型表现良好,但缺乏上下文时则表现不佳。2)虽然专用的小型模型在合成数据集上表现出了优势,但它们仍然落后于其大型对应模型。3)利用混合规模模型,我们的CoGenesis框架在隐私问题方面具有竞争力的性能,提供了一种可行的解决方案。
https://arxiv.org/abs/2403.03129
Natural language and images are commonly used as goal representations in goal-conditioned imitation learning (IL). However, natural language can be ambiguous and images can be over-specified. In this work, we propose hand-drawn sketches as a modality for goal specification in visual imitation learning. Sketches are easy for users to provide on the fly like language, but similar to images they can also help a downstream policy to be spatially-aware and even go beyond images to disambiguate task-relevant from task-irrelevant objects. We present RT-Sketch, a goal-conditioned policy for manipulation that takes a hand-drawn sketch of the desired scene as input, and outputs actions. We train RT-Sketch on a dataset of paired trajectories and corresponding synthetically generated goal sketches. We evaluate this approach on six manipulation skills involving tabletop object rearrangements on an articulated countertop. Experimentally we find that RT-Sketch is able to perform on a similar level to image or language-conditioned agents in straightforward settings, while achieving greater robustness when language goals are ambiguous or visual distractors are present. Additionally, we show that RT-Sketch has the capacity to interpret and act upon sketches with varied levels of specificity, ranging from minimal line drawings to detailed, colored drawings. For supplementary material and videos, please refer to our website: this http URL.
自然语言和图像通常作为目标表示在目标条件模仿学习(IL)中使用。然而,自然语言可能会歧义,图像可能会过于详细。在这项工作中,我们提出手绘草图作为一种视觉模仿学习中的目标表示模态。草图对于用户来说就像语言一样容易提供,但与图像类似,它们也可以帮助下游策略具有空间意识,甚至超越图像来消除任务相关于任务无关的对象的混淆。我们提出RT-Sketch,一种以手绘场景为目标条件的操作策略,输出动作。我们在成对轨迹的数据集上训练RT-Sketch,并对其进行评估。实验结果表明,在直接设置中,RT-Sketch与图像或语言条件代理相当,但在语言目标歧义或存在视觉干扰的情况下,具有更大的鲁棒性。此外,我们还证明了RT-Sketch具有解释和操作具有不同程度具体性的草图的能力,从最小线绘到详细、有颜色的绘图。如有补充材料和视频,请参阅我们的网站:http:// this http URL。
https://arxiv.org/abs/2403.02709
We introduce a novel sketch-to-image tool that aligns with the iterative refinement process of artists. Our tool lets users sketch blocking strokes to coarsely represent the placement and form of objects and detail strokes to refine their shape and silhouettes. We develop a two-pass algorithm for generating high-fidelity images from such sketches at any point in the iterative process. In the first pass we use a ControlNet to generate an image that strictly follows all the strokes (blocking and detail) and in the second pass we add variation by renoising regions surrounding blocking strokes. We also present a dataset generation scheme that, when used to train a ControlNet architecture, allows regions that do not contain strokes to be interpreted as not-yet-specified regions rather than empty space. We show that this partial-sketch-aware ControlNet can generate coherent elements from partial sketches that only contain a small number of strokes. The high-fidelity images produced by our approach serve as scaffolds that can help the user adjust the shape and proportions of objects or add additional elements to the composition. We demonstrate the effectiveness of our approach with a variety of examples and evaluative comparisons.
我们介绍了一种新的草图转图像工具,该工具与艺术家的迭代精炼过程相 aligns。我们的工具让用户使用粗略的草图勾勒出物体和细节草图,以进一步精炼它们的形状和轮廓。我们开发了一个在迭代过程中生成高保真图像的两步算法。在第一步中,我们使用一个控制网生成一个严格遵循所有草图(粗线和细节)的图像。在第二步中,我们通过模糊围绕粗线的区域来添加变化。我们还提出了一个数据生成方案,当用于训练控制网络架构时,可以让不包含草图的区域被解释为未指定区域,而不是空洞空间。我们证明了这种部分草图注意的控制网络可以从仅包含少量草图的草图中生成连贯的元素。我们用多种示例和评估比较来展示我们方法的成效。
https://arxiv.org/abs/2402.18116
Videos are prominent learning materials to prepare surgical trainees before they enter the operating room (OR). In this work, we explore techniques to enrich the video-based surgery learning experience. We propose Surgment, a system that helps expert surgeons create exercises with feedback based on surgery recordings. Surgment is powered by a few-shot-learning-based pipeline (SegGPT+SAM) to segment surgery scenes, achieving an accuracy of 92\%. The segmentation pipeline enables functionalities to create visual questions and feedback desired by surgeons from a formative study. Surgment enables surgeons to 1) retrieve frames of interest through sketches, and 2) design exercises that target specific anatomical components and offer visual feedback. In an evaluation study with 11 surgeons, participants applauded the search-by-sketch approach for identifying frames of interest and found the resulting image-based questions and feedback to be of high educational value.
视频是一种重要的学习材料,用于术前培训外科实习生。在这项工作中,我们探讨了如何丰富基于视频的手术学习体验的技术。我们提出了Surgment系统,这是一种基于手术录音创建反馈的系统,由SegGPT和SAM组成。Surgent通过基于少量样本学习的管道(SegGPT+SAM)对手术场景进行分割,达到92%的准确度。分割管道使得来自形式研究的医生能够创建所需视觉问题和反馈。Surgent使医生能够:1)通过草图检索感兴趣的帧;2)设计针对特定解剖结构的锻炼,并提供视觉反馈。在一个由11名外科医生参加的评估研究中,参与者赞扬了通过搜索草图来查找感兴趣的帧,并认为基于图像的反馈具有很高的教育价值。
https://arxiv.org/abs/2402.17903
Reverse engineering in the realm of Computer-Aided Design (CAD) has been a longstanding aspiration, though not yet entirely realized. Its primary aim is to uncover the CAD process behind a physical object given its 3D scan. We propose CAD-SIGNet, an end-to-end trainable and auto-regressive architecture to recover the design history of a CAD model represented as a sequence of sketch-and-extrusion from an input point cloud. Our model learns visual-language representations by layer-wise cross-attention between point cloud and CAD language embedding. In particular, a new Sketch instance Guided Attention (SGA) module is proposed in order to reconstruct the fine-grained details of the sketches. Thanks to its auto-regressive nature, CAD-SIGNet not only reconstructs a unique full design history of the corresponding CAD model given an input point cloud but also provides multiple plausible design choices. This allows for an interactive reverse engineering scenario by providing designers with multiple next-step choices along with the design process. Extensive experiments on publicly available CAD datasets showcase the effectiveness of our approach against existing baseline models in two settings, namely, full design history recovery and conditional auto-completion from point clouds.
在计算机辅助设计(CAD)领域,逆向工程一直是一个长期的目标,但尚未完全实现。其主要目标是揭示给定物理对象的三维扫描背后的CAD过程。我们提出CAD-SIGNet,一种端到端的训练有素和自适应架构,从输入点云中恢复CAD模型的设计历史。我们的模型通过逐层跨注意点云和CAD语言嵌入之间的视觉语言表示学习。特别地,我们提出了一个新的 sketches 实例引导注意力(SGA)模块,以重建草图的详细信息。由于其自回归性质,CAD-SIGNet不仅重构了对应CAD模型的输入点云的唯一完整设计历史,还提供了多个可信的设计选择。这使得设计师可以通过设计过程的多个下一步选择来进行交互式逆向工程。在公开可用的CAD数据集上进行的大量实验证明了我们的方法在两个设置中的有效性,即完整设计历史恢复和基于点云的条件下自动完成。
https://arxiv.org/abs/2402.17678
Personalization techniques for large text-to-image (T2I) models allow users to incorporate new concepts from reference images. However, existing methods primarily rely on textual descriptions, leading to limited control over customized images and failing to support fine-grained and local editing (e.g., shape, pose, and details). In this paper, we identify sketches as an intuitive and versatile representation that can facilitate such control, e.g., contour lines capturing shape information and flow lines representing texture. This motivates us to explore a novel task of sketch concept extraction: given one or more sketch-image pairs, we aim to extract a special sketch concept that bridges the correspondence between the images and sketches, thus enabling sketch-based image synthesis and editing at a fine-grained level. To accomplish this, we introduce CustomSketching, a two-stage framework for extracting novel sketch concepts. Considering that an object can often be depicted by a contour for general shapes and additional strokes for internal details, we introduce a dual-sketch representation to reduce the inherent ambiguity in sketch depiction. We employ a shape loss and a regularization loss to balance fidelity and editability during optimization. Through extensive experiments, a user study, and several applications, we show our method is effective and superior to the adapted baselines.
大型文本-图像(T2I)模型的个性化技术允许用户将参考图像的新概念进行整合。然而,现有的方法主要依赖于文本描述,导致用户对定制图像的控制有限,并且无法支持细粒度和局部编辑(例如形状、姿态和细节)。在本文中,我们将草图作为一种直观且多功能的表示,以促进这种控制,例如轮廓线捕捉形状信息,流动线表示纹理。这激励我们探索一个新任务:基于草图的概念提取:给定一组草图图像对,我们的目标是提取一个特殊的草图概念,该概念将图像和草图之间的对应关系联系起来,从而实现基于草图的图像合成和编辑。为了实现这一目标,我们引入了CustomSketching,一个两阶段框架,用于提取新的草图概念。考虑到物体通常用轮廓线来表示一般形状,以及内部细节的额外笔画,我们引入了双草图表示来减少草图描绘中的固有歧义。我们采用形状损失和正则化损失来平衡还原度和可编辑性。通过广泛的实验、用户研究和多个应用,我们证明了我们的方法是有效的,并且优于适应基线。
https://arxiv.org/abs/2402.17624