Informational videos serve as a crucial source for explaining conceptual and procedural knowledge to novices and experts alike. When producing informational videos, editors edit videos by overlaying text/images or trimming footage to enhance the video quality and make it more engaging. However, video editing can be difficult and time-consuming, especially for novice video editors who often struggle with expressing and implementing their editing ideas. To address this challenge, we first explored how multimodality$-$natural language (NL) and sketching, which are natural modalities humans use for expression$-$can be utilized to support video editors in expressing video editing ideas. We gathered 176 multimodal expressions of editing commands from 10 video editors, which revealed the patterns of use of NL and sketching in describing edit intents. Based on the findings, we present ExpressEdit, a system that enables editing videos via NL text and sketching on the video frame. Powered by LLM and vision models, the system interprets (1) temporal, (2) spatial, and (3) operational references in an NL command and spatial references from sketching. The system implements the interpreted edits, which then the user can iterate on. An observational study (N=10) showed that ExpressEdit enhanced the ability of novice video editors to express and implement their edit ideas. The system allowed participants to perform edits more efficiently and generate more ideas by generating edits based on user's multimodal edit commands and supporting iterations on the editing commands. This work offers insights into the design of future multimodal interfaces and AI-based pipelines for video editing.
信息视频作为向新手和专家解释概念和程序知识的关键来源。在制作信息视频时,编辑师通过在视频上叠加文本/图像或裁剪 footage 来提高视频质量和使其更具吸引力。然而,视频编辑可能会很困难,尤其是对于新手编辑来说,他们通常很难表达和实施他们的编辑想法。为了解决这个挑战,我们首先探讨了如何利用自然语言(NL)和手绘,这些是人类表达编辑意图的自然方式,来支持视频编辑师表达视频编辑想法。我们收集了10个视频编辑器的176个多模态编辑命令,揭示了NL和手绘在描述编辑意图时使用的模式。根据这些发现,我们提出了ExpressEdit系统,它通过NL文本和手绘在视频帧上编辑视频。通过LLM和视觉模型供电,系统解释NL命令中的时间、空间和操作引用,以及手绘中的空间参考。系统实现解释的编辑,然后用户可以进行迭代。一个观察性研究(N=10)表明,ExpressEdit提高了新手视频编辑师表达和实施编辑想法的能力。系统允许参与者通过基于用户的多模态编辑命令进行编辑,并生成更多编辑想法,通过对编辑命令的支持进行迭代。这项工作为未来多模态界面和基于AI的视频编辑流程的设计提供了洞见。
https://arxiv.org/abs/2403.17693
The drawing order of a sketch records how it is created stroke-by-stroke by a human being. For graphic sketch representation learning, recent studies have injected sketch drawing orders into graph edge construction by linking each patch to another in accordance to a temporal-based nearest neighboring strategy. However, such constructed graph edges may be unreliable, since a sketch could have variants of drawings. In this paper, we propose a variant-drawing-protected method by equipping sketch patches with context-aware positional encoding (PE) to make better use of drawing orders for learning graphic sketch representation. Instead of injecting sketch drawings into graph edges, we embed these sequential information into graph nodes only. More specifically, each patch embedding is equipped with a sinusoidal absolute PE to highlight the sequential position in the drawing order. And its neighboring patches, ranked by the values of self-attention scores between patch embeddings, are equipped with learnable relative PEs to restore the contextual positions within a neighborhood. During message aggregation via graph convolutional networks, a node receives both semantic contents from patch embeddings and contextual patterns from PEs by its neighbors, arriving at drawing-order-enhanced sketch representations. Experimental results indicate that our method significantly improves sketch healing and controllable sketch synthesis.
草图的绘制顺序记录了它是通过连续的绘制方式由人类创建的。对于图形草图表示学习,最近的研究将绘制草图顺序注入到图的边缘构建中,根据基于时间的最近邻策略将每个补丁链接到另一个。然而,这样的构建图边可能不可靠,因为补图可能有不同的绘制版本。在本文中,我们提出了一种版本保护的绘制方法,通过为补丁分配上下文感知的位置编码(PE)来更好地利用学习图形草图表示的绘制顺序。我们不再将补图绘制直接注入到图的边缘中,而是将顺序信息仅嵌入图的节点中。具体来说,每个补丁嵌入都配备了一个正弦波绝对PE,以突出绘制顺序中的序列位置。并且它的邻居补丁,根据补丁嵌入之间自我关注分数的值排序,都配备有可学习的相对PE,以恢复邻域内的上下文位置。在图卷积网络消息聚合过程中,节点通过图卷积操作从补丁嵌入中获取语义内容,并从PE中获取上下文模式,从而达到增强绘制顺序的草图表示。实验结果表明,我们的方法显著提高了草图修复和可控制草图合成的效果。
https://arxiv.org/abs/2403.17525
Recently, a simple but powerful language for expressing and learning general policies and problem decompositions (sketches) has been introduced in terms of rules defined over a set of Boolean and numerical features. In this work, we consider three extensions of this language aimed at making policies and sketches more flexible and reusable: internal memory states, as in finite state controllers; indexical features, whose values are a function of the state and a number of internal registers that can be loaded with objects; and modules that wrap up policies and sketches and allow them to call each other by passing parameters. In addition, unlike general policies that select state transitions rather than ground actions, the new language allows for the selection of such actions. The expressive power of the resulting language for policies and sketches is illustrated through a number of examples.
近年来,一种简单但功能强大的语言,用于表达和学习一般策略和问题分解(草图),在基于布尔和数值特征的集合中定义了规则。在本文中,我们考虑了这种语言的三个扩展,旨在使策略和草图更加灵活和可重复使用:内部记忆状态,类似于有限状态控制器;指示特征,其值是一个状态和一个可以加载对象的内部寄存器的函数;以及模块,可以包装策略和草图,并允许它们通过传递参数相互调用。此外,与选择状态转移而不是地面动作的一般策略不同,新语言允许选择这种动作。通过一系列例子,描述了所得到的语言对策略和草图的表述能力。
https://arxiv.org/abs/2403.16824
The impressive performance of large language models (LLMs) on code-related tasks has shown the potential of fully automated software development. In light of this, we introduce a new software engineering task, namely Natural Language to code Repository (NL2Repo). This task aims to generate an entire code repository from its natural language requirements. To address this task, we propose a simple yet effective framework CodeS, which decomposes NL2Repo into multiple sub-tasks by a multi-layer sketch. Specifically, CodeS includes three modules: RepoSketcher, FileSketcher, and SketchFiller. RepoSketcher first generates a repository's directory structure for given requirements; FileSketcher then generates a file sketch for each file in the generated structure; SketchFiller finally fills in the details for each function in the generated file sketch. To rigorously assess CodeS on the NL2Repo task, we carry out evaluations through both automated benchmarking and manual feedback analysis. For benchmark-based evaluation, we craft a repository-oriented benchmark, SketchEval, and design an evaluation metric, SketchBLEU. For feedback-based evaluation, we develop a VSCode plugin for CodeS and engage 30 participants in conducting empirical studies. Extensive experiments prove the effectiveness and practicality of CodeS on the NL2Repo task.
大语言模型(LLMs)在代码相关任务上的出色表现证明了完全自动化软件开发的潜力。鉴于这一点,我们介绍了一个新的软件工程任务,即自然语言到代码仓库(NL2Repo)。这个任务旨在从自然语言需求中生成整个代码仓库。为解决此任务,我们提出了一个简单而有效的框架CodeS,它通过多层草图将NL2Repo分解为多个子任务。具体来说,CodeS包括三个模块:RepoSketcher,FileSketcher和SketchFiller。RepoSketcher首先根据给定需求生成仓库目录结构;FileSketcher然后为生成的结构中的每个文件生成一个文件草图;SketchFiller最后填充每个生成的文件草图中的函数详细信息。为了严格评估CodeS在NL2Repo任务上的表现,我们通过自动化基准测试和手动反馈分析进行评估。基于基准测试评估,我们构造成语义化的仓库基准,设计了一个评估指标SketchBLEU。基于反馈评估,我们为CodeS开发了一个VSCode插件,并招募了30个参与者进行实证研究。大量实验证明CodeS在NL2Repo任务上的有效性和实用性。
https://arxiv.org/abs/2403.16443
The challenge in combined task and motion planning (TAMP) is the effective integration of a search over a combinatorial space, usually carried out by a task planner, and a search over a continuous configuration space, carried out by a motion planner. Using motion planners for testing the feasibility of task plans and filling out the details is not effective because it makes the geometrical constraints play a passive role. This work introduces a new interleaved approach for integrating the two dimensions of TAMP that makes use of sketches, a recent simple but powerful language for expressing the decomposition of problems into subproblems. A sketch has width 1 if it decomposes the problem into subproblems that can be solved greedily in linear time. In the paper, a general sketch is introduced for several classes of TAMP problems which has width 1 under suitable assumptions. While sketch decompositions have been developed for classical planning, they offer two important benefits in the context of TAMP. First, when a task plan is found to be unfeasible due to the geometric constraints, the combinatorial search resumes in a specific sub-problem. Second, the sampling of object configurations is not done once, globally, at the start of the search, but locally, at the start of each subproblem. Optimizations of this basic setting are also considered and experimental results over existing and new pick-and-place benchmarks are reported.
TAMP(任务和动作规划)中的挑战在于在搜索解空间和组合空间中实现有效整合,通常由任务规划器完成。而搜索解空间由运动规划器完成。使用运动规划器来测试任务计划的可行性并填充细节,由于使几何约束处于被动角色,因此效果不佳。 本工作提出了一种新的并行方法整合TAMP的这两个维度,该方法利用图素,这是一种最近简单而强大的语言,用于将问题分解为子问题。如果一个图素分解问题可以在线性时间内以贪心方式求解,则具有宽度为1。在论文中,针对多个TAMP问题类,引入了一个总图素,具有宽度为1的条件。 虽然针对经典规划的图素分解已经得到了开发,但在TAMP的背景下它们提供了两个重要的优势。首先,当由于几何约束,任务计划不可行时,组合搜索会在特定子问题中重新启动。其次,在搜索的每个子问题开始时,不是全局地采样对象配置,而是局部地采样。还考虑了这种基本设置的优化,并报告了现有和新的捡选基准的实验结果。
https://arxiv.org/abs/2403.16277
Fashion illustration is a crucial medium for designers to convey their creative vision and transform design concepts into tangible representations that showcase the interplay between clothing and the human body. In the context of fashion design, computer vision techniques have the potential to enhance and streamline the design process. Departing from prior research primarily focused on virtual try-on, this paper tackles the task of multimodal-conditioned fashion image editing. Our approach aims to generate human-centric fashion images guided by multimodal prompts, including text, human body poses, garment sketches, and fabric textures. To address this problem, we propose extending latent diffusion models to incorporate these multiple modalities and modifying the structure of the denoising network, taking multimodal prompts as input. To condition the proposed architecture on fabric textures, we employ textual inversion techniques and let diverse cross-attention layers of the denoising network attend to textual and texture information, thus incorporating different granularity conditioning details. Given the lack of datasets for the task, we extend two existing fashion datasets, Dress Code and VITON-HD, with multimodal annotations. Experimental evaluations demonstrate the effectiveness of our proposed approach in terms of realism and coherence concerning the provided multimodal inputs.
时尚插图是一种设计师传达其创意愿景并将设计概念转化为实际展示服装与人体之间相互作用的可视化方式。在时尚设计的背景下,计算机视觉技术有可能提高并简化设计过程。从主要关注虚拟试穿的研究中脱离出来,本文解决了多模态条件时尚图像编辑的任务。我们的方法旨在生成以多模态提示为导向的人体为中心的时尚图像,包括文本、人体姿势、服装轮廓和面料纹理。为解决这个问题,我们提出了扩展潜在扩散模型的方法,并修改了去噪网络的结构,将多模态提示作为输入。为了通过面料纹理对所提出的架构进行条件,我们使用了文本反演技术,并让去噪网络的不同跨注意层关注文本和纹理信息,从而包括不同粒度条件的细节。由于缺乏相应的数据集,我们扩展了两个现有的时尚数据集Dress Code和VITON-HD,添加了多模态注释。实验评估表明,我们提出的方法在现实感和连贯性方面具有有效性和可靠性。
https://arxiv.org/abs/2403.14828
One of the biggest challenges in single-view 3D shape reconstruction in the wild is the scarcity of <3D shape, 2D image>-paired data from real-world environments. Inspired by remarkable achievements via domain randomization, we propose ObjectDR which synthesizes such paired data via a random simulation of visual variations in object appearances and backgrounds. Our data synthesis framework exploits a conditional generative model (e.g., ControlNet) to generate images conforming to spatial conditions such as 2.5D sketches, which are obtainable through a rendering process of 3D shapes from object collections (e.g., Objaverse-XL). To simulate diverse variations while preserving object silhouettes embedded in spatial conditions, we also introduce a disentangled framework which leverages an initial object guidance. After synthesizing a wide range of data, we pre-train a model on them so that it learns to capture a domain-invariant geometry prior which is consistent across various domains. We validate its effectiveness by substantially improving 3D shape reconstruction models on a real-world benchmark. In a scale-up evaluation, our pre-training achieves 23.6% superior results compared with the pre-training on high-quality computer graphics renderings.
在野外单视3D形状重建的一个重大挑战是来自现实环境中的<3D形状, 2D图像>-对对数据非常有限。受到领域随机化技术的启示,我们提出了ObjectDR,它通过对象外观和背景的随机模拟来合成这样的对对数据。我们的数据合成框架利用了条件生成模型(如ControlNet)生成符合空间条件的图像,这些图像是通过从对象集合中渲染3D形状获得的(例如,Objaverse-XL)。为了在保留嵌入在空间条件中的对象轮廓的同时模拟多样变化,我们还引入了一个解耦框架,它利用了初始对象指导。在合成广泛的數據之后,我们在它们上预训练模型,使它学会捕捉跨多个领域的领域不变的幾何。我们通过在真实世界基准上显著提高3D形状重建模型的效果来验证其有效性。在扩展评估中,我们在高质量计算机图形渲染上的预训练实现了23.6%的优越性。
https://arxiv.org/abs/2403.14539
Generating realistic 3D scenes is challenging due to the complexity of room layouts and object geometries.We propose a sketch based knowledge enhanced diffusion architecture (SEK) for generating customized, diverse, and plausible 3D scenes. SEK conditions the denoising process with a hand-drawn sketch of the target scene and cues from an object relationship knowledge base. We first construct an external knowledge base containing object relationships and then leverage knowledge enhanced graph reasoning to assist our model in understanding hand-drawn sketches. A scene is represented as a combination of 3D objects and their relationships, and then incrementally diffused to reach a Gaussian distribution.We propose a 3D denoising scene transformer that learns to reverse the diffusion process, conditioned by a hand-drawn sketch along with knowledge cues, to regressively generate the scene including the 3D object instances as well as their layout. Experiments on the 3D-FRONT dataset show that our model improves FID, CKL by 17.41%, 37.18% in 3D scene generation and FID, KID by 19.12%, 20.06% in 3D scene completion compared to the nearest competitor DiffuScene.
生成逼真的3D场景具有复杂的空间布局和物体几何形状的复杂性。我们提出了一种基于手绘场景的增强扩散架构(SEK)用于生成定制的、多样化和逼真的3D场景。SEK通过目标场景的手绘草图和物体关系知识库中的提示来约束去噪过程。我们首先构建了一个包含物体关系的外部知识库,然后利用增强图推理来协助我们的模型理解手绘草图。场景被表示为3D物体及其关系的组合,然后通过逐层扩散达到高斯分布。我们提出了一种3D去噪场景变换器,它通过手绘草图和知识提示来学习反扩散过程,以递归地生成场景,包括3D物体实例及其布局。在3D-FRONT数据集的实验中,我们的模型将FID和CKL提高了17.41%和37.18%,而3D场景生成和完成分别提高了19.12%和20.06%,相对于最近的竞争对手DiffuScene。
https://arxiv.org/abs/2403.14121
This paper introduces Spatial Diagrammatic Instructions (SDIs), an approach for human operators to specify objectives and constraints that are related to spatial regions in the working environment. Human operators are enabled to sketch out regions directly on camera images that correspond to the objectives and constraints. These sketches are projected to 3D spatial coordinates, and continuous Spatial Instruction Maps (SIMs) are learned upon them. These maps can then be integrated into optimization problems for tasks of robots. In particular, we demonstrate how Spatial Diagrammatic Instructions can be applied to solve the Base Placement Problem of mobile manipulators, which concerns the best place to put the manipulator to facilitate a certain task. Human operators can specify, via sketch, spatial regions of interest for a manipulation task and permissible regions for the mobile manipulator to be at. Then, an optimization problem that maximizes the manipulator's reachability, or coverage, over the designated regions of interest while remaining in the permissible regions is solved. We provide extensive empirical evaluations, and show that our formulation of Spatial Instruction Maps provides accurate representations of user-specified diagrammatic instructions. Furthermore, we demonstrate that our diagrammatic approach to the Mobile Base Placement Problem enables higher quality solutions and faster run-time.
本文介绍了Spatial Diagrammatic Instructions(SDIs),一种人机操作员指定与工作环境中的空间区域相关的目标和约束的方法。人机操作员可以直接在摄像头图像上绘制区域,这些区域对应于目标和约束。这些草图被投影到3D空间坐标,然后通过它们学习连续的空间指令图(SIMs)。这些地图可以 then be integrated into robot tasks的优化问题。 特别地,我们证明了SDIs可以应用于解决移动操作器的基本放置问题,该问题涉及将操作器放置到最有利于某种任务的最好位置。人机操作员可以通过草图指定操作任务的感兴趣空间和移动操作器的允许区域。然后,在指定兴趣区域的同时,求解最大化操作器可达性(或覆盖)的问题。 我们提供了广泛的实证评估,并证明了SDI构型的空间指令图提供了准确的用户指定图形的表示。此外,我们还证明了我们的移动基座放置问题图解方法提供了更高质量的解决方案,并且运行时间更短。
https://arxiv.org/abs/2403.12465
In this work, we address two limitations of existing conditional diffusion models: their slow inference speed due to the iterative denoising process and their reliance on paired data for model fine-tuning. To tackle these issues, we introduce a general method for adapting a single-step diffusion model to new tasks and domains through adversarial learning objectives. Specifically, we consolidate various modules of the vanilla latent diffusion model into a single end-to-end generator network with small trainable weights, enhancing its ability to preserve the input image structure while reducing overfitting. We demonstrate that, for unpaired settings, our model CycleGAN-Turbo outperforms existing GAN-based and diffusion-based methods for various scene translation tasks, such as day-to-night conversion and adding/removing weather effects like fog, snow, and rain. We extend our method to paired settings, where our model pix2pix-Turbo is on par with recent works like Control-Net for Sketch2Photo and Edge2Image, but with a single-step inference. This work suggests that single-step diffusion models can serve as strong backbones for a range of GAN learning objectives. Our code and models are available at this https URL.
在这项工作中,我们解决了现有条件扩散模型的两个局限:由于迭代去噪过程导致其推理速度较慢,以及它们依赖于成对数据进行模型微调。为了应对这些问题,我们引入了一种通过对抗学习目标将单步扩散模型适应新任务和领域的通用方法。具体来说,我们将各种模块整合到一个具有小训练权重的单端到端生成器网络中,提高其保留输入图像结构的能力,同时减少过拟合。我们证明了,对于未配对设置,我们的模型CycleGAN-Turbo在各种场景平移任务中优于现有的基于GAN和扩散的方法,如日夜转换和添加/删除天气效果(如雾、雪和雨)。我们将我们的方法扩展到配对设置,其中我们的模型pix2pix-Turbo与近期的类似工作 Control-Net for Sketch2Photo和Edge2Image相当,但只有一个步骤的推理。这项工作表明,单步扩散模型可以作为各种GAN学习目标的强大骨架。我们的代码和模型可以从该https URL获取。
https://arxiv.org/abs/2403.12036
Facial sketches are both a concise way of showing the identity of a person and a means to express artistic intention. While a few techniques have recently emerged that allow sketches to be extracted in different styles, they typically rely on a large amount of data that is difficult to obtain. Here, we propose StyleSketch, a method for extracting high-resolution stylized sketches from a face image. Using the rich semantics of the deep features from a pretrained StyleGAN, we are able to train a sketch generator with 16 pairs of face and the corresponding sketch images. The sketch generator utilizes part-based losses with two-stage learning for fast convergence during training for high-quality sketch extraction. Through a set of comparisons, we show that StyleSketch outperforms existing state-of-the-art sketch extraction methods and few-shot image adaptation methods for the task of extracting high-resolution abstract face sketches. We further demonstrate the versatility of StyleSketch by extending its use to other domains and explore the possibility of semantic editing. The project page can be found in this https URL.
脸部轮廓图是一种既表明人身份又表达艺术意图的简洁方法。虽然最近出现了一些技术,使可以从不同风格中提取脸部轮廓图,但它们通常依赖于大量难以获取的数据。在这里,我们提出了StyleSketch,一种可以从人脸图像中提取高分辨率风格化的轮廓图的方法。利用预训练的StyleGAN中深特征的丰富语义,我们能够训练一个16对脸和相应轮廓图的轮廓图生成器。轮廓图生成器使用基于部分的第二阶段学习进行高速训练,以获得高质量轮廓提取。通过一系列比较,我们证明了StyleSketch在提取高分辨率抽象脸部轮廓图方面超越了现有最先进的轮廓提取方法和零散射击图像适应方法。我们进一步展示了StyleSketch的多样性,将其扩展到其他领域,并探讨了语义编辑的可能性。项目页面可以在https://这个链接中找到。
https://arxiv.org/abs/2403.11263
State-of-the-art KBQA models assume answerability of questions. Recent research has shown that while these can be adapted to detect unaswerability with suitable training and thresholding, this comes at the expense of accuracy for answerable questions, and no single model is able to handle all categories of unanswerability. We propose a new model for KBQA named RetinaQA that is robust against unaswerability. It complements KB-traversal based logical form retrieval with sketch-filling based logical form construction. This helps with questions that have valid logical forms but no data paths in the KB leading to an answer. Additionally, it uses discrimination instead of generation to better identify questions that do not have valid logical forms. We demonstrate that RetinaQA significantly outperforms adaptations of state-of-the-art KBQA models across answerable and unanswerable questions, while showing robustness across unanswerability categories. Remarkably, it also establishes a new state-of-the art for answerable KBQA by surpassing existing models
先进的KBQA模型假定问题有答案。最近的研究表明,虽然这些模型可以适应性地检测不具答案的问题,但代价是对答案问题的准确性,而且没有一个模型能够处理所有类别的无答案问题。我们提出了一种名为RetinaQA的新模型,对无答案问题具有鲁棒性。它与基于逻辑形式检索的KB-遍历和基于填充的逻辑形式构建相结合。这有助于那些具有有效逻辑形式但无KB路径回答问题的提问。此外,它使用区分而不是生成来更好地识别没有有效逻辑形式的提问。我们证明了RetinaQA在答案问题和无答案问题上的先进模型改编方面显著优于其他模型,而保持对无答案问题的鲁棒性。值得注意的是,它还在答案问题的KBQA领域树立了新的标杆,超越了现有模型。
https://arxiv.org/abs/2403.10849
In recent years, diffusion models have made remarkable strides in text-to-video generation, sparking a quest for enhanced control over video outputs to more accurately reflect user intentions. Traditional efforts predominantly focus on employing either semantic cues, like images or depth maps, or motion-based conditions, like moving sketches or object bounding boxes. Semantic inputs offer a rich scene context but lack detailed motion specificity; conversely, motion inputs provide precise trajectory information but miss the broader semantic narrative. For the first time, we integrate both semantic and motion cues within a diffusion model for video generation, as demonstrated in Fig 1. To this end, we introduce the Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs. It incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions, promoting synergy between different modalities. For model training, we separate the conditions for the two modalities, introducing a two-stage training pipeline. Experimental results demonstrate that our design significantly enhances video quality, motion precision, and semantic coherence.
近年来,扩散模型在文本转视频生成方面取得了显著的进展,引发了更准确地反映用户意图对视频输出进行控制的需求。传统努力主要集中在使用语义线索(如图像或深度图)或运动基础条件(如移动草图或物体边界框)上。语义输入提供了丰富的场景背景,但缺乏详细的运动特定性;相反,运动输入提供了精确的轨迹信息,但错过了更广泛的语义叙述。直到现在,我们才在扩散模型中集成语义和运动线索,如图1所示。为此,我们引入了场景和运动条件扩散(SMCD)这一新方法来管理多模态输入。它包括一个已知的运动调节模块,并研究了各种方法来整合场景条件,促进不同模态之间的协同作用。对于模型训练,我们将两个模态的条件分开,引入了双阶段训练流程。实验结果表明,我们的设计显著提高了视频质量、运动精度和语义连贯性。
https://arxiv.org/abs/2403.10179
In this paper, we explore the unique modality of sketch for explainability, emphasising the profound impact of human strokes compared to conventional pixel-oriented studies. Beyond explanations of network behavior, we discern the genuine implications of explainability across diverse downstream sketch-related tasks. We propose a lightweight and portable explainability solution -- a seamless plugin that integrates effortlessly with any pre-trained model, eliminating the need for re-training. Demonstrating its adaptability, we present four applications: highly studied retrieval and generation, and completely novel assisted drawing and sketch adversarial attacks. The centrepiece to our solution is a stroke-level attribution map that takes different forms when linked with downstream tasks. By addressing the inherent non-differentiability of rasterisation, we enable explanations at both coarse stroke level (SLA) and partial stroke level (P-SLA), each with its advantages for specific downstream tasks.
在本文中,我们探讨了可解释性绘图的独特维度,强调人类笔触与传统像素导向研究的深刻影响。除了网络行为的解释外,我们分辨出可解释性在各种下游绘图相关任务中的真正含义。我们提出了一个轻量级且便携的 explainability 解决方案--无缝插件,可轻松地集成到任何预训练模型中,无需重新训练。展示其适应性,我们提出了四个应用:高度研究过的检索和生成,以及完全新颖的辅助绘图和绘图对抗攻击。我们解决方案的核心是一个在连接到下游任务时具有不同形式的笔触级别归因图。通过解决平滑映射固有的不可分性,我们使得解释在粗笔级别(SLA)和部分笔级别(P-SLA)上都具有优势, each with its advantages for specific downstream tasks.
https://arxiv.org/abs/2403.09480
We propose SketchINR, to advance the representation of vector sketches with implicit neural models. A variable length vector sketch is compressed into a latent space of fixed dimension that implicitly encodes the underlying shape as a function of time and strokes. The learned function predicts the $xy$ point coordinates in a sketch at each time and stroke. Despite its simplicity, SketchINR outperforms existing representations at multiple tasks: (i) Encoding an entire sketch dataset into a fixed size latent vector, SketchINR gives $60\times$ and $10\times$ data compression over raster and vector sketches, respectively. (ii) SketchINR's auto-decoder provides a much higher-fidelity representation than other learned vector sketch representations, and is uniquely able to scale to complex vector sketches such as FS-COCO. (iii) SketchINR supports parallelisation that can decode/render $\sim$$100\times$ faster than other learned vector representations such as SketchRNN. (iv) SketchINR, for the first time, emulates the human ability to reproduce a sketch with varying abstraction in terms of number and complexity of strokes. As a first look at implicit sketches, SketchINR's compact high-fidelity representation will support future work in modelling long and complex sketches.
我们提出了SketchINR,以通过隐式神经模型提高向量草图的表示。一个长度可变的向量草图被压缩成一个固定维度的潜在空间,其中隐含地编码了 underlying 形状作为时间和水流的函数。学习到的函数在每个时间和线条上预测草图中的 $xy$ 点坐标。尽管它的简单性,SketchINR在多个任务上优于现有的表示: (i) 将整个草图数据集压缩到固定大小的潜在向量中,SketchINR在遥感草图和向量草图上分别给出了60倍和10倍的数据压缩。 (ii) SketchINR的自动编码器提供了比其他学习到的向量草图表示更高的保真度,并且具有独特的能力将其扩展到复杂的向量草图(如FS-COCO)。 (iii) SketchINR支持并行化,这使得它能够比其他学习到的向量表示更快地解码/渲染大约100倍的草图。 (iv) SketchINR是第一个模拟人类能力在抽象程度和复杂性方面复制草图的。作为对隐性草图的第一印象,SketchINR的紧凑高保真度表示将支持未来在建模长和复杂草图方面的研究。
https://arxiv.org/abs/2403.09344
Using vision-language models (VLMs) in web development presents a promising strategy to increase efficiency and unblock no-code solutions: by providing a screenshot or a sketch of a UI, a VLM could generate the code to reproduce it, for instance in a language like HTML. Despite the advancements in VLMs for various tasks, the specific challenge of converting a screenshot into a corresponding HTML has been minimally explored. We posit that this is mainly due to the absence of a suitable, high-quality dataset. This work introduces WebSight, a synthetic dataset consisting of 2 million pairs of HTML codes and their corresponding screenshots. We fine-tune a foundational VLM on our dataset and show proficiency in converting webpage screenshots to functional HTML code. To accelerate the research in this area, we open-source WebSight.
在Web开发中使用视觉语言模型(VLMs)是一种提高效率并打破无代码解决方案的有前途的方法:通过提供UI截图或草图,VLM可以生成复制该截图的代码,例如在HTML语言中。尽管VLMs在各种任务上取得了进步,但将截图转换为相应的HTML的具体挑战却被大大缩小了。我们认为是由于缺乏一个合适、高质量的數據集。这项工作引入了WebSight,一个包含200万对HTML代码及其截图的合成数据集。我们在我们的数据集上微调了一个基本VLM,并展示了将网页截图转换为功能HTML代码的熟练程度。为了加速这一领域的研究,我们开源了WebSight。
https://arxiv.org/abs/2403.09029
Drawing is an art that enables people to express their imagination and emotions. However, individuals usually face challenges in drawing, especially when translating conceptual ideas into visually coherent representations and bridging the gap between mental visualization and practical execution. In response, we propose ARtVista - a novel system integrating AR and generative AI technologies. ARtVista not only recommends reference images aligned with users' abstract ideas and generates sketches for users to draw but also goes beyond, crafting vibrant paintings in various painting styles. ARtVista also offers users an alternative approach to create striking paintings by simulating the paint-by-number concept on reference images, empowering users to create visually stunning artwork devoid of the necessity for advanced drawing skills. We perform a pilot study and reveal positive feedback on its usability, emphasizing its effectiveness in visualizing user ideas and aiding the painting process to achieve stunning pictures without requiring advanced drawing skills. The source code will be available at this https URL.
绘画是一种艺术形式,让人们对想象力和情感进行表达。然而,在绘画过程中,个人通常会面临挑战,尤其是在将概念性想法转化为视觉上连贯的图像,以及将头脑中的想象和实际操作之间建立联系时。为此,我们提出了ArtVista - 一款集成了AR和生成式人工智能技术的全新系统。ArtVista不仅推荐与用户抽象想法相符的参考图像,还为用户生成绘画草图,但还超越了这一点,通过各种绘画风格创作出鲜艳的画作。ArtVista还通过模拟“画数法”概念在参考图像上,让用户在不需要高级绘画技能的情况下,创造出令人惊叹的画作。我们对ArtVista的可用性进行了试点研究,并得到了积极的反馈,强调其在可视化用户想法和帮助绘画过程方面的高效性。源代码将在此处链接提供。
https://arxiv.org/abs/2403.08876
In the realm of fashion design, sketches serve as the canvas for expressing an artist's distinctive drawing style and creative vision, capturing intricate details like stroke variations and texture nuances. The advent of sketch-to-image cross-modal translation technology has notably aided designers. However, existing methods often compromise these sketch details during image generation, resulting in images that deviate from the designer's intended concept. This limitation hampers the ability to offer designers a precise preview of the final output. To overcome this challenge, we introduce HAIFIT, a novel approach that transforms sketches into high-fidelity, lifelike clothing images by integrating multi-scale features and capturing extensive feature map dependencies from diverse perspectives. Through extensive qualitative and quantitative evaluations conducted on our self-collected dataset, our method demonstrates superior performance compared to existing methods in generating photorealistic clothing images. Our method excels in preserving the distinctive style and intricate details essential for fashion design applications.
在时尚设计领域,草图作为表现艺术家独特绘画风格和创意视图的画布,捕捉到微妙的细节,如笔触变化和质感细微差别。草图到图像跨模态翻译技术的出现,显著地帮助了设计师。然而,现有的方法通常在图像生成过程中妥协这些草图细节,导致设计师的预期概念与生成的图像存在偏差。这一限制阻碍了设计师能够得到精确的最终输出预览。为了克服这一挑战,我们引入了HAIFIT,一种通过整合多尺度特征并从不同角度捕捉广泛特征映射的新颖方法。通过对我们的自收集数据进行广泛的定性和定量评估,我们的方法证明了在生成逼真的服装图像方面,其性能优于现有方法。我们的方法擅长保留时尚设计应用程序所需的独特风格和复杂细节。
https://arxiv.org/abs/2403.08651
While manga is a popular entertainment form, creating manga is tedious, especially adding screentones to the created sketch, namely manga screening. Unfortunately, there is no existing method that tailors for automatic manga screening, probably due to the difficulty of generating high-quality shaded high-frequency screentones. The classic manga screening approaches generally require user input to provide screentone exemplars or a reference manga image. The recent deep learning models enables the automatic generation by learning from a large-scale dataset. However, the state-of-the-art models still fail to generate high-quality shaded screentones due to the lack of a tailored model and high-quality manga training data. In this paper, we propose a novel sketch-to-manga framework that first generates a color illustration from the sketch and then generates a screentoned manga based on the intensity guidance. Our method significantly outperforms existing methods in generating high-quality manga with shaded high-frequency screentones.
虽然漫画是一种流行的娱乐形式,但创作漫画费力,尤其是为创作的插图添加阴影高频率屏幕,例如漫画扫描。不幸的是,目前尚无针对自动漫画扫描的方法,这可能是因为生成高质量阴影高频率屏幕的难度较大。经典的漫画扫描方法通常需要用户输入提供屏幕示例或参考漫画图像。近年来,深度学习模型通过从大型数据集中学习自动生成,取得了显著的成果。然而,由于缺乏定制化的模型和高质量漫画训练数据,最先进的模型仍无法生成高质量的阴影屏幕。在本文中,我们提出了一种新颖的插图到漫画框架,首先根据插图生成彩色插图,然后根据强度指导生成漫画扫描。我们的方法在生成高质量带有阴影的高频屏幕漫画方面显著优于现有方法。
https://arxiv.org/abs/2403.08266
This paper unravels the potential of sketches for diffusion models, addressing the deceptive promise of direct sketch control in generative AI. We importantly democratise the process, enabling amateur sketches to generate precise images, living up to the commitment of "what you sketch is what you get". A pilot study underscores the necessity, revealing that deformities in existing models stem from spatial-conditioning. To rectify this, we propose an abstraction-aware framework, utilising a sketch adapter, adaptive time-step sampling, and discriminative guidance from a pre-trained fine-grained sketch-based image retrieval model, working synergistically to reinforce fine-grained sketch-photo association. Our approach operates seamlessly during inference without the need for textual prompts; a simple, rough sketch akin to what you and I can create suffices! We welcome everyone to examine results presented in the paper and its supplementary. Contributions include democratising sketch control, introducing an abstraction-aware framework, and leveraging discriminative guidance, validated through extensive experiments.
本文探讨了对于扩散模型的 sketches 的潜在功能,并解决了生成式 AI 中直接绘制控制所带来的误导性承诺。我们重要的是使过程民主化,使业余 sketches 能够生成精确的图像,达到“你画什么,你就得到什么”的承诺。一个试点研究证实了必要性,揭示了现有模型的畸形源于空间约束。为了纠正这个问题,我们提出了一个抽象感知框架,利用了插图适配器、自适应时间步采样和预训练的精细颗粒插图基于图像检索模型的歧视性指导,协同工作以强化细粒度插图与照片的关联。在推理过程中,我们的方法无需文本提示操作顺畅;类似于我们和您可以创建的简单而粗糙的插图足够了!我们欢迎所有人研究论文及其补充。贡献包括使插图控制民主化、引入了抽象感知框架以及利用了歧视性指导,并通过广泛的实验验证了其有效性。
https://arxiv.org/abs/2403.07234