During the early stages of interface design, designers need to produce multiple sketches to explore a design space. Design tools often fail to support this critical stage, because they insist on specifying more details than necessary. Although recent advances in generative AI have raised hopes of solving this issue, in practice they fail because expressing loose ideas in a prompt is impractical. In this paper, we propose a diffusion-based approach to the low-effort generation of interface sketches. It breaks new ground by allowing flexible control of the generation process via three types of inputs: A) prompts, B) wireframes, and C) visual flows. The designer can provide any combination of these as input at any level of detail, and will get a diverse gallery of low-fidelity solutions in response. The unique benefit is that large design spaces can be explored rapidly with very little effort in input-specification. We present qualitative results for various combinations of input specifications. Additionally, we demonstrate that our model aligns more accurately with these specifications than other models.
在界面设计的早期阶段,设计师需要制作多张草图以探索设计方案。然而,现有的设计工具往往无法有效支持这一关键步骤,因为它们要求指定过多不必要的细节。尽管最近生成式人工智能技术的进步带来了解决这一问题的希望,但实际上这些方法由于难以通过提示表达松散的想法而未能成功。在本文中,我们提出了一种基于扩散模型的方法来快速生成界面草图,该方法通过允许设计师以三种类型输入的任意组合进行灵活控制来创新性地解决了这个问题:A)描述性的文字提示、B)线框图和C)视觉流程图。设计师可以在任意详细程度上提供这些输入,并将获得一系列多样化的低保真度解决方案作为回应。 这种方法的独特优势在于,它能够以极小的输入指定量快速探索大规模的设计空间。我们展示了各种不同的输入组合所得到的结果,并且证明我们的模型在与特定设计规范的一致性方面优于其他现有模型。
https://arxiv.org/abs/2502.03330
Low-rank tensor estimation offers a powerful approach to addressing high-dimensional data challenges and can substantially improve solutions to ill-posed inverse problems, such as image reconstruction under noisy or undersampled conditions. Meanwhile, tensor decomposition has gained prominence in federated learning (FL) due to its effectiveness in exploiting latent space structure and its capacity to enhance communication efficiency. In this paper, we present a federated image reconstruction method that applies Tucker decomposition, incorporating joint factorization and randomized sketching to manage large-scale, multimodal data. Our approach avoids reconstructing full-size tensors and supports heterogeneous ranks, allowing clients to select personalized decomposition ranks based on prior knowledge or communication capacity. Numerical results demonstrate that our method achieves superior reconstruction quality and communication compression compared to existing approaches, thereby highlighting its potential for multimodal inverse problems in the FL setting.
低秩张量估计为应对高维数据挑战提供了一种强大的方法,并且能够显著提升解决欠定逆问题(如在噪声或采样不足条件下图像重建)的解决方案。同时,由于其有效利用潜在空间结构以及增强通信效率的能力,张量分解在联邦学习(FL)中得到了广泛应用。本文提出了一种应用 Tucker 分解的联邦图像重建方法,结合联合因子化和随机草图技术来处理大规模、多模态数据。我们的方法避免了全尺寸张量的重建,并支持异构秩,允许客户端根据先验知识或通信能力选择个性化的分解秩。 数值结果表明,与现有方法相比,我们的方法在图像重建质量和通信压缩方面表现更优,从而突显其在联邦学习环境中解决多模态逆问题中的潜力。
https://arxiv.org/abs/2502.02761
Learning diffusion bridge models is easy; making them fast and practical is an art. Diffusion bridge models (DBMs) are a promising extension of diffusion models for applications in image-to-image translation. However, like many modern diffusion and flow models, DBMs suffer from the problem of slow inference. To address it, we propose a novel distillation technique based on the inverse bridge matching formulation and derive the tractable objective to solve it in practice. Unlike previously developed DBM distillation techniques, the proposed method can distill both conditional and unconditional types of DBMs, distill models in a one-step generator, and use only the corrupted images for training. We evaluate our approach for both conditional and unconditional types of bridge matching on a wide set of setups, including super-resolution, JPEG restoration, sketch-to-image, and other tasks, and show that our distillation technique allows us to accelerate the inference of DBMs from 4x to 100x and even provide better generation quality than used teacher model depending on particular setup.
学习扩散桥模型(Diffusion Bridge Models,DBMs)很简单;但要使它们变得快速且实用则是一门艺术。扩散桥模型是扩散模型在图像到图像转换应用中的一个有前景的扩展。然而,与许多现代的扩散和流模型一样,DBMs面临着推理速度慢的问题。为了解决这个问题,我们提出了一种基于逆向桥匹配公式的新颖蒸馏技术,并推导出实用的目标函数来解决这一问题。 不同于之前开发的DBM蒸馏技术,我们的方法可以同时对条件性和非条件性的DBMs进行蒸馏,在一步生成器中训练模型,并仅使用损坏的图像进行训练。我们在一系列广泛的设置上评估了我们这种方法在条件性和非条件性桥匹配上的表现,包括超分辨率、JPEG恢复、草图到图像转换以及其他任务,结果显示我们的蒸馏技术可以将DBM的推理速度提高4倍至100倍不等,并且在某些情况下甚至能提供比原教师模型更好的生成质量。
https://arxiv.org/abs/2502.01362
With the advancement of generative artificial intelligence, previous studies have achieved the task of generating aesthetic images from hand-drawn sketches, fulfilling the public's needs for drawing. However, these methods are limited to static images and lack the ability to control video animation generation using hand-drawn sketches. To address this gap, we propose VidSketch, the first method capable of generating high-quality video animations directly from any number of hand-drawn sketches and simple text prompts, bridging the divide between ordinary users and professional artists. Specifically, our method introduces a Level-Based Sketch Control Strategy to automatically adjust the guidance strength of sketches during the generation process, accommodating users with varying drawing skills. Furthermore, a TempSpatial Attention mechanism is designed to enhance the spatiotemporal consistency of generated video animations, significantly improving the coherence across frames. You can find more detailed cases on our official website.
随着生成式人工智能的进步,以往的研究已经实现了从手绘草图生成美观图像的任务,满足了公众的绘画需求。然而,这些方法仅限于静态图像,并且缺乏使用手绘草图控制视频动画生成的能力。为了解决这一缺口,我们提出了VidSketch,这是一种能够直接从任何数量的手绘草图和简单的文本提示中生成高质量视频动画的方法,它缩小了普通用户与专业艺术家之间的差距。 具体来说,我们的方法引入了一种基于级别的草图控制策略(Level-Based Sketch Control Strategy),能够在生成过程中自动调整手绘草图的指导强度,以适应不同绘画技能水平的用户。此外,我们设计了一个时空注意机制(TempSpatial Attention mechanism)来增强生成视频动画的空间时间一致性,在帧之间显著提高了连贯性。 您可以在我们的官方网站上找到更多详细案例。
https://arxiv.org/abs/2502.01101
The emergence of foundational models has greatly improved performance across various downstream tasks, with fine-tuning often yielding even better results. However, existing fine-tuning approaches typically require access to model weights and layers, leading to challenges such as managing multiple model copies or inference pipelines, inefficiencies in edge device optimization, and concerns over proprietary rights, privacy, and exposure to unsafe model variants. In this paper, we address these challenges by exploring "Gray-box" fine-tuning approaches, where the model's architecture and weights remain hidden, allowing only gradient propagation. We introduce a novel yet simple and effective framework that adapts to new tasks using two lightweight learnable modules at the model's input and output. Additionally, we present a less restrictive variant that offers more entry points into the model, balancing performance with model exposure. We evaluate our approaches across several backbones on benchmarks such as text-image alignment, text-video alignment, and sketch-image alignment. Results show that our Gray-box approaches are competitive with full-access fine-tuning methods, despite having limited access to the model.
基础模型的出现显著提升了各种下游任务的表现,而微调通常可以带来更好的结果。然而,现有的微调方法往往需要访问模型权重和层结构,这导致了管理多个模型副本或推理管道的挑战、边缘设备优化效率低下以及有关知识产权、隐私及模型变体安全性的担忧。在本文中,我们通过探索“灰盒”微调方法来解决这些问题,在这种方法中,模型的架构和权重保持隐藏状态,仅允许梯度传播。我们引入了一种新颖但简单且有效的框架,该框架利用了模型输入和输出处的两个轻量级可学习模块来进行任务适应性调整。此外,我们还提出了一种限制更少的变体版本,提供更多的模型访问入口点,并在性能与模型曝光之间取得平衡。我们在包括文本-图像对齐、文本-视频对齐以及草图-图像对齐等基准测试中评估了我们的方法。结果显示,尽管对我们访问模型的方式有限制,但我们的“灰盒”方法仍然能够与完全访问的微调方法相竞争。
https://arxiv.org/abs/2502.00796
Applying diffusion models to image-to-image translation (I2I) has recently received increasing attention due to its practical applications. Previous attempts inject information from the source image into each denoising step for an iterative refinement, thus resulting in a time-consuming implementation. We propose an efficient method that equips a diffusion model with a lightweight translator, dubbed a Diffusion Model Translator (DMT), to accomplish I2I. Specifically, we first offer theoretical justification that in employing the pioneering DDPM work for the I2I task, it is both feasible and sufficient to transfer the distribution from one domain to another only at some intermediate step. We further observe that the translation performance highly depends on the chosen timestep for domain transfer, and therefore propose a practical strategy to automatically select an appropriate timestep for a given task. We evaluate our approach on a range of I2I applications, including image stylization, image colorization, segmentation to image, and sketch to image, to validate its efficacy and general utility. The comparisons show that our DMT surpasses existing methods in both quality and efficiency. Code will be made publicly available.
将扩散模型应用于图像到图像的翻译(I2I)因其实际应用价值而日益受到关注。以往的方法通过在每次去噪步骤中注入来自源图像的信息来进行迭代细化,从而导致耗时的实现过程。我们提出了一种有效方法,即为扩散模型配备一个轻量级转换器,称为“扩散模型转换器”(DMT),以完成I2I任务。 具体来说,我们首先提供了理论依据,在使用开创性的DDPM工作进行I2I任务时,只需在某些中间步骤上将分布从一个领域转移到另一个领域就既可行又足够。我们进一步观察到,翻译性能高度依赖于选择的域转换时间步,并因此提出了一种实用策略以自动为给定任务选择合适的时序。我们在包括图像风格化、图像着色、分割到图像和草图到图像等多种I2I应用上评估了我们的方法,以验证其有效性和通用性。 通过比较显示,与现有的方法相比,DMT在质量和效率方面都表现出超越优势。代码将公开发布。
https://arxiv.org/abs/2502.00307
Sketches, with their expressive potential, allow humans to convey the essence of an object through even a rough contour. For the first time, we harness this expressive potential to improve segmentation performance in challenging tasks like camouflaged object detection (COD). Our approach introduces an innovative sketch-guided interactive segmentation framework, allowing users to intuitively annotate objects with freehand sketches (drawing a rough contour of the object) instead of the traditional bounding boxes or points used in classic interactive segmentation models like SAM. We demonstrate that sketch input can significantly improve performance in existing iterative segmentation methods, outperforming text or bounding box annotations. Additionally, we introduce key modifications to network architectures and a novel sketch augmentation technique to fully harness the power of sketch input and further boost segmentation accuracy. Remarkably, our model' s output can be directly used to train other neural networks, achieving results comparable to pixel-by-pixel annotations--while reducing annotation time by up to 120 times, which shows great potential in democratizing the annotation process and enabling model training with less reliance on resource-intensive, laborious pixel-level annotations. We also present KOSCamo+, the first freehand sketch dataset for camouflaged object detection. The dataset, code, and the labeling tool will be open sourced.
草图,凭借其强大的表现力,允许人类通过粗糙的轮廓来传达一个物体的本质。首次,我们利用这一表现潜力以提升在具有挑战性的任务如伪装目标检测(COD)中的分割性能。我们的方法引入了一个创新的基于草图引导的交互式分割框架,该框架让用户能够用自由手绘的草图(绘制物体的大致轮廓),而不是经典交互式分割模型如SAM中使用的传统边界框或点,来进行直观地标注对象。 我们展示了草图输入可以显著提升现有迭代分割方法的表现,并且在现有的文本或边界框注释方式下表现更佳。此外,我们对网络架构进行了关键修改,并引入了一种新颖的草图增强技术,以充分利用草图输入并进一步提高分割准确性。令人印象深刻的是,我们的模型输出可以直接用于训练其他神经网络,在结果上媲美逐像素标注—同时将标注时间减少多达120倍,这显示了在注释过程中实现民主化以及使模型训练更少依赖于资源密集型、劳动密集的逐像素级标注方面具有巨大潜力。 此外,我们还发布了KOSCamo+,这是首个用于伪装目标检测的手绘草图数据集。该数据集、代码和标记工具将会开源提供。
https://arxiv.org/abs/2501.19329
With recent advancements in the capabilities of Text-to-Image (T2I) AI models, product designers have begun experimenting with them in their work. However, T2I models struggle to interpret abstract language and the current user experience of T2I tools can induce design fixation rather than a more iterative, exploratory process. To address these challenges, we developed Inkspire, a sketch-driven tool that supports designers in prototyping product design concepts with analogical inspirations and a complete sketch-to-design-to-sketch feedback loop. To inform the design of Inkspire, we conducted an exchange session with designers and distilled design goals for improving T2I interactions. In a within-subjects study comparing Inkspire to ControlNet, we found that Inkspire supported designers with more inspiration and exploration of design ideas, and improved aspects of the co-creative process by allowing designers to effectively grasp the current state of the AI to guide it towards novel design intentions.
随着文本到图像(T2I)AI模型能力的近期进步,产品设计师已经开始在工作中尝试使用这些技术。然而,T2I模型难以理解抽象语言,并且目前的T2I工具用户体验可能会导致设计固化而非迭代探索过程。为了解决这些问题,我们开发了Inkspire,这是一种以草图驱动的工具,支持设计师用类比灵感进行产品设计理念原型制作,并提供从草图到设计再到反馈草图的完整循环。为了指导Inkspire的设计,我们与设计师进行了交流会,提炼出改进T2I交互的目标。在一项将Inkspire与ControlNet进行比较的同一主体研究中,我们发现Inkspire能够为设计师提供更多灵感和探索设计想法的机会,并通过允许设计师有效掌握AI当前状态以引导其向新颖的设计意图发展,从而改善协同创作过程的各个方面。
https://arxiv.org/abs/2501.18588
As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe. We sketch how developers could construct a "control safety case", which is a structured argument that models are incapable of subverting control measures in order to cause unacceptable outcomes. As a case study, we sketch an argument that a hypothetical LLM agent deployed internally at an AI company won't exfiltrate sensitive information. The sketch relies on evidence from a "control evaluation,"' where a red team deliberately designs models to exfiltrate data in a proxy for the deployment environment. The safety case then hinges on several claims: (1) the red team adequately elicits model capabilities to exfiltrate data, (2) control measures remain at least as effective in deployment, and (3) developers conservatively extrapolate model performance to predict the probability of data exfiltration in deployment. This safety case sketch is a step toward more concrete arguments that can be used to show that a dangerously capable LLM agent is safe to deploy.
随着大型语言模型(LLM)代理的能力增强,它们可能造成危害的风险也随之增加。因此,AI开发者可能会越来越多地依赖控制措施(如监控),以证明这些模型在实际应用中的安全性。我们概述了开发者如何构建一个“控制系统案例”,这是一个结构化的论证过程,旨在表明模型无法通过规避控制措施来导致不可接受的后果。 作为案例研究,我们描述了一个假设性的LLM代理被内部部署在一个AI公司中不会泄露敏感信息的例子。这一论述依赖于来自“控制评估”的证据,在这种评估中,一个红队(安全团队)故意设计出能够尝试在模拟环境中窃取数据的模型以替代实际部署环境。 安全性论证主要基于以下几点主张: 1. 红队成功地激发了模型潜在的数据泄露能力。 2. 控制措施在真实部署时至少与测试环境中一样有效。 3. 开发人员保守地推断出模型性能,从而预测部署中数据泄露的可能性。 这种控制系统案例的概述是一个向着构建更具体论证迈出的重要步骤,这些论证能够证明一个具有危险潜力的LLM代理是安全可用的。
https://arxiv.org/abs/2501.17315
Cooperative perception offers an optimal solution to overcome the perception limitations of single-agent systems by leveraging Vehicle-to-Everything (V2X) communication for data sharing and fusion across multiple agents. However, most existing approaches focus on single-modality data exchange, limiting the potential of both homogeneous and heterogeneous fusion across agents. This overlooks the opportunity to utilize multi-modality data per agent, restricting the system's performance. In the automotive industry, manufacturers adopt diverse sensor configurations, resulting in heterogeneous combinations of sensor modalities across agents. To harness the potential of every possible data source for optimal performance, we design a robust LiDAR and camera cross-modality fusion module, Radian-Glue-Attention (RG-Attn), applicable to both intra-agent cross-modality fusion and inter-agent cross-modality fusion scenarios, owing to the convenient coordinate conversion by transformation matrix and the unified sampling/inversion mechanism. We also propose two different architectures, named Paint-To-Puzzle (PTP) and Co-Sketching-Co-Coloring (CoS-CoCo), for conducting cooperative perception. PTP aims for maximum precision performance and achieves smaller data packet size by limiting cross-agent fusion to a single instance, but requiring all participants to be equipped with LiDAR. In contrast, CoS-CoCo supports agents with any configuration-LiDAR-only, camera-only, or LiDAR-camera-both, presenting more generalization ability. Our approach achieves state-of-the-art (SOTA) performance on both real and simulated cooperative perception datasets. The code will be released at GitHub in early 2025.
合作感知通过利用车辆到一切(V2X)通信进行跨多个代理的数据共享和融合,为单一代理系统感知限制提供了一种最优解决方案。然而,现有的大多数方法都集中在单模态数据交换上,这限制了同质和异质传感器在多代理之间的融合潜力,忽视了利用每个代理的多模态数据的机会,从而限制了系统的性能。在汽车行业中,制造商采用不同的传感器配置,导致各代理之间出现不同类型的传感器模式组合。 为了充分利用每一个可能的数据源以达到最佳性能,我们设计了一个稳健的激光雷达和摄像头跨模态融合模块——Radian-Glue-Attention(RG-Attn),该模块适用于代理内部跨模态融合以及跨代理间融合场景。这得益于转换矩阵提供的方便坐标转换功能及统一采样/逆变换机制。同时,我们提出了两种不同的架构:Paint-To-Puzzle(PTP)和Co-Sketching-Co-Coloring(CoS-CoCo),用于进行合作感知。 PTP旨在通过限制跨代理融合为单一实例来实现最大精度性能,并通过要求所有参与者都配备激光雷达以减小数据包大小。相比之下,CoS-CoCo支持各种配置的代理——仅激光雷达、仅摄像头或两者都有,展示了更强的泛化能力。我们的方法在真实和模拟合作感知的数据集上均达到了最先进的(SOTA)性能水平。 代码将于2025年初在GitHub上发布。
https://arxiv.org/abs/2501.16803
Creating hand-drawn animation sequences is labor-intensive and demands professional expertise. We introduce PhysAnimator, a novel approach for generating physically plausible meanwhile anime-stylized animation from static anime illustrations. Our method seamlessly integrates physics-based simulations with data-driven generative models to produce dynamic and visually compelling animations. To capture the fluidity and exaggeration characteristic of anime, we perform image-space deformable body simulations on extracted mesh geometries. We enhance artistic control by introducing customizable energy strokes and incorporating rigging point support, enabling the creation of tailored animation effects such as wind interactions. Finally, we extract and warp sketches from the simulation sequence, generating a texture-agnostic representation, and employ a sketch-guided video diffusion model to synthesize high-quality animation frames. The resulting animations exhibit temporal consistency and visual plausibility, demonstrating the effectiveness of our method in creating dynamic anime-style animations.
创作手绘动画序列是一项劳动密集型工作,需要专业的技能。我们提出了一种名为PhysAnimator的新方法,可以从静态的动漫插图生成既符合物理原理又具有动漫风格的动画。我们的方法将基于物理的仿真与数据驱动的生成模型无缝结合,以产生动态且视觉上引人注目的动画。 为了捕捉动漫中特有的流畅性和夸张效果,我们在提取出的网格几何体上执行图像空间中的可变形身体模拟。通过引入可以自定义的能量笔触和加入绑定点支持来增强艺术控制,使创作风交互等定制化的动画效果成为可能。 最后,我们从仿真序列中抽取并扭曲草图,生成一种无关纹理的表现形式,并使用引导草图的视频扩散模型合成高质量的动画帧。由此产生的动画在时间上具有一致性和视觉上的真实性,证明了我们的方法在创造动态动漫风格动画方面的有效性。
https://arxiv.org/abs/2501.16550
In this paper, we expand the domain of sketch research into the field of image segmentation, aiming to establish freehand sketches as a query modality for subjective image segmentation. Our innovative approach introduces a "sketch-in-the-loop" image segmentation framework, enabling the segmentation of visual concepts partially, completely, or in groupings - a truly "freestyle" approach - without the need for a purpose-made dataset (i.e., mask-free). This framework capitalises on the synergy between sketch-based image retrieval (SBIR) models and large-scale pre-trained models (CLIP or DINOv2). The former provides an effective training signal, while fine-tuned versions of the latter execute the subjective segmentation. Additionally, our purpose-made augmentation strategy enhances the versatility of our sketch-guided mask generation, allowing segmentation at multiple granularity levels. Extensive evaluations across diverse benchmark datasets underscore the superior performance of our method in comparison to existing approaches across various evaluation scenarios.
在这篇论文中,我们将素描研究的领域扩展到了图像分割领域,旨在将自由手绘草图作为主观图像分割的查询模式。我们的创新方法引入了“草图循环”图像分割框架,使得可以部分地、完全地或成组地进行视觉概念的分割——这是一种真正意义上的“自由风格”方法——而无需专用的数据集(即无需掩码)。该框架利用了基于素描的图像检索(SBIR)模型和大规模预训练模型(CLIP或DINOv2)之间的协同效应。前者提供了一个有效的训练信号,而后者经过微调后执行主观分割任务。此外,我们专门设计的增强策略提升了草图引导的掩码生成方法的灵活性,使其能够在多个粒度级别上进行分割。在广泛的基准数据集上的全面评估强调了我们的方法在各种评价场景下相比现有方法具有优越性能。
https://arxiv.org/abs/2501.16022
Reinforcement learning (RL) has demonstrated success in automating insulin dosing in simulated type 1 diabetes (T1D) patients but is currently unable to incorporate patient expertise and preference. This work introduces PAINT (Preference Adaptation for INsulin control in T1D), an original RL framework for learning flexible insulin dosing policies from patient records. PAINT employs a sketch-based approach for reward learning, where past data is annotated with a continuous reward signal to reflect patient's desired outcomes. Labelled data trains a reward model, informing the actions of a novel safety-constrained offline RL algorithm, designed to restrict actions to a safe strategy and enable preference tuning via a sliding scale. In-silico evaluation shows PAINT achieves common glucose goals through simple labelling of desired states, reducing glycaemic risk by 15% over a commercial benchmark. Action labelling can also be used to incorporate patient expertise, demonstrating an ability to pre-empt meals (+10% time-in-range post-meal) and address certain device errors (-1.6% variance post-error) with patient guidance. These results hold under realistic conditions, including limited samples, labelling errors, and intra-patient variability. This work illustrates PAINT's potential in real-world T1D management and more broadly any tasks requiring rapid and precise preference learning under safety constraints.
强化学习(RL)在模拟1型糖尿病(T1D)患者的胰岛素给药自动化方面已经取得了成功,但目前尚无法整合患者的专业知识和偏好。这项工作介绍了一种名为PAINT(用于T1D胰岛素控制的偏好适应)的新框架,该框架通过从患者记录中学习灵活的胰岛素给药策略来解决这个问题。 PAINT采用基于草图的方法进行奖励学习,其中过去的数据被标记为连续的奖励信号以反映患者的期望结果。带有标签的数据用于训练一个奖励模型,这会告知一种新的受安全约束的离线RL算法采取哪些行动,该算法旨在限制动作到安全策略,并通过滑动比例来调节偏好。 模拟评估显示,PAINT能够通过简单的标签标记预期状态达到常见的血糖目标,在商业基准下降低了15%的糖化血红蛋白风险。动作标注还可以用于整合患者的专长知识,证明了其可以提前预测餐食(+10%餐后时间范围)和纠正特定设备错误(-1.6%误差后的方差波动)的能力,并在患者指导下的表现。 这些结果在现实条件下的有效性得到验证,包括样本量有限、标签标注错误以及个体间的变异性。这项工作展示了PAINT在实际的T1D管理中的潜力,并更广泛地说,在任何需要快速而精确地学习偏好且需遵守安全约束的任务中也是如此。
https://arxiv.org/abs/2501.15972
3D Gaussian Splatting (3DGS) has emerged as a promising representation for photorealistic rendering of 3D scenes. However, its high storage requirements pose significant challenges for practical applications. We observe that Gaussians exhibit distinct roles and characteristics that are analogous to traditional artistic techniques -- Like how artists first sketch outlines before filling in broader areas with color, some Gaussians capture high-frequency features like edges and contours; While other Gaussians represent broader, smoother regions, that are analogous to broader brush strokes that add volume and depth to a painting. Based on this observation, we propose a novel hybrid representation that categorizes Gaussians into (i) Sketch Gaussians, which define scene boundaries, and (ii) Patch Gaussians, which cover smooth regions. Sketch Gaussians are efficiently encoded using parametric models, leveraging their geometric coherence, while Patch Gaussians undergo optimized pruning, retraining, and vector quantization to maintain volumetric consistency and storage efficiency. Our comprehensive evaluation across diverse indoor and outdoor scenes demonstrates that this structure-aware approach achieves up to 32.62% improvement in PSNR, 19.12% in SSIM, and 45.41% in LPIPS at equivalent model sizes, and correspondingly, for an indoor scene, our model maintains the visual quality with 2.3% of the original model size.
3D高斯点阵(3DGS)作为一种具有前景的表示方法,被应用于三维场景的真实感渲染。然而,其高昂的数据存储需求对实际应用构成了重大挑战。我们观察到,高斯函数在角色和特性上与传统的艺术技巧相似——就像艺术家先勾勒轮廓线再用更宽的笔触填充区域一样,一些高斯函数捕捉了高频特征(如边缘和轮廓),而其他高斯函数则代表更广泛、平滑的区域,类似于为一幅画增加体积感和深度的宽刷子笔触。基于这一观察,我们提出了一种新的混合表示方法,将高斯函数分为两类:(i) 轮廓高斯(Sketch Gaussians),定义场景边界;以及(ii) 补丁高斯(Patch Gaussians),覆盖平滑区域。轮廓高斯通过参数模型进行高效编码,利用其几何一致性,而补丁高斯则经过优化剪枝、重新训练和向量量化处理以保持体积一致性和存储效率。 我们对多种室内及室外场景进行了全面评估,结果表明这种结构感知的方法在等效模型大小下,PSNR(峰值信噪比)提高了多达32.62%,SSIM(结构相似度指数)提升了19.12%,LPIPS(相对图片感知质量评分)减少了45.41%。对于一个室内场景来说,我们的模型可以在仅占原始模型大小的2.3%的情况下保持视觉效果的质量。
https://arxiv.org/abs/2501.13045
Despite advancements in cross-domain image translation, challenges persist in asymmetric tasks such as SAR-to-Optical and Sketch-to-Instance conversions, which involve transforming data from a less detailed domain into one with richer content. Traditional CNN-based methods are effective at capturing fine details but struggle with global structure, leading to unwanted merging of image regions. To address this, we propose the CNN-Swin Hybrid Network (CSHNet), which combines two key modules: Swin Embedded CNN (SEC) and CNN Embedded Swin (CES), forming the SEC-CES-Bottleneck (SCB). SEC leverages CNN's detailed feature extraction while integrating the Swin Transformer's structural bias. CES, in turn, preserves the Swin Transformer's global integrity, compensating for CNN's lack of focus on structure. Additionally, CSHNet includes two components designed to enhance cross-domain information retention: the Interactive Guided Connection (IGC), which enables dynamic information exchange between SEC and CES, and Adaptive Edge Perception Loss (AEPL), which maintains structural boundaries during translation. Experimental results show that CSHNet outperforms existing methods in both visual quality and performance metrics across scene-level and instance-level datasets. Our code is available at: this https URL.
尽管跨领域图像转换技术取得了进展,但在诸如SAR到光学和草图到实例的不对称任务中仍存在挑战。这些任务涉及将数据从细节较少的领域转换为内容更丰富的领域。传统基于CNN的方法在捕捉细微之处方面非常有效,但它们难以处理全局结构,导致图像区域意外合并的问题。 为了应对这一挑战,我们提出了CNN-Swin混合网络(CSHNet),该网络结合了两个关键模块:Swin嵌入式CNN (SEC) 和 CNN嵌入式Swin (CES),形成了 SEC-CES-瓶颈 (SCB) 结构。SEC利用CNN的详细特征提取能力,同时整合了Swin变换器的结构偏置。CES则保留了Swin变换器对全局完整性的关注,并弥补了CNN在结构方面注意力不足的问题。 此外,CSHNet还包括两个旨在增强跨域信息保持的组件:互动引导连接(IGC),它使SEC和CES之间能够进行动态的信息交换;以及自适应边缘感知损失(AEPL),该机制确保在转换过程中维持结构性边界。 实验结果显示,与现有方法相比,CSHNet在场景级和实例级数据集中的视觉质量和性能指标上均表现出色。我们的代码可在此处获取:[提供链接]。
https://arxiv.org/abs/2501.10197
Animated video separates foreground and background elements into layers, with distinct processes for sketching, refining, coloring, and in-betweening. Existing video generation methods typically treat animation as a monolithic data domain, lacking fine-grained control over individual layers. In this paper, we introduce LayerAnimate, a novel architectural approach that enhances fine-grained control over individual animation layers within a video diffusion model, allowing users to independently manipulate foreground and background elements in distinct layers. To address the challenge of limited layer-specific data, we propose a data curation pipeline that features automated element segmentation, motion-state hierarchical merging, and motion coherence refinement. Through quantitative and qualitative comparisons, and user study, we demonstrate that LayerAnimate outperforms current methods in terms of animation quality, control precision, and usability, making it an ideal tool for both professional animators and amateur enthusiasts. This framework opens up new possibilities for layer-specific animation applications and creative flexibility. Our code is available at this https URL.
动画视频将前景和背景元素分层处理,分别经历草图绘制、细化、着色以及中间帧生成等独立过程。现有的视频生成方法通常将动画视为一个单一的数据领域,缺乏对各层的精细控制。在本文中,我们引入了LayerAnimate这一新颖架构方法,它增强了视频扩散模型内部单个动画层的精细控制能力,使用户能够独立操控前景和背景元素的不同层次。为解决特定层次数据有限的问题,我们提出了一条包含自动化元素分割、运动状态分层合并以及运动连贯性优化的数据整理流水线。通过定量与定性的比较及用户研究,我们证明了LayerAnimate在动画质量、控制精度及易用性方面超越了当前方法,使其成为专业动画师和业余爱好者都适用的理想工具。这一框架为特定层次的动画应用及其创意灵活性开辟了新的可能性。我们的代码可在[此链接](https://example.com)获取。
https://arxiv.org/abs/2501.08295
From Paleolithic cave paintings to Impressionism, human painting has evolved to depict increasingly complex and detailed scenes, conveying more nuanced messages. This paper attempts to emerge this artistic capability by simulating the evolutionary pressures that enhance visual communication efficiency. Specifically, we present a model with a stroke branch and a palette branch that together simulate human-like painting. The palette branch learns a limited colour palette, while the stroke branch parameterises each stroke using Bézier curves to render an image, subsequently evaluated by a high-level recognition module. We quantify the efficiency of visual communication by measuring the recognition accuracy achieved with machine vision. The model then optimises the control points and colour choices for each stroke to maximise recognition accuracy with minimal strokes and colours. Experimental results show that our model achieves superior performance in high-level recognition tasks, delivering artistic expression and aesthetic appeal, especially in abstract sketches. Additionally, our approach shows promise as an efficient bit-level image compression technique, outperforming traditional methods.
从旧石器时代的洞穴壁画到印象派,人类绘画的发展历程中,描绘的场景逐渐变得更加复杂和详细,并传达出更为细腻的信息。本文试图通过模拟增强视觉通信效率的进化压力来再现这种艺术能力。具体来说,我们提出了一种具有笔触分支和调色板分支的模型,这些分支共同模拟了类似人类的作画方式。调色板分支学习有限的颜色方案,而笔触分支则使用Bézier曲线参数化每个笔触以生成图像,并随后通过高层次识别模块进行评估。我们通过测量机器视觉实现的识别准确率来量化视觉通信的有效性。模型优化每条笔触的控制点和颜色选择,以在最少的笔画和颜色下最大化识别准确性。 实验结果显示,我们的模型在高级别识别任务中表现出色,在抽象草图方面尤其具有艺术表现力和美学吸引力。此外,我们的方法显示出了作为高效位级图像压缩技术的巨大潜力,并且优于传统的方法。
https://arxiv.org/abs/2501.04966
Controlling text-to-speech (TTS) systems to synthesize speech with the prosodic characteristics expected by users has attracted much attention. To achieve controllability, current studies focus on two main directions: (1) using reference speech as prosody prompt to guide speech synthesis, and (2) using natural language descriptions to control the generation process. However, finding reference speech that exactly contains the prosody that users want to synthesize takes a lot of effort. Description-based guidance in TTS systems can only determine the overall prosody, which has difficulty in achieving fine-grained prosody control over the synthesized speech. In this paper, we propose DrawSpeech, a sketch-conditioned diffusion model capable of generating speech based on any prosody sketches drawn by users. Specifically, the prosody sketches are fed to DrawSpeech to provide a rough indication of the expected prosody trends. DrawSpeech then recovers the detailed pitch and energy contours based on the coarse sketches and synthesizes the desired speech. Experimental results show that DrawSpeech can generate speech with a wide variety of prosody and can precisely control the fine-grained prosody in a user-friendly manner. Our implementation and audio samples are publicly available.
控制文本到语音(TTS)系统,使其生成具有用户期望的韵律特征的语音,已经吸引了广泛的关注。为了实现这种可控性,目前的研究主要集中在两个方向:(1) 使用参考语音作为韵律提示来指导语音合成;(2) 使用自然语言描述来控制生成过程。然而,找到包含用户希望合成的特定韵律特性的参考语音需要大量努力。基于描述的方法在TTS系统中只能确定整体的韵律特征,在实现对合成语音细粒度的韵律控制方面存在困难。 在本文中,我们提出了一种名为DrawSpeech的新模型,这是一种可以根据用户绘制的任何韵律草图生成语音的扩散模型。具体来说,用户的韵律草图被输入到DrawSpeech系统中,以提供期望的韵律趋势的一个粗略指示。然后,DrawSpeech根据这些粗糙的草图恢复出详细的音高和能量轮廓,并最终合成所需的语音。 实验结果表明,DrawSpeech可以生成具有广泛不同韵律特征的语音,并且可以通过用户友好的方式精确控制细粒度的韵律特性。我们的实现代码及音频样本都公开可用。
https://arxiv.org/abs/2501.04256
Vision generation remains a challenging frontier in artificial intelligence, requiring seamless integration of visual understanding and generative capabilities. In this paper, we propose a novel framework, Vision-Driven Prompt Optimization (VDPO), that leverages Large Language Models (LLMs) to dynamically generate textual prompts from visual inputs, guiding high-fidelity image synthesis. VDPO combines a visual embedding prompt tuner, a textual instruction generator, and a vision generation module to achieve state-of-the-art performance in diverse vision generation tasks. Extensive experiments on benchmarks such as COCO and Sketchy demonstrate that VDPO consistently outperforms existing methods, achieving significant improvements in FID, LPIPS, and BLEU/CIDEr scores. Additional analyses reveal the scalability, robustness, and generalization capabilities of VDPO, making it a versatile solution for in-domain and out-of-domain tasks. Human evaluations further validate the practical superiority of VDPO in generating visually appealing and semantically coherent outputs.
视觉生成仍然是人工智能的一个充满挑战的领域,需要无缝地结合视觉理解和生成能力。在本文中,我们提出了一种新颖的框架——基于视觉的提示优化(Vision-Driven Prompt Optimization, VDPO),该框架利用大型语言模型(LLMs)从视觉输入动态生成文本提示,从而指导高质量图像合成。VDPO结合了视觉嵌入提示调节器、文本指令生成器和视觉生成模块,在各种视觉生成任务中实现了最先进的性能。在COCO和Sketchy等基准测试上的广泛实验表明,VDPO始终优于现有的方法,并且在FID(Frechet Inception Distance)、LPIPS(Learned Perceptual Image Patch Similarity)以及BLEU/CIDEr评分方面取得了显著提升。此外的分析揭示了VDPO具备可扩展性、鲁棒性和泛化能力,在领域内和跨领域的任务中都表现出色。人工评估进一步验证了VDPO在生成视觉上吸引人且语义一致输出方面的实际优势。
https://arxiv.org/abs/2501.02527
Current sketch extraction methods either require extensive training or fail to capture a wide range of artistic styles, limiting their practical applicability and versatility. We introduce Mixture-of-Self-Attention (MixSA), a training-free sketch extraction method that leverages strong diffusion priors for enhanced sketch perception. At its core, MixSA employs a mixture-of-self-attention technique, which manipulates self-attention layers by substituting the keys and values with those from reference sketches. This allows for the seamless integration of brushstroke elements into initial outline images, offering precise control over texture density and enabling interpolation between styles to create novel, unseen styles. By aligning brushstroke styles with the texture and contours of colored images, particularly in late decoder layers handling local textures, MixSA addresses the common issue of color averaging by adjusting initial outlines. Evaluated with various perceptual metrics, MixSA demonstrates superior performance in sketch quality, flexibility, and applicability. This approach not only overcomes the limitations of existing methods but also empowers users to generate diverse, high-fidelity sketches that more accurately reflect a wide range of artistic expressions.
目前的草图提取方法要么需要大量的训练,要么无法捕捉广泛的艺术风格,这限制了它们的实际应用性和灵活性。我们引入了一种无需训练的草图提取方法——混合自注意力(MixSA),该方法利用强大的扩散先验来增强对草图的理解和感知。在核心机制上,MixSA采用了混合自注意力技术,通过用参考草图中的键值替换自我注意层中的键值来进行操作。这使得能够将笔触元素无缝地整合到初始轮廓图像中,并提供对纹理密度的精确控制以及风格之间的插值以创造新颖、前所未见的风格。 通过使笔触样式与彩色图像的纹理和轮廓对齐,特别是在处理局部纹理的解码器后期层中进行调整,MixSA解决了常见的颜色平均化问题。经过各种感知度量评估后,结果表明MixSA在草图质量、灵活性和应用性方面表现优越。这种方法不仅克服了现有方法的局限性,还使用户能够生成多样化且高保真的草图,更准确地反映广泛的艺术表达形式。
https://arxiv.org/abs/2501.00816