From Paleolithic cave paintings to Impressionism, human painting has evolved to depict increasingly complex and detailed scenes, conveying more nuanced messages. This paper attempts to emerge this artistic capability by simulating the evolutionary pressures that enhance visual communication efficiency. Specifically, we present a model with a stroke branch and a palette branch that together simulate human-like painting. The palette branch learns a limited colour palette, while the stroke branch parameterises each stroke using Bézier curves to render an image, subsequently evaluated by a high-level recognition module. We quantify the efficiency of visual communication by measuring the recognition accuracy achieved with machine vision. The model then optimises the control points and colour choices for each stroke to maximise recognition accuracy with minimal strokes and colours. Experimental results show that our model achieves superior performance in high-level recognition tasks, delivering artistic expression and aesthetic appeal, especially in abstract sketches. Additionally, our approach shows promise as an efficient bit-level image compression technique, outperforming traditional methods.
从旧石器时代的洞穴壁画到印象派,人类绘画的发展历程中,描绘的场景逐渐变得更加复杂和详细,并传达出更为细腻的信息。本文试图通过模拟增强视觉通信效率的进化压力来再现这种艺术能力。具体来说,我们提出了一种具有笔触分支和调色板分支的模型,这些分支共同模拟了类似人类的作画方式。调色板分支学习有限的颜色方案,而笔触分支则使用Bézier曲线参数化每个笔触以生成图像,并随后通过高层次识别模块进行评估。我们通过测量机器视觉实现的识别准确率来量化视觉通信的有效性。模型优化每条笔触的控制点和颜色选择,以在最少的笔画和颜色下最大化识别准确性。 实验结果显示,我们的模型在高级别识别任务中表现出色,在抽象草图方面尤其具有艺术表现力和美学吸引力。此外,我们的方法显示出了作为高效位级图像压缩技术的巨大潜力,并且优于传统的方法。
https://arxiv.org/abs/2501.04966
Controlling text-to-speech (TTS) systems to synthesize speech with the prosodic characteristics expected by users has attracted much attention. To achieve controllability, current studies focus on two main directions: (1) using reference speech as prosody prompt to guide speech synthesis, and (2) using natural language descriptions to control the generation process. However, finding reference speech that exactly contains the prosody that users want to synthesize takes a lot of effort. Description-based guidance in TTS systems can only determine the overall prosody, which has difficulty in achieving fine-grained prosody control over the synthesized speech. In this paper, we propose DrawSpeech, a sketch-conditioned diffusion model capable of generating speech based on any prosody sketches drawn by users. Specifically, the prosody sketches are fed to DrawSpeech to provide a rough indication of the expected prosody trends. DrawSpeech then recovers the detailed pitch and energy contours based on the coarse sketches and synthesizes the desired speech. Experimental results show that DrawSpeech can generate speech with a wide variety of prosody and can precisely control the fine-grained prosody in a user-friendly manner. Our implementation and audio samples are publicly available.
控制文本到语音(TTS)系统,使其生成具有用户期望的韵律特征的语音,已经吸引了广泛的关注。为了实现这种可控性,目前的研究主要集中在两个方向:(1) 使用参考语音作为韵律提示来指导语音合成;(2) 使用自然语言描述来控制生成过程。然而,找到包含用户希望合成的特定韵律特性的参考语音需要大量努力。基于描述的方法在TTS系统中只能确定整体的韵律特征,在实现对合成语音细粒度的韵律控制方面存在困难。 在本文中,我们提出了一种名为DrawSpeech的新模型,这是一种可以根据用户绘制的任何韵律草图生成语音的扩散模型。具体来说,用户的韵律草图被输入到DrawSpeech系统中,以提供期望的韵律趋势的一个粗略指示。然后,DrawSpeech根据这些粗糙的草图恢复出详细的音高和能量轮廓,并最终合成所需的语音。 实验结果表明,DrawSpeech可以生成具有广泛不同韵律特征的语音,并且可以通过用户友好的方式精确控制细粒度的韵律特性。我们的实现代码及音频样本都公开可用。
https://arxiv.org/abs/2501.04256
Vision generation remains a challenging frontier in artificial intelligence, requiring seamless integration of visual understanding and generative capabilities. In this paper, we propose a novel framework, Vision-Driven Prompt Optimization (VDPO), that leverages Large Language Models (LLMs) to dynamically generate textual prompts from visual inputs, guiding high-fidelity image synthesis. VDPO combines a visual embedding prompt tuner, a textual instruction generator, and a vision generation module to achieve state-of-the-art performance in diverse vision generation tasks. Extensive experiments on benchmarks such as COCO and Sketchy demonstrate that VDPO consistently outperforms existing methods, achieving significant improvements in FID, LPIPS, and BLEU/CIDEr scores. Additional analyses reveal the scalability, robustness, and generalization capabilities of VDPO, making it a versatile solution for in-domain and out-of-domain tasks. Human evaluations further validate the practical superiority of VDPO in generating visually appealing and semantically coherent outputs.
视觉生成仍然是人工智能的一个充满挑战的领域,需要无缝地结合视觉理解和生成能力。在本文中,我们提出了一种新颖的框架——基于视觉的提示优化(Vision-Driven Prompt Optimization, VDPO),该框架利用大型语言模型(LLMs)从视觉输入动态生成文本提示,从而指导高质量图像合成。VDPO结合了视觉嵌入提示调节器、文本指令生成器和视觉生成模块,在各种视觉生成任务中实现了最先进的性能。在COCO和Sketchy等基准测试上的广泛实验表明,VDPO始终优于现有的方法,并且在FID(Frechet Inception Distance)、LPIPS(Learned Perceptual Image Patch Similarity)以及BLEU/CIDEr评分方面取得了显著提升。此外的分析揭示了VDPO具备可扩展性、鲁棒性和泛化能力,在领域内和跨领域的任务中都表现出色。人工评估进一步验证了VDPO在生成视觉上吸引人且语义一致输出方面的实际优势。
https://arxiv.org/abs/2501.02527
Current sketch extraction methods either require extensive training or fail to capture a wide range of artistic styles, limiting their practical applicability and versatility. We introduce Mixture-of-Self-Attention (MixSA), a training-free sketch extraction method that leverages strong diffusion priors for enhanced sketch perception. At its core, MixSA employs a mixture-of-self-attention technique, which manipulates self-attention layers by substituting the keys and values with those from reference sketches. This allows for the seamless integration of brushstroke elements into initial outline images, offering precise control over texture density and enabling interpolation between styles to create novel, unseen styles. By aligning brushstroke styles with the texture and contours of colored images, particularly in late decoder layers handling local textures, MixSA addresses the common issue of color averaging by adjusting initial outlines. Evaluated with various perceptual metrics, MixSA demonstrates superior performance in sketch quality, flexibility, and applicability. This approach not only overcomes the limitations of existing methods but also empowers users to generate diverse, high-fidelity sketches that more accurately reflect a wide range of artistic expressions.
目前的草图提取方法要么需要大量的训练,要么无法捕捉广泛的艺术风格,这限制了它们的实际应用性和灵活性。我们引入了一种无需训练的草图提取方法——混合自注意力(MixSA),该方法利用强大的扩散先验来增强对草图的理解和感知。在核心机制上,MixSA采用了混合自注意力技术,通过用参考草图中的键值替换自我注意层中的键值来进行操作。这使得能够将笔触元素无缝地整合到初始轮廓图像中,并提供对纹理密度的精确控制以及风格之间的插值以创造新颖、前所未见的风格。 通过使笔触样式与彩色图像的纹理和轮廓对齐,特别是在处理局部纹理的解码器后期层中进行调整,MixSA解决了常见的颜色平均化问题。经过各种感知度量评估后,结果表明MixSA在草图质量、灵活性和应用性方面表现优越。这种方法不仅克服了现有方法的局限性,还使用户能够生成多样化且高保真的草图,更准确地反映广泛的艺术表达形式。
https://arxiv.org/abs/2501.00816
Computer-aided design (CAD) significantly enhances the efficiency, accuracy, and innovation of design processes by enabling precise 2D and 3D modeling, extensive analysis, and optimization. Existing methods for creating CAD models rely on latent vectors or point clouds, which are difficult to obtain and costly to store. Recent advances in Multimodal Large Language Models (MLLMs) have inspired researchers to use natural language instructions and images for CAD model construction. However, these models still struggle with inferring accurate 3D spatial location and orientation, leading to inaccuracies in determining the spatial 3D starting points and extrusion directions for constructing geometries. This work introduces CAD-GPT, a CAD synthesis method with spatial reasoning-enhanced MLLM that takes either a single image or a textual description as input. To achieve precise spatial inference, our approach introduces a 3D Modeling Spatial Mechanism. This method maps 3D spatial positions and 3D sketch plane rotation angles into a 1D linguistic feature space using a specialized spatial unfolding mechanism, while discretizing 2D sketch coordinates into an appropriate planar space to enable precise determination of spatial starting position, sketch orientation, and 2D sketch coordinate translations. Extensive experiments demonstrate that CAD-GPT consistently outperforms existing state-of-the-art methods in CAD model synthesis, both quantitatively and qualitatively.
计算机辅助设计(CAD)通过实现精确的二维和三维建模、广泛的分析以及优化,显著提升了设计过程的效率、准确性和创新性。现有的创建CAD模型的方法依赖于潜在向量或点云数据,这些数据难以获取且存储成本高昂。近期在多模态大型语言模型(MLLMs)领域的进展激发了研究人员使用自然语言指令和图像来进行CAD模型构建。然而,目前这些模型仍然在推断精确的三维空间位置和方向方面存在困难,导致确定构造几何体所需的空间3D起始点及拉伸方向时出现不准确的情况。 这项工作介绍了CAD-GPT,这是一种增强型多模态大型语言模型(MLLM),它结合了空间推理能力,并可以接受单一图像或文本描述作为输入。为了实现精确的空间推断,我们的方法引入了一种三维建模空间机制。该方法利用专门的空间展开机制将3D空间位置和3D草图平面旋转角度映射到一维语言特征空间中,同时对2D草图坐标进行离散化处理以适当地分配给适当的二维平面空间。这样可以精确确定空间起始位置、草图方向以及2D草图坐标的转换。 广泛的实验表明,CAD-GPT在CAD模型合成方面始终优于现有的最先进的方法,在定量和定性两个层面上均取得了卓越的成果。
https://arxiv.org/abs/2412.19663
To use assistive robots in everyday life, a remote control system with common devices, such as 2D devices, is helpful to control the robots anytime and anywhere as intended. Hand-drawn sketches are one of the intuitive ways to control robots with 2D devices. However, since similar sketches have different intentions from scene to scene, existing work needs additional modalities to set the sketches' semantics. This requires complex operations for users and leads to decreasing usability. In this paper, we propose Sketch-MoMa, a teleoperation system using the user-given hand-drawn sketches as instructions to control a robot. We use Vision-Language Models (VLMs) to understand the user-given sketches superimposed on an observation image and infer drawn shapes and low-level tasks of the robot. We utilize the sketches and the generated shapes for recognition and motion planning of the generated low-level tasks for precise and intuitive operations. We validate our approach using state-of-the-art VLMs with 7 tasks and 5 sketch shapes. We also demonstrate that our approach effectively specifies the detailed motions, such as how to grasp and how much to rotate. Moreover, we show the competitive usability of our approach compared with the existing 2D interface through a user experiment with 14 participants.
为了在日常生活中使用辅助机器人,一个配备有常见设备(如2D设备)的远程控制系统可以帮助用户随时随地按意图控制机器人。手绘草图是通过2D设备直观地控制机器人的方法之一。然而,由于相似的草图在不同的场景中可能具有不同的意图,现有的工作需要额外的模态来设定这些草图的意义,这要求用户执行复杂的操作并降低了可用性。 本文提出了一种名为Sketch-MoMa的远程操作系统,该系统利用用户提供的手绘草图作为指令来控制机器人。我们使用视觉-语言模型(Vision-Language Models, VLMs)理解叠加在观测图像上的用户给定的手绘草图,并推断出所绘制形状以及机器人的低级任务。通过这些草图和生成的形状,我们可以对生成的低级任务进行识别和运动规划,从而实现精确且直观的操作。 我们使用最先进的视觉-语言模型验证了我们的方法在7项任务和5种草图形状上的有效性。此外,我们还展示了这种方法可以有效地指定详细的动作,如如何抓取物体以及旋转多少度。最后,通过14名参与者的用户实验,我们证明了与现有的2D界面相比,我们的方法具有竞争力的可用性。
https://arxiv.org/abs/2412.19153
Missing value is a critical issue in data science, significantly impacting the reliability of analyses and predictions. Missing value imputation (MVI) is a longstanding problem because it highly relies on domain knowledge. Large language models (LLMs) have emerged as a promising tool for data cleaning, including MVI for tabular data, offering advanced capabilities for understanding and generating content. However, despite their promise, existing LLM techniques such as in-context learning and Chain-of-Thought (CoT) often fall short in guiding LLMs to perform complex reasoning for MVI, particularly when imputing derived missing values, which require mathematical formulas and data relationships across rows and columns. This gap underscores the need for further advancements in LLM methodologies to enhance their reasoning capabilities for more reliable imputation outcomes. To fill this gap, we propose SketchFill, a novel sketch-based method to guide LLMs in generating accurate formulas to impute missing numerical values. Our experimental results demonstrate that SketchFill significantly outperforms state-of-the-art methods, achieving 56.2% higher accuracy than CoT-based methods and 78.8% higher accuracy than MetaGPT. This sets a new standard for automated data cleaning and advances the field of MVI for numerical values.
数据科学中缺失值是一个关键问题,对分析和预测的可靠性有显著影响。缺失值插补(MVI)长期以来一直是个难题,因为它高度依赖于领域知识。大型语言模型(LLMs)作为数据清理工具的出现,包括用于表格数据的MVI,在理解和生成内容方面提供了先进能力。然而,尽管前景广阔,现有的LLM技术如上下文学习和链式思维(CoT),在引导LLMs进行复杂推理以执行插补任务时,尤其是在需要数学公式和跨行列数据关系来填补衍生缺失值的情况下,往往表现不足。这一差距凸显了进一步提升LLM方法论的必要性,以增强其推理能力,从而实现更可靠的插补结果。 为了填补这一空白,我们提出了一种名为SketchFill的新颖草图引导方法,用于指导LLMs生成准确的公式来插补缺失数值。实验结果显示,SketchFill显著优于现有最先进的方法,在准确性上比CoT基线高出56.2%,比MetaGPT高78.8%。这为自动数据清理设定了新的标准,并推动了数值MVI领域的进步。
https://arxiv.org/abs/2412.19113
Existing facial editing methods have achieved remarkable results, yet they often fall short in supporting multimodal conditional local facial editing. One of the significant evidences is that their output image quality degrades dramatically after several iterations of incremental editing, as they do not support local editing. In this paper, we present a novel multimodal generative and fusion framework for globally-consistent local facial editing (FACEMUG) that can handle a wide range of input modalities and enable fine-grained and semantic manipulation while remaining unedited parts unchanged. Different modalities, including sketches, semantic maps, color maps, exemplar images, text, and attribute labels, are adept at conveying diverse conditioning details, and their combined synergy can provide more explicit guidance for the editing process. We thus integrate all modalities into a unified generative latent space to enable multimodal local facial edits. Specifically, a novel multimodal feature fusion mechanism is proposed by utilizing multimodal aggregation and style fusion blocks to fuse facial priors and multimodalities in both latent and feature spaces. We further introduce a novel self-supervised latent warping algorithm to rectify misaligned facial features, efficiently transferring the pose of the edited image to the given latent codes. We evaluate our FACEMUG through extensive experiments and comparisons to state-of-the-art (SOTA) methods. The results demonstrate the superiority of FACEMUG in terms of editing quality, flexibility, and semantic control, making it a promising solution for a wide range of local facial editing tasks.
现有的面部编辑方法已经取得了显著成果,但它们往往在支持多模态条件下的局部面部编辑方面存在不足。一个重要证据是,在多次迭代的增量编辑后,这些方法生成的图像质量会急剧下降,因为它们不支持局部编辑。本文提出了一种新的全局一致性的局部面部编辑(FACEMUG)的多模态生成和融合框架,该框架能够处理广泛的输入模式,并允许进行细粒度和语义操作的同时保持未编辑部分不变。不同的模态,包括草图、语义地图、色彩地图、示例图像、文本和属性标签,都擅长传达各种条件细节,而它们的综合效应可以为编辑过程提供更明确的指导。因此,我们将所有模态整合到一个统一的生成潜在空间中,以实现多模态局部面部编辑。 具体而言,我们提出了一种新的多模态特征融合机制,利用多模态聚合和风格融合块在潜在空间和特征空间中融合面部先验知识与多种模式。此外,我们还引入了一种新颖的自我监督的潜在变形算法,用于校正错位的面部特征,并有效地将编辑图像的姿态转移到给定的潜在代码上。 我们通过广泛的实验和与其他最先进的方法的比较来评估FACEMUG的效果。结果表明,FACEMUG在编辑质量、灵活性和语义控制方面具有优越性,使其成为局部面部编辑任务中一个有前景的解决方案。
https://arxiv.org/abs/2412.19009
Humans naturally rely on floor plans to navigate in unfamiliar environments, as they are readily available, reliable, and provide rich geometrical guidance. However, existing visual navigation settings overlook this valuable prior knowledge, leading to limited efficiency and accuracy. To eliminate this gap, we introduce a novel navigation task: Floor Plan Visual Navigation (FloNa), the first attempt to incorporate floor plan into embodied visual navigation. While the floor plan offers significant advantages, two key challenges emerge: (1) handling the spatial inconsistency between the floor plan and the actual scene layout for collision-free navigation, and (2) aligning observed images with the floor plan sketch despite their distinct modalities. To address these challenges, we propose FloDiff, a novel diffusion policy framework incorporating a localization module to facilitate alignment between the current observation and the floor plan. We further collect $20k$ navigation episodes across $117$ scenes in the iGibson simulator to support the training and evaluation. Extensive experiments demonstrate the effectiveness and efficiency of our framework in unfamiliar scenes using floor plan knowledge. Project website: this https URL.
人类自然地依赖楼层平面图在不熟悉的环境中导航,因为这些平面图易于获取、可靠,并且能提供丰富的几何指导。然而,现有的视觉导航设置忽略了这一有价值的先验知识,导致效率和准确性有限。为了消除这一差距,我们引入了一种新的导航任务:Floor Plan Visual Navigation(FloNa),这是首次尝试将楼层平面图融入具身化视觉导航中。虽然楼层平面图具有显著优势,但也出现了两个主要挑战:(1) 处理楼层平面图与实际场景布局之间的空间不一致性以实现无碰撞导航;(2) 尽管它们的模式不同,仍需对齐观察到的图像和楼层平面草图。为了解决这些挑战,我们提出了一种新的扩散策略框架FloDiff,该框架集成了一个定位模块,用于促进当前观察与楼层平面之间的对齐。我们进一步收集了在iGibson模拟器中的117个场景中进行的20,000次导航事件,以支持训练和评估。大量的实验表明,在使用楼层平面图知识的情况下,我们的框架在不熟悉的场景中具有有效性和效率。项目网站:此 https URL。
https://arxiv.org/abs/2412.18335
We introduce ChatGarment, a novel approach that leverages large vision-language models (VLMs) to automate the estimation, generation, and editing of 3D garments from images or text descriptions. Unlike previous methods that struggle in real-world scenarios or lack interactive editing capabilities, ChatGarment can estimate sewing patterns from in-the-wild images or sketches, generate them from text descriptions, and edit garments based on user instructions, all within an interactive dialogue. These sewing patterns can then be draped into 3D garments, which are easily animatable and simulatable. This is achieved by finetuning a VLM to directly generate a JSON file that includes both textual descriptions of garment types and styles, as well as continuous numerical attributes. This JSON file is then used to create sewing patterns through a programming parametric model. To support this, we refine the existing programming model, GarmentCode, by expanding its garment type coverage and simplifying its structure for efficient VLM fine-tuning. Additionally, we construct a large-scale dataset of image-to-sewing-pattern and text-to-sewing-pattern pairs through an automated data pipeline. Extensive evaluations demonstrate ChatGarment's ability to accurately reconstruct, generate, and edit garments from multimodal inputs, highlighting its potential to revolutionize workflows in fashion and gaming applications. Code and data will be available at this https URL.
我们介绍了一种全新的方法——ChatGarment,它利用大型视觉-语言模型(VLMs)来自动化从图像或文本描述中估计、生成和编辑3D服装的过程。与以前的方法在现实场景下表现不佳或缺乏交互式编辑功能不同,ChatGarment能够从野外图像或草图中估算缝纫图案,根据文本描述生成它们,并基于用户指令对服装进行编辑,所有这些操作都在一个互动对话环境中完成。然后,这些缝纫图案可以被转换成易于动画和模拟的3D服装。这是通过微调VLM来直接生成包含服装类型和风格的文字描述以及连续数值属性的JSON文件实现的。这个JSON文件随后用于通过编程参数模型创建缝纫图案。为此,我们改进了现有的编程模型GarmentCode,扩大其服装类型的覆盖范围并简化结构以提高VLM微调效率。此外,我们还通过自动化数据管道构建了一个大规模图像-缝纫图案和文本-缝纫图案对的数据集。广泛的评估证明了ChatGarment能够准确地从多模态输入中重建、生成和编辑服装,这凸显了其在时尚和游戏应用工作流程中的革命性潜力。代码和数据将在以下链接提供:[这个 https URL]。
https://arxiv.org/abs/2412.17811
Test-time adaptation (TTA) aims to fine-tune a trained model online using unlabeled testing data to adapt to new environments or out-of-distribution data, demonstrating broad application potential in real-world scenarios. However, in this optimization process, unsupervised learning objectives like entropy minimization frequently encounter noisy learning signals. These signals produce unreliable gradients, which hinder the model ability to converge to an optimal solution quickly and introduce significant instability into the optimization process. In this paper, we seek to resolve these issues from the perspective of optimizer design. Unlike prior TTA using manually designed optimizers like SGD, we employ a learning-to-optimize approach to automatically learn an optimizer, called Meta Gradient Generator (MGG). Specifically, we aim for MGG to effectively utilize historical gradient information during the online optimization process to optimize the current model. To this end, in MGG, we design a lightweight and efficient sequence modeling layer -- gradient memory layer. It exploits a self-supervised reconstruction loss to compress historical gradient information into network parameters, thereby enabling better memorization ability over a long-term adaptation process. We only need a small number of unlabeled samples to pre-train MGG, and then the trained MGG can be deployed to process unseen samples. Promising results on ImageNet-C, R, Sketch, and A indicate that our method surpasses current state-of-the-art methods with fewer updates, less data, and significantly shorter adaptation iterations. Compared with a previous SOTA method SAR, we achieve 7.4% accuracy improvement and 4.2 times faster adaptation speed on ImageNet-C.
测试时适应(TTA)旨在通过使用未标记的测试数据在线微调已训练模型,以适应新环境或分布外的数据,在现实世界场景中展示出广泛的应用潜力。然而,在这一优化过程中,无监督学习目标如熵最小化经常遇到噪声学习信号的问题。这些信号产生不可靠的梯度,阻碍了模型快速收敛到最优解的能力,并给优化过程带来了显著的不稳定性。本文试图从优化器设计的角度解决这些问题。与之前TTA使用的手动设计的优化器(如SGD)不同,我们采用了一种学习优化的方法来自动学习一个优化器,称为元梯度生成器(MGG)。具体来说,我们的目标是使MGG在在线优化过程中有效地利用历史梯度信息来优化当前模型。为此,在MGG中,我们设计了一个轻量级且高效的序列建模层——梯度记忆层。它利用自监督重构损失将历史梯度信息压缩到网络参数中,从而能够在长期适应过程中拥有更好的记忆力能力。我们只需要少量未标记的样本进行预训练MGG,然后即可部署已训练好的MGG来处理未见过的样本。ImageNet-C、R、Sketch和A上的有希望的结果表明,我们的方法在较少更新次数、更少数据量以及显著更短适应迭代次数的情况下超越了当前最先进的方法。与之前SOTA方法SAR相比,在ImageNet-C上我们实现了7.4%的精度提升,并且适应速度提高了4.2倍。
https://arxiv.org/abs/2412.16901
Retrieval-Augmented Generation (RAG) systems have become pivotal in leveraging vast corpora to generate informed and contextually relevant responses, notably reducing hallucinations in Large Language Models. Despite significant advancements, these systems struggle to efficiently process and retrieve information from large datasets while maintaining a comprehensive understanding of the context. This paper introduces SKETCH, a novel methodology that enhances the RAG retrieval process by integrating semantic text retrieval with knowledge graphs, thereby merging structured and unstructured data for a more holistic comprehension. SKETCH, demonstrates substantial improvements in retrieval performance and maintains superior context integrity compared to traditional methods. Evaluated across four diverse datasets: QuALITY, QASPER, NarrativeQA, and Italian Cuisine-SKETCH consistently outperforms baseline approaches on key RAGAS metrics such as answer_relevancy, faithfulness, context_precision and context_recall. Notably, on the Italian Cuisine dataset, SKETCH achieved an answer relevancy of 0.94 and a context precision of 0.99, representing the highest performance across all evaluated metrics. These results highlight SKETCH's capability in delivering more accurate and contextually relevant responses, setting new benchmarks for future retrieval systems.
检索增强生成(RAG)系统已成为利用大规模语料库生成信息丰富且上下文相关响应的关键,这显著减少了大型语言模型中的幻觉现象。尽管取得了重大进展,这些系统在处理和从大数据集中检索信息时仍然难以保持全面的上下文理解能力。本文介绍了SKETCH这一新颖的方法论,该方法通过将语义文本检索与知识图谱结合来增强RAG检索过程,从而融合结构化和非结构化数据以实现更整体的理解。相比传统方法,SKETCH在检索性能方面表现出显著改进,并且保持了更好的上下文完整性。在四个不同数据集(QuALITY、QASPER、NarrativeQA 和意大利美食)上进行的评估表明,SKETCH 在关键RAGAS指标(如答案相关性、忠实度、上下文精度和上下文召回率)上始终优于基线方法。特别是在意大利美食数据集中,SKETCH 达到了0.94的答案相关性和0.99的上下文精度,这些在所有评估指标中的表现均是最高的。这些结果突显了SKETCH提供更准确且与上下文高度相关的响应的能力,并为未来的检索系统设定了新的基准。
https://arxiv.org/abs/2412.15443
Generating sewing patterns in garment design is receiving increasing attention due to its CG-friendly and flexible-editing nature. Previous sewing pattern generation methods have been able to produce exquisite clothing, but struggle to design complex garments with detailed control. To address these issues, we propose SewingLDM, a multi-modal generative model that generates sewing patterns controlled by text prompts, body shapes, and garment sketches. Initially, we extend the original vector of sewing patterns into a more comprehensive representation to cover more intricate details and then compress them into a compact latent space. To learn the sewing pattern distribution in the latent space, we design a two-step training strategy to inject the multi-modal conditions, \ie, body shapes, text prompts, and garment sketches, into a diffusion model, ensuring the generated garments are body-suited and detail-controlled. Comprehensive qualitative and quantitative experiments show the effectiveness of our proposed method, significantly surpassing previous approaches in terms of complex garment design and various body adaptability. Our project page: this https URL.
服装设计中的缝制图案生成因其符合计算图形学且具有灵活编辑的特点而受到越来越多的关注。先前的缝制图案生成方法能够制作出精美的衣物,但在设计复杂服饰时难以实现详细的控制。为了解决这些问题,我们提出了SewingLDM,这是一种多模态生成模型,可以根据文本提示、体型和服装草图来生成缝制图案。首先,我们将原始缝制图案向量扩展为更全面的表示形式以涵盖更多细节,并将它们压缩到紧凑的潜在空间中。为了在潜在空间中学到缝制图案分布,我们设计了一个两步训练策略,将多模态条件(即体型、文本提示和服装草图)注入扩散模型中,确保生成的衣物既合身又具有细节控制。全面的定性和定量实验显示了我们提出方法的有效性,在复杂服饰设计和多种体型适应性方面显著超越了先前的方法。我们的项目页面:这个 https URL。
https://arxiv.org/abs/2412.14453
The production of 2D animation follows an industry-standard workflow, encompassing four essential stages: character design, keyframe animation, in-betweening, and coloring. Our research focuses on reducing the labor costs in the above process by harnessing the potential of increasingly powerful generative AI. Using video diffusion models as the foundation, AniDoc emerges as a video line art colorization tool, which automatically converts sketch sequences into colored animations following the reference character specification. Our model exploits correspondence matching as an explicit guidance, yielding strong robustness to the variations (e.g., posture) between the reference character and each line art frame. In addition, our model could even automate the in-betweening process, such that users can easily create a temporally consistent animation by simply providing a character image as well as the start and end sketches. Our code is available at: this https URL.
二维动画的生产遵循一个行业标准的工作流程,涵盖了四个基本阶段:角色设计、关键帧动画、中间画和上色。我们的研究旨在通过利用日益强大的生成式AI的潜力来减少上述过程中的劳动力成本。基于视频扩散模型,AniDoc 作为一种视频线稿着色工具应运而生,它能够根据参考角色规范自动将草图序列转换为彩色动画。我们的模型利用对应匹配作为显式指导,对参考角色与每帧线稿之间的变化(例如姿势)具有很强的鲁棒性。此外,我们的模型甚至可以自动化中间画过程,使得用户只需提供一个角色图像以及起始和结束草图就能轻松创建时间上一致的动画。我们的代码可以在以下链接获取:此 https URL。
https://arxiv.org/abs/2412.14173
Computer-Aided Design (CAD) models are typically constructed by sequentially drawing parametric sketches and applying CAD operations to obtain a 3D model. The problem of 3D CAD reverse engineering consists of reconstructing the sketch and CAD operation sequences from 3D representations such as point clouds. In this paper, we address this challenge through novel contributions across three levels: CAD sequence representation, network design, and dataset. In particular, we represent CAD sketch-extrude sequences as Python code. The proposed CAD-Recode translates a point cloud into Python code that, when executed, reconstructs the CAD model. Taking advantage of the exposure of pre-trained Large Language Models (LLMs) to Python code, we leverage a relatively small LLM as a decoder for CAD-Recode and combine it with a lightweight point cloud projector. CAD-Recode is trained solely on a proposed synthetic dataset of one million diverse CAD sequences. CAD-Recode significantly outperforms existing methods across three datasets while requiring fewer input points. Notably, it achieves 10 times lower mean Chamfer distance than state-of-the-art methods on DeepCAD and Fusion360 datasets. Furthermore, we show that our CAD Python code output is interpretable by off-the-shelf LLMs, enabling CAD editing and CAD-specific question answering from point clouds.
计算机辅助设计(CAD)模型通常是通过顺序绘制参数化草图并应用CAD操作来获得三维模型。3D CAD逆向工程的问题在于,从点云等3D表示形式中重建草图和CAD操作序列。本文通过三个层次上的新贡献解决了这一挑战:CAD序列表示、网络设计和数据集。特别是,我们将CAD草图-拉伸序列表示为Python代码。我们提出的CAD-Recode将点云转换成一段Python代码,当执行这段代码时,可以重建CAD模型。利用预训练大语言模型(LLM)对Python代码的接触,我们使用了一个相对较小的LLM作为CAD-Recode的解码器,并将其与轻量级点云投影器结合在一起。CAD-Recode仅在一个包含一百万种不同CAD序列的合成数据集上进行训练。在三个数据集中,CAD-Recode显著超越了现有方法,且所需输入点更少。值得注意的是,在DeepCAD和Fusion360数据集上,其平均Chamfer距离比最先进的方法低10倍。此外,我们展示了我们的CAD Python代码输出可以被现成的LLM解释,从而实现从点云进行CAD编辑和特定于CAD的问题回答。
https://arxiv.org/abs/2412.14042
Scene generation is crucial to many computer graphics applications. Recent advances in generative AI have streamlined sketch-to-image workflows, easing the workload for artists and designers in creating scene concept art. However, these methods often struggle for complex scenes with multiple detailed objects, sometimes missing small or uncommon instances. In this paper, we propose a Training-free Triplet Tuning for Sketch-to-Scene (T3-S2S) generation after reviewing the entire cross-attention mechanism. This scheme revitalizes the existing ControlNet model, enabling effective handling of multi-instance generations, involving prompt balance, characteristics prominence, and dense tuning. Specifically, this approach enhances keyword representation via the prompt balance module, reducing the risk of missing critical instances. It also includes a characteristics prominence module that highlights TopK indices in each channel, ensuring essential features are better represented based on token sketches. Additionally, it employs dense tuning to refine contour details in the attention map, compensating for instance-related regions. Experiments validate that our triplet tuning approach substantially improves the performance of existing sketch-to-image models. It consistently generates detailed, multi-instance 2D images, closely adhering to the input prompts and enhancing visual quality in complex multi-instance scenes. Code is available at this https URL.
场景生成对许多计算机图形应用至关重要。最近在生成式人工智能方面的进展简化了草图转图像的工作流程,减轻了艺术家和设计师创作场景概念艺术的工作量。然而,这些方法往往难以处理包含多个详细对象的复杂场景,有时会遗漏小或不常见的实例。在这篇论文中,我们在审查整个交叉注意力机制后提出了一个无需训练的三元组调整方案(T3-S2S)以进行草图转场景生成。该方案重新激活了现有的ControlNet模型,使其能够有效处理多实例生成,包括提示平衡、特征突出和密集调优。具体而言,这种方法通过提示平衡模块增强了关键词表示,降低了遗漏关键实例的风险。它还包含一个特征突出模块,在每个通道中突出TopK索引,确保基于标记草图的重要特征得到更好的表现。此外,该方法采用密集调优来细化注意力图中的轮廓细节,弥补与实例相关的区域。实验验证了我们的三元组调整方案显著提高了现有草图转图像模型的性能。它能够一致地生成详细、多实例的2D图像,紧密遵循输入提示并增强复杂多实例场景下的视觉质量。代码可在此 https URL 获取。
https://arxiv.org/abs/2412.13486
This paper explores the integration of Visual Code Assistants in Integrated Development Environments (IDEs). In Software Engineering, whiteboard sketching is often the initial step before coding, serving as a crucial collaboration tool for developers. Previous studies have investigated patterns in SE sketches and how they are used in practice, yet methods for directly using these sketches for code generation remain limited. The emergence of visually-equipped large language models presents an opportunity to bridge this gap, which is the focus of our research. In this paper, we built a first prototype of a Visual Code Assistant to get user feedback regarding in-IDE sketch-to-code tools. We conduct an experiment with 19 data scientists, most of whom regularly sketch as part of their job. We investigate developers' mental models by analyzing patterns commonly observed in their sketches when developing an ML workflow. Analysis indicates that diagrams were the preferred organizational component (52.6%), often accompanied by lists (42.1%) and numbered points (36.8%). Our tool converts their sketches into a Python notebook by querying an LLM. We use an LLM-as-judge setup to score the quality of the generated code, finding that even brief sketching can effectively generate useful code outlines. We also find a positive correlation between sketch time and the quality of the generated code. We conclude the study by conducting extensive interviews to assess the tool's usefulness, explore potential use cases, and understand developers' needs. As noted by participants, promising applications for these assistants include education, prototyping, and collaborative settings. Our findings signal promise for the next generation of Code Assistants to integrate visual information, both to improve code generation and to better leverage developers' existing sketching practices.
本文探讨了在集成开发环境(IDE)中整合视觉代码助手的可能性。在软件工程中,白板草图往往是编码前的第一步,作为开发者之间的重要协作工具。先前的研究已经调查了SE草图中的模式及其实际应用情况,但直接利用这些草图进行代码生成的方法仍然有限。配备视觉功能的大语言模型的出现为填补这一空白提供了机会,这也是我们研究的重点所在。在本文中,我们构建了一个可视代码助手的第一个原型,以获取用户对IDE内草图转代码工具的反馈。我们进行了一个实验,对象是19位数据科学家,其中大多数人在工作中经常进行草图绘制。我们通过分析开发ML工作流时常见于他们草图中的模式来研究开发者们的心理模型。分析表明,图表是最受欢迎的组织组件(52.6%),通常伴随列表(42.1%)和编号要点(36.8%)。我们的工具通过查询大语言模型将他们的草图转换成Python笔记本。我们使用大语言模型作为评分者来评估生成代码的质量,发现即使是简短的草图也能有效地生成有用的代码大纲。我们还发现了草图时间和生成代码质量之间的正相关关系。我们通过对参与者进行广泛的访谈以评估工具的实用性、探索潜在应用场景以及理解开发者的具体需求而结束了这项研究。正如参与者所指出的那样,这些助手在教育、原型设计和协作场景中的应用前景广阔。我们的发现表明,下一代代码助手整合视觉信息具有潜力,既能够提高代码生成的质量,也能更好地利用开发者现有的草图绘制习惯。
https://arxiv.org/abs/2412.13386
Despite the rapid advancements in text-to-image (T2I) synthesis, enabling precise visual control remains a significant challenge. Existing works attempted to incorporate multi-facet controls (text and sketch), aiming to enhance the creative control over generated images. However, our pilot study reveals that the expressive power of humans far surpasses the capabilities of current methods. Users desire a more versatile approach that can accommodate their diverse creative intents, ranging from controlling individual subjects to manipulating the entire scene composition. We present VersaGen, a generative AI agent that enables versatile visual control in T2I synthesis. VersaGen admits four types of visual controls: i) single visual subject; ii) multiple visual subjects; iii) scene background; iv) any combination of the three above or merely no control at all. We train an adaptor upon a frozen T2I model to accommodate the visual information into the text-dominated diffusion process. We introduce three optimization strategies during the inference phase of VersaGen to improve generation results and enhance user experience. Comprehensive experiments on COCO and Sketchy validate the effectiveness and flexibility of VersaGen, as evidenced by both qualitative and quantitative results.
尽管文本到图像(T2I)合成领域的快速进步,实现精确的视觉控制仍然是一个重大挑战。现有的工作尝试结合多方面的控制(如文本和草图),旨在增强生成图像的创造性控制能力。然而,我们的初步研究显示,人类的表现力远远超过了现有方法的能力。用户希望有一个更灵活的方法来满足他们多样化的创意意图,从控制单个主题到操纵整个场景构图。我们介绍了VersaGen,一个能够实现T2I合成中多样化视觉控制的生成AI代理。VersaGen接受四种类型的视觉控制:i) 单一视觉主体;ii) 多个视觉主体;iii) 场景背景;iv) 上述三者中的任意组合或完全不进行控制。我们在冻结的T2I模型上训练了一个适配器,以将视觉信息融入主导文本的扩散过程中。在VersaGen的推理阶段,我们引入了三种优化策略来改进生成结果并提升用户体验。通过COCO和Sketchy上的全面实验验证了VersaGen的有效性和灵活性,这从定性定量的结果中得到了证明。
https://arxiv.org/abs/2412.11594
The use of Unmanned Aerial Vehicles (UAVs) for aerial tasks and environmental manipulation is increasingly desired. This can be demonstrated via art tasks. This paper presents the development of Magnasketch, capable of translating image inputs into art on a magnetic drawing board via a Bitcraze Crazyflie 2.0 quadrotor. Optimal trajectories were generated using a Model Predictive Control (MPC) formulation newly incorporating magnetic force dynamics. A Z-compliant magnetic drawing apparatus was designed for the quadrotor. Experimental results of the novel controller tested against the existing Position High Level Commander showed comparable performance. Although slightly outperformed in terms of error, with average errors of 3.9 cm, 4.4 cm, and 0.5 cm in x, y, and z respectively, the Magnasketch controller produced smoother drawings with the added benefit of full state control.
无人航空器(UAV)用于空中任务和环境操控的需求日益增加。这一点可以通过艺术创作来展示。本文介绍了Magnasketch的发展,该系统能够通过Bitcraze Crazyflie 2.0四旋翼无人机将图像输入转化为磁画板上的艺术品。使用一种新的结合了磁场动力学的模型预测控制(MPC)公式生成了最优轨迹。为四旋翼设计了一种Z轴兼容的磁性绘画装置。实验结果表明,与现有的位置高级指令控制器相比,新型控制器表现出相当的性能。尽管在误差方面稍逊一筹,平均误差分别为x、y和z方向3.9厘米、4.4厘米和0.5厘米,但Magnasketch控制器生成了更平滑的绘画,并且具有全状态控制的优势。
https://arxiv.org/abs/2412.10670
We present MS2Mesh-XR, a novel multi-modal sketch-to-mesh generation pipeline that enables users to create realistic 3D objects in extended reality (XR) environments using hand-drawn sketches assisted by voice inputs. In specific, users can intuitively sketch objects using natural hand movements in mid-air within a virtual environment. By integrating voice inputs, we devise ControlNet to infer realistic images based on the drawn sketches and interpreted text prompts. Users can then review and select their preferred image, which is subsequently reconstructed into a detailed 3D mesh using the Convolutional Reconstruction Model. In particular, our proposed pipeline can generate a high-quality 3D mesh in less than 20 seconds, allowing for immersive visualization and manipulation in run-time XR scenes. We demonstrate the practicability of our pipeline through two use cases in XR settings. By leveraging natural user inputs and cutting-edge generative AI capabilities, our approach can significantly facilitate XR-based creative production and enhance user experiences. Our code and demo will be available at: this https URL
我们推出了MS2Mesh-XR,这是一种新型的多模态草图到网格生成管线,它让用户能够使用手绘素描和语音输入,在扩展现实(XR)环境中创建逼真的3D物体。具体来说,用户可以利用自然的手部动作在虚拟环境中的空中直观地绘制物体。通过整合语音输入,我们设计了ControlNet来根据所画的草图和解释的文字提示推断出真实的图像。然后,用户可以选择他们喜欢的图像,并使用卷积重建模型将其重构为详细的3D网格。特别是,我们的管线可以在20秒内生成高质量的3D网格,从而在运行时XR场景中实现沉浸式的可视化和操作。我们通过两个XR环境下的用例展示了该管线的实际应用性。通过利用自然用户输入和尖端的生成式AI能力,我们的方法可以显著促进基于XR的创意生产,并提升用户体验。我们的代码和演示将在以下链接提供:此 https URL
https://arxiv.org/abs/2412.09008