Generating human portraits is a hot topic in the image generation area, e.g. mask-to-face generation and text-to-face generation. However, these unimodal generation methods lack controllability in image generation. Controllability can be enhanced by exploring the advantages and complementarities of various modalities. For instance, we can utilize the advantages of text in controlling diverse attributes and masks in controlling spatial locations. Current state-of-the-art methods in multimodal generation face limitations due to their reliance on extensive hyperparameters, manual operations during the inference stage, substantial computational demands during training and inference, or inability to edit real images. In this paper, we propose a practical framework - MM2Latent - for multimodal image generation and editing. We use StyleGAN2 as our image generator, FaRL for text encoding, and train an autoencoders for spatial modalities like mask, sketch and 3DMM. We propose a strategy that involves training a mapping network to map the multimodal input into the w latent space of StyleGAN. The proposed framework 1) eliminates hyperparameters and manual operations in the inference stage, 2) ensures fast inference speeds, and 3) enables the editing of real images. Extensive experiments demonstrate that our method exhibits superior performance in multimodal image generation, surpassing recent GAN- and diffusion-based methods. Also, it proves effective in multimodal image editing and is faster than GAN- and diffusion-based methods. We make the code publicly available at: this https URL
生成人类肖像是一个热门话题,图像生成领域,例如口罩到脸的生成和文本到脸的生成等。然而,这些单模式生成方法在图像生成过程中缺乏可控制性。通过探索各种模态的优势和互补性,可以增强可控制性。例如,我们可以利用文本在控制多样属性和口罩在控制空间位置的优势。目前,在多模态生成中,由于它们依赖于广泛的超参数,推理阶段的手动操作,训练和推理阶段的巨大计算需求,或无法编辑真实图像,最先进的方法面临局限。在本文中,我们提出了一个实用的框架 - MM2Latent - 多模态图像生成和编辑。我们使用StyleGAN2作为图像生成器,FaRL进行文本编码,并训练具有空间模态(如口罩、轮廓和3DMM)的自动编码器。我们提出了一种策略,涉及训练映射网络将多模态输入映射到StyleGAN的w latent空间。所提出的框架1)消除了推理阶段中的超参数和手动操作,2)确保了快速的推理速度,3)使得真实图像的编辑成为可能。大量的实验证明,我们的方法在多模态图像生成方面表现出卓越的性能,超过了基于GAN和扩散的方法。此外,它还证明在多模态图像编辑方面非常有效,比基于GAN和扩散的方法更快。您可以通过以下链接公开获取代码:https://this URL
https://arxiv.org/abs/2409.11010
This paper introduces the notion of a universal plan, which when executed, is guaranteed to solve all planning problems in a category, regardless of the obstacles, initial state, and goal set. Such plans are specified as a deterministic sequence of actions that are blindly applied without any sensor feedback. Thus, they can be considered as pure exploration in a reinforcement learning context, and we show that with basic memory requirements, they even yield optimal plans. Building upon results in number theory and theory of automata, we provide universal plans both for discrete and continuous (motion) planning and prove their (semi)completeness. The concepts are applied and illustrated through simulation studies, and several directions for future research are sketched.
本文引入了通用计划的概念,当其执行时,保证解决该类规划问题中的所有问题,而不管障碍、初始状态和目标集。这样的计划被确定为一系列盲目执行的动作的确定性序列,没有任何传感器反馈。因此,在强化学习背景下,可以将它们视为纯粹的探索。我们证明了,在基本记忆需求的基础上,它们甚至可以产生最优的计划。通过在数论和自动机理论的基础上,我们为离散和连续(运动)规划提供了通用计划,并证明了它们的(半)完备性。通过仿真研究进行了概念的应用,并勾勒出未来研究的几个方向。
https://arxiv.org/abs/2407.02090
Human-level autonomous driving is an ever-elusive goal, with planning and decision making -- the cognitive functions that determine driving behavior -- posing the greatest challenge. Despite a proliferation of promising approaches, progress is stifled by the difficulty of deploying experimental planners in naturalistic settings. In this work, we propose Lab2Car, an optimization-based wrapper that can take a trajectory sketch from an arbitrary motion planner and convert it to a safe, comfortable, dynamically feasible trajectory that the car can follow. This allows motion planners that do not provide such guarantees to be safely tested and optimized in real-world environments. We demonstrate the versatility of Lab2Car by using it to deploy a machine learning (ML) planner and a search-based planner on self-driving cars in Las Vegas. The resulting systems handle challenging scenarios, such as cut-ins, overtaking, and yielding, in complex urban environments like casino pick-up/drop-off areas. Our work paves the way for quickly deploying and evaluating candidate motion planners in realistic settings, ensuring rapid iteration and accelerating progress towards human-level autonomy.
人类水平自动驾驶是一个遥不可及的目标,在规划与决策——决定驾驶行为的心智功能——方面提出了最大的挑战。尽管有大量的前景良好的方法,但部署实验规划器在自然环境中的难度阻碍了进步。在这项工作中,我们提出了Lab2Car,一种基于优化的封装,可以将任意运动规划器的轨迹草案转换为汽车可以遵循的安全、舒适、动态可行的轨迹。这使得不提供这种保证的运动规划器可以在现实环境中安全测试和优化。我们通过使用Lab2Car在拉斯维加斯的自动驾驶汽车上部署机器学习(ML)规划和基于搜索的规划,展示了其多才多艺。通过处理具有挑战性的场景,如切入、超车和让行等,如赌场接送区域等复杂的城市环境,我们的工作为在现实环境中快速部署和评估候选运动规划器铺平了道路,确保快速迭代,加速进步,实现人类水平自动驾驶。
https://arxiv.org/abs/2409.09523
Large language models (LLMs) represented by GPT family have achieved remarkable success. The characteristics of LLMs lie in their ability to accommodate a wide range of tasks through a generative approach. However, the flexibility of their output format poses challenges in controlling and harnessing the model's outputs, thereby constraining the application of LLMs in various domains. In this work, we present Sketch, an innovative toolkit designed to streamline LLM operations across diverse fields. Sketch comprises the following components: (1) a suite of task description schemas and prompt templates encompassing various NLP tasks; (2) a user-friendly, interactive process for building structured output LLM services tailored to various NLP tasks; (3) an open-source dataset for output format control, along with tools for dataset construction; and (4) an open-source model based on LLaMA3-8B-Instruct that adeptly comprehends and adheres to output formatting instructions. We anticipate this initiative to bring considerable convenience to LLM users, achieving the goal of ''plug-and-play'' for various applications. The components of Sketch will be progressively open-sourced at this https URL.
大语言模型(LLMs)用GPT家族的代表已经取得了巨大的成功。LLMs的特点在于它们通过生成方法适应各种任务的能力。然而,它们的输出格式的灵活性使得控制和利用模型的输出存在挑战,从而限制了LLM在各个领域的应用。在这项工作中,我们提出了Sketch,这是一个旨在简化LLM操作的多领域创新工具包。Sketch包括以下组件:(1)一系列任务描述模式和提示模板,涵盖各种NLP任务;(2)一个用户友好、交互式的构建结构化输出LLM服务的流程,针对各种NLP任务进行定制;(3)一个用于控制输出格式开放的源数据集以及相应的工具;(4)基于LLaMA3-8B-Instruct的开源模型,能够准确理解并遵循输出格式说明。我们预计,这项倡议将为LLM用户提供带来很大便利,实现各种应用的“插上即用”目标。Sketch的组件将逐步开源,请访问以下链接。
https://arxiv.org/abs/2409.03346
This work studies offline Reinforcement Learning (RL) in a class of non-Markovian environments called Regular Decision Processes (RDPs). In RDPs, the unknown dependency of future observations and rewards from the past interactions can be captured by some hidden finite-state automaton. For this reason, many RDP algorithms first reconstruct this unknown dependency using automata learning techniques. In this paper, we show that it is possible to overcome two strong limitations of previous offline RL algorithms for RDPs, notably RegORL. This can be accomplished via the introduction of two original techniques: the development of a new pseudometric based on formal languages, which removes a problematic dependency on $L_\infty^\mathsf{p}$-distinguishability parameters, and the adoption of Count-Min-Sketch (CMS), instead of naive counting. The former reduces the number of samples required in environments that are characterized by a low complexity in language-theoretic terms. The latter alleviates the memory requirements for long planning horizons. We derive the PAC sample complexity bounds associated to each of these techniques, and we validate the approach experimentally.
本文研究了一类非马尔可夫过程(RDP)中的离线强化学习(RL)。在RDP中,通过一些隐式有限状态自动机可以捕获过去交互中未来观察和奖励的不确定性。因此,许多RDP算法最初使用机器学习技术重构这种未知依赖关系。在本文中,我们证明了可以为RDP克服两个 previous RL算法的两个主要限制,特别是 RegORL。这可以通过引入两种新的技术来实现:基于形式语言开发的新型伪迹,删除了问题依赖 $L_\infty^\mathrm{p}$-分形参数;采用计数最小描画(CMS),而不是简单计数。前者减少了在语言理论术语中具有低复杂性的环境中需要的样本数量。后者减轻了长期规划视野中的内存需求。我们分别计算了这两种技术的相关PAE样本复杂度下限,并通过实验验证了这种方法。
https://arxiv.org/abs/2409.02747
Large language models (LLMs) have reshaped the landscape of program synthesis. However, contemporary LLM-based code completion systems often hallucinate broken code because they lack appropriate context, particularly when working with definitions not in the training data nor near the cursor. This paper demonstrates that tight integration with the type and binding structure of a language, as exposed by its language server, can address this contextualization problem in a token-efficient manner. In short, we contend that AIs need IDEs, too! In particular, we integrate LLM code generation into the Hazel live program sketching environment. The Hazel Language Server identifies the type and typing context of the hole being filled, even in the presence of errors, ensuring that a meaningful program sketch is always available. This allows prompting with codebase-wide contextual information not lexically local to the cursor, nor necessarily in the same file, but that is likely to be semantically local to the developer's goal. Completions synthesized by the LLM are then iteratively refined via further dialog with the language server. To evaluate these techniques, we introduce MVUBench, a dataset of model-view-update (MVU) web applications. These applications serve as challenge problems due to their reliance on application-specific data structures. We find that contextualization with type definitions is particularly impactful. After introducing our ideas in the context of Hazel we duplicate our techniques and port MVUBench to TypeScript in order to validate the applicability of these methods to higher-resource languages. Finally, we outline ChatLSP, a conservative extension to the Language Server Protocol (LSP) that language servers can implement to expose capabilities that AI code completion systems of various designs can use to incorporate static context when generating prompts for an LLM.
大规模语言模型(LLMs)已经在程序合成领域改变了格局。然而,当代基于LLM的代码补全系统通常因为缺乏适当的上下文而误判为有问题的代码,尤其是在处理不在训练数据附近或不在光标下的定义时。本文证明了紧密与编程语言的类型和绑定结构集成,如由其语言服务器暴露的上下文,可以以 token-efficient 的方式解决这种上下文化问题。简而言之,我们认为AI也需要IDE。特别是,我们将LLM代码生成集成到Hazel活程序绘图环境中。Hazel语言服务器在存在错误的情况下仍能识别出要填补的空缺类型和类型上下文,确保始终提供有意义的程序草图。这使得可以使用代码库范围的外部上下文信息,既不是光标附近也不是同一文件,但很可能与开发者的目标语义上下文。由LLM生成的补全可以通过与语言服务器进一步对话进行迭代优化。为了评估这些技术,我们引入了MVUBench,一个模型视图更新(MVU)Web应用程序的 dataset。这些应用程序因依赖应用程序特定的数据结构而成为挑战。我们发现,与类型定义相关的上下文化尤为重要。在将我们的想法引入Hazel上下文之后,我们重复了我们的技术并将MVUBench迁移到TypeScript,以验证这些方法是否适用于具有更高资源要求的高级语言。最后,我们概述了ChatLSP,这是一种保守的Language Server 协议(LSP)扩展,语言服务器可以实现以在生成LLM提示时使用静态上下文的功能。
https://arxiv.org/abs/2409.00921
Portrait sketching involves capturing identity specific attributes of a real face with abstract lines and shades. Unlike photo-realistic images, a good portrait sketch generation method needs selective attention to detail, making the problem challenging. This paper introduces \textbf{Portrait Sketching StyleGAN (PS-StyleGAN)}, a style transfer approach tailored for portrait sketch synthesis. We leverage the semantic $W+$ latent space of StyleGAN to generate portrait sketches, allowing us to make meaningful edits, like pose and expression alterations, without compromising identity. To achieve this, we propose the use of Attentive Affine transform blocks in our architecture, and a training strategy that allows us to change StyleGAN's output without finetuning it. These blocks learn to modify style latent code by paying attention to both content and style latent features, allowing us to adapt the outputs of StyleGAN in an inversion-consistent manner. Our approach uses only a few paired examples ($\sim 100$) to model a style and has a short training time. We demonstrate PS-StyleGAN's superiority over the current state-of-the-art methods on various datasets, qualitatively and quantitatively.
肖像绘画涉及用抽象线条和阴影捕捉真实脸的身份特定特征。与照片现实主义图像相比,一个好的肖像绘画模板生成方法需要对细节进行选择性关注,使得这个问题具有挑战性。本文介绍了一种名为PS-StyleGAN的样式迁移方法,专门用于肖像绘画合成。我们利用StyleGAN的语义$W+$潜在空间生成肖像画,使我们能够在不损害身份的情况下进行有意义的变化,如姿态和表情的改变。为了实现这一点,我们在架构中引入了关注内容和风格的注意力平滑变换块,并采用一种允许我们无需微调StyleGAN输出以改变其风格的学习策略。这些块通过关注内容和风格潜在特征来修改样式潜在代码,使我们在反向一致的方式中适应StyleGAN的输出。我们的方法只需要几对成对的示例($\sim100$)来建模风格,训练时间较短。我们证明了PS-StyleGAN在各种数据集上的优越性,无论是定性的还是定量的。
https://arxiv.org/abs/2409.00345
Based on recent advanced diffusion models, Text-to-image (T2I) generation models have demonstrated their capabilities in generating diverse and high-quality images. However, leveraging their potential for real-world content creation, particularly in providing users with precise control over the image generation result, poses a significant challenge. In this paper, we propose an innovative training-free pipeline that extends existing text-to-image generation models to incorporate a sketch as an additional condition. To generate new images with a layout and structure closely resembling the input sketch, we find that these core features of a sketch can be tracked with the cross-attention maps of diffusion models. We introduce latent optimization, a method that refines the noisy latent at each intermediate step of the generation process using cross-attention maps to ensure that the generated images closely adhere to the desired structure outlined in the reference sketch. Through latent optimization, our method enhances the fidelity and accuracy of image generation, offering users greater control and customization options in content creation.
基于最近的高级扩散模型,文本到图像(T2I)生成模型已经在生成多样且高质量图像方面展示了其能力。然而,将它们的潜在用于现实世界内容创作的优势,特别是在为用户提供对图像生成结果的精确控制方面,仍然是一个巨大的挑战。在本文中,我们提出了一种创新的无训练的流程,将现有的文本到图像生成模型扩展到包括一个 sketches。为了生成与输入草图具有相似布局和结构的图像,我们发现,这些草图的核心特征可以通过扩散模型的跨注意图进行跟踪。我们引入了潜在优化,一种通过跨注意图在生成过程的每个中间步骤中优化噪声干扰的method,以确保生成的图像紧密遵循参考草图所提出的期望结构。通过潜在优化,我们的方法提高了图像生成的一致性和准确性,为用户提供了更大的创作控制和定制选项。
https://arxiv.org/abs/2409.00313
Automating architectural floorplan design is vital for housing and interior design, offering a faster, cost-effective alternative to manual sketches by architects. However, existing methods, including rule-based and learning-based approaches, face challenges in design complexity and constrained generation with extensive post-processing, and tend to obvious geometric inconsistencies such as misalignment, overlap, and gaps. In this work, we propose a novel generative framework for vector floorplan design via structural graph generation, called GSDiff, focusing on wall junction generation and wall segment prediction to capture both geometric and semantic aspects of structural graphs. To improve the geometric rationality of generated structural graphs, we propose two innovative geometry enhancement methods. In wall junction generation, we propose a novel alignment loss function to improve geometric consistency. In wall segment prediction, we propose a random self-supervision method to enhance the model's perception of the overall geometric structure, thereby promoting the generation of reasonable geometric structures. Employing the diffusion model and the Transformer model, as well as the geometry enhancement strategies, our framework can generate wall junctions, wall segments and room polygons with structural and semantic information, resulting in structural graphs that accurately represent floorplans. Extensive experiments show that the proposed method surpasses existing techniques, enabling free generation and constrained generation, marking a shift towards structure generation in architectural design.
自动化建筑平面设计对于住宅和室内设计至关重要,为建筑师提供了更快速、更经济的方法,取代了手动草图。然而,现有方法,包括基于规则的方法和学习方法,在设计复杂性和有限后处理条件下面临挑战,并通常表现出明显的几何不一致性,如错位、重叠和缺口。在这项工作中,我们提出了一个新型的生成框架,通过结构图生成矢量平面设计,称为GSDiff,重点关注墙接点生成和墙段预测,以捕捉结构图的几何和语义方面。为了提高生成的结构图的几何合理性,我们提出了两种创新的几何增强方法。在墙接点生成中,我们提出了一种新的对齐损失函数,以改善几何一致性。在墙段预测中,我们提出了一种随机自监督方法,以增强模型对整个几何结构的感知,从而促进生成合理的几何结构。采用扩散模型、Transformer模型以及几何增强策略,我们的框架可以生成带有结构和语义信息的墙接点、墙段和房间多边形,从而生成准确的平面图。大量实验证明,与现有技术相比,该方法超过了现有技术,实现了免费生成和约束生成,标志着建筑设计结构生成时代的到来。
https://arxiv.org/abs/2408.16258
Diffusion models have emerged as a popular method for 3D generation. However, it is still challenging for diffusion models to efficiently generate diverse and high-quality 3D shapes. In this paper, we introduce OctFusion, which can generate 3D shapes with arbitrary resolutions in 2.5 seconds on a single Nvidia 4090 GPU, and the extracted meshes are guaranteed to be continuous and manifold. The key components of OctFusion are the octree-based latent representation and the accompanying diffusion models. The representation combines the benefits of both implicit neural representations and explicit spatial octrees and is learned with an octree-based variational autoencoder. The proposed diffusion model is a unified multi-scale U-Net that enables weights and computation sharing across different octree levels and avoids the complexity of widely used cascaded diffusion schemes. We verify the effectiveness of OctFusion on the ShapeNet and Objaverse datasets and achieve state-of-the-art performances on shape generation tasks. We demonstrate that OctFusion is extendable and flexible by generating high-quality color fields for textured mesh generation and high-quality 3D shapes conditioned on text prompts, sketches, or category labels. Our code and pre-trained models are available at \url{this https URL}.
扩散模型已成为3D生成的流行方法。然而,扩散模型仍然难以高效地生成多样且高质量的3D形状。在本文中,我们介绍了OctFusion,它可以在单个Nvidia 4090 GPU上生成具有任意分辨率的3D形状,并且提取的网格保证是连续的和多维的。OctFusion的关键组件是基于 octree 的潜在表示和相应的扩散模型。表示结合了隐式神经表示和显式空间octree 的优势,使用基于octree的变分自编码器进行学习。所提出的扩散模型是一个统一的 multi-scale U-Net,可以实现不同octree级别之间的权重和计算共享,并避免了广泛使用的级联扩散方案的复杂性。我们在 ShapeNet 和 Objaverse 数据集上验证了OctFusion的有效性,并在形状生成任务上取得了最先进的性能。我们证明了OctFusion可以通过生成高质量的颜色场来扩展和灵活,基于文本提示、草图或分类标签生成高质量的三维形状。我们的代码和预训练模型可以从 \url{这个链接} 下载。
https://arxiv.org/abs/2408.14732
Volumetric segmentation is crucial for medical imaging but is often constrained by labor-intensive manual annotations and the need for scenario-specific model training. Furthermore, existing general segmentation models are inefficient due to their design and inferential approaches. Addressing this clinical demand, we introduce PropSAM, a propagation-based segmentation model that optimizes the use of 3D medical structure information. PropSAM integrates a CNN-based UNet for intra-slice processing with a Transformer-based module for inter-slice propagation, focusing on structural and semantic continuities to enhance segmentation across various modalities. Distinctively, PropSAM operates on a one-view prompt, such as a 2D bounding box or sketch mask, unlike conventional models that require two-view prompts. It has demonstrated superior performance, significantly improving the Dice Similarity Coefficient (DSC) across 44 medical datasets and various imaging modalities, outperforming models like MedSAM and SegVol with an average DSC improvement of 18.1%. PropSAM also maintains stable predictions despite prompt deviations and varying propagation configurations, confirmed by one-way ANOVA tests with P>0.5985 and P>0.6131, respectively. Moreover, PropSAM's efficient architecture enables faster inference speeds (Wilcoxon rank-sum test, P<0.001) and reduces user interaction time by 37.8% compared to two-view prompt models. Its ability to handle irregular and complex objects with robust performance further demonstrates its potential in clinical settings, facilitating more automated and reliable medical imaging analyses with minimal retraining.
体积分割在医学成像中至关重要,但通常受到劳动密集型手动注释和场景特定模型训练的需求所限制。此外,现有的通用分割模型由于其设计和推理方法而效率低下。为了满足临床需求,我们引入了PropSAM,一种基于传播的分割模型,旨在优化3D医疗结构信息的利用。PropSAM将基于CNN的UNet进行内切片处理与基于Transformer的模块进行跨切片传播相结合,重点关注结构和语义连续性以增强各种模态下的分割。与传统模型不同,PropSAM采用单视图提示,如2D边界框或草图掩膜,而传统模型需要双视图提示。已经在44个医疗数据集和各种成像模态上显著提高了Dice相似性系数(DSC),超越了MedSAM和SegVol等模型,平均DSC改善率为18.1%。此外,通过单向ANOVA测试,PropSAM在提示偏差和传播配置变化的情况下,预测结果仍然保持稳定。此外,PropSAM高效的架构使推理速度更快(Wilcoxon秩和检验,P<0.001),将用户交互时间减少37.8%。其处理不规则和复杂对象的稳健性能进一步表明其在临床环境中的巨大潜力,通过最小重新训练,促进更自动化的医学成像分析。
https://arxiv.org/abs/2408.13836
Following the advancements in text-guided image generation technology exemplified by Stable Diffusion, video generation is gaining increased attention in the academic community. However, relying solely on text guidance for video generation has serious limitations, as videos contain much richer content than images, especially in terms of motion. This information can hardly be adequately described with plain text. Fortunately, in computer vision, various visual representations can serve as additional control signals to guide generation. With the help of these signals, video generation can be controlled in finer detail, allowing for greater flexibility for different applications. Integrating various controls, however, is nontrivial. In this paper, we propose a universal framework called EasyControl. By propagating and injecting condition features through condition adapters, our method enables users to control video generation with a single condition map. With our framework, various conditions including raw pixels, depth, HED, etc., can be integrated into different Unet-based pre-trained video diffusion models at a low practical cost. We conduct comprehensive experiments on public datasets, and both quantitative and qualitative results indicate that our method outperforms state-of-the-art methods. EasyControl significantly improves various evaluation metrics across multiple validation datasets compared to previous works. Specifically, for the sketch-to-video generation task, EasyControl achieves an improvement of 152.0 on FVD and 19.9 on IS, respectively, in UCF101 compared with VideoComposer. For fidelity, our model demonstrates powerful image retention ability, resulting in high FVD and IS in UCF101 and MSR-VTT compared to other image-to-video models.
随着文本指导图像生成技术的发展,如Stable Diffusion,视频生成在学术领域受到了越来越多的关注。然而,仅依赖文本指导进行视频生成存在严重的局限性,因为视频比图像包含更多的动态内容。用纯文本描述这些信息几乎是不可能的。幸运的是,在计算机视觉领域,各种视觉表示形式可以作为额外的控制信号引导生成。这些信号可以帮助控制视频生成,从而为各种应用提供更大的灵活性。然而,整合各种控制并不容易。在本文中,我们提出了一个名为EasyControl的通用框架。通过通过条件适配器传播和注入条件特征,我们的方法使用单个条件图控制视频生成。通过我们的框架,各种条件,包括原始像素、深度、HED等,都可以在不同的基于UNet预训练视频扩散模型上集成,且具有较低的实际成本。我们在公共数据集上进行全面的实验,并且量化结果和定性结果都表明,我们的方法超越了最先进的方法。EasyControl在多个验证数据集上的各种评估指标都显著优于 previous works. 特别是,对于草图到视频生成任务,EasyControl在UCF101上的FVD改进了152.0,而在IS上的改进了19.9。对于保真度,我们的模型表现出强大的图像保留能力,导致UCF101和MSR-VTT上的FVD和IS远高于其他图像到视频模型。
https://arxiv.org/abs/2408.13005
The facial sketch synthesis (FSS) model, capable of generating sketch portraits from given facial photographs, holds profound implications across multiple domains, encompassing cross-modal face recognition, entertainment, art, media, among others. However, the production of high-quality sketches remains a formidable task, primarily due to the challenges and flaws associated with three key factors: (1) the scarcity of artist-drawn data, (2) the constraints imposed by limited style types, and (3) the deficiencies of processing input information in existing models. To address these difficulties, we propose a lightweight end-to-end synthesis model that efficiently converts images to corresponding multi-stylized sketches, obviating the necessity for any supplementary inputs (\eg, 3D geometry). In this study, we overcome the issue of data insufficiency by incorporating semi-supervised learning into the training process. Additionally, we employ a feature extraction module and style embeddings to proficiently steer the generative transformer during the iterative prediction of masked image tokens, thus achieving a continuous stylized output that retains facial features accurately in sketches. The extensive experiments demonstrate that our method consistently outperforms previous algorithms across multiple benchmarks, exhibiting a discernible disparity.
表情图合成(FSS)模型,可以从给定的面部照片生成草图肖像,在多个领域具有深刻的启示意义,包括跨模态人脸识别、娱乐、艺术、媒体等。然而,生产高质量草图仍然是一个具有挑战性和缺点的艰巨任务,主要原因是与三个关键因素相关的困难和缺陷:(1)艺术家手绘数据的稀缺性,(2)受到有限风格类型的限制,(3)现有模型对输入信息处理不足。为了应对这些困难,我们提出了一个轻量级的端到端合成模型,能够高效地将图像转换为相应的多风格化草图,消除了任何补充输入(例如3D几何)的需要。在本研究中,我们通过引入半监督学习方法克服了数据不足的问题。此外,我们还使用特征提取模块和风格嵌入来有效地引导生成转换器在掩膜图像标记的迭代预测过程中,从而实现了一个连续的风格化输出,准确保留面部特征在草图中。大量实验证明,我们的方法在多个基准测试中都显著优于之前的算法,显示出明显的优劣差距。
https://arxiv.org/abs/2408.12400
This study introduces a dataset consisting of approximately 9,000 images of mechanical mechanisms and their corresponding descriptions, aimed at supporting research in mechanism design. The dataset consists of a diverse collection of 2D and 3D sketches, meticulously curated to ensure relevance and quality. We demonstrate the application of this dataset by fine-tuning two models: 1) Stable Diffusion (for generating new mechanical designs), and 2) BLIP-2 (for captioning these designs). While the results from Stable Diffusion show promise, particularly in generating coherent 3D sketches, the model struggles with 2D sketches and occasionally produces nonsensical outputs. These limitations underscore the need for further development, particularly in expanding the dataset and refining model architectures. Nonetheless, this work serves as a step towards leveraging generative AI in mechanical design, highlighting both the potential and current limitations of these approaches.
本研究介绍了一个由大约9,000个机械机制图像及其相应描述组成的数据集,旨在支持机械设计研究。数据集包括2D和3D草图的多样性集合,经过精心挑选以确保相关性和质量。我们通过微调两个模型来展示这个数据集的应用:1) 稳定扩散(用于生成新的机械设计),2) BLIP-2(用于对这些设计进行文本描述)。虽然Stable Diffusion的结果显示出潜力,尤其是在生成连贯的3D草图方面,但模型在2D草图上挣扎,偶尔产生不切实际的输出。这些限制突显了进一步发展的需要,特别是在扩展数据集和优化模型架构方面。然而,这项工作为利用生成人工智能在机械设计中发挥作用迈出了重要一步,突出了这些方法的可能性及其现有局限性。
https://arxiv.org/abs/2409.03763
Soft robots can exhibit better performance in specific tasks compared to conventional robots, particularly in healthcare-related tasks. However, the field of soft robotics is still young, and designing them often involves mimicking natural organisms or relying heavily on human experts' creativity. A formal automated design process is required. We propose the use of neuroevolution-based algorithms to automatically design initial sketches of soft actuators that can enable the movement of future medical devices, such as drug-delivering catheters. The actuator morphologies discovered by algorithms like Age-Fitness Pareto Optimization, NeuroEvolution of Augmenting Topologies (NEAT), and Hypercube-based NEAT (HyperNEAT) were compared based on the maximum displacement reached and their robustness against various control methods. Analyzing the results granted the insight that neuroevolution-based algorithms produce better-performing and more robust actuators under different control methods. Moreover, the best-performing morphologies were discovered by the NEAT algorithm. As a future work aspect, we propose using the morphologies discovered here as test beds to optimize specialized controllers, enabling more effective functionality towards the desired deflections of the suggested soft catheters.
软性机器人可以在特定任务中表现出比传统机器人更好的性能,特别是在医疗相关任务中。然而,软性机器人的领域仍然年轻,设计它们通常需要模仿自然生物或依赖人类专家的创意。需要进行正式的自动设计过程。我们提出使用基于神经进化算法的软性致动器初步设计来自动设计可以实现未来医疗设备运动的基本 sketch。基于Age-Fitness Pareto优化、NeuroEvolution of Augmenting Topologies (NEAT)和Hypercube-based NEAT (HyperNEAT)等算法的软性致动器形态进行了比较,基于最大位移达到的值和它们对各种控制方法的鲁棒性。分析结果表明,基于神经进化算法的软性致动器在不同的控制方法下产生了更优良的性能和更强的鲁棒性。此外,NEAT算法发现了最佳的形态。作为未来的研究方向,我们提出将发现的软性致动器形态作为测试台,以优化专用的控制器,实现更有效的功能以达到建议的软性导管的预期偏转。
https://arxiv.org/abs/2408.09107
The inner alignment problem, which asserts whether an arbitrary artificial intelligence (AI) model satisfices a non-trivial alignment function of its outputs given its inputs, is undecidable. This is rigorously proved by Rice's theorem, which is also equivalent to a reduction to Turing's Halting Problem, whose proof sketch is presented in this work. Nevertheless, there is an enumerable set of provenly aligned AIs that are constructed from a finite set of provenly aligned operations. Therefore, we argue that the alignment should be a guaranteed property from the AI architecture rather than a characteristic imposed post-hoc on an arbitrary AI model. Furthermore, while the outer alignment problem is the definition of a judge function that captures human values and preferences, we propose that such a function must also impose a halting constraint that guarantees that the AI model always reaches a terminal state in finite execution steps. Our work presents examples and models that illustrate this constraint and the intricate challenges involved, advancing a compelling case for adopting an intrinsically hard-aligned approach to AI systems architectures that ensures halting.
内 align 问题,即判断任意人工智能(AI)模型是否满足给定其输入的输出非平凡对齐函数,是一个不可解的问题。这个问题的证明由 Rice 定理严格证明,该定理等价于图灵的停止问题,其证明概要也在本论文中给出。然而,存在一个可枚举的集合,这些集合中的 AI 都是从给定的有限集合证明对齐操作构建的。因此,我们认为对齐应该是一种人工智能架构中固有的保证性质,而不是对任意 AI 模型的后置约束。此外,尽管外对齐问题是对人类价值和偏好进行判断的定义,但我们提出,这样的函数还必须施加一个停止约束,确保 AI 模型在有限执行步骤内始终达到终端状态。我们的工作展示了这个约束以及与之相关的复杂挑战,为采用一种确保 AI 系统架构中存在停止的内在对齐方法辩护。
https://arxiv.org/abs/2408.08995
Sketch, a powerful artistic technique to capture essential visual information about real-world objects, is increasingly gaining attention in the image synthesis field. However, evaluating the quality of synthesized sketches presents unique unsolved challenges. Current evaluation methods for sketch synthesis are inadequate due to the lack of a unified benchmark dataset, over-reliance on classification accuracy for recognizability, and unfair evaluation of sketches with different levels of simplification. To address these issues, we introduce SketchRef, a benchmark dataset comprising 4 categories of reference photos--animals, human faces, human bodies, and common objects--alongside novel evaluation metrics. Considering that classification accuracy is insufficient to measure the structural consistency between a sketch and its reference photo, we propose the mean Object Keypoint Similarity (mOKS) metric, utilizing pose estimation to assess structure-level recognizability. To ensure fair evaluation sketches with different simplification levels, we propose a recognizability calculation method constrained by simplicity. We also collect 8K responses from art enthusiasts, validating the effectiveness of our proposed evaluation methods. We hope this work can provide a comprehensive evaluation of sketch synthesis algorithms, thereby aligning their performance more closely with human understanding.
sketch,一种强大的艺术技巧,用于捕捉现实世界中物体的重要视觉信息,在图像合成领域越来越受到关注。然而,评估合成草图的质量提出了独特的未解决挑战。当前的合成草图评估方法由于缺乏统一的基准数据集、过度依赖分类准确性以及不同简化程度的合成草图的不公平评估而不足。为了解决这些问题,我们引入了sketchRef,一个由四个类别参考照片组成的基准数据集,以及新的评估指标。考虑到分类准确性不足以衡量草图与参考照片之间的结构一致性,我们提出了mean Object Keypoint Similarity(mOKS) metric,利用姿态估计来评估结构级的可识别性。为了确保对不同简化程度的合成草图进行公平评估,我们提出了一个受简单性限制的识别性计算方法。我们还从爱好者收集了8K个响应,验证了我们所提出的评估方法的有效性。我们希望这项工作可以为图像合成算法提供全面的评估,从而使它们的性能更加接近人类理解。
https://arxiv.org/abs/2408.08623
Attention based models have achieved many remarkable breakthroughs in numerous applications. However, the quadratic complexity of Attention makes the vanilla Attention based models hard to apply to long sequence tasks. Various improved Attention structures are proposed to reduce the computation cost by inducing low rankness and approximating the whole sequence by sub-sequences. The most challenging part of those approaches is maintaining the proper balance between information preservation and computation reduction: the longer sub-sequences used, the better information is preserved, but at the price of introducing more noise and computational costs. In this paper, we propose a smoothed skeleton sketching based Attention structure, coined S$^3$Attention, which significantly improves upon the previous attempts to negotiate this trade-off. S$^3$Attention has two mechanisms to effectively minimize the impact of noise while keeping the linear complexity to the sequence length: a smoothing block to mix information over long sequences and a matrix sketching method that simultaneously selects columns and rows from the input matrix. We verify the effectiveness of S$^3$Attention both theoretically and empirically. Extensive studies over Long Range Arena (LRA) datasets and six time-series forecasting show that S$^3$Attention significantly outperforms both vanilla Attention and other state-of-the-art variants of Attention structures.
基于注意力的模型已经在许多应用中取得了许多令人印象深刻的突破。然而,基于注意力的平方复杂性使得基于原始注意力的模型在长序列任务上难以应用。为了降低计算成本,提出了各种改进的注意力结构,包括诱导低秩性并通过子序列近似整个序列,以及使用平滑骨架图的Attention结构,其被称为S3Attention。这些方法的最困难的部分是在信息保留和计算减少之间保持适当的平衡:越长的子序列,保留的信息越多,但代价是引入更多的噪声和计算成本。在本文中,我们提出了一个平滑骨架图的Attention结构,称为S3Attention,它在保持线性复杂性的同时显著改善了前人尝试解决这个问题所采取的权衡。S3Attention有两个机制来有效地最小化噪声的影响,同时保持序列长度线性复杂性:一个平滑块用于在长序列中混合信息,和一个矩阵简化方法,同时选择输入矩阵的列和行。我们通过理论和实证研究验证了S3Attention的有效性。在Long Range Arena(LRA)数据集上进行的大量研究和六个时间序列预测表明,S3Attention显著优于原始注意力和其他Attention结构。
https://arxiv.org/abs/2408.08567
Distributional drift detection is important in medical applications as it helps ensure the accuracy and reliability of models by identifying changes in the underlying data distribution that could affect diagnostic or treatment decisions. However, current methods have limitations in detecting drift; for example, the inclusion of abnormal datasets can lead to unfair comparisons. This paper presents an accurate and sensitive approach to detect distributional drift in CT-scan medical images by leveraging data-sketching and fine-tuning techniques. We developed a robust baseline library model for real-time anomaly detection, allowing for efficient comparison of incoming images and identification of anomalies. Additionally, we fine-tuned a vision transformer pre-trained model to extract relevant features using breast cancer images as an example, significantly enhancing model accuracy to 99.11\%. Combining with data-sketches and fine-tuning, our feature extraction evaluation demonstrated that cosine similarity scores between similar datasets provide greater improvements, from around 50\% increased to 100\%. Finally, the sensitivity evaluation shows that our solutions are highly sensitive to even 1\% salt-and-pepper and speckle noise, and it is not sensitive to lighting noise (e.g., lighting conditions have no impact on data drift). The proposed methods offer a scalable and reliable solution for maintaining the accuracy of diagnostic models in dynamic clinical environments.
分布式漂移检测在医学应用中非常重要,因为它有助于确保模型的准确性和可靠性,通过识别可能影响诊断或治疗决策的底层数据分布的变化来确定。然而,现有的方法在检测漂移方面存在局限性;例如,包含异常数据集可能导致不公平的比较。本文介绍了一种准确且敏感的方法,通过利用数据绘图和微调技术来检测CT扫描医疗图像的分布式漂移。我们开发了一个实时光异常检测的稳健基线库模型,允许对传入图像进行有效的比较并识别异常。此外,我们通过使用乳腺癌图像进行微调,对预训练的视觉 transformer模型进行优化,以提取相关特征,显著提高了模型的准确性至99.11%。结合数据绘图和微调,我们的特征提取评估表明,相似数据集的余弦相似度分数提供更大的改进,从约50%增加至100%。最后,敏感性评估表明,我们的解决方案对甚至1%的盐雾和椒盐噪声都高度敏感,对光线噪声(例如,光线条件对数据漂移没有影响)不敏感。所提出的方法为在动态临床环境中保持诊断模型准确性提供了一个可扩展且可靠的解决方案。
https://arxiv.org/abs/2408.08456
Drawing freehand sketches of mechanical components on multimedia devices for AI-based engineering modeling has become a new trend. However, its development is being impeded because existing works cannot produce suitable sketches for data-driven research. These works either generate sketches lacking a freehand style or utilize generative models not originally designed for this task resulting in poor effectiveness. To address this issue, we design a two-stage generative framework mimicking the human sketching behavior pattern, called MSFormer, which is the first time to produce humanoid freehand sketches tailored for mechanical components. The first stage employs Open CASCADE technology to obtain multi-view contour sketches from mechanical components, filtering perturbing signals for the ensuing generation process. Meanwhile, we design a view selector to simulate viewpoint selection tasks during human sketching for picking out information-rich sketches. The second stage translates contour sketches into freehand sketches by a transformer-based generator. To retain essential modeling features as much as possible and rationalize stroke distribution, we introduce a novel edge-constraint stroke initialization. Furthermore, we utilize a CLIP vision encoder and a new loss function incorporating the Hausdorff distance to enhance the generalizability and robustness of the model. Extensive experiments demonstrate that our approach achieves state-of-the-art performance for generating freehand sketches in the mechanical domain. Project page: this https URL .
手绘在多媒体设备上的机械部件自由手绘图已成为AI基础工程建模的新趋势。然而,由于现有作品无法为数据驱动研究产生合适的草图,其发展受到了阻碍。这些作品要么生成缺乏自由手风格的草图,要么使用原设计不是为此任务的生成模型,导致效果不佳。为了解决这个问题,我们设计了一个双阶段生成框架,模仿人类绘图行为模式,称为MSFormer,这是第一次产生针对机械部件的人形自由手绘草图。第一阶段采用Open CASCADE技术从机械部件中获取多视角轮廓草图,并过滤掉后续生成过程的扰动信号。同时,我们还设计了一个视图选择器,模拟人类在绘图过程中选择具有丰富信息的草图的任务。第二阶段通过基于变换器的生成器将轮廓草图转换为自由手绘草图。为了保留关键建模特征并简化笔触分布,我们引入了一种新颖的边缘约束笔触初始化。此外,我们还利用CLIP视觉编码器和新损失函数,以提高模型的可扩展性和稳健性。大量实验证明,我们的方法在机械领域生成自由手绘草图方面实现了最先进的性能。项目页面:此https:// this URL.
https://arxiv.org/abs/2408.05966