With the recent surge in the use of touchscreen devices, free-hand sketching has emerged as a promising modality for human-computer interaction. While previous research has focused on tasks such as recognition, retrieval, and generation of familiar everyday objects, this study aims to create a Sketch Input Method Editor (SketchIME) specifically designed for a professional C4I system. Within this system, sketches are utilized as low-fidelity prototypes for recommending standardized symbols in the creation of comprehensive situation maps. This paper also presents a systematic dataset comprising 374 specialized sketch types, and proposes a simultaneous recognition and segmentation architecture with multilevel supervision between recognition and segmentation to improve performance and enhance interpretability. By incorporating few-shot domain adaptation and class-incremental learning, the network's ability to adapt to new users and extend to new task-specific classes is significantly enhanced. Results from experiments conducted on both the proposed dataset and the SPG dataset illustrate the superior performance of the proposed architecture. Our dataset and code are publicly available at this https URL.
随着触摸屏设备的日益普及,自由手绘图作为一种人机交互的有前景的方法已经出现。虽然以前的研究主要集中在诸如识别、检索和生成熟悉日常物品等任务上,但本研究旨在为专业C4I系统创建一个专门为该系统设计的画图输入方法编辑器(SketchIME)。在这个系统中,画图被用作低保真度原型,用于创建全面情况地图中的标准化符号。本文还提出了一个同时识别和分割架构,其中在识别和分割之间存在多级监督,以提高性能和提高可解释性。通过结合少样本领域自适应和类增加学习,网络对适应新用户和扩展到新的任务特定类的适应能力得到了显著增强。对提出的数据集和代码的实验结果表明,所提出的架构具有卓越的性能。我们的数据集和代码可以通过这个链接公开使用:https://www.acm.org/dl/d/2022.pdf。
https://arxiv.org/abs/2311.18254
The development of text-to-video (T2V), i.e., generating videos with a given text prompt, has been significantly advanced in recent years. However, relying solely on text prompts often results in ambiguous frame composition due to spatial uncertainty. The research community thus leverages the dense structure signals, e.g., per-frame depth/edge sequences, to enhance controllability, whose collection accordingly increases the burden of inference. In this work, we present SparseCtrl to enable flexible structure control with temporally sparse signals, requiring only one or a few inputs, as shown in Figure 1. It incorporates an additional condition encoder to process these sparse signals while leaving the pre-trained T2V model untouched. The proposed approach is compatible with various modalities, including sketches, depth maps, and RGB images, providing more practical control for video generation and promoting applications such as storyboarding, depth rendering, keyframe animation, and interpolation. Extensive experiments demonstrate the generalization of SparseCtrl on both original and personalized T2V generators. Codes and models will be publicly available at this https URL .
近年来,文本到视频(T2V)的发展已经显著加快。然而,仅依赖文本提示往往会导致由于空间不确定性而产生的模糊的帧构图。因此,研究社区利用密集结构信号,例如每帧深度/边缘序列,来增强可控制性,这些信号的收集因此增加了推理的负担。在这项工作中,我们提出了SparseCtrl,以便于在时稀疏信号上实现灵活的结构控制,只需要一个或几个输入,如图1所示。它还包含一个额外的条件编码器来处理这些稀疏信号,同时不修改预训练的T2V模型。所提出的方法兼容各种模式,包括草图、深度图和RGB图像,为视频生成提供了更多的实际控制,并促进了诸如故事板、深度渲染、关键帧动画和插值等应用的发展。大量的实验证明,SparseCtrl在原始和个性化的T2V生成器上具有泛化能力。代码和模型将公开发布在https://这个URL上。
https://arxiv.org/abs/2311.16933
Sketch semantic segmentation is a well-explored and pivotal problem in computer vision involving the assignment of pre-defined part labels to individual strokes. This paper presents ContextSeg - a simple yet highly effective approach to tackling this problem with two stages. In the first stage, to better encode the shape and positional information of strokes, we propose to predict an extra dense distance field in an autoencoder network to reinforce structural information learning. In the second stage, we treat an entire stroke as a single entity and label a group of strokes within the same semantic part using an auto-regressive Transformer with the default attention mechanism. By group-based labeling, our method can fully leverage the context information when making decisions for the remaining groups of strokes. Our method achieves the best segmentation accuracy compared with state-of-the-art approaches on two representative datasets and has been extensively evaluated demonstrating its superior performance. Additionally, we offer insights into solving part imbalance in training data and the preliminary experiment on cross-category training, which can inspire future research in this field.
基于语义分割的绘画分割是一个在计算机视觉中广泛研究且具有里程碑意义的问题,涉及将预定义的局部标签分配给单个画笔。本文提出了一种简单而有效的解决方案来解决这个问题,包括两个阶段。在第一阶段,为了更好地编码画笔的形状和位置信息,我们提出在自编码器网络中预测一个额外的密集距离场来增强结构信息学习。在第二阶段,我们将整个画笔视为一个单一实体,并使用自回归Transformer(默认注意力机制)对同一语义部分内的几条画笔进行标签。通过基于组的标签,我们的方法可以在决策剩余画笔时完全利用上下文信息。与最先进的解决方案相比,我们的方法在两个具有代表性的数据集上的分割精度最佳,已经进行了广泛的评估,证明了其卓越的性能。此外,我们还提供了关于解决训练数据中的部分不平衡以及关于跨类别训练的初步实验,这可以激发该领域未来的研究。
https://arxiv.org/abs/2311.16682
Language has emerged as a natural interface for image editing. In this paper, we introduce a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches. Specifically, our approach leverages an existing pretrained text-to-image model and introduces a bounding box generator to find the edit regions that are aligned with the textual prompts. We show that this simple approach enables flexible editing that is compatible with current image generation models, and is able to handle complex prompts featuring multiple objects, complex sentences or long paragraphs. We conduct an extensive user study to compare our method against state-of-the-art methods. Experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that align with the language descriptions provided. Our project webpage: this https URL.
语言已成为图像编辑的自然用户界面。在本文中,我们提出了一种基于区域的经济图像编辑方法,无需用户提供掩码或草图。具体来说,我们的方法利用了现有的预训练文本转图像模型,并引入了一个边界框生成器来查找与文本提示相符的编辑区域。我们证明了这种简单的方法能够实现与当前图像生成模型兼容的灵活编辑,并能够处理包含多个物体、复杂句子或长段落的复杂提示。我们对我们的方法进行了广泛的用户研究,以与最先进的方法进行比较。实验结果表明,我们的方法在处理具有高保真度和真实感的图像方面具有竞争力,并能够处理包含多个物体、复杂句子或长段落的复杂提示。我们的项目网页:https:// this URL。
https://arxiv.org/abs/2311.16432
We introduce a novel optimization problem formulation that departs from the conventional way of minimizing machine learning model loss as a black-box function. Unlike traditional formulations, the proposed approach explicitly incorporates an initially pre-trained model and random sketch operators, allowing for sparsification of both the model and gradient during training. We establish insightful properties of the proposed objective function and highlight its connections to the standard formulation. Furthermore, we present several variants of the Stochastic Gradient Descent (SGD) method adapted to the new problem formulation, including SGD with general sampling, a distributed version, and SGD with variance reduction techniques. We achieve tighter convergence rates and relax assumptions, bridging the gap between theoretical principles and practical applications, covering several important techniques such as Dropout and Sparse training. This work presents promising opportunities to enhance the theoretical understanding of model training through a sparsification-aware optimization approach.
我们提出了一种新颖的优化问题求解方法,其与传统最小化机器学习模型损失为黑盒函数的方法不同。与传统方法不同,所提出的方法明确地将预先训练的模型和随机绘图操作融入其中,允许在训练过程中对模型和梯度进行稀疏化。我们研究了所提出的目标函数的有益性质,并强调了其与标准形式的联系。此外,我们还针对新问题形式介绍了一种几个 SGD 方法的变体,包括具有通量抽样的 SGD、分布式版本和具有方差减少技术的 SGD。我们通过较快的收敛速度和放松假设,将理论原则与实际应用相结合,涵盖了包括 Dropout 和稀疏训练在内的重要技术。这项工作为通过稀疏化感知优化方法增强模型训练的理论理解提供了有前景的机会。
https://arxiv.org/abs/2311.16086
Understanding semantic intricacies and high-level concepts is essential in image sketch generation, and this challenge becomes even more formidable when applied to the domain of videos. To address this, we propose a novel optimization-based framework for sketching videos represented by the frame-wise Bézier curve. In detail, we first propose a cross-frame stroke initialization approach to warm up the location and the width of each curve. Then, we optimize the locations of these curves by utilizing a semantic loss based on CLIP features and a newly designed consistency loss using the self-decomposed 2D atlas network. Built upon these design elements, the resulting sketch video showcases impressive visual abstraction and temporal coherence. Furthermore, by transforming a video into SVG lines through the sketching process, our method unlocks applications in sketch-based video editing and video doodling, enabled through video composition, as exemplified in the teaser.
理解语义复杂性和高级概念对于图像草图生成至关重要,尤其是在视频领域。当应用于视频领域时,这个挑战变得更加艰巨。为了解决这个问题,我们提出了一个基于优化的框架,用于表示由帧间Bézier曲线组成的视频。具体来说,我们首先提出了一个跨帧笔触初始化方法,以温暖每个曲线的坐标和宽度。然后,我们通过基于CLIP特征的语义损失和新设计的一致性损失来优化这些曲线的位置。基于这些设计元素,生成的草图视频展示了令人印象深刻的视觉抽象和时间一致性。此外,通过将视频转换为SVG线通过草图过程来实现,我们的方法解锁了基于草图的视频编辑和涂鸦应用,这通过视频组合得以实现,正如在预告片中所示。
https://arxiv.org/abs/2311.15306
This document illustrates the use of pyrealb for generating two parallel texts (English and French) from a single source of data. The data selection and text organisation processes are shared between the two languages. only language dependent word and phrasing choices are distinct processes. The realized texts thus convey identical information in both languages without the risk of being lost in translation. This is especially important in cases where strict and simultaneous bilingualism is required. We first present the types of applications targeted by this approach and how the pyrealb English and French realizer can be used for achieving this goal in a natural way. We describe an object-oriented organization to ensure a convenient realization in both languages. To illustrate the process, different types of applications are then briefly sketched with links to the source code. A brief comparison of the text generation is given with the output of an instance of a GPT.
本文展示了使用pyrealb从单一数据源生成两种并行文本(英语和法语)的方法。数据选择和文本组织过程在两种语言之间共享。只有语言相关的单词和短语选择是不同的过程。因此,实现文本在两种语言中传达相同信息,不会丢失翻译的风险。尤其是在需要进行严格的双语主义要求的情况下,这一点尤为重要。我们首先介绍使用这种方法的各类应用以及pyrealb English和法语实现器如何以自然的方式实现这一目标。我们描述了一个面向对象的设置,以确保在两种语言中实现便利的实现。为了说明过程,我们简要描述了不同类型的应用程序以及与源代码的链接。简要比较了文本生成与GPT实例的输出。
https://arxiv.org/abs/2311.14808
Deploying Large Language Models (LLMs) in streaming applications that involve long contexts, particularly for extended dialogues and text analysis, is of paramount importance but presents two significant challenges. Firstly, the memory consumption is substantial during the decoding phase due to the caching of Key and Value states (KV) of previous tokens. Secondly, attention computation is time-consuming with a time complexity of $O(n^2)$ for the generation of each token. In recent OpenAI DevDay (Nov 6, 2023), OpenAI released a new model that is able to support a 128K-long document, in our paper, we focus on the memory-efficient issue when context length $n$ is much greater than 128K ($n \gg 2^d$). Considering a single-layer self-attention with Query, Key, and Value matrices $Q, K, V \in \mathbb{R}^{n \times d}$, the polynomial method approximates the attention output $T \in \mathbb{R}^{n \times d}$. It accomplishes this by constructing $U_1, U_2 \in \mathbb{R}^{n \times t}$ to expedite attention ${\sf Attn}(Q, K, V)$ computation within $n^{1+o(1)}$ time executions. Despite this, storing the Key and Value matrices $K, V \in \mathbb{R}^{n \times d}$ still necessitates $O( n d)$ space, leading to significant memory usage. In response to these challenges, we introduce a new algorithm that only reads one pass of the data in streaming fashion. This method employs sublinear space $o(n)$ to store three sketch matrices, alleviating the need for exact $K, V$ storage. Notably, our algorithm exhibits exceptional memory-efficient performance with super-long tokens. As the token length $n$ increases, our error guarantee diminishes while the memory usage remains nearly constant. This unique attribute underscores the potential of our technique in efficiently handling LLMs in streaming applications.
在涉及长上下文、特别是长对话和文本分析的流式应用中部署大型语言模型(LLMs)至关重要,但带来了两个显著的挑战。首先,在解码阶段,由于以前标记的键值状态(KV)的缓存,内存消耗巨大。其次,对于每个生成token,计算注意力需要花费时间$O(n^2)$。在最近的开源AI DevDay(11月6日,2023年)上,OpenAI发布了一种新模型,能够支持128K长度的文档,在我们论文中,我们关注当上下文长度$n$大于2的$d$次方时,内存高效性问题。考虑到单层自注意力的查询、键和值矩阵$Q, K, V \in \mathbb{R}^{n \times d}$,多项式方法逼近注意力输出$T \in \mathbb{R}^{n \times d}$。它通过构建$U_1, U_2 \in \mathbb{R}^{n \times t}$来加速${\sf Attn}(Q, K, V)$的计算,只需执行$n^{1+o(1)}$次时间。尽管如此,存储键和值矩阵$K, V \in \mathbb{R}^{n \times d}$仍需要$O( n d)$的空间,导致使用内存相当大。为了应对这些挑战,我们引入了一种新的算法,以流式方式读取数据的一次。这种方法使用大小为$O(n)$的子空间存储三个 sketches矩阵,减轻了需要精确$K, V$存储的需求。值得注意的是,我们的算法在处理长标记语时表现出出色的内存效率。随着标记长度$n$的增加,我们的错误保证会减弱,而内存使用几乎保持不变。这种独特的特性突出了我们在流式应用程序中高效处理LLMs的潜力。
https://arxiv.org/abs/2311.14652
A sketch is one of the most intuitive and versatile tools humans use to convey their ideas visually. An animated sketch opens another dimension to the expression of ideas and is widely used by designers for a variety of purposes. Animating sketches is a laborious process, requiring extensive experience and professional design skills. In this work, we present a method that automatically adds motion to a single-subject sketch (hence, "breathing life into it"), merely by providing a text prompt indicating the desired motion. The output is a short animation provided in vector representation, which can be easily edited. Our method does not require extensive training, but instead leverages the motion prior of a large pretrained text-to-video diffusion model using a score-distillation loss to guide the placement of strokes. To promote natural and smooth motion and to better preserve the sketch's appearance, we model the learned motion through two components. The first governs small local deformations and the second controls global affine transformations. Surprisingly, we find that even models that struggle to generate sketch videos on their own can still serve as a useful backbone for animating abstract representations.
草图是人类用于通过视觉传达其想法的最直观和多功能的工具之一。一个动画草图打开了表达想法的另一个维度,被设计师们广泛用于各种目的。给定一个文本提示来指示所需的运动,自动为单个人物草图添加动量(因此,“赋予它生命”),输出是一个用向量表示的短动画,可以轻松编辑。我们的方法不需要大量的训练,而是利用预训练的大规模文本-图像扩散模型的运动,通过分数蒸馏损失来引导画笔的放置。为了促进自然和平滑的运动,并更好地保留草图的外观,我们通过两个组件来建模学习到的运动。第一个控制小的局部变形,第二个控制全局平移变换。令人惊讶的是,我们发现,即使那些在自己生成草图视频方面挣扎的模型,也可以作为一个有用的骨骼来动画抽象表示。
https://arxiv.org/abs/2311.13608
Engineering Design is undergoing a transformative shift with the advent of AI, marking a new era in how we approach product, system, and service planning. Large language models have demonstrated impressive capabilities in enabling this shift. Yet, with text as their only input modality, they cannot leverage the large body of visual artifacts that engineers have used for centuries and are accustomed to. This gap is addressed with the release of multimodal vision language models, such as GPT-4V, enabling AI to impact many more types of tasks. In light of these advancements, this paper presents a comprehensive evaluation of GPT-4V, a vision language model, across a wide spectrum of engineering design tasks, categorized into four main areas: Conceptual Design, System-Level and Detailed Design, Manufacturing and Inspection, and Engineering Education Tasks. Our study assesses GPT-4V's capabilities in design tasks such as sketch similarity analysis, concept selection using Pugh Charts, material selection, engineering drawing analysis, CAD generation, topology optimization, design for additive and subtractive manufacturing, spatial reasoning challenges, and textbook problems. Through this structured evaluation, we not only explore GPT-4V's proficiency in handling complex design and manufacturing challenges but also identify its limitations in complex engineering design applications. Our research establishes a foundation for future assessments of vision language models, emphasizing their immense potential for innovating and enhancing the engineering design and manufacturing landscape. It also contributes a set of benchmark testing datasets, with more than 1000 queries, for ongoing advancements and applications in this field.
工程设计正在经历一场变革性的转变,随着AI的出现,标志着我们对待产品、系统和服务的规划方式进入了一个新时代。大型语言模型已经在使这一转变成为可能方面展现出令人印象深刻的能力。然而,由于它们只接受文本作为输入模式,这些模型无法利用工程师们长期以来用于数百年的大量视觉艺术品,也无法充分利用工程师们习惯的这些视觉艺术品。这一差距通过发布多模态视觉语言模型,如GPT-4V等,使AI能够影响更多的任务类型来解决。在这种情况下,本文对GPT-4V(视觉语言模型)进行全面评估,涵盖了工程设计的各个领域,分为四个主要部分:概念设计、系统级和详细设计、制造和检验、工程教育任务。我们的研究评估了GPT-4V在设计任务方面的能力,如草图相似性分析、使用皮尔格图进行概念选择、材料选择、工程图纸分析、CAD生成、拓扑优化、为Additive和Subtractive manufacturing设计的工程、空间推理挑战和教科书问题。通过这种结构化的评估,我们不仅评估了GPT-4V在处理复杂设计和制造挑战方面的能力,还指出了其在复杂工程设计应用中的局限性。我们的研究为未来对视觉语言模型的评估奠定了基础,强调了它们在创新和增强工程设计和制造领域中的巨大潜力。此外,我们还为该领域提供了一组基准测试数据集,包括超过1000个查询。
https://arxiv.org/abs/2311.12668
Story visualization aims to generate a series of images that match the story described in texts, and it requires the generated images to satisfy high quality, alignment with the text description, and consistency in character identities. Given the complexity of story visualization, existing methods drastically simplify the problem by considering only a few specific characters and scenarios, or requiring the users to provide per-image control conditions such as sketches. However, these simplifications render these methods incompetent for real applications. To this end, we propose an automated story visualization system that can effectively generate diverse, high-quality, and consistent sets of story images, with minimal human interactions. Specifically, we utilize the comprehension and planning capabilities of large language models for layout planning, and then leverage large-scale text-to-image models to generate sophisticated story images based on the layout. We empirically find that sparse control conditions, such as bounding boxes, are suitable for layout planning, while dense control conditions, e.g., sketches and keypoints, are suitable for generating high-quality image content. To obtain the best of both worlds, we devise a dense condition generation module to transform simple bounding box layouts into sketch or keypoint control conditions for final image generation, which not only improves the image quality but also allows easy and intuitive user interactions. In addition, we propose a simple yet effective method to generate multi-view consistent character images, eliminating the reliance on human labor to collect or draw character images.
故事可视化旨在生成一系列与描述文本中的故事相匹配的图像,并需要生成的图像满足高质量、与文本描述对齐和高一致性的要求。由于故事可视化的复杂性,现有的方法通过仅考虑少数具体角色或场景或要求用户提供每张图片的控制条件,如草图,大大简化了问题。然而,这些简化使得这些方法在现实应用中变得无能为力。因此,我们提出了一个自动故事可视化系统,可以有效地生成多样、高质量和一致的故事图像,且无需大量的人工干预。 具体来说,我们利用大型语言模型的理解和学习能力来进行布局规划,然后利用大规模的文本-图像模型根据布局生成复杂的故事图像。我们通过实验发现,稀疏控制条件,如边界框,适用于布局规划,而密集控制条件,如轮廓和关键点,适用于生成高质量图像内容。为了实现最佳效果,我们设计了一个密集条件生成模块,将简单的边界框布局转换为轮廓或关键点控制条件,用于最终图像生成。这不仅提高了图像质量,而且允许用户进行直观且高效的操作。 此外,我们提出了一种简单而有效的方法生成多视角一致的角色图像,消除了依赖人类收集或绘制角色图像的需求。
https://arxiv.org/abs/2311.11243
Distribution shifts are characterized by differences between the training and test data distributions. They can significantly reduce the accuracy of machine learning models deployed in real-world scenarios. This paper explores the distribution shift problem when classifying pollen grains from microscopic images collected in the wild with a low-cost camera sensor. We leverage the domain knowledge that geometric features are highly important for accurate pollen identification and introduce two novel geometric image augmentation techniques to significantly narrow the accuracy gap between the model performance on the train and test datasets. In particular, we show that Tenengrad and ImageToSketch filters are highly effective to balance the shape and texture information while leaving out unimportant details that may confuse the model. Extensive evaluations on various model architectures demonstrate a consistent improvement of the model generalization to field data of up to 14% achieved by the geometric augmentation techniques when compared to a wide range of standard image augmentations. The approach is validated through an ablation study using pollen hydration tests to recover the shape of dry pollen grains. The proposed geometric augmentations also receive the highest scores according to the affinity and diversity measures from the literature.
分布漂移是指训练数据和测试数据分布之间的差异。它们可能导致机器学习模型在现实场景中的准确性显著降低。本文探讨了在用低成本相机传感器从野外收集微镜图像分类花粉粒时,分布漂移问题。我们利用几何特征在准确花粉识别中具有重要作用这一领域知识,引入了两种新颖的几何图像增强技术,显著缩小了模型在训练和测试数据集上的准确性差距。特别是,我们证明了Tenengrad和ImageToSketch滤波器在平衡形状和纹理信息的同时,可以消除可能使模型产生困惑的不重要的细节。对各种模型架构的广泛评估表明,在比较广泛的图像增强方法中,几何增强技术实现了模型对现场数据扩展的14%以上的提高。通过使用花粉脱水测试进行消融研究,验证了所提出的几何增强方法。根据文献中的亲和性和多样性度量,这种几何增强方法也获得了最高分数。
https://arxiv.org/abs/2311.11029
The socially-aware navigation system has evolved to adeptly avoid various obstacles while performing multiple tasks, such as point-to-point navigation, human-following, and -guiding. However, a prominent gap persists: in Human-Robot Interaction (HRI), the procedure of communicating commands to robots demands intricate mathematical formulations. Furthermore, the transition between tasks does not quite possess the intuitive control and user-centric interactivity that one would desire. In this work, we propose an LLM-driven interactive multimodal multitask robot navigation framework, termed LIM2N, to solve the above new challenge in the navigation field. We achieve this by first introducing a multimodal interaction framework where language and hand-drawn inputs can serve as navigation constraints and control objectives. Next, a reinforcement learning agent is built to handle multiple tasks with the received information. Crucially, LIM2N creates smooth cooperation among the reasoning of multimodal input, multitask planning, and adaptation and processing of the intelligent sensing modules in the complicated system. Extensive experiments are conducted in both simulation and the real world demonstrating that LIM2N has superior user needs understanding, alongside an enhanced interactive experience.
社会意识导航系统已经发展成为一个多任务执行的优秀工具,包括点对点导航、跟随和指导。然而,在人机交互(HRI)中,向机器人传达指令的过程需要复杂的数学公式。此外,任务之间的转换并不具备用户友好和直观的控制和用户中心化的交互。在本文中,我们提出了一个基于LLM的交互多模态多任务机器人导航框架LIM2N,以解决在导航领域的新挑战。我们通过首先引入一个多模态交互框架,其中语言和手绘输入可以作为导航约束和控制目标来解决该问题。接下来,构建了一个强化学习代理来处理接收到的信息。关键是,LIM2N在多模态输入的推理、多任务规划和智能感知模块的适应与处理之间创建了平滑的合作。在仿真和现实世界的实验中,我们进行了广泛的测试,证明了LIM2N具有卓越的用户需求理解,以及增强的交互体验。
https://arxiv.org/abs/2311.08244
Reference-based video object segmentation is an emerging topic which aims to segment the corresponding target object in each video frame referred by a given reference, such as a language expression or a photo mask. However, language expressions can sometimes be vague in conveying an intended concept and ambiguous when similar objects in one frame are hard to distinguish by language. Meanwhile, photo masks are costly to annotate and less practical to provide in a real application. This paper introduces a new task of sketch-based video object segmentation, an associated benchmark, and a strong baseline. Our benchmark includes three datasets, Sketch-DAVIS16, Sketch-DAVIS17 and Sketch-YouTube-VOS, which exploit human-drawn sketches as an informative yet low-cost reference for video object segmentation. We take advantage of STCN, a popular baseline of semi-supervised VOS task, and evaluate what the most effective design for incorporating a sketch reference is. Experimental results show sketch is more effective yet annotation-efficient than other references, such as photo masks, language and scribble.
基于参考的视频物体分割是一个新兴的话题,旨在通过给定的参考,例如语言表达或照片掩码,对视频中的相应目标物体进行分割。然而,语言表达有时可能传达不出预期的概念,而且当同一帧中的类似物体难以通过语言区分时,可能会出现歧义。同时,照片掩码的 annotate 成本较高,而且在实际应用中提供起来不够实用。本文介绍了一个基于草图的视频物体分割新任务、一个相关基准和一个强大的 baseline。我们的基准包括三个数据集:Sketch-DAVIS16、Sketch-DAVIS17 和 Sketch-YouTube-VOS,它们利用人类绘制的草图作为视频物体分割的有用但低成本的参考。我们利用 STCN,一个流行的半监督 VOS 任务的基准,评估了将草图参考有效地融入视频物体分割的最有效设计。实验结果表明,草图比其他参考方法更有效,同时 annotate 成本更低。
https://arxiv.org/abs/2311.07261
It has been observed that many classical planning domains with atomic goals can be solved by means of a simple polynomial exploration procedure, called IW, that runs in time exponential in the problem width, which in these cases is bounded and small. Yet, while the notion of width has become part of state-of-the-art planning algorithms such as BFWS, there is no good explanation for why so many benchmark domains have bounded width when atomic goals are considered. In this work, we address this question by relating bounded width with the existence of general optimal policies that in each planning instance are represented by tuples of atoms of bounded size. We also define the notions of (explicit) serializations and serialized width that have a broader scope as many domains have a bounded serialized width but no bounded width. Such problems are solved non-optimally in polynomial time by a suitable variant of the Serialized IW algorithm. Finally, the language of general policies and the semantics of serializations are combined to yield a simple, meaningful, and expressive language for specifying serializations in compact form in the form of sketches, which can be used for encoding domain control knowledge by hand or for learning it from small examples. Sketches express general problem decompositions in terms of subgoals, and sketches of bounded width express problem decompositions that can be solved in polynomial time.
观察到,许多具有原子目标的经典规划域可以通过一个时间复杂度为指数函数的简单多项式搜索算法IW来求解,这个算法在问题宽度上运行。然而,虽然在诸如BFWS这样的最先进的规划算法中,宽度的概念已经成为了状态,但是对于考虑原子目标的情况,为什么许多基准域具有宽度限制并没有一个好的解释。在本文中,我们通过将有限宽度与具有有限大小的原子策略的存在联系起来,回答了这个问题。我们还定义了有限序列化和序列化宽度的概念,它们具有更广泛的范围,因为许多领域具有有限序列化宽度,但不存在有限宽度。 这样的问题通过一个适当的序列化IW算法的变体在多项式时间内非优化地解决。最后,将表示序列的语义和语义表示语言结合起来,产生了一个简单、有意义、富有表现力的语言,用于在紧凑形式下表示序列,可以用于通过手动编码或从小例子中学习领域控制知识。序列图表示一般问题分解,序列图的有限宽度表示可以求解的问题分解。
https://arxiv.org/abs/2311.05490
Modern biomedical image analysis using deep learning often encounters the challenge of limited annotated data. To overcome this issue, deep generative models can be employed to synthesize realistic biomedical images. In this regard, we propose an image synthesis method that utilizes denoising diffusion probabilistic models (DDPMs) to automatically generate retinal optical coherence tomography (OCT) images. By providing rough layer sketches, the trained DDPMs can generate realistic circumpapillary OCT images. We further find that more accurate pseudo labels can be obtained through knowledge adaptation, which greatly benefits the segmentation task. Through this, we observe a consistent improvement in layer segmentation accuracy, which is validated using various neural networks. Furthermore, we have discovered that a layer segmentation model trained solely with synthesized images can achieve comparable results to a model trained exclusively with real images. These findings demonstrate the promising potential of DDPMs in reducing the need for manual annotations of retinal OCT images.
利用深度学习进行现代生物医学图像分析时,常常面临标注数据有限的问题。为解决这个问题,可以使用深度生成模型来合成真实的生物医学图像。在这方面,我们提出了一种利用去噪扩散概率模型(DDPMs)进行图像合成的方法,以生成 retinal optical coherence tomography(OCT)图像。通过提供粗层草图,训练后的DDPMs可以生成真实的环状OCT图像。此外,我们还发现,通过知识迁移可以获得更准确的伪标签,这极大地促进了分割任务的准确性。通过这种方式,我们观察到层分割准确性的不断提高,这是通过各种神经网络验证的。此外,我们还发现,仅通过合成图像训练的层分割模型可以达到与仅通过真实图像训练的模型相当的效果。这些发现表明,DDPM在减少视网膜OCT图像的手动注释方面具有很大的潜力。
https://arxiv.org/abs/2311.05479
Recent remarkable advances in large-scale text-to-image diffusion models have inspired a significant breakthrough in text-to-3D generation, pursuing 3D content creation solely from a given text prompt. However, existing text-to-3D techniques lack a crucial ability in the creative process: interactively control and shape the synthetic 3D contents according to users' desired specifications (e.g., sketch). To alleviate this issue, we present the first attempt for text-to-3D generation conditioning on the additional hand-drawn sketch, namely Control3D, which enhances controllability for users. In particular, a 2D conditioned diffusion model (ControlNet) is remoulded to guide the learning of 3D scene parameterized as NeRF, encouraging each view of 3D scene aligned with the given text prompt and hand-drawn sketch. Moreover, we exploit a pre-trained differentiable photo-to-sketch model to directly estimate the sketch of the rendered image over synthetic 3D scene. Such estimated sketch along with each sampled view is further enforced to be geometrically consistent with the given sketch, pursuing better controllable text-to-3D generation. Through extensive experiments, we demonstrate that our proposal can generate accurate and faithful 3D scenes that align closely with the input text prompts and sketches.
近年来在大型文本到图像扩散模型方面的显著进步激发了对文本到3D生成的重大突破,该突破仅从给定的文本提示出发进行3D内容创作。然而,现有的文本到3D技术在创意过程中缺乏关键能力:根据用户需求交互式控制和塑造合成3D内容(例如,草图)。为了减轻这个问题,我们提出了第一个基于手绘草图的文本到3D生成调整尝试,即Control3D,它提高了用户的可控性。 具体来说,将一个2D约束扩散模型(ControlNet)重构为引导用户学习3D场景参数化作为NeRF的3D场景,鼓励每个与给定文本提示和手绘草图对齐的3D视图。此外,我们还利用预训练的不同iable照片到草图模型直接估计渲染图像的草图。这种估计的草图与每个采样视图一起被强制要求与给定草图在几何上一致,以实现更好的文本到3D生成可控性。 通过广泛的实验,我们证明了我们的建议可以生成与输入文本提示和草图精确一致的准确和真实的3D场景。
https://arxiv.org/abs/2311.05461
The human brain is naturally equipped to comprehend and interpret visual information rapidly. When confronted with complex problems or concepts, we use flowcharts, sketches, and diagrams to aid our thought process. Leveraging this inherent ability can significantly enhance logical reasoning. However, current Large Language Models (LLMs) do not utilize such visual intuition to help their thinking. Even the most advanced version language models (e.g., GPT-4V and LLaVA) merely align images into textual space, which means their reasoning processes remain purely verbal. To mitigate such limitations, we present a Chain of Images (CoI) approach, which can convert complex language reasoning problems to simple pattern recognition by generating a series of images as intermediate representations. Furthermore, we have developed a CoI evaluation dataset encompassing 15 distinct domains where images can intuitively aid problem-solving. Based on this dataset, we aim to construct a benchmark to assess the capability of future multimodal large-scale models to leverage images for reasoning. In supporting our CoI reasoning, we introduce a symbolic multimodal large language model (SyMLLM) that generates images strictly based on language instructions and accepts both text and image as input. Experiments on Geometry, Chess and Common Sense tasks sourced from the CoI evaluation dataset show that CoI improves performance significantly over the pure-language Chain of Thoughts (CoT) baselines. The code is available at this https URL.
人类大脑天生具有快速理解和解释视觉信息的自然能力。面对复杂的问题或概念,我们使用流程图、草图和图表来辅助我们的思维过程。利用这种固有的视觉直觉能力,可以显著增强逻辑推理能力。然而,当前的大型语言模型(LLMs)并没有利用这种视觉直觉来帮助他们的思考。即使是最先进的版本语言模型(例如 GPT-4V 和 LLaVA)也仅仅将图像对齐到文本空间中,这意味着他们的推理过程仍然是纯文本的。为了减轻这种限制,我们提出了链式图像(CoI)方法,通过生成一系列图像作为中间表示将复杂语言推理问题转化为简单的模式识别。此外,我们还开发了一个CoI评估数据集,涵盖了15个不同的领域,图像可以直观地帮助解决问题。基于这个数据集,我们旨在构建一个基准,以评估未来多模态大规模模型利用图像进行推理的能力。为了支持我们的CoI推理,我们引入了一个符号多模态大型语言模型(SyMLLM),它根据语言指令生成图像,并接受文本和图像作为输入。来自CoI评估数据集的实验表明,CoI在几何、国际象棋和常识任务中的性能显著优于纯文本的链式思考(CoT)基线。代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2311.09241
Many recent prompting strategies for large language models (LLMs) query the model multiple times sequentially -- first to produce intermediate results and then the final answer. However, using these methods, both decoder and model are unaware of potential follow-up prompts, leading to disconnected and undesirably wordy intermediate responses. In this work, we address this issue by proposing prompt sketching, a new prompting paradigm in which an LLM does not only respond by completing a prompt, but by predicting values for multiple variables in a template. This way, sketching grants users more control over the generation process, e.g., by providing a reasoning framework via intermediate instructions, leading to better overall results. The key idea enabling sketching with existing, autoregressive models is to adapt the decoding procedure to also score follow-up instructions during text generation, thus optimizing overall template likelihood in inference. Our experiments show that in a zero-shot setting, prompt sketching outperforms existing, sequential prompting schemes such as direct asking or chain-of-thought on 7 out of 8 LLM benchmarking tasks, including state tracking, arithmetic reasoning, and general question answering. To facilitate future use, we release a number of generic, yet effective sketches applicable to many tasks, and an open source library called dclib, powering our sketch-aware decoders.
许多最近的大语言模型(LLM)提示策略在逐步查询模型时存在问题——首先产生中间结果,然后才是最终答案。然而,使用这些方法,模型和解码器都不知道潜在的后续提示,导致离散且不必要的复杂中间回答。在本文中,我们通过提出提示草图解决了这个问题,这是一种新的提示范例,其中LLM不仅通过完成提示来回应,而且还预测模板中多个变量的值。这样,草图使得用户对生成过程具有更多的控制,例如,通过中间指令提供推理框架,从而导致更好的整体结果。 使用现有自回归模型的关键思想促进草图的是将解码过程也应用于文本生成中评分后续提示,从而在推理过程中优化模板可能性。我们的实验证明,在零散射击设置中,提示草图在8个LLM基准任务中的表现优于现有的顺序提示方法和链式思维。为了方便未来使用,我们发布了一系列通用的、有效的提示,称为dclib库,该库使我们的具有提示的解码器具有更强大的功能。
https://arxiv.org/abs/2311.04954
Recent advances in computer vision (CV) and natural language processing have been driven by exploiting big data on practical applications. However, these research fields are still limited by the sheer volume, versatility, and diversity of the available datasets. CV tasks, such as image captioning, which has primarily been carried out on natural images, still struggle to produce accurate and meaningful captions on sketched images often included in scientific and technical documents. The advancement of other tasks such as 3D reconstruction from 2D images requires larger datasets with multiple viewpoints. We introduce DeepPatent2, a large-scale dataset, providing more than 2.7 million technical drawings with 132,890 object names and 22,394 viewpoints extracted from 14 years of US design patent documents. We demonstrate the usefulness of DeepPatent2 with conceptual captioning. We further provide the potential usefulness of our dataset to facilitate other research areas such as 3D image reconstruction and image retrieval.
近年来,计算机视觉(CV)和自然语言处理(NLP)的发展主要是由利用大数据对实际应用进行挖掘而推动的。然而,这些研究领域仍然受到可用数据集的有限数量、多样性和分布的限制。CV任务,如图像标注,主要在自然图像上进行,仍然很难在包含在科学和技术文件中的草图图像上产生准确和有意义的标注。其他任务(如从2D图像进行3D重建)需要更大的数据集,包括多个视角。我们介绍了一个名为DeepPatent2的大规模数据集,提供了14年内美国设计专利文件中超过270,000个技术图和132,890个物体名称以及22,394个视角。我们用概念性注释证明了DeepPatent2的价值。我们进一步提供了DeepPatent2数据集对促进其他研究领域(如3D图像重建和图像检索)的潜在有用性。
https://arxiv.org/abs/2311.04098