Text-to-Image diffusion models have made tremendous progress over the past two years, enabling the generation of highly realistic images based on open-domain text descriptions. However, despite their success, text descriptions often struggle to adequately convey detailed controls, even when composed of long and complex texts. Moreover, recent studies have also shown that these models face challenges in understanding such complex texts and generating the corresponding images. Therefore, there is a growing need to enable more control modes beyond text description. In this paper, we introduce Uni-ControlNet, a novel approach that allows for the simultaneous utilization of different local controls (e.g., edge maps, depth map, segmentation masks) and global controls (e.g., CLIP image embeddings) in a flexible and composable manner within one model. Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models, eliminating the huge cost of training from scratch. Moreover, thanks to some dedicated adapter designs, Uni-ControlNet only necessitates a constant number (i.e., 2) of adapters, regardless of the number of local or global controls used. This not only reduces the fine-tuning costs and model size, making it more suitable for real-world deployment, but also facilitate composability of different conditions. Through both quantitative and qualitative comparisons, Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability. Code is available at \url{this https URL}.
文本到图像扩散模型在过去两年中取得了巨大的进展,基于公开领域的文本描述能够生成高度逼真的图像。然而,尽管它们取得了成功,文本描述常常难以充分传达详细的控制,即使包含长且复杂的文本。此外,最近的研究表明,这些模型在理解这些复杂的文本和生成相应的图像方面面临挑战。因此,越来越需要实现更多的控制模式超越文本描述。在本文中,我们介绍了 Uni-ControlNet,一种新颖的方法,能够同时利用不同的局部控制(例如边缘图、深度图、分块掩码)和全局控制(例如CLIP图像嵌入)在一个模型中以灵活和可组合的方式使用。与现有方法不同,Uni-ControlNet只需要在训练前冻存的文本到图像扩散模型进行微调,消除从头训练的巨大成本。此外,得益于一些专门的适配器设计,Uni-ControlNet只需要一个恒定的数量(即2)的适配器,无论使用多少局部或全局控制。这不仅减少了微调成本和模型大小,使其更适用于现实世界的部署,而且还促进了不同条件的可组合性。通过量化和定性比较,Uni-ControlNet证明了它在控制性和生成质量以及可组合性方面的优势。代码可在 \url{this https URL} 找到。
https://arxiv.org/abs/2305.16322
Decomposing an object's appearance into representations of its materials and the surrounding illumination is difficult, even when the object's 3D shape is known beforehand. This problem is ill-conditioned because diffuse materials severely blur incoming light, and is ill-posed because diffuse materials under high-frequency lighting can be indistinguishable from shiny materials under low-frequency lighting. We show that it is possible to recover precise materials and illumination -- even from diffuse objects -- by exploiting unintended shadows, like the ones cast onto an object by the photographer who moves around it. These shadows are a nuisance in most previous inverse rendering pipelines, but here we exploit them as signals that improve conditioning and help resolve material-lighting ambiguities. We present a method based on differentiable Monte Carlo ray tracing that uses images of an object to jointly recover its spatially-varying materials, the surrounding illumination environment, and the shapes of the unseen light occluders who inadvertently cast shadows upon it.
将物体的外观分解成其材料和周围环境的表示方法是困难的,即使物体的三维形状已知。这个问题是Conditioning不足的,因为扩散材料严重模糊入射光,也因为高频照明下的扩散材料可以与低频照明下的闪亮材料分辨不清。我们表明,可以利用意外生成的 shadows,例如摄影师围绕物体移动时生成的 shadows。这些 shadows 在大多数先前的反渲染管道中都是令人困扰的,但在这里我们利用它们作为改善 conditioning 和解决材料照明混淆的信号。我们提出了基于不同变的蒙特卡罗射线渲染的方法,该方法使用物体的图像一起恢复其空间 varying 的材料、周围的照明环境,以及无意中对物体生成的光遮蔽器的形状。
https://arxiv.org/abs/2305.16321
This paper reveals that every image can be understood as a first-order norm+linear autoregressive process, referred to as FINOLA, where norm+linear denotes the use of normalization before the linear model. We demonstrate that images of size 256$\times$256 can be reconstructed from a compressed vector using autoregression up to a 16$\times$16 feature map, followed by upsampling and convolution. This discovery sheds light on the underlying partial differential equations (PDEs) governing the latent feature space. Additionally, we investigate the application of FINOLA for self-supervised learning through a simple masked prediction technique. By encoding a single unmasked quadrant block, we can autoregressively predict the surrounding masked region. Remarkably, this pre-trained representation proves effective for image classification and object detection tasks, even in lightweight networks, without requiring fine-tuning. The code will be made publicly available.
这篇文章表明,每个图像都可以被视为一个第一阶 norms+线性 autoregressive 过程,也称为 FINOLA,其中 norms+线性表示在线性模型之前使用标准化。我们证明了,大小为 256x256 的图像可以通过自回归从压缩向量重构到 16x16 特征图,然后进行增广和卷积。这个发现揭示了支配潜在特征空间的基 partial differential equations (PDEs)。此外,我们通过简单的蒙面预测技术研究了 FINOLA 对自监督学习的应用。通过编码一个未暴露的 Quadrant 块,我们可以自回归预测周围的蒙面区域。令人惊讶地,这个预训练表示证明对于图像分类和物体检测任务有效,即使在轻量级网络中,也不需要微调。代码将公开可用。
https://arxiv.org/abs/2305.16319
Recently, video object segmentation (VOS) referred by multi-modal signals, e.g., language and audio, has evoked increasing attention in both industry and academia. It is challenging for exploring the semantic alignment within modalities and the visual correspondence across frames. However, existing methods adopt separate network architectures for different modalities, and neglect the inter-frame temporal interaction with references. In this paper, we propose MUTR, a Multi-modal Unified Temporal transformer for Referring video object segmentation. With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference. Specifically, we introduce two strategies to fully explore the temporal relations between videos and multi-modal signals. Firstly, for low-level temporal aggregation before the transformer, we enable the multi-modal references to capture multi-scale visual cues from consecutive video frames. This effectively endows the text or audio signals with temporal knowledge and boosts the semantic alignment between modalities. Secondly, for high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video. On Ref-YouTube-VOS and AVSBench datasets with respective text and audio references, MUTR achieves +4.2% and +4.2% J&F improvements to state-of-the-art methods, demonstrating our significance for unified multi-modal VOS. Code is released at this https URL.
最近,视频对象分割(VOS)由多种modal信号,如语言和音频,引起 industry 和学术界的广泛关注。探索modal之间语义对齐以及不同帧之间的视觉对应是挑战性的任务。然而,现有方法采用不同的modal网络架构,并忽略了帧间间的时间交互。在本文中,我们提出了MUTR,一个多模态统一时间Transformer,以 refering 视频对象分割。第一次使用统一框架,MUTR采用DETR风格的Transformer,能够以文本或音频参考分别分割指定视频对象。具体来说,我们介绍了两种策略,以 fully 探索视频和modal信号之间的时间关系。首先,在Transformer之前进行低级别时间聚合,我们使多模态参考能够从连续的视频帧捕获多尺度的视觉线索。这 effectively赋予文本或音频信号时间知识,并增强modal之间的语义对齐。其次,在Transformer之后进行高级别时间交互,我们进行不同物体嵌入之间的帧间特征通信,有助于更好地跟踪视频对象。在ref-YouTube-VOS和AVSbench数据集,以相应的文本和音频参考,MUTR取得了与当前方法相比 +4.2% 和 +4.2%的J&F改进,这表明我们对于统一多模态VOS的重要性。代码在此https URL发布。
https://arxiv.org/abs/2305.16318
Diffusion models are powerful generative models but suffer from slow sampling, often taking 1000 sequential denoising steps for one sample. As a result, considerable efforts have been directed toward reducing the number of denoising steps, but these methods hurt sample quality. Instead of reducing the number of denoising steps (trading quality for speed), in this paper we explore an orthogonal approach: can we run the denoising steps in parallel (trading compute for speed)? In spite of the sequential nature of the denoising steps, we show that surprisingly it is possible to parallelize sampling via Picard iterations, by guessing the solution of future denoising steps and iteratively refining until convergence. With this insight, we present ParaDiGMS, a novel method to accelerate the sampling of pretrained diffusion models by denoising multiple steps in parallel. ParaDiGMS is the first diffusion sampling method that enables trading compute for speed and is even compatible with existing fast sampling techniques such as DDIM and DPMSolver. Using ParaDiGMS, we improve sampling speed by 2-4x across a range of robotics and image generation models, giving state-of-the-art sampling speeds of 0.2s on 100-step DiffusionPolicy and 16s on 1000-step StableDiffusion-v2 with no measurable degradation of task reward, FID score, or CLIP score.
扩散模型是强大的生成模型,但采样速度较慢,常常需要1000个Sequential的denoising步骤才能完成一个样本。因此,已经有大量的努力被用于减少denoising步骤的数量,但这些方法却损害了样本质量。我们本 paper 探索了一种与之相反的方法:并行运行denoising步骤(以计算换取速度)。尽管denoising步骤的顺序性,但我们表明,实际上可以通过 Picard 迭代法并行化采样,通过猜测未来denoising步骤的解决方案并迭代优化,直到收敛。利用这一洞察力,我们提出了 ParaDiGMS,一种 novel 方法,以加速训练好的扩散模型的采样,通过并行denoising多个步骤。 ParaDiGMS 是第一种能够以计算换取速度的扩散采样方法,甚至与现有的快速采样技术如 DDIM 和 DPMSolver 兼容。使用 ParaDiGMS,我们在各种机器人和图像生成模型中提高了采样速度,使得最先进的采样速度为 0.2s 的 100 步扩散策略和 16s 的 1000 步稳定扩散-v2,且任务奖励、FID 得分或Clip 得分没有可测量的下降。
https://arxiv.org/abs/2305.16317
For computer vision tasks, Vision Transformers (ViTs) have become one of the go-to deep net architectures. Despite being inspired by Convolutional Neural Networks (CNNs), ViTs remain sensitive to small shifts in the input image. To address this, we introduce novel designs for each of the modules in ViTs, such as tokenization, self-attention, patch merging, and positional encoding. With our proposed modules, we achieve truly shift-equivariant ViTs on four well-established models, namely, Swin, SwinV2, MViTv2, and CvT, both in theory and practice. Empirically, we tested these models on image classification and semantic segmentation, achieving competitive performance across three different datasets while maintaining 100% shift consistency.
对计算机视觉任务而言,视觉转换器(ViTs)已成为深度学习架构的首选之一。尽管受到了卷积神经网络(CNNs)的启发,ViTs仍然对输入图像的微小变化非常敏感。为了解决这一问题,我们提出了ViTs中的每个模块的全新设计,例如 tokenization、自注意力、块融合和位置编码。利用我们提出的模块,我们实现了真正的变换同构ViTs,对四个已知模型(Swin、SwinV2、MViTv2和CvT)进行了验证,在理论和实践上都实现了100%的变换一致性。具体来说,我们在实践中测试了这些模型的图像分类和语义分割性能,在不同数据集上取得了竞争表现,同时保持了100%的变换一致性。
https://arxiv.org/abs/2305.16316
We propose Neural 3D Articulation Prior (NAP), the first 3D deep generative model to synthesize 3D articulated object models. Despite the extensive research on generating 3D objects, compositions, or scenes, there remains a lack of focus on capturing the distribution of articulated objects, a common object category for human and robot interaction. To generate articulated objects, we first design a novel articulation tree/graph parameterization and then apply a diffusion-denoising probabilistic model over this representation where articulated objects can be generated via denoising from random complete graphs. In order to capture both the geometry and the motion structure whose distribution will affect each other, we design a graph-attention denoising network for learning the reverse diffusion process. We propose a novel distance that adapts widely used 3D generation metrics to our novel task to evaluate generation quality, and experiments demonstrate our high performance in articulated object generation. We also demonstrate several conditioned generation applications, including Part2Motion, PartNet-Imagination, Motion2Part, and GAPart2Object.
我们提出了神经网络3D关节构造前奏(NAP),这是合成3D关节对象模型的第一种3D深度生成模型。尽管研究了生成3D物体、组合或场景的广泛研究,但仍缺乏关注捕捉关节对象分布的重点,这是人类和机器人交互的常见对象类别。生成关节对象,我们首先设计了一个 novel 关节树/图参数化,然后应用一个扩散除噪的probabilistic模型,在这个表示上,可以从随机完整图生成关节对象。为了捕捉 both the geometry 和运动结构,Whose distribution will affect each other,我们设计了图注意力除噪网络,以学习逆扩散过程。我们提出了一种新的距离,该距离适应广泛使用的3D生成度量任务,以评估生成质量,并实验表明我们在关节对象生成方面表现出高性能。我们还展示了多个条件生成应用,包括Part2Motion、PartNet-Imagination、Motion2Part和 GAPart2Object。
https://arxiv.org/abs/2305.16315
Equivariance has gained strong interest as a desirable network property that inherently ensures robust generalization. However, when dealing with complex systems such as articulated objects or multi-object scenes, effectively capturing inter-part transformations poses a challenge, as it becomes entangled with the overall structure and local transformations. The interdependence of part assignment and per-part group action necessitates a novel equivariance formulation that allows for their co-evolution. In this paper, we present Banana, a Banach fixed-point network for equivariant segmentation with inter-part equivariance by construction. Our key insight is to iteratively solve a fixed-point problem, where point-part assignment labels and per-part SE(3)-equivariance co-evolve simultaneously. We provide theoretical derivations of both per-step equivariance and global convergence, which induces an equivariant final convergent state. Our formulation naturally provides a strict definition of inter-part equivariance that generalizes to unseen inter-part configurations. Through experiments conducted on both articulated objects and multi-object scans, we demonstrate the efficacy of our approach in achieving strong generalization under inter-part transformations, even when confronted with substantial changes in pointcloud geometry and topology.
一致性作为一个重要的网络属性,本身就确保了稳健泛化。然而,在与复杂的系统,如多关节疼痛对象或多对象场景时,有效地捕捉部分之间的变换面临着挑战,因为它与整体结构和局部变换交织在一起。部分分配和每个部分的独立行动需要一种独特的一致性定义,以便它们的协同演化。在本文中,我们介绍了banana,一个由Banach fixed-point网络构建的一致性分割网络,具有部分一致性。我们的关键发现是解决一个固定点问题,该问题点部分分配标签和每个部分的SE(3)一致性同时演化。我们提供了每个步骤的一致性和全球收敛的理论推导,导致一致性最终的收敛状态。我们的定义自然地提供了部门一致性的严格定义,可以泛化到未知的部门配置。通过在多关节疼痛对象和多对象扫描的实验中实施,我们证明了我们的方法在部分变换下实现稳健泛化的有效性,即使面对点云几何和拓扑的重大变化。
https://arxiv.org/abs/2305.16314
Text-to-image model personalization aims to introduce a user-provided concept to the model, allowing its synthesis in diverse contexts. However, current methods primarily focus on the case of learning a single concept from multiple images with variations in backgrounds and poses, and struggle when adapted to a different scenario. In this work, we introduce the task of textual scene decomposition: given a single image of a scene that may contain several concepts, we aim to extract a distinct text token for each concept, enabling fine-grained control over the generated scenes. To this end, we propose augmenting the input image with masks that indicate the presence of target concepts. These masks can be provided by the user or generated automatically by a pre-trained segmentation model. We then present a novel two-phase customization process that optimizes a set of dedicated textual embeddings (handles), as well as the model weights, striking a delicate balance between accurately capturing the concepts and avoiding overfitting. We employ a masked diffusion loss to enable handles to generate their assigned concepts, complemented by a novel loss on cross-attention maps to prevent entanglement. We also introduce union-sampling, a training strategy aimed to improve the ability of combining multiple concepts in generated images. We use several automatic metrics to quantitatively compare our method against several baselines, and further affirm the results using a user study. Finally, we showcase several applications of our method. Project page is available at: this https URL
文本到图像模型个性化的目标是将用户提供的概念引入模型,并在多种情境下进行合成。然而,当前的方法主要关注从多个图像中学习单一概念的情况,并在适应不同情境时面临困难。在本文中,我们介绍了文本场景分解任务:给定一张可能包含多个概念的图像,我们的目标是提取每个概念的 distinct 文本 token,从而实现对生成的场景的精细控制。为此,我们建议增加输入图像上的掩码,以指示目标概念的存在。这些掩码可以由用户提供或由预先训练的分割模型自动生成。然后我们介绍了一种独特的两阶段定制过程,该过程优化了一组专门化文本嵌入(handles)和模型权重,实现在准确捕捉概念和避免过拟合之间的微妙平衡。我们采用Masked Diffusion Loss来实现handles 生成其指定的概念,并添加Cross-Attention Loss以防止纠缠,我们还介绍了合并采样训练策略,旨在提高生成图像中多个概念的合并能力。我们使用多个自动指标对方法和多个基准进行比较,并使用用户研究进一步确认结果。最后,我们展示了我们方法的多个应用。项目页面可用如下: this https URL
https://arxiv.org/abs/2305.16311
We propose a learning-based method to recover normals, specularity, and roughness from a single diffuse image of a material, using microgeometry appearance as our primary cue. Previous methods that work on single images tend to produce over-smooth outputs with artifacts, operate at limited resolution, or train one model per class with little room for generalization. Previous methods that work on single images tend to produce over-smooth outputs with artifacts, operate at limited resolution, or train one model per class with little room for generalization. In contrast, in this work, we propose a novel capture approach that leverages a generative network with attention and a U-Net discriminator, which shows outstanding performance integrating global information at reduced computational complexity. We showcase the performance of our method with a real dataset of digitized textile materials and show that a commodity flatbed scanner can produce the type of diffuse illumination required as input to our method. Additionally, because the problem might be illposed -more than a single diffuse image might be needed to disambiguate the specular reflection- or because the training dataset is not representative enough of the real distribution, we propose a novel framework to quantify the model's confidence about its prediction at test time. Our method is the first one to deal with the problem of modeling uncertainty in material digitization, increasing the trustworthiness of the process and enabling more intelligent strategies for dataset creation, as we demonstrate with an active learning experiment.
我们提出了一种基于学习的方法,从材料的一个均匀图像中恢复正则性、亮度和粗糙度,使用微几何外观作为主要线索。以前的方法和在单个图像上工作的方法和以前的方法都倾向于产生平滑的输出带有 artifacts,只能在有限的分辨率下操作或训练每个类别的单个模型,几乎没有泛化能力。相比之下,在本文中,我们提出了一种新的捕获方法,利用注意力机制和 U-Net 分类器,表现出卓越的性能,在减少计算复杂度的情况下整合全球信息。我们使用数字纺织材料数据集展示了我们方法的性能,并表明商品平板扫描仪可以产生所需的均匀照明类型,此外,由于问题可能不存在或不完备,可能需要多个均匀图像才能区分正则反射,或者训练数据集不足以代表真实分布,我们提出了一种新的框架,以量化模型在测试时的自信程度。我们的方法是第一个处理材料数字化中的不确定性问题的方法,提高了过程的可靠性,并 enabling 更聪明的数据集创建策略,正如我们使用主动学习实验所证明的那样。
https://arxiv.org/abs/2305.16312
Recent advances in deep generative models have led to the development of methods capable of synthesizing high-quality, realistic images. These models pose threats to society due to their potential misuse. Prior research attempted to mitigate these threats by detecting generated images, but the varying traces left by different generative models make it challenging to create a universal detector capable of generalizing to new, unseen generative models. In this paper, we propose to inject a universal adversarial signature into an arbitrary pre-trained generative model, in order to make its generated contents more detectable and traceable. First, the imperceptible optimal signature for each image can be found by a signature injector through adversarial training. Subsequently, the signature can be incorporated into an arbitrary generator by fine-tuning it with the images processed by the signature injector. In this way, the detector corresponding to the signature can be reused for any fine-tuned generator for tracking the generator identity. The proposed method is validated on the FFHQ and ImageNet datasets with various state-of-the-art generative models, consistently showing a promising detection rate. Code will be made publicly available at \url{this https URL}.
深度学习模型的最新发展导致能够合成高质量、现实感强的图像的方法的开发。这些模型对社会构成了威胁,因为它们的潜在滥用可能性。先前的研究试图通过检测生成图像来减轻这些威胁,但不同生成模型留下的差异痕迹使得创建一个能够普遍适用于新、未见面的生成模型的通用检测器变得困难。在本文中,我们提议将一种通用的对抗性签名注入任意训练好的生成模型中,以使其生成的内容更容易检测和追踪。首先,通过对抗训练,每个图像的可见最优签名可以通过签名注入器找到。随后,签名可以与由签名注入器处理的图像进行微调,并将其注入任意生成器中。这样,与签名对应的检测器就可以用于任何微调生成器的跟踪生成器身份。该提议方法在FFHQ和ImageNet等各种先进生成模型的多种数据集上进行了验证, consistently 显示有 promising 的检测率。代码将在\url{this https URL}上公开发布。
https://arxiv.org/abs/2305.16310
Imitation learning is a powerful tool for training robot manipulation policies, allowing them to learn from expert demonstrations without manual programming or trial-and-error. However, common methods of data collection, such as human supervision, scale poorly, as they are time-consuming and labor-intensive. In contrast, Task and Motion Planning (TAMP) can autonomously generate large-scale datasets of diverse demonstrations. In this work, we show that the combination of large-scale datasets generated by TAMP supervisors and flexible Transformer models to fit them is a powerful paradigm for robot manipulation. To that end, we present a novel imitation learning system called OPTIMUS that trains large-scale visuomotor Transformer policies by imitating a TAMP agent. OPTIMUS introduces a pipeline for generating TAMP data that is specifically curated for imitation learning and can be used to train performant transformer-based policies. In this paper, we present a thorough study of the design decisions required to imitate TAMP and demonstrate that OPTIMUS can solve a wide variety of challenging vision-based manipulation tasks with over 70 different objects, ranging from long-horizon pick-and-place tasks, to shelf and articulated object manipulation, achieving 70 to 80% success rates. Video results at this https URL
模仿学习是一种强大的工具,用于训练机器人操纵策略,使其可以从专家演示中学习,而无需手动编程或试错。然而,常见的数据收集方法,如人类监督,规模较小,因为它们需要时间和劳动密集。相反,任务和运动规划(TAMP)可以自主生成大规模的多样性演示数据。在本文中,我们表明,由TAMP指导生成的大规模数据集再加上适应这些数据的灵活Transformer模型是机器人操纵的强大范式。为此,我们介绍了一种称为OPTIMUS的新模仿学习系统,它通过模仿TAMP代理训练大规模视觉 motor Transformer policies。OPTIMUS介绍了一个用于生成TAMP数据的特定 curated 的管道,该管道可以用于训练高效的Transformer based policies。在本文中,我们深入研究了模仿TAMP所需的设计决策,并表明OPTIMUS可以处理超过70个不同物体的各种挑战性视觉操纵任务,包括远程选取任务、货架和连接对象操纵,达到70至80%的成功率。视频结果在此https URL上。
https://arxiv.org/abs/2305.16309
Composed image retrieval aims to find an image that best matches a given multi-modal user query consisting of a reference image and text pair. Existing methods commonly pre-compute image embeddings over the entire corpus and compare these to a reference image embedding modified by the query text at test time. Such a pipeline is very efficient at test time since fast vector distances can be used to evaluate candidates, but modifying the reference image embedding guided only by a short textual description can be difficult, especially independent of potential candidates. An alternative approach is to allow interactions between the query and every possible candidate, i.e., reference-text-candidate triplets, and pick the best from the entire set. Though this approach is more discriminative, for large-scale datasets the computational cost is prohibitive since pre-computation of candidate embeddings is no longer possible. We propose to combine the merits of both schemes using a two-stage model. Our first stage adopts the conventional vector distancing metric and performs a fast pruning among candidates. Meanwhile, our second stage employs a dual-encoder architecture, which effectively attends to the input triplet of reference-text-candidate and re-ranks the candidates. Both stages utilize a vision-and-language pre-trained network, which has proven beneficial for various downstream tasks. Our method consistently outperforms state-of-the-art approaches on standard benchmarks for the task.
Composed image retrieval的目标是找到与给定的多项式用户查询包含参考图像和文本一对的最优匹配图像。现有的方法通常会对整个语料库进行图像嵌入的预处理,并在测试时比较参考图像嵌入由查询文本修改后的结果。这种管道在测试时非常高效,因为可以快速计算向量距离来评估候选人,但仅通过简短的文本描述指导修改参考图像嵌入可能会很困难,特别是与潜在候选人独立的。另一种方法是允许查询和每个可能候选人之间的交互,即参考文本候选人三件套,并从中选择最好的。尽管这种方法更加歧视性,但对于大型数据集,计算成本过高,因为预计算候选人嵌入不再可行。我们提议使用两个阶段的模型将两种方案的优点结合起来。我们的第一阶段采用传统的向量距离度量,并快速修剪候选人之间的中间结果。与此同时,我们的第二阶段采用双编码架构,有效地关注输入的参考文本-候选人三件套并重新评估候选人。两个阶段都使用视觉和语言预训练网络,已经证明对于各种后续任务有益。我们的方法在任务的标准基准测试中 consistently 优于最先进的方法。
https://arxiv.org/abs/2305.16304
Multi-Agent Path Finding (MAPF) is a fundamental motion coordination problem arising in multi-agent systems with a wide range of applications. The problem's intractability has led to extensive research on improving the scalability of solvers for it. Since optimal solvers can struggle to scale, a major challenge that arises is understanding what makes MAPF hard. We tackle this challenge through a fine-grained complexity analysis of time-optimal MAPF on 2D grids, thereby closing two gaps and identifying a new tractability frontier. First, we show that 2-colored MAPF, i.e., where the agents are divided into two teams, each with its own set of targets, remains NP-hard. Second, for the flowtime objective (also called sum-of-costs), we show that it remains NP-hard to find a solution in which agents have an individually optimal cost, which we call an individually optimal solution. The previously tightest results for these MAPF variants are for (non-grid) planar graphs. We use a single hardness construction that replaces, strengthens, and unifies previous proofs. We believe that it is also simpler than previous proofs for the planar case as it employs minimal gadgets that enable its full visualization in one figure. Finally, for the flowtime objective, we establish a tractability frontier based on the number of directions agents can move in. Namely, we complement our hardness result, which holds for three directions, with an efficient algorithm for finding an individually optimal solution if only two directions are allowed. This result sheds new light on the structure of optimal solutions, which may help guide algorithm design for the general problem.
多Agent 路径找到(MAPF)是存在于多种Agent 系统应用领域中的基本运动协调问题。该问题的不可解性导致了对解决它的优化程度的深入研究。由于最优解引擎难以扩展,一个主要挑战是理解是什么导致MAPF 难以解决。我们通过细致的时间最优MAPF在二维网格上的复杂度分析,从而关闭了两个间隙并确定了新的可解性前沿。首先,我们表明2-颜色MAPF,即agents 被分为两个团队,每个团队都有各自的目标,仍然是NP-难问题。其次,对于流量时间目标(也称为总成本),我们表明仍然NP-难的是找到一种解决方案,其中agents 具有个体最优成本,我们称之为个体最优解。这些MAPF 变体先前最紧的结果是针对(非网格)平面图形的(非网格)平面图形。我们使用一种单一的难性质构建来取代、加强并统一了以前的证明。我们认为它也比平面情况下的证明更简单,因为它使用了最小器件,使其在单个图像中充分可视化。最后,对于流量时间目标,我们建立了可解性前沿基于agents 可能移动的方向数。换句话说,我们补充了我们的困难结果,该结果适用于三个方向,如果只允许两个方向移动,以高效算法找到个体最优解。 This result 的新视角照亮了最优解的结构,这可能会指导算法设计对一般问题。
https://arxiv.org/abs/2305.16303
While impressive performance has been achieved on the task of Answer Sentence Selection (AS2) for English, the same does not hold for languages that lack large labeled datasets. In this work, we propose Cross-Lingual Knowledge Distillation (CLKD) from a strong English AS2 teacher as a method to train AS2 models for low-resource languages in the tasks without the need of labeled data for the target language. To evaluate our method, we introduce 1) Xtr-WikiQA, a translation-based WikiQA dataset for 9 additional languages, and 2) TyDi-AS2, a multilingual AS2 dataset with over 70K questions spanning 8 typologically diverse languages. We conduct extensive experiments on Xtr-WikiQA and TyDi-AS2 with multiple teachers, diverse monolingual and multilingual pretrained language models (PLMs) as students, and both monolingual and multilingual training. The results demonstrate that CLKD either outperforms or rivals even supervised fine-tuning with the same amount of labeled data and a combination of machine translation and the teacher model. Our method can potentially enable stronger AS2 models for low-resource languages, while TyDi-AS2 can serve as the largest multilingual AS2 dataset for further studies in the research community.
虽然英语在回答句子选择任务(AS2)方面取得了令人印象深刻的表现,但对于缺乏大型标注数据的语言来说,情况并不是这样。在本文中,我们提出了跨语言知识蒸馏(CLKD)方法,从一位强大的英语AS2老师那里提出,用于训练低资源语言中的AS2模型,而不需要目标语言的标注数据。为了评估我们的方法,我们介绍了1) Xtr-WikiQA,一个基于翻译的WikiQA数据集,适用于9个其他语言,以及2) TyDi-AS2,一个跨越多个语言类型的AS2数据集,超过7000个问题,涵盖了8个类型不同的语言。我们使用多个老师、多种语言的预训练语言模型(PLM)作为学生,进行双语和多语的训练。结果证明,CLKD在同样数量的标注数据和机器翻译与教师模型的结合下,优于或甚至与 supervised fine-tuning 相等。我们的方法可以潜力地支持低资源语言中的更强AS2模型,而 TyDi-AS2可以作为 research community 中最大的跨语言AS2数据集进行进一步研究。
https://arxiv.org/abs/2305.16302
The analysis and use of egocentric videos for robotic tasks is made challenging by occlusion due to the hand and the visual mismatch between the human hand and a robot end-effector. In this sense, the human hand presents a nuisance. However, often hands also provide a valuable signal, e.g. the hand pose may suggest what kind of object is being held. In this work, we propose to extract a factored representation of the scene that separates the agent (human hand) and the environment. This alleviates both occlusion and mismatch while preserving the signal, thereby easing the design of models for downstream robotics tasks. At the heart of this factorization is our proposed Video Inpainting via Diffusion Model (VIDM) that leverages both a prior on real-world images (through a large-scale pre-trained diffusion model) and the appearance of the object in earlier frames of the video (through attention). Our experiments demonstrate the effectiveness of VIDM at improving inpainting quality on egocentric videos and the power of our factored representation for numerous tasks: object detection, 3D reconstruction of manipulated objects, and learning of reward functions, policies, and affordances from videos.
对人类手作为机器人任务中的媒介进行分析和使用个人视角的视频是困难的,这因为手和人类手与机器人末端执行器之间的视觉不匹配而 occlusion。从这个意义上说,人类手是一个麻烦。然而,通常手也提供有价值的信号,例如手的姿势可能暗示着正在握着什么物体。在这项工作中,我们提议提取一个Factored Representation,将agent(人类手)和环境分开。这可以减轻 occlusion 和不匹配,同时保留信号,从而简化后续机器人任务中模型的设计。在这个观点的中心是我们所提出的视频扩散模型(VIDM),它利用现实世界图像的先验知识和视频早期帧中物体的外观(通过注意力)。我们的实验证明了 VIDM 在改善个人视角视频涂色质量和我们Factored Representation 对于许多任务的有效性:物体检测、3D重建操纵物体、从视频中学习奖励函数、政策和可用性。
https://arxiv.org/abs/2305.16301
While transformers have shown remarkable success in natural language processing, their attention mechanism's large memory requirements have limited their ability to handle longer contexts. Prior approaches, such as recurrent memory or retrieval-based augmentation, have either compromised the random-access flexibility of attention (i.e., the capability to select any token in the entire context) or relied on separate mechanisms for relevant context retrieval, which may not be compatible with the model's attention. In this paper, we present a novel approach that allows access to the complete context while retaining random-access flexibility, closely resembling running attention on the entire context. Our method uses a landmark token to represent each block of the input and trains the attention to use it for selecting relevant blocks, enabling retrieval of blocks directly through the attention mechanism instead of by relying on a separate mechanism. Our approach seamlessly integrates with specialized data structures and the system's memory hierarchy, enabling processing of arbitrarily long context lengths. We demonstrate that our method can obtain comparable performance with Transformer-XL while significantly reducing the number of retrieved tokens in each step. Finally, we show that fine-tuning LLaMA 7B with our method successfully extends its context length capacity up to 32k tokens, allowing for inference at the context lengths of GPT-4.
虽然变分自编码器在自然语言处理方面取得了显著的成功,但他们的注意力机制巨大的内存要求已经限制了他们处理更长上下文的能力。先前的方法,如循环记忆或检索增强,要么牺牲了注意力的随机访问灵活性(即整个上下文中选择任意 token 的能力)要么依赖于 separate 机制来获取相关上下文,这可能与模型的注意力不兼容。在本文中,我们提出了一种新的方法来访问整个上下文,同时保留随机访问灵活性,几乎像整个上下文运行注意力一样。我们的方法使用地标性 token 来代表输入的每个块,并训练注意力使用它来选择相关块,从而使块直接通过注意力机制进行检索,而不是通过 separate 机制。我们的方法无缝集成了 specialized 数据结构和系统的记忆层次结构,使可以处理任意长的上下文长度。我们证明了,我们的方法可以与 Transformer-XL 取得类似的性能,同时显著减少每个步骤检索 token 的数量。最后,我们展示了,通过与我们的方法 fine-tuning LLaMA 7B,成功地将上下文长度能力扩展到 32k tokens,使可以在 GPT-4 的上下文长度上进行推理。
https://arxiv.org/abs/2305.16300
A great deal of progress has been made in image captioning, driven by research into how to encode the image using pre-trained models. This includes visual encodings (e.g. image grid features or detected objects) and more recently textual encodings (e.g. image tags or text descriptions of image regions). As more advanced encodings are available and incorporated, it is natural to ask: how to efficiently and effectively leverage the heterogeneous set of encodings? In this paper, we propose to regard the encodings as augmented views of the input image. The image captioning model encodes each view independently with a shared encoder efficiently, and a contrastive loss is incorporated across the encoded views in a novel way to improve their representation quality and the model's data efficiency. Our proposed hierarchical decoder then adaptively weighs the encoded views according to their effectiveness for caption generation by first aggregating within each view at the token level, and then across views at the view level. We demonstrate significant performance improvements of +5.6% CIDEr on MS-COCO and +12.9% CIDEr on Flickr30k compared to state of the arts, and conduct rigorous analyses to demonstrate the importance of each part of our design.
图像标题生成取得了很大的进展,这受到研究如何用预训练模型编码图像的推动。包括视觉编码(例如图像网格特征或检测到的对象)以及最近文本编码(例如图像标签或图像区域文本描述)。随着更先进的编码器和将其集成到系统中的方法的可用性和实现,人们自然地问:如何高效和有效地利用这些不同类型的编码器?在本文中,我们提议将编码视为输入图像的增强视图。图像标题生成模型以共享编码器的方式高效地每个视图独立编码,并以一种新颖的方式在编码视图之间引入Contrastive Loss,以提高它们的表示质量和模型的数据效率。我们提出的Hierarchical Decoder然后自适应地根据图像标题生成的效率对编码视图进行加权,首先在 token 级别内对每个视图进行聚合,然后在整个视图级别上进行加权。我们证明了在MS-COCO和Flickr30k上相对于艺术状态的显著性能改进,即+5.6%和+12.9%的CER,并进行了严格的分析,以证明我们设计的每个部分的重要性。
https://arxiv.org/abs/2305.16295
We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager consists of three key components: 1) an automatic curriculum that maximizes exploration, 2) an ever-growing skill library of executable code for storing and retrieving complex behaviors, and 3) a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement. Voyager interacts with GPT-4 via blackbox queries, which bypasses the need for model parameter fine-tuning. The skills developed by Voyager are temporally extended, interpretable, and compositional, which compounds the agent's abilities rapidly and alleviates catastrophic forgetting. Empirically, Voyager shows strong in-context lifelong learning capability and exhibits exceptional proficiency in playing Minecraft. It obtains 3.3x more unique items, travels 2.3x longer distances, and unlocks key tech tree milestones up to 15.3x faster than prior SOTA. Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch, while other techniques struggle to generalize. We open-source our full codebase and prompts at this https URL.
我们介绍了Voyager,它是Minecraft中第一个使用LLM技术 embodied 的 lifelong learningagent,能够持续探索世界、获取多样化的技能,并且在没有人类干预的情况下发现新的事物。Voyager由三个关键组件组成:1) 最大化探索的自动课程,2) 不断增长的技能库,用于存储和检索复杂的行为,3) 新的迭代prompt机制,其中包括环境反馈、执行错误和自我验证,用于改进程序。Voyager通过黑盒查询与GPT-4交互,无需调整模型参数即可绕过了模型参数微调的需求。由Voyager开发的技能具有时间上的扩展、可解释性和组合性,这迅速增强了agent的能力,并减轻了灾难性遗忘。实证研究表明,Voyager表现出Context-based in-game lifelong learning能力,并在Minecraft游戏中表现出非凡的技能水平。它获得了3.3倍的更独特物品、旅行2.3倍的更长距离,并且比先前的SOTA更快地解锁关键科技树里程碑。Voyager能够利用已学习的技能库在一个新的Minecraft世界中解决全新的任务,而其他技术则难以通用。我们公开了我们完整的代码库和prompts,这些prompts位于这个httpsURL上。
https://arxiv.org/abs/2305.16291
Many fine-grained classification tasks, like rare animal identification, have limited training data and consequently classifiers trained on these datasets often fail to generalize to variations in the domain like changes in weather or location. As such, we explore how natural language descriptions of the domains seen in training data can be used with large vision models trained on diverse pretraining datasets to generate useful variations of the training data. We introduce ALIA (Automated Language-guided Image Augmentation), a method which utilizes large vision and language models to automatically generate natural language descriptions of a dataset's domains and augment the training data via language-guided image editing. To maintain data integrity, a model trained on the original dataset filters out minimal image edits and those which corrupt class-relevant information. The resulting dataset is visually consistent with the original training data and offers significantly enhanced diversity. On fine-grained and cluttered datasets for classification and detection, ALIA surpasses traditional data augmentation and text-to-image generated data by up to 15\%, often even outperforming equivalent additions of real data. Code is avilable at this https URL.
许多精细的分类任务,例如稀有动物识别,训练数据有限,因此训练在这些数据上的分类器往往无法泛化到domain的变化,例如天气或地点的变化。因此,我们探索如何使用训练数据中的自然语言描述生成大型视觉模型,通过语言引导的图像编辑来生成有用的训练数据变异。我们引入了ALIA(自动语言引导图像增强),这种方法利用大型视觉和语言模型自动生成dataset的domain的自然语言描述,并通过语言引导的图像编辑增强训练数据。为了维持数据完整性,训练在原始数据上的分类器过滤掉最小图像编辑和那些损坏类相关的信息。 resulting dataset与原始训练数据 visually consistent,并提供了显著的增加多样性。在精细的分类和检测数据集上,ALIA超过了传统的数据增强和文本到图像生成的数据,可以达到15\%以上的超越,常常甚至超越了真实的数据增加。代码在这个httpsURL上可用。
https://arxiv.org/abs/2305.16289