Diffusion models are powerful generative models but suffer from slow sampling, often taking 1000 sequential denoising steps for one sample. As a result, considerable efforts have been directed toward reducing the number of denoising steps, but these methods hurt sample quality. Instead of reducing the number of denoising steps (trading quality for speed), in this paper we explore an orthogonal approach: can we run the denoising steps in parallel (trading compute for speed)? In spite of the sequential nature of the denoising steps, we show that surprisingly it is possible to parallelize sampling via Picard iterations, by guessing the solution of future denoising steps and iteratively refining until convergence. With this insight, we present ParaDiGMS, a novel method to accelerate the sampling of pretrained diffusion models by denoising multiple steps in parallel. ParaDiGMS is the first diffusion sampling method that enables trading compute for speed and is even compatible with existing fast sampling techniques such as DDIM and DPMSolver. Using ParaDiGMS, we improve sampling speed by 2-4x across a range of robotics and image generation models, giving state-of-the-art sampling speeds of 0.2s on 100-step DiffusionPolicy and 16s on 1000-step StableDiffusion-v2 with no measurable degradation of task reward, FID score, or CLIP score.
扩散模型是强大的生成模型,但采样速度较慢,常常需要1000个Sequential的denoising步骤才能完成一个样本。因此,已经有大量的努力被用于减少denoising步骤的数量,但这些方法却损害了样本质量。我们本 paper 探索了一种与之相反的方法:并行运行denoising步骤(以计算换取速度)。尽管denoising步骤的顺序性,但我们表明,实际上可以通过 Picard 迭代法并行化采样,通过猜测未来denoising步骤的解决方案并迭代优化,直到收敛。利用这一洞察力,我们提出了 ParaDiGMS,一种 novel 方法,以加速训练好的扩散模型的采样,通过并行denoising多个步骤。 ParaDiGMS 是第一种能够以计算换取速度的扩散采样方法,甚至与现有的快速采样技术如 DDIM 和 DPMSolver 兼容。使用 ParaDiGMS,我们在各种机器人和图像生成模型中提高了采样速度,使得最先进的采样速度为 0.2s 的 100 步扩散策略和 16s 的 1000 步稳定扩散-v2,且任务奖励、FID 得分或Clip 得分没有可测量的下降。
https://arxiv.org/abs/2305.16317
We propose Neural 3D Articulation Prior (NAP), the first 3D deep generative model to synthesize 3D articulated object models. Despite the extensive research on generating 3D objects, compositions, or scenes, there remains a lack of focus on capturing the distribution of articulated objects, a common object category for human and robot interaction. To generate articulated objects, we first design a novel articulation tree/graph parameterization and then apply a diffusion-denoising probabilistic model over this representation where articulated objects can be generated via denoising from random complete graphs. In order to capture both the geometry and the motion structure whose distribution will affect each other, we design a graph-attention denoising network for learning the reverse diffusion process. We propose a novel distance that adapts widely used 3D generation metrics to our novel task to evaluate generation quality, and experiments demonstrate our high performance in articulated object generation. We also demonstrate several conditioned generation applications, including Part2Motion, PartNet-Imagination, Motion2Part, and GAPart2Object.
我们提出了神经网络3D关节构造前奏(NAP),这是合成3D关节对象模型的第一种3D深度生成模型。尽管研究了生成3D物体、组合或场景的广泛研究,但仍缺乏关注捕捉关节对象分布的重点,这是人类和机器人交互的常见对象类别。生成关节对象,我们首先设计了一个 novel 关节树/图参数化,然后应用一个扩散除噪的probabilistic模型,在这个表示上,可以从随机完整图生成关节对象。为了捕捉 both the geometry 和运动结构,Whose distribution will affect each other,我们设计了图注意力除噪网络,以学习逆扩散过程。我们提出了一种新的距离,该距离适应广泛使用的3D生成度量任务,以评估生成质量,并实验表明我们在关节对象生成方面表现出高性能。我们还展示了多个条件生成应用,包括Part2Motion、PartNet-Imagination、Motion2Part和 GAPart2Object。
https://arxiv.org/abs/2305.16315
Self-supervised learning (SSL) based speech pre-training has attracted much attention for its capability of extracting rich representations learned from massive unlabeled data. On the other hand, the use of weakly-supervised data is less explored for speech pre-training. To fill this gap, we propose a weakly-supervised speech pre-training method based on speaker-aware speech data. It adopts a similar training procedure to the widely-used masked speech prediction based SSL framework, while incorporating additional target-speaker enrollment information as an auxiliary input. In this way, the learned representation is steered towards the target speaker even in the presence of highly overlapping interference, allowing potential applications to tasks such as target speech recognition. Our experiments on Libri2Mix and WSJ0-2mix datasets show that the proposed model achieves significantly better ASR performance compared to WavLM, the state-of-the-art SSL model with denoising capability.
基于自监督学习的语音前训练受到了广泛关注,因为它能够从大量未标记数据中学习到丰富的表示。然而,对于语音前训练,使用弱监督数据的探索较少。为了填补这一差距,我们提出了一种基于语音识别者意识的语音前训练方法。该方法采用了与广泛使用的掩码语音预测框架类似的训练流程,同时添加目标语音识别者 enrollment信息作为辅助输入。这样, learned 表示就会被引导向目标语音识别者,即使在存在高度重叠的干扰情况下也是如此,从而允许潜在的应用领域进行目标语音识别等任务。我们在Libri2混合和WSJ0-2混合数据集上的实验表明, proposed model相比具有去噪能力的WavLM,在语音识别性能方面取得了显著更好的表现。
https://arxiv.org/abs/2305.16286
In recent years, Denoising Diffusion Probabilistic Models (DDPM) have caught significant attention. By composing a Markovian process that starts in the data domain and then gradually adds noise until reaching pure white noise, they achieve superior performance in learning data distributions. Yet, these models require a large number of diffusion steps to produce aesthetically pleasing samples, which is inefficient. In addition, unlike common generative adversarial networks, the latent space of diffusion models is not interpretable. In this work, we propose to generalize the denoising diffusion process into an Upsampling Diffusion Probabilistic Model (UDPM), in which we reduce the latent variable dimension in addition to the traditional noise level addition. As a result, we are able to sample images of size $256\times 256$ with only 7 diffusion steps, which is less than two orders of magnitude compared to standard DDPMs. We formally develop the Markovian diffusion processes of the UDPM, and demonstrate its generation capabilities on the popular FFHQ, LSUN horses, ImageNet, and AFHQv2 datasets. Another favorable property of UDPM is that it is very easy to interpolate its latent space, which is not the case with standard diffusion models. Our code is available online \url{this https URL}
近年来,去噪扩散概率模型(DDPM)吸引了大量关注。通过构建始于数据域的马尔可夫过程,然后逐渐添加噪声,直到达到纯白色噪声的水平,这些模型在学习数据分布方面表现出更好的性能。然而,这些模型需要许多扩散步骤来产生审美上满意的样本,效率较低。此外,与常见的生成对抗网络不同,扩散模型的隐状态空间无法解释。在本文中,我们提议将去噪扩散过程泛化为增采样扩散概率模型(UDPM),其中我们除了传统的噪声水平增加外,还减少了隐变量维度。因此,我们只需要7个扩散步骤就能样本大小为256×256的图像,比标准DDPM的规模小得多。我们正式开发了UDPM的马尔可夫扩散过程,并在流行的FFHQ、LCNS horses、ImageNet和AFHQv2数据集上展示了其生成能力。UDPM的另一个有利特性是,它很容易进行隐状态空间的插值,而标准扩散模型则无法做到。我们的代码现在在线 \url{this https URL}。
https://arxiv.org/abs/2305.16269
Reconstruction-based methods have struggled to achieve competitive performance on anomaly detection. In this paper, we introduce Denoising Diffusion Anomaly Detection (DDAD). We propose a novel denoising process for image reconstruction conditioned on a target image. This results in a coherent restoration that closely resembles the target image. Subsequently, our anomaly detection framework leverages this conditioning where the target image is set as the input image to guide the denoising process, leading to defectless reconstruction while maintaining nominal patterns. We localise anomalies via a pixel-wise and feature-wise comparison of the input and reconstructed image. Finally, to enhance the effectiveness of feature comparison, we introduce a domain adaptation method that utilises generated examples from our conditioned denoising process to fine-tune the feature extractor. The veracity of the approach is demonstrated on various datasets including MVTec and VisA benchmarks, achieving state-of-the-art results of 99.5% and 99.3% image-level AUROC respectively.
基于重构的方法在异常检测方面一直难以取得竞争性能。在本文中,我们介绍了去噪扩散异常检测(DDAD),我们提出了一种基于目标图像的全新的去噪过程,以产生与目标图像非常相似的连贯恢复。随后,我们的异常检测框架利用目标图像作为输入图像的指导,以引导去噪过程,从而实现无缺陷恢复并保持名义模式。我们通过像素级和特征级比较输入和恢复图像来定位异常。最后,为了增强特征比较的有效性,我们引入了一种域适应方法,该方法利用我们Conditioned去噪过程生成的示例来微调特征提取器。该方法在包括MVTec和 VisA基准数据的各种数据集上进行了验证,分别实现了99.5%和99.3%的图像auROC水平。
https://arxiv.org/abs/2305.15956
Low-dose computed tomography (CT) image denoising is crucial in medical image computing. Recent years have been remarkable improvement in deep learning-based methods for this task. However, training deep denoising neural networks requires low-dose and normal-dose CT image pairs, which are difficult to obtain in the clinic settings. To address this challenge, we propose a novel fully unsupervised method for low-dose CT image denoising, which is based on denoising diffusion probabilistic model -- a powerful generative model. First, we train an unconditional denoising diffusion probabilistic model capable of generating high-quality normal-dose CT images from random noise. Subsequently, the probabilistic priors of the pre-trained diffusion model are incorporated into a Maximum A Posteriori (MAP) estimation framework for iteratively solving the image denoising problem. Our method ensures the diffusion model produces high-quality normal-dose CT images while keeping the image content consistent with the input low-dose CT images. We evaluate our method on a widely used low-dose CT image denoising benchmark, and it outperforms several supervised low-dose CT image denoising methods in terms of both quantitative and visual performance.
低剂量核磁共振(CT)图像去噪在医学图像计算中至关重要。近年来,基于深度学习的方法在该领域取得了显著的进展。然而,训练深度去噪神经网络需要低剂量和正常剂量的CT图像对,这在临床 settings 中很难获取。为了解决这一挑战,我们提出了一种全新的完全 unsupervised 的方法,用于低剂量CT图像去噪,其基于去噪扩散概率模型,这是一种强大的生成模型。我们首先训练一个无条件去噪扩散概率模型,可以从随机噪声中生成高质量的正常剂量CT图像。随后,我们训练了预先训练的扩散模型的概率前向量,并将其融入最大后效估计框架中,以迭代地解决图像去噪问题。我们的方法和输入的低剂量CT图像的图像内容保持一致,确保了扩散模型生成高质量的正常剂量CT图像,同时保持图像质量与输入的低剂量CT图像相似。我们使用了一个广泛应用的低剂量CT图像去噪基准进行评估,该方法在 quantitative 和 visual 性能上均优于 several supervised low-剂量CT图像去噪方法。
https://arxiv.org/abs/2305.15887
Diffusion-based generative models have exhibited powerful generative performance in recent years. However, as many attributes exist in the data distribution and owing to several limitations of sharing the model parameters across all levels of the generation process, it remains challenging to control specific styles for each attribute. To address the above problem, this paper presents decoupled denoising diffusion models (DDDMs) with disentangled representations, which can control the style for each attribute in generative models. We apply DDDMs to voice conversion (VC) tasks to address the challenges of disentangling and controlling each speech attribute (e.g., linguistic information, intonation, and timbre). First, we use a self-supervised representation to disentangle the speech representation. Subsequently, the DDDMs are applied to resynthesize the speech from the disentangled representations for denoising with respect to each attribute. Moreover, we also propose the prior mixup for robust voice style transfer, which uses the converted representation of the mixed style as a prior distribution for the diffusion models. The experimental results reveal that our method outperforms publicly available VC models. Furthermore, we show that our method provides robust generative performance regardless of the model size. Audio samples are available this https URL.
扩散based生成模型在近年来表现出强大的生成性能。然而,由于数据分布中存在许多属性,并且由于在生成过程中共享模型参数的一些限制,控制每个属性的特定风格仍然是一项挑战。为了解决上述问题,本文提出了分离的去除噪声扩散模型(DDDMs),具有分离的表示,可以控制生成模型中的每个属性的风格。我们将DDDMs应用于语音转换任务,以解决分离和控制每个 speech 属性(如语言信息、音调和音色)的挑战。首先,我们使用自监督表示分离语音表示。随后,我们将DDDMs应用于从分离的表示中提取语音以去除每个属性。此外,我们还提出了可靠的语音风格转换的先混合方案,该方案使用混合风格转换为扩散模型的先分布。实验结果表明,我们的方法比公开可用的语音转换模型表现更好。此外,我们表明,我们的方法无论模型大小如何都提供了可靠的生成性能。音频样本在这个 https URL 上可用。
https://arxiv.org/abs/2305.15816
Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real-time generation. Efficient music generation with a quality on par with MusicLM remains a significant challenge. In this paper, we present MeLoDy (M for music; L for LM; D for diffusion), an LM-guided diffusion model that generates music audios of state-of-the-art quality meanwhile reducing 95.7% or 99.6% forward passes in MusicLM, respectively, for sampling 10s or 30s music. MeLoDy inherits the highest-level LM from MusicLM for semantic modeling, and applies a novel dual-path diffusion (DPD) model and an audio VAE-GAN to efficiently decode the conditioning semantic tokens into waveform. DPD is proposed to simultaneously model the coarse and fine acoustics by incorporating the semantic information into segments of latents effectively via cross-attention at each denoising step. Our experimental results suggest the superiority of MeLoDy, not only in its practical advantages on sampling speed and infinitely continuable generation, but also in its state-of-the-art musicality, audio quality, and text correlation. Our samples are available at this https URL.
音乐生成最近的进展得益于最先进的 MusicLM,该模型由三个 LM 级联构成,分别用于语义、粗听和细听建模。然而,使用 MusicLM 进行采样需要逐个处理这些 LM 以获取精细的声学代币,这使得计算代价很高,并且无法用于实时生成。与 MusicLM 的质量相当高效的音乐生成仍然是一个重大的挑战。在本文中,我们介绍了 MeLoDy(M 代表音乐,L 代表 LM,D 代表扩散),它是一个 LM 引导的扩散模型,可以生成高质量的音乐音频,同时 MusicLM 中 forward pass 的百分比分别减少了 95.7% 或 99.6%。MeLoDy 从 MusicLM 继承了大量的语义建模 LM 级别,并应用了一个新颖的双路径扩散模型(DPD)和一个音频 VAE-GAN,高效地解码 conditioning 语义代币到波形。DPD 建议同时建模粗听和细听声音,通过在每个去噪步骤中的交叉注意力有效地将语义信息嵌入到潜在部分中。我们的实验结果表明,MeLoDy 优越于 MusicLM,不仅在于它的采样速度和无限连续生成的实际优势,还在于它先进的音乐性、音频质量和文本相关性。我们的样本可在 this https URL 上获取。
https://arxiv.org/abs/2305.15719
The representation gap between teacher and student is an emerging topic in knowledge distillation (KD). To reduce the gap and improve the performance, current methods often resort to complicated training schemes, loss functions, and feature alignments, which are task-specific and feature-specific. In this paper, we state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature, and propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models. Our approach is based on the observation that student features typically contain more noises than teacher features due to the smaller capacity of student model. To address this, we propose to denoise student features using a diffusion model trained by teacher features. This allows us to perform better distillation between the refined clean feature and teacher feature. Additionally, we introduce a light-weight diffusion model with a linear autoencoder to reduce the computation cost and an adpative noise matching module to improve the denoising performance. Extensive experiments demonstrate that DiffKD is effective across various types of features and achieves state-of-the-art performance consistently on image classification, object detection, and semantic segmentation tasks. Code will be available at this https URL.
教师和学生之间的表示差距是知识蒸馏(KD)领域的一个新兴话题。为了缩小差距并提高性能,当前的方法常常采用复杂的训练计划、损失函数和特征对齐,这些任务和特征特定的。在本文中,我们指出这些方法的核心是排除噪声信息并蒸馏特征中的有价值信息,并提出了一种新的KD方法称为DiffKD,使用扩散模型来明确消除特征。我们的的方法是基于观察,学生特征通常包含比教师特征更多的噪声,因为学生模型的容量较小。为了解决这一问题,我们提议使用教师特征训练的扩散模型来消除学生特征。这允许我们在 refined clean feature 和教师特征之间的蒸馏任务中更好地进行知识蒸馏。此外,我们介绍了一种轻量级扩散模型,并配置了一个线性自编码器,以降低计算成本,并引入了一种自适应噪声匹配模块,以提高去噪性能。广泛的实验表明,DiffKD 适用于各种特征类型,并在图像分类、对象检测和语义分割任务中实现了最先进的性能。代码将在本链接中提供。
https://arxiv.org/abs/2305.15712
There are more than 80,000 character categories in Chinese while most of them are rarely used. To build a high performance handwritten Chinese character recognition (HCCR) system supporting the full character set with a traditional approach, many training samples need be collected for each character category, which is both time-consuming and expensive. In this paper, we propose a novel approach to transforming Chinese character glyph images generated from font libraries to handwritten ones with a denoising diffusion probabilistic model (DDPM). Training from handwritten samples of a small character set, the DDPM is capable of mapping printed strokes to handwritten ones, which makes it possible to generate photo-realistic and diverse style handwritten samples of unseen character categories. Combining DDPM-synthesized samples of unseen categories with real samples of other categories, we can build an HCCR system to support the full character set. Experimental results on CASIA-HWDB dataset with 3,755 character categories show that the HCCR systems trained with synthetic samples perform similarly with the one trained with real samples in terms of recognition accuracy. The proposed method has the potential to address HCCR with a larger vocabulary.
中文字符有超过80,000个分类,但大部分很少被使用。通过传统的方法建立一个支持全部字符集的高性能手写中文字符识别系统,需要为每个字符类别收集许多训练样本,这既耗时又昂贵。在本文中,我们提出了一种 novel 的方法,使用一种denoising diffusion probabilistic模型(DDPM)将从字体库生成的中文字符glyph图像转换为手写图像,从而实现手写字符的去噪。通过训练小型字符集的手写样本,DDPM能够将打印 strokes 映射到手写 ones,从而生成从未见过的分类类别的逼真和多样化的手写样本。将 DDPM-合成的未知分类类别样本与其他类别的真实样本组合在一起,可以构建一个支持全部字符集的 HCCR 系统。针对CASIA-HWDB数据集,包含3,755个字符类别的实验结果显示,使用合成样本训练的 HCCR 系统在识别精度方面与使用真实样本训练的系统类似。该方法有潜力解决使用更大量词汇的 HCCR。
https://arxiv.org/abs/2305.15660
Denoising Diffusion Probabilistic Models (DDPM) have shown remarkable efficacy in the synthesis of high-quality images. However, their inference process characteristically requires numerous, potentially hundreds, of iterative steps, which could lead to the problem of exposure bias due to the accumulation of prediction errors over iterations. Previous work has attempted to mitigate this issue by perturbing inputs during training, which consequently mandates the retraining of the DDPM. In this work, we conduct a systematic study of exposure bias in diffusion models and, intriguingly, we find that the exposure bias could be alleviated with a new sampling method, without retraining the model. We empirically and theoretically show that, during inference, for each backward time step $t$ and corresponding state $\hat{x}_t$, there might exist another time step $t_s$ which exhibits superior coupling with $\hat{x}_t$. Based on this finding, we introduce an inference method named Time-Shift Sampler. Our framework can be seamlessly integrated with existing sampling algorithms, such as DDIM or DDPM, inducing merely minimal additional computations. Experimental results show that our proposed framework can effectively enhance the quality of images generated by existing sampling algorithms.
去噪扩散概率模型(DDPM)在合成高质量图像方面表现出了卓越的效果。然而,它们的推理过程的特点是需要进行大量、 potentially 数百次迭代步骤,这可能导致由于迭代中预测误差的累积而产生的曝光偏差问题。以前的工作曾试图通过在训练时扰动输入来缓解这个问题,因此要求 DDPM 进行重新训练。在本文中,我们进行了一项系统研究扩散模型的曝光偏差问题,令人感兴趣的是,我们发现可以通过一种新的采样方法来解决曝光偏差问题,而不需要重新训练模型。我们经验证和理论地表明,在推理时,对于每个backward time step $t$ 和相应的状态 $\hat{x}_t$,可能存在另一个时间 step $t_s$ 表现出与 $\hat{x}_t$ 更强的耦合。基于这个发现,我们引入了名为时间Shift Sampler的推理方法。我们的框架可以无缝地与现有的采样算法,如 DDIM 或 DDPM,产生仅仅少量的额外计算,从而实现了模型的无缝集成。实验结果显示,我们提出的框架可以 effectively enhance 由现有采样算法生成的图像的质量。
https://arxiv.org/abs/2305.15583
We propose an unsupervised speech-to-speech translation (S2ST) system that does not rely on parallel data between the source and target languages. Our approach maps source and target language speech signals into automatically discovered, discrete units and reformulates the problem as unsupervised unit-to-unit machine translation. We develop a three-step training procedure that involves (a) pre-training an unit-based encoder-decoder language model with a denoising objective (b) training it with word-by-word translated utterance pairs created by aligning monolingual text embedding spaces and (c) running unsupervised backtranslation bootstrapping off of the initial translation model. Our approach avoids mapping the speech signal into text and uses speech-to-unit and unit-to-speech models instead of automatic speech recognition and text to speech models. We evaluate our model on synthetic-speaker Europarl-ST English-German and German-English evaluation sets, finding that unit-based translation is feasible under this constrained scenario, achieving 9.29 ASR-BLEU in German to English and 8.07 in English to German.
我们提出了一个不需要源和目标语言平行数据的非监督语音到语音翻译系统(S2ST)。我们的方法将源和目标语言语音信号映射到自动发现、离散单元,并重新表述问题为无监督单元到单元机器翻译。我们开发了一个三步骤的训练程序,包括(a) 先训练一个单元基于编码-解码语言模型,以消除噪声目标(b) 通过对齐单语言文本嵌入空间创建 word-by-word 翻译 utterance 对进行训练,(c) 运行无监督反向翻译Bootstrapping 从初始翻译模型中启动。我们的方法避免将语音信号映射到文本,而是使用语音到单元和单元到语音模型,而不是自动语音识别和文本到语音模型。我们在合成听者欧洲语言资源( Europarl-ST)的英语-德语和德语-英语评估 sets 上评估我们的模型,发现在 this 约束条件下,单元翻译是可行的,实现德语到英语的 ASR-BLEU 值为 9.29,英语到德语的值为 8.07。
https://arxiv.org/abs/2305.15405
Synthesizing novel 3D models that resemble the input example has long been pursued by researchers and artists in computer graphics. In this paper, we present Sin3DM, a diffusion model that learns the internal patch distribution from a single 3D textured shape and generates high-quality variations with fine geometry and texture details. Training a diffusion model directly in 3D would induce large memory and computational cost. Therefore, we first compress the input into a lower-dimensional latent space and then train a diffusion model on it. Specifically, we encode the input 3D textured shape into triplane feature maps that represent the signed distance and texture fields of the input. The denoising network of our diffusion model has a limited receptive field to avoid overfitting, and uses triplane-aware 2D convolution blocks to improve the result quality. Aside from randomly generating new samples, our model also facilitates applications such as retargeting, outpainting and local editing. Through extensive qualitative and quantitative evaluation, we show that our model can generate 3D shapes of various types with better quality than prior methods.
生成类似于输入示例的新颖的三维模型一直是计算机图形领域的研究人员和艺术家追求的目标。在本文中,我们介绍了 Sin3DM,一种扩散模型,可以从一个三维纹理形状中学习内部补丁分布,并生成高质量的精细几何和纹理细节的变异体。直接在三维训练扩散模型会导致巨大的内存和计算成本。因此,我们首先将输入压缩到低维度潜在空间中,然后训练一个扩散模型对其进行训练。具体来说,我们将输入三维纹理形状编码为三角平面特征映射,代表输入的 signed 距离和纹理场。我们的扩散模型的去噪网络具有有限的响应域以避免过拟合,并使用三角平面意识的 2D 卷积块以提高结果质量。除了随机生成新样本外,我们的模型还促进各种应用,例如重新定位、户外 painting 和本地编辑。通过广泛的定性和定量评估,我们表明,我们的模型可以生成比先前方法更好的各种类型三维形状。
https://arxiv.org/abs/2305.15399
A key aspect of text-to-image personalization methods is the manner in which the target concept is represented within the generative process. This choice greatly affects the visual fidelity, downstream editability, and disk space needed to store the learned concept. In this paper, we explore a new text-conditioning space that is dependent on both the denoising process timestep (time) and the denoising U-Net layers (space) and showcase its compelling properties. A single concept in the space-time representation is composed of hundreds of vectors, one for each combination of time and space, making this space challenging to optimize directly. Instead, we propose to implicitly represent a concept in this space by optimizing a small neural mapper that receives the current time and space parameters and outputs the matching token embedding. In doing so, the entire personalized concept is represented by the parameters of the learned mapper, resulting in a compact, yet expressive, representation. Similarly to other personalization methods, the output of our neural mapper resides in the input space of the text encoder. We observe that one can significantly improve the convergence and visual fidelity of the concept by introducing a textual bypass, where our neural mapper additionally outputs a residual that is added to the output of the text encoder. Finally, we show how one can impose an importance-based ordering over our implicit representation, providing users control over the reconstruction and editability of the learned concept using a single trained model. We demonstrate the effectiveness of our approach over a range of concepts and prompts, showing our method's ability to generate high-quality and controllable compositions without fine-tuning any parameters of the generative model itself.
文本到图像个性化方法的关键方面是在生成过程中如何表示目标概念。这种选择极大地影响了视觉质量、后续编辑能力和存储学习概念所需的磁盘空间。在本文中,我们探索了一个新的文本塑造空间,该空间依赖于去噪过程时间步骤(时间)和去噪U-Net层(空间)并展示了其令人瞩目的特性。在空间时间和时间空间表示中,一个概念由数百个向量组成,每个向量对应于时间空间和空间组合,这使得直接优化这个空间非常具有挑战性。相反,我们建议在这个空间中通过优化一个小的神经网络映射器来间接表示一个概念,输出匹配 token 嵌入。这样做,整个个性化概念都由学习的映射器参数表示,生成紧凑但表达能力强的表示。与其他个性化方法类似,我们神经网络映射器的输出位于文本编码器的输入空间中。我们观察到,通过引入文本绕过,可以显著改善概念的收敛和视觉质量,我们的神经网络映射器还额外输出一个残留值,将其添加到文本编码器的输出中。最后,我们展示了如何通过引入文本绕过来强加重要性排序,为用户提供对学习概念的重建和编辑控制,使用单个训练模型提供用户对学习概念的重构和编辑控制。我们展示了我们方法在不同概念和提示下的有效性,展示了我们方法生成高质量、可控制的组合的能力,而无需对生成模型自身的任何参数进行微调。
https://arxiv.org/abs/2305.15391
Large language models (LLMs) have been significantly improved by instruction fine-tuning, but still lack transparency and the ability to utilize up-to-date knowledge and information. In this work, we propose search-augmented instruction learning (SAIL), which grounds the language generation and instruction following abilities on complex search results generated by in-house and external search engines. With an instruction tuning corpus, we collect search results for each training case from different search APIs and domains, and construct a new search-grounded training set containing \textit{(instruction, grounding information, response)} triplets. We then fine-tune the LLaMA-7B model on the constructed training set. Since the collected results contain unrelated and disputing languages, the model needs to learn to ground on trustworthy search results, filter out distracting passages, and generate the target response. The search result-denoising process entails explicit trustworthy information selection and multi-hop reasoning, since the retrieved passages might be informative but not contain the instruction-following answer. Experiments show that the fine-tuned SAIL-7B model has a strong instruction-following ability, and it performs significantly better on transparency-sensitive tasks, including open-ended question answering and fact checking.
通过指令微调,大型语言模型(LLMs)已经得到了显著改善,但仍然缺乏透明度和使用最新知识与信息的能力。在这个项目中,我们提出了搜索增强指令学习(SAIL),它可以基于由内部和外部搜索引擎生成的复杂搜索结果,建立语言生成和指令跟随能力。使用指令调优语料库,我们收集了每个训练案例的不同搜索API和领域上的搜索结果,并构建了一个新的搜索grounded训练集,其中包含三个指令、grounding信息和响应的组合。然后,我们微调了LLaMA-7B模型在构建的训练集上。由于收集的结果中含有无关和争议的语言,模型需要学习基于可靠的搜索结果进行 ground,过滤掉分心的段落,生成目标响应。搜索结果去噪过程涉及到明确的可靠的信息选择和多级推理,因为从检索的段落中可能获取到有用的信息,但不含指令跟随答案。实验表明,微调后的SAIL-7B模型具有强大的指令跟随能力,它在涉及透明度敏感任务的方面,包括开放式问题回答和事实验证等方面表现更好。
https://arxiv.org/abs/2305.15225
Cross-lingual named entity recognition (NER) aims to train an NER system that generalizes well to a target language by leveraging labeled data in a given source language. Previous work alleviates the data scarcity problem by translating source-language labeled data or performing knowledge distillation on target-language unlabeled data. However, these methods may suffer from label noise due to the automatic labeling process. In this paper, we propose CoLaDa, a Collaborative Label Denoising Framework, to address this problem. Specifically, we first explore a model-collaboration-based denoising scheme that enables models trained on different data sources to collaboratively denoise pseudo labels used by each other. We then present an instance-collaboration-based strategy that considers the label consistency of each token's neighborhood in the representation space for denoising. Experiments on different benchmark datasets show that the proposed CoLaDa achieves superior results compared to previous methods, especially when generalizing to distant languages.
跨语言命名实体识别(NER)的目标是训练一种能够对给定源语言中的标记数据进行良好泛化的NER系统,通过利用标记数据来利用该源语言的标注数据。以前的工作通过翻译源语言的标注数据或对目标语言的未标注数据进行知识蒸馏来缓解数据缺乏的问题。然而,这些方法可能因为自动标注过程而产生标签噪声。在本文中,我们提出了CoLaDa,一个协作标签去噪框架,以解决这个问题。具体来说,我们首先探索了一种基于模型协作的去噪方案,该方案可以使基于不同数据源训练的模型协作去噪彼此使用的伪标签。然后我们提出了一种基于实例协作的策略,该策略考虑每个 token 的邻居标签一致性在表示空间中的去噪。在不同基准数据集上的实验表明,与以前的方法相比,提出的CoLaDa取得了更好的结果,特别是在推广到远距离语言时。
https://arxiv.org/abs/2305.14913
Language models are often at risk of diverse backdoor attacks, especially data poisoning. Thus, it is important to investigate defense solutions for addressing them. Existing backdoor defense methods mainly focus on backdoor attacks with explicit triggers, leaving a universal defense against various backdoor attacks with diverse triggers largely unexplored. In this paper, we propose an end-to-end ensemble-based backdoor defense framework, DPoE (Denoised Product-of-Experts), which is inspired by the shortcut nature of backdoor attacks, to defend various backdoor attacks. DPoE consists of two models: a shallow model that captures the backdoor shortcuts and a main model that is prevented from learning the backdoor shortcuts. To address the label flip caused by backdoor attackers, DPoE incorporates a denoising design. Experiments on SST-2 dataset show that DPoE significantly improves the defense performance against various types of backdoor triggers including word-level, sentence-level, and syntactic triggers. Furthermore, DPoE is also effective under a more challenging but practical setting that mixes multiple types of trigger.
语言模型经常面临多种后缀攻击的风险,特别是数据中毒。因此,研究防御解决方案是非常必要的。现有的后缀防御方法主要关注具有明确触发器的后缀攻击,而忽略了多种不同类型的后缀攻击,即各种不同类型的后缀攻击的通用防御方法 largely unexplored。在本文中,我们提出了一种基于整体集成的后缀防御框架,称为 DPoE (Denoised Product-of- Experts),它受后缀攻击的快捷性启发,以保护各种后缀攻击。DPoE 由两个模型组成:一个浅层的模型,用于捕获后缀快捷,一个主要的模型,以防止学习后缀快捷。为了应对后缀攻击者造成的标签翻转,DPoE 采用了去噪设计。对 SST-2 数据集的实验表明,DPoE 显著改进了对抗各种类型后缀触发器,包括词级、句子级和语法触发器的攻击性能。此外,DPoE 在混合多种触发器的更困难但实用的场景中也有效。
https://arxiv.org/abs/2305.14910
The remarkable capabilities of large language models have been accompanied by a persistent drawback: the generation of false and unsubstantiated claims commonly known as "hallucinations". To combat this issue, recent research has introduced approaches that involve editing and attributing the outputs of language models, particularly through prompt-based editing. However, the inference cost and speed of using large language models for editing currently bottleneck prompt-based methods. These bottlenecks motivate the training of compact editors, which is challenging due to the scarcity of training data for this purpose. To overcome these challenges, we exploit the power of large language models to introduce corruptions (i.e., noise) into text and subsequently fine-tune compact editors to denoise the corruptions by incorporating relevant evidence. Our methodology is entirely unsupervised and provides us with faux hallucinations for training in any domain. Our Petite Unsupervised Research and Revision model, PURR, not only improves attribution over existing editing methods based on fine-tuning and prompting, but also achieves faster execution times by orders of magnitude.
大型语言模型的卓越能力伴随着一个持久的缺点是生成虚假且缺乏证据的支持声称,这种声称通常被称为“幻觉”。为了解决这个问题,最近的研究引入了涉及编辑和 attributed 语言模型输出的方法,特别是基于提示的编辑。然而,使用大型语言模型进行编辑的推断成本和速度目前的瓶颈是基于提示的方法。这些瓶颈激励了紧凑编辑的训练,但由于训练数据匮乏,这是具有挑战性的。为了克服这些挑战,我们利用大型语言模型的力量将错误(即噪声)引入文本,然后通过集成相关证据微调紧凑编辑,以消除错误。我们的方法论是完全 unsupervised 的,为我们在任何领域训练中的虚假幻觉提供了伪现实。我们的小型 unsupervised 研究和修订模型 purR 不仅基于 fine-tuning 和提示改进了现有的编辑方法,而且通过数倍数的速度加快了执行时间。
https://arxiv.org/abs/2305.14908
Video multimodal fusion aims to integrate multimodal signals in videos, such as visual, audio and text, to make a complementary prediction with multiple modalities contents. However, unlike other image-text multimodal tasks, video has longer multimodal sequences with more redundancy and noise in both visual and audio modalities. Prior denoising methods like forget gate are coarse in the granularity of noise filtering. They often suppress the redundant and noisy information at the risk of losing critical information. Therefore, we propose a denoising bottleneck fusion (DBF) model for fine-grained video multimodal fusion. On the one hand, we employ a bottleneck mechanism to filter out noise and redundancy with a restrained receptive field. On the other hand, we use a mutual information maximization module to regulate the filter-out module to preserve key information within different modalities. Our DBF model achieves significant improvement over current state-of-the-art baselines on multiple benchmarks covering multimodal sentiment analysis and multimodal summarization tasks. It proves that our model can effectively capture salient features from noisy and redundant video, audio, and text inputs. The code for this paper is publicly available at this https URL.
视频多模态融合旨在将视频中的多种模态信号合并,例如视觉、音频和文本,以进行互补预测,同时保留多种模态内容。然而,与其他图像-文本多模态任务不同,视频在视觉和音频模态中具有更长的模态序列,同时存在更多的冗余和噪声。类似于忘记门等先前的降噪方法,它们的降噪粒度较粗,常常抑制冗余和噪声信息,有可能导致重要信息丢失。因此,我们提出了一种精细的视频多模态融合降噪瓶颈融合模型(DBF)。一方面,我们使用瓶颈机制来过滤噪声和冗余,限制响应面。另一方面,我们使用 mutual information 最大化模块来调节过滤模块,以保留不同模态中的关键信息。我们的 DBF 模型在多个基准点上比当前最先进的基准模型在许多方面都取得了显著的改进,涵盖了多种模态的情感分析和模态摘要任务。证明我们的模型可以有效地捕获噪声和冗余的视频、音频和文本输入中的突出特征。本文代码在此 https URL 上公开可用。
https://arxiv.org/abs/2305.14652
Although end-to-end multi-object trackers like MOTR enjoy the merits of simplicity, they suffer from the conflict between detection and association seriously, resulting in unsatisfactory convergence dynamics. While MOTRv2 partly addresses this problem, it demands an additional detection network for assistance. In this work, we serve as the first to reveal that this conflict arises from the unfair label assignment between detect queries and track queries during training, where these detect queries recognize targets and track queries associate them. Based on this observation, we propose MOTRv3, which balances the label assignment process using the developed release-fetch supervision strategy. In this strategy, labels are first released for detection and gradually fetched back for association. Besides, another two strategies named pseudo label distillation and track group denoising are designed to further improve the supervision for detection and association. Without the assistance of an extra detection network during inference, MOTRv3 achieves impressive performance across diverse benchmarks, e.g., MOT17, DanceTrack.
尽管像MOTR这样的端到端多目标跟踪器享受简单的优点,但它们在检测和关联之间存在严重冲突,导致不满意的收敛动态。尽管MOTRv2部分解决了这个问题,但它需要额外的检测网络来进行协助。在这个工作中,我们是第一个揭示这个问题的人,发现这冲突在训练期间从检测询问和跟踪询问之间的不公平标签分配中产生,这些检测询问识别目标并将跟踪询问与之关联。基于这个观察,我们提出了MOTRv3,它使用开发的发布-查找监督策略平衡标签分配过程。在这个策略中,先释放标签用于检测,然后逐步回收用于关联。此外,我们还设计了另一个名为伪标签分解和跟踪组去噪的策略,以进一步提高检测和关联的监督。在没有额外的检测网络推理期间提供帮助的情况下,MOTRv3能够在各种基准上实现令人印象深刻的表现,例如MOT17和DanceTrack。
https://arxiv.org/abs/2305.14298