Self-supervised learning (SSL) based speech pre-training has attracted much attention for its capability of extracting rich representations learned from massive unlabeled data. On the other hand, the use of weakly-supervised data is less explored for speech pre-training. To fill this gap, we propose a weakly-supervised speech pre-training method based on speaker-aware speech data. It adopts a similar training procedure to the widely-used masked speech prediction based SSL framework, while incorporating additional target-speaker enrollment information as an auxiliary input. In this way, the learned representation is steered towards the target speaker even in the presence of highly overlapping interference, allowing potential applications to tasks such as target speech recognition. Our experiments on Libri2Mix and WSJ0-2mix datasets show that the proposed model achieves significantly better ASR performance compared to WavLM, the state-of-the-art SSL model with denoising capability.
基于自监督学习的语音前训练受到了广泛关注,因为它能够从大量未标记数据中学习到丰富的表示。然而,对于语音前训练,使用弱监督数据的探索较少。为了填补这一差距,我们提出了一种基于语音识别者意识的语音前训练方法。该方法采用了与广泛使用的掩码语音预测框架类似的训练流程,同时添加目标语音识别者 enrollment信息作为辅助输入。这样, learned 表示就会被引导向目标语音识别者,即使在存在高度重叠的干扰情况下也是如此,从而允许潜在的应用领域进行目标语音识别等任务。我们在Libri2混合和WSJ0-2混合数据集上的实验表明, proposed model相比具有去噪能力的WavLM,在语音识别性能方面取得了显著更好的表现。
https://arxiv.org/abs/2305.16286
Controllable scene synthesis aims to create interactive environments for various industrial use cases. Scene graphs provide a highly suitable interface to facilitate these applications by abstracting the scene context in a compact manner. Existing methods, reliant on retrieval from extensive databases or pre-trained shape embeddings, often overlook scene-object and object-object relationships, leading to inconsistent results due to their limited generation capacity. To address this issue, we present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes, which are semantically realistic and conform to commonsense. Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes via latent diffusion, capturing global scene-object and local inter-object relationships while preserving shape diversity. The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model. Due to lacking a scene graph dataset offering high-quality object-level meshes with relations, we also construct SG-FRONT, enriching the off-the-shelf indoor dataset 3D-FRONT with additional scene graph labels. Extensive experiments are conducted on SG-FRONT where CommonScenes shows clear advantages over other methods regarding generation consistency, quality, and diversity. Codes and the dataset will be released upon acceptance.
可控场景合成的目标是为各种工业应用创建交互环境。场景图提供了高度合适的接口,以通过紧凑的方式抽象场景上下文,方便这些应用。现有的方法依赖于从广泛的数据库或预先训练的形状嵌入中检索,往往忽略场景对象和对象之间的关系,导致由于它们的生成能力有限而产生不一致的结果。为了解决这一问题,我们提出了CommonScenes,这是一个全生成模型,将场景图转换为相应的可控3D场景,语义上真实且符合常识。我们的管道由两个分支组成,一个通过Variational Auto-encoder 预测整个场景布局,另一个通过隐式扩散生成兼容的形状,捕捉全球场景对象和本地对象之间的关系,同时保持形状多样性。生成的场景可以通过编辑输入场景图和采样扩散模型中的噪声来操纵。由于缺少提供高质量对象级网格与关系的场景图数据集,我们还建立了SG-Front,将现有的室内数据集3D-Front中添加额外的场景图标签。在SG-Front上进行广泛的实验,CommonScenes 在生成一致性、质量和多样性方面明显优于其他方法。代码和数据集将在接受后发布。
https://arxiv.org/abs/2305.16283
Progress in Automated Handwriting Recognition has been hampered by the lack of large training datasets. Nearly all research uses a set of small datasets that often cause models to overfit. We present CENSUS-HWR, a new dataset consisting of full English handwritten words in 1,812,014 gray scale images. A total of 1,865,134 handwritten texts from a vocabulary of 10,711 words in the English language are present in this collection. This dataset is intended to serve handwriting models as a benchmark for deep learning algorithms. This huge English handwriting recognition dataset has been extracted from the US 1930 and 1940 censuses taken by approximately 70,000 enumerators each year. The dataset and the trained model with their weights are freely available to download at this https URL.
自动手写识别的进展受到了缺乏大型训练数据的困扰。几乎所有研究都使用了一些小型数据集,这往往导致模型过拟合。我们提出了CENSUS-HWR,这是一个新的数据集,包含1,812,014张灰度图像上的全英语手写单词。这个数据集包含了总共1,865,134篇从英语语言中 vocabulary为10,711个单词的手写文本。这个数据集旨在作为深度学习算法的基准手写模型。这个巨大的英语手写识别数据集是从美国1930和1940年人口普查中每年由大约70,000名调查员收集到的。数据集和训练模型及其权重可以在这个httpsURL上免费下载。
https://arxiv.org/abs/2305.16275
In recent years, Denoising Diffusion Probabilistic Models (DDPM) have caught significant attention. By composing a Markovian process that starts in the data domain and then gradually adds noise until reaching pure white noise, they achieve superior performance in learning data distributions. Yet, these models require a large number of diffusion steps to produce aesthetically pleasing samples, which is inefficient. In addition, unlike common generative adversarial networks, the latent space of diffusion models is not interpretable. In this work, we propose to generalize the denoising diffusion process into an Upsampling Diffusion Probabilistic Model (UDPM), in which we reduce the latent variable dimension in addition to the traditional noise level addition. As a result, we are able to sample images of size $256\times 256$ with only 7 diffusion steps, which is less than two orders of magnitude compared to standard DDPMs. We formally develop the Markovian diffusion processes of the UDPM, and demonstrate its generation capabilities on the popular FFHQ, LSUN horses, ImageNet, and AFHQv2 datasets. Another favorable property of UDPM is that it is very easy to interpolate its latent space, which is not the case with standard diffusion models. Our code is available online \url{this https URL}
近年来,去噪扩散概率模型(DDPM)吸引了大量关注。通过构建始于数据域的马尔可夫过程,然后逐渐添加噪声,直到达到纯白色噪声的水平,这些模型在学习数据分布方面表现出更好的性能。然而,这些模型需要许多扩散步骤来产生审美上满意的样本,效率较低。此外,与常见的生成对抗网络不同,扩散模型的隐状态空间无法解释。在本文中,我们提议将去噪扩散过程泛化为增采样扩散概率模型(UDPM),其中我们除了传统的噪声水平增加外,还减少了隐变量维度。因此,我们只需要7个扩散步骤就能样本大小为256×256的图像,比标准DDPM的规模小得多。我们正式开发了UDPM的马尔可夫扩散过程,并在流行的FFHQ、LCNS horses、ImageNet和AFHQv2数据集上展示了其生成能力。UDPM的另一个有利特性是,它很容易进行隐状态空间的插值,而标准扩散模型则无法做到。我们的代码现在在线 \url{this https URL}。
https://arxiv.org/abs/2305.16269
A practical text-to-SQL system should generalize well on a wide variety of natural language questions, unseen database schemas, and novel SQL query structures. To comprehensively evaluate text-to-SQL systems, we introduce a \textbf{UNI}fied benchmark for \textbf{T}ext-to-SQL \textbf{E}valuation (UNITE). It is composed of publicly available text-to-SQL datasets, containing natural language questions from more than 12 domains, SQL queries from more than 3.9K patterns, and 29K databases. Compared to the widely used Spider benchmark \cite{yu-etal-2018-spider}, we introduce $\sim$120K additional examples and a threefold increase in SQL patterns, such as comparative and boolean questions. We conduct a systematic study of six state-of-the-art (SOTA) text-to-SQL parsers on our new benchmark and show that: 1) Codex performs surprisingly well on out-of-domain datasets; 2) specially designed decoding methods (e.g. constrained beam search) can improve performance for both in-domain and out-of-domain settings; 3) explicitly modeling the relationship between questions and schemas further improves the Seq2Seq models. More importantly, our benchmark presents key challenges towards compositional generalization and robustness issues -- which these SOTA models cannot address well.
一个实用的文本到SQL系统应该对多种自然语言问题、未知的数据库表结构和新的SQL查询结构有很好的泛化能力。为了全面评估文本到SQL系统,我们提出了一个统一基准(UNITE),该基准由公开可用的文本到SQL数据集组成,包含来自超过12个领域的自然语言问题、超过3.9K种模式中的SQL查询和29K个数据库。与广泛使用的蜘蛛基准(yu-etal-2018-spider)相比,我们引入了大约120K个额外的示例和SQL模式,如比较和布尔问题。我们对六个最先进的文本到SQL解析器进行了系统性的研究,并在新基准上展示了:1) Codex在跨域数据上表现惊人;2)特别设计的解码方法(例如约束梁搜索)可以在跨域和内部域环境下提高性能;3) explicitly modeling问题和表结构的关系进一步改善了Seq2Seq模型。更重要的是,我们的基准提出了组成泛化和稳健问题的关键挑战,而这些SOTA模型无法很好地解决。
https://arxiv.org/abs/2305.16265
The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are publicly available at this https URL.
目前的缩放语言模型趋势涉及增加参数数量和训练数据集大小。从趋势的推测来看,训练数据集大小可能很快受到在互联网上可用的文本数据的数量的限制。出于这一限制的启发,我们研究在数据受限的情况下缩放语言模型的趋势。具体来说,我们进行了大量实验, varying 数据重复度和计算预算的广度,范围从900亿训练代币和900亿参数模型到更高。我们发现,在限制数据并固定计算预算的情况下,使用重复数据进行训练相比于仅有独特数据产生几乎无变化的损失。然而,随着重复数据的增加,增加计算价值的效果最终衰减到零。我们提出了计算最优性的 scaling 定律,并经验证了它考虑到了重复代币和多余的参数价值下降的情况。最后,我们尝试缓解数据稀缺性的措施,包括增加训练数据集的代码数据或删除常见的过滤器。我们400次训练运行中的模型和数据集在此 https URL 上公开可用。
https://arxiv.org/abs/2305.16264
Multi-talker overlapped speech poses a significant challenge for speech recognition and diarization. Recent research indicated that these two tasks are inter-dependent and complementary, motivating us to explore a unified modeling method to address them in the context of overlapped speech. A recent study proposed a cost-effective method to convert a single-talker automatic speech recognition (ASR) system into a multi-talker one, by inserting a Sidecar separator into the frozen well-trained ASR model. Extending on this, we incorporate a diarization branch into the Sidecar, allowing for unified modeling of both ASR and diarization with a negligible overhead of only 768 parameters. The proposed method yields better ASR results compared to the baseline on LibriMix and LibriSpeechMix datasets. Moreover, without sophisticated customization on the diarization task, our method achieves acceptable diarization results on the two-speaker subset of CALLHOME with only a few adaptation steps.
多说话者重叠语音对语音识别和去噪构成了一个重要的挑战。最近的研究表明,这两个任务是相互依赖和互补的,因此我们可以考虑一种统一建模方法,在重叠语音的背景下解决这些问题。一项最近的研究提出了一种成本效益高的新方法,通过将 Sidecar 分离器插入已经训练好的单说话者自动语音识别(ASR)模型中,可以将 ASR 和去噪统一建模,而只需要很少的参数 overhead。 proposed 方法在 Libri 混合和 LibriSpeech 混合数据集上的 ASR 结果比基准更好。此外,在没有复杂的去噪任务定制的情况下,我们的方法和只需要几个适应步骤就可以在 Home 电话中实现可接受去噪结果。
https://arxiv.org/abs/2305.16263
We propose a new class of generative models that naturally handle data of varying dimensionality by jointly modeling the state and dimension of each datapoint. The generative process is formulated as a jump diffusion process that makes jumps between different dimensional spaces. We first define a dimension destroying forward noising process, before deriving the dimension creating time-reversed generative process along with a novel evidence lower bound training objective for learning to approximate it. Simulating our learned approximation to the time-reversed generative process then provides an effective way of sampling data of varying dimensionality by jointly generating state values and dimensions. We demonstrate our approach on molecular and video datasets of varying dimensionality, reporting better compatibility with test-time diffusion guidance imputation tasks and improved interpolation capabilities versus fixed dimensional models that generate state values and dimensions separately.
我们提出一种新的生成模型,该模型通过同时 Modeling 每个数据点的状态和维度,自然地处理不同维度的数据。生成过程可以表述为在不同维度空间中的跳跃扩散过程。我们首先定义一个破坏维度的向前噪声过程,然后推导出维度生成的逆生成过程,并提出了一种新的证据下的训练目标,以学习近似该逆生成过程。模拟我们学习到的近似逆生成过程,然后通过同时生成状态值和维度,有效地采样不同维度的数据。我们在不同维度的分子和视频数据集上演示了我们的这种方法,并报告了与测试时扩散指导插值任务更好的兼容性,以及与生成状态值和维度分别独立的固定维度模型相比,更好的插值能力。
https://arxiv.org/abs/2305.16261
The adoption of Deep Neural Networks (DNNs) has greatly benefited Natural Language Processing (NLP) during the past decade. However, the demands of long documents analysis are quite different from those of shorter texts, with the ever increasing size of documents uploaded online rendering NLP on long documents a critical area of research. This paper surveys the current state-of-the-art in the domain, overviewing the relevant neural building blocks and subsequently focusing on two main NLP tasks: Document Classification, Summarization as well as mentioning uses in Sentiment Analysis. We detail the challenges, issues and current solutions related to long-document NLP. We also list publicly available, labelled, long-document datasets used in current research.
在过去的十年中,采用深度神经网络(DNNs)极大地促进了自然语言处理(NLP)的发展。然而,对长文档的分析需求与对短文本的分析需求 quite different,随着在线文档上传内容的日益增加,使得对长文档的NLP分析成为一个重要的研究领域。本文综述了该领域当前的研究进展,概述了相关的神经网络构建块,随后重点探讨了 two main NLP任务:文档分类、摘要提取以及在Sentiment Analysis中的具体应用。本文详细描述了与长文档NLP相关的挑战、问题和当前的解决方案。此外,我们还列出了目前公开可用、标签明确的长文档数据集。
https://arxiv.org/abs/2305.16259
This paper studies the online node classification problem under a transductive learning setting. Current methods either invert a graph kernel matrix with $\mathcal{O}(n^3)$ runtime and $\mathcal{O}(n^2)$ space complexity or sample a large volume of random spanning trees, thus are difficult to scale to large graphs. In this work, we propose an improvement based on the \textit{online relaxation} technique introduced by a series of works (Rakhlin et al.,2012; Rakhlin and Sridharan, 2015; 2017). We first prove an effective regret $\mathcal{O}(\sqrt{n^{1+\gamma}})$ when suitable parameterized graph kernels are chosen, then propose an approximate algorithm FastONL enjoying $\mathcal{O}(k\sqrt{n^{1+\gamma}})$ regret based on this relaxation. The key of FastONL is a \textit{generalized local push} method that effectively approximates inverse matrix columns and applies to a series of popular kernels. Furthermore, the per-prediction cost is $\mathcal{O}(\text{vol}({\mathcal{S}})\log 1/\epsilon)$ locally dependent on the graph with linear memory cost. Experiments show that our scalable method enjoys a better tradeoff between local and global consistency.
本研究在一种递归学习设定下研究了在线节点分类问题。当前的方法要么使用 $\mathcal{O}(n^3)$ 的运行时长和 $\mathcal{O}(n^2)$ 的空间复杂度来翻转图的卷积矩阵,要么使用大量的随机连通树来采样,因此难以处理大型图。在这项工作中,我们基于一系列研究提出的 \textit{在线放松} 技术提出了改进,我们首先证明了当选择合适的参数化图kernel时,选择的卷积矩阵的有效 regret 为 $\mathcal{O}(\sqrt{n^{1+\gamma}})$,然后基于这个放松技术提出了一个基于这个放松技术的近似算法 FastONL,其有效 regret为 $\mathcal{O}(k\sqrt{n^{1+\gamma}})$。FastONL 的关键是一种 \textit{广义本地推送} 方法,有效地近似了逆矩阵列并适用于一系列流行的卷积kernel。此外,每个预测的成本为 $\mathcal{O}(\text{vol}({\mathcal{S}})\log 1/\epsilon)$ locally 依赖于具有线性内存成本的图。实验结果表明,我们的可扩展方法在 local 和 global 一致性之间的更好的权衡中取得了更好的结果。
https://arxiv.org/abs/2305.16257
Content Warning: This work contains examples that potentially implicate stereotypes, associations, and other harms that could be offensive to individuals in certain social groups.} Large pre-trained language models are acknowledged to carry social biases towards different demographics, which can further amplify existing stereotypes in our society and cause even more harm. Text-to-SQL is an important task, models of which are mainly adopted by administrative industries, where unfair decisions may lead to catastrophic consequences. However, existing Text-to-SQL models are trained on clean, neutral datasets, such as Spider and WikiSQL. This, to some extent, cover up social bias in models under ideal conditions, which nevertheless may emerge in real application scenarios. In this work, we aim to uncover and categorize social biases in Text-to-SQL models. We summarize the categories of social biases that may occur in structured data for Text-to-SQL models. We build test benchmarks and reveal that models with similar task accuracy can contain social biases at very different rates. We show how to take advantage of our methodology to uncover and assess social biases in the downstream Text-to-SQL task. We will release our code and data.
警告:本工作包含了可能对某些社会群体和个人造成 offensive stereotypes, associations, 和其他 harm 的例子。大型预训练语言模型承认具有针对不同人口统计的社会偏见,这可能会进一步加剧我们社会现有的偏见,并造成更多的 harm。文本到关系型数据库(Text-to-SQL)是一项重要的任务,主要由行政行业采用,不公平的决策可能导致灾难性的后果。然而,现有的文本到关系型数据库模型主要使用干净、中立的数据集,如蜘蛛和维基 SQL。这在一定程度上掩盖了模型在理想条件下可能存在的社会偏见,但即便如此,也可能在 real-world 应用场景中出现。在这项工作中,我们的目标是揭露和分类文本到关系型数据库模型中的社会偏见。我们总结了文本到关系型数据库模型中的 structured data 可能涉及的社会偏见类别。我们建立了测试基准,并表明具有类似任务精度的模型可以以极不同的速率包含社会偏见。我们展示了如何利用我们的方法和方法揭露和评估后续文本到关系型数据库任务中的社会偏见。我们将发布我们的代码和数据。
https://arxiv.org/abs/2305.16253
Real-life multilingual systems should be able to efficiently incorporate new languages as data distributions fed to the system evolve and shift over time. To do this, systems need to handle the issue of catastrophic forgetting, where the model performance drops for languages or tasks seen further in its past. In this paper, we study catastrophic forgetting, as well as methods to minimize this, in a massively multilingual continual learning framework involving up to 51 languages and covering both classification and sequence labeling tasks. We present LR ADJUST, a learning rate scheduling method that is simple, yet effective in preserving new information without strongly overwriting past knowledge. Furthermore, we show that this method is effective across multiple continual learning approaches. Finally, we provide further insights into the dynamics of catastrophic forgetting in this massively multilingual setup.
真实的多语言系统应该能够高效地整合新的语言能力,随着输入系统的数据分布的演变和变化而不断发展。要做到这一点,系统需要处理灾难性遗忘的问题,即模型性能对过去的语言或任务下降。在本文中,我们研究了灾难性遗忘的问题,以及减少这一问题的方法,在一个涉及51种语言、涵盖分类和序列标签任务的大型多语言持续学习框架中。我们提出了LR调整,一种简单的学习率调度方法,能够在不强烈覆盖旧知识的情况下,有效地保留新信息。此外,我们证明,这种方法适用于多个持续学习方法。最后,我们提供了对这种大型多语言 setup 灾难性遗忘动态的更深入理解。
https://arxiv.org/abs/2305.16252
Augmenting language models with a retrieval mechanism has been shown to significantly improve their performance while keeping the number of parameters low. Retrieval-augmented models commonly rely on a semantic retrieval mechanism based on the similarity between dense representations of the query chunk and potential neighbors. In this paper, we study the state-of-the-art Retro model and observe that its performance gain is better explained by surface-level similarities, such as token overlap. Inspired by this, we replace the semantic retrieval in Retro with a surface-level method based on BM25, obtaining a significant reduction in perplexity. As full BM25 retrieval can be computationally costly for large datasets, we also apply it in a re-ranking scenario, gaining part of the perplexity reduction with minimal computational overhead.
增加语言模型的检索机制可以显著改善其性能,同时保持参数数量较少。检索增强模型通常依赖于基于查询分块和潜在邻居的密集表示之间的语义检索机制。在本文中,我们研究了最先进的 Retro 模型,并观察到其性能增益更好地可以用表面相似性,例如 token 重叠等解释。受此启发,我们替换了 Retro 中的语义检索机制,基于 BM25 的表面方法,取得了显著的去混淆效果。由于对于大型数据集而言,完全BM25检索的计算成本很高,我们也在重新排序场景中应用了它,通过最小计算代价获得了去混淆部分的性能提升。
https://arxiv.org/abs/2305.16243
This paper investigates the potential of enhancing Neural Radiance Fields (NeRF) with semantics to expand their applications. Although NeRF has been proven useful in real-world applications like VR and digital creation, the lack of semantics hinders interaction with objects in complex scenes. We propose to imitate the backbone feature of off-the-shelf perception models to achieve zero-shot semantic segmentation with NeRF. Our framework reformulates the segmentation process by directly rendering semantic features and only applying the decoder from perception models. This eliminates the need for expensive backbones and benefits 3D consistency. Furthermore, we can project the learned semantics onto extracted mesh surfaces for real-time interaction. With the state-of-the-art Segment Anything Model (SAM), our framework accelerates segmentation by 16 times with comparable mask quality. The experimental results demonstrate the efficacy and computational advantages of our approach. Project page: \url{https://me.kiui.moe/san/}.
本论文研究如何将神经网络辐射场(NeRF)与语义增强其应用范围。尽管NeRF在虚拟现实和数字创造等实际应用领域已经被证明有用,但缺乏语义会阻碍复杂场景下与物体的互动。我们提议仿效现有的感知模型的主干特性,以通过直接渲染语义特征来实现NeRF的零次元语义分割。我们的框架重写了分割过程,仅从感知模型中应用解码器,从而消除了昂贵的主干需求并实现了3D一致性。此外,我们可以将学到的语义投影到提取的网格表面,实现实时交互。利用最先进的分割任意模型(SAM),我们的框架将分割速度提高了16倍,与同等掩模质量相比。实验结果证明了我们方法的有效性和计算优势。项目页面: \url{https://me.kiui.moe/san/}。
https://arxiv.org/abs/2305.16233
Personalizing generative models offers a way to guide image generation with user-provided references. Current personalization methods can invert an object or concept into the textual conditioning space and compose new natural sentences for text-to-image diffusion models. However, representing and editing specific visual attributes like material, style, layout, etc. remains a challenge, leading to a lack of disentanglement and editability. To address this, we propose a novel approach that leverages the step-by-step generation process of diffusion models, which generate images from low- to high-frequency information, providing a new perspective on representing, generating, and editing images. We develop Prompt Spectrum Space P*, an expanded textual conditioning space, and a new image representation method called ProSpect. ProSpect represents an image as a collection of inverted textual token embeddings encoded from per-stage prompts, where each prompt corresponds to a specific generation stage (i.e., a group of consecutive steps) of the diffusion model. Experimental results demonstrate that P* and ProSpect offer stronger disentanglement and controllability compared to existing methods. We apply ProSpect in various personalized attribute-aware image generation applications, such as image/text-guided material/style/layout transfer/editing, achieving previously unattainable results with a single image input without fine-tuning the diffusion models.
个性化生成模型提供了一个用户提供参考的方式来指导图像生成,提供了一种新的视角,可以代表、生成和编辑图像。目前,个性化方法可以将对象或概念翻转到文本 conditioning space 中,并为文本到图像扩散模型生成新的自然语句。然而,代表和编辑特定的视觉属性,如材料、风格、布局等仍然是一个挑战,导致缺乏分离性和编辑性。为了解决这个问题,我们提出了一种新方法,利用扩散模型的每一步生成过程,提供了新的代表、生成和编辑图像的视角。我们开发Prompt Spectrum Space P*、扩展了文本 conditioning space,并开发了一个新的图像表示方法,称为ProSpect。ProSpect 表示一个图像是一个从每个阶段引导的逆转文本 token embeddings 编码的集合,每个引导对应于扩散模型的特定生成阶段(即一组连续的步骤)。实验结果显示,P* 和 ProSpect 相比现有方法提供了更强的分离性和控制性。我们应用 ProSpect 在各种个性化属性aware图像生成应用程序中,如图像/文本引导的材料、风格、布局转移/编辑,通过单个图像输入实现了以前无法达到的结果,而不需要微调扩散模型。
https://arxiv.org/abs/2305.16225
Text-to-image (T2I) research has grown explosively in the past year, owing to the large-scale pre-trained diffusion models and many emerging personalization and editing approaches. Yet, one pain point persists: the text prompt engineering, and searching high-quality text prompts for customized results is more art than science. Moreover, as commonly argued: "an image is worth a thousand words" - the attempt to describe a desired image with texts often ends up being ambiguous and cannot comprehensively cover delicate visual details, hence necessitating more additional controls from the visual domain. In this paper, we take a bold step forward: taking "Text" out of a pre-trained T2I diffusion model, to reduce the burdensome prompt engineering efforts for users. Our proposed framework, Prompt-Free Diffusion, relies on only visual inputs to generate new images: it takes a reference image as "context", an optional image structural conditioning, and an initial noise, with absolutely no text prompt. The core architecture behind the scene is Semantic Context Encoder (SeeCoder), substituting the commonly used CLIP-based or LLM-based text encoder. The reusability of SeeCoder also makes it a convenient drop-in component: one can also pre-train a SeeCoder in one T2I model and reuse it for another. Through extensive experiments, Prompt-Free Diffusion is experimentally found to (i) outperform prior exemplar-based image synthesis approaches; (ii) perform on par with state-of-the-art T2I models using prompts following the best practice; and (iii) be naturally extensible to other downstream applications such as anime figure generation and virtual try-on, with promising quality. Our code and models are open-sourced at this https URL.
图像生成文本(T2I)研究在过去一年里迅速发展,由于大规模训练扩散模型和许多新兴的个性化和编辑方法。然而,仍然存在一个痛点:文本 prompt engineering 和寻找高质量的自定义文本提示对于实现个性化的结果来说更像是艺术而不是科学。此外,正如通常所说的那样:“一个图像的价值在于它的千句话”——试图用文本描述想要的图像往往会导致模糊不清,不能全面覆盖微妙的视觉细节,因此需要更多的视觉域额外的控制。在本文中,我们将采取大胆一步:将“文本”从训练好的 T2I 扩散模型中移除,以减少用户的负担式的文本提示工程努力。我们提出的框架称为Prompt-Free Diffusion,它仅依赖于视觉输入生成新图像:它使用参考图像作为“上下文”,可选的图像结构 conditioning,以及最初的噪声,完全没有文本提示。在幕后的核心架构是语义上下文编码器(SeeCoder),替代了常用的CLIP 或 LLM 文本编码器。SeeCoder的可重用性也使其成为一个方便的升级组件:你也可以在其中一个 T2I 模型中预先训练 SeeCoder,并用它来另一个模型。通过广泛的实验,Prompt-Free Diffusion 实验表明(i)比先前基于示例的图像合成方法表现更好;(ii)与最先进的 T2I 模型使用最佳实践的提示运行水平相当;(iii)自然地可扩展到其他下游应用,如动画人物生成和虚拟试穿,具有出色的质量。我们的代码和模型在此 https URL 上开源。
https://arxiv.org/abs/2305.16223
Recent advancements in the acquisition of various brain data sources have created new opportunities for integrating multimodal brain data to assist in early detection of complex brain disorders. However, current data integration approaches typically need a complete set of biomedical data modalities, which may not always be feasible, as some modalities are only available in large-scale research cohorts and are prohibitive to collect in routine clinical practice. Especially in studies of brain diseases, research cohorts may include both neuroimaging data and genetic data, but for practical clinical diagnosis, we often need to make disease predictions only based on neuroimages. As a result, it is desired to design machine learning models which can use all available data (different data could provide complementary information) during training but conduct inference using only the most common data modality. We propose a new incomplete multimodal data integration approach that employs transformers and generative adversarial networks to effectively exploit auxiliary modalities available during training in order to improve the performance of a unimodal model at inference. We apply our new method to predict cognitive degeneration and disease outcomes using the multimodal imaging genetic data from Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. Experimental results demonstrate that our approach outperforms the related machine learning and deep learning methods by a significant margin.
近年来,获取各种脑数据源的进步,为整合多模态脑数据,帮助早期识别复杂的脑障碍创造了新的机会。然而,当前的数据整合方法通常需要完整的生物医学数据模态,这可能不一定可行,因为一些模态只有在大规模的研究群体才能提供,并且在常规临床实践中收集是禁止的。特别是对于脑疾病的研究,研究群体可能包括神经影像学数据和基因数据,但在实际临床诊断中,我们通常需要仅基于神经影像进行疾病预测。因此,我们希望设计一种机器学习模型,可以在训练期间使用所有可用的数据(不同的数据可以提供补充信息),但仅使用最常见的数据模态进行推理。我们提出了一种新的不完整的多模态数据整合方法,利用Transformer和生成对抗网络有效地利用训练期间提供的辅助模态,以提高单一模态模型的推理性能。我们应用我们的新方法来预测阿尔茨海默病神经影像学倡议(ADNI)研究群体中的多模态影像基因数据中的脑功能退化和疾病结果。实验结果显示,我们的方法相比相关的机器学习和深度学习方法表现出显著的优势。
https://arxiv.org/abs/2305.16222
Segment anything model (SAM) has presented impressive objectness identification capability with the idea of prompt learning and a new collected large-scale dataset. Given a prompt (e.g., points, bounding boxes, or masks) and an input image, SAM is able to generate valid segment masks for all objects indicated by the prompts, presenting high generalization across diverse scenarios and being a general method for zero-shot transfer to downstream vision tasks. Nevertheless, it remains unclear whether SAM may introduce errors in certain threatening scenarios. Clarifying this is of significant importance for applications that require robustness, such as autonomous vehicles. In this paper, we aim to study the testing-time robustness of SAM under adversarial scenarios and common corruptions. To this end, we first build a testing-time robustness evaluation benchmark for SAM by integrating existing public datasets. Second, we extend representative adversarial attacks against SAM and study the influence of different prompts on robustness. Third, we study the robustness of SAM under diverse corruption types by evaluating SAM on corrupted datasets with different prompts. With experiments conducted on SA-1B and KITTI datasets, we find that SAM exhibits remarkable robustness against various corruptions, except for blur-related corruption. Furthermore, SAM remains susceptible to adversarial attacks, particularly when subjected to PGD and BIM attacks. We think such a comprehensive study could highlight the importance of the robustness issues of SAM and trigger a series of new tasks for SAM as well as downstream vision tasks.
Segment anything模型(Sam)以prompt learning和收集大型数据集的新想法,展示了令人印象深刻的对象识别能力。给定prompt(例如点、边界框或掩膜)和输入图像,Sam能够生成所有由prompt指示的对象的有效分块Mask,在各种情况下表现出高泛化能力,是直接转移到后续视觉任务通用的方法。然而,仍然不清楚Sam在某些威胁情况下可能会引入错误。澄清这一点对于需要鲁棒性的应用程序,例如自动驾驶车辆等具有重要意义。在本文中,我们旨在研究Sam在对抗场景和常见腐败情况下的测试时鲁棒性。为此,我们首先建立了Sam的测试时鲁棒性评估基准,通过整合现有公共数据集。其次,我们扩展了代表性的对抗攻击对Sam进行研究,并探讨不同prompt对鲁棒性的影响了。通过在SAB和KITTI数据集上进行实验,我们发现Sam表现出对多种腐败的 remarkable 鲁棒性,除了与模糊相关的腐败。此外,Sam仍然容易受到对抗攻击,特别是在受到PGD和BIM攻击的情况下。我们认为这种全面研究可以强调Sam的鲁棒性问题的重要性,并触发一系列新的任务,为Sam以及后续视觉任务。
https://arxiv.org/abs/2305.16220
Semi-supervised medical image segmentation offers a promising solution for large-scale medical image analysis by significantly reducing the annotation burden while achieving comparable performance. Employing this method exhibits a high degree of potential for optimizing the segmentation process and increasing its feasibility in clinical settings during translational investigations. Recently, cross-supervised training based on different co-training sub-networks has become a standard paradigm for this task. Still, the critical issues of sub-network disagreement and label-noise suppression require further attention and progress in cross-supervised training. This paper proposes a cross-supervised learning framework based on dual classifiers (DC-Net), including an evidential classifier and a vanilla classifier. The two classifiers exhibit complementary characteristics, enabling them to handle disagreement effectively and generate more robust and accurate pseudo-labels for unlabeled data. We also incorporate the uncertainty estimation from the evidential classifier into cross-supervised training to alleviate the negative effect of the error supervision signal. The extensive experiments on LA and Pancreas-CT dataset illustrate that DC-Net outperforms other state-of-the-art methods for semi-supervised segmentation. The code will be released soon.
半监督医学图像分割提供了一个有前途的解决方案,通过显著减少标注负担而实现类似的性能。使用这种方法可以展示高度的潜力,以优化分割过程并增加在临床实验期间 Translational 研究期间的实践可行性。最近,基于不同的协同训练子网络的交叉监督训练已经成为该任务的标准范式。然而,子网络不同意和标签噪声抑制等关键问题需要进一步的关注和进展的交叉监督训练。本文提出了基于双重分类器(DC-Net)的交叉监督学习框架,包括证据分类器和无分类分类器。两个分类器具有互补的特征,使他们能够有效地处理不同意并生成未标记数据更为稳健和准确的伪标签。我们还将证据分类器的不确定估计引入交叉监督训练,以减轻错误监督信号的负面影响。在LA和肝脏CT数据集上的广泛实验表明,DC-Net在半监督分割方面优于其他先进的方法。代码将很快发布。
https://arxiv.org/abs/2305.16216
Consistency learning plays a crucial role in semi-supervised medical image segmentation as it enables the effective utilization of limited annotated data while leveraging the abundance of unannotated data. The effectiveness and efficiency of consistency learning are challenged by prediction diversity and training stability, which are often overlooked by existing studies. Meanwhile, the limited quantity of labeled data for training often proves inadequate for formulating intra-class compactness and inter-class discrepancy of pseudo labels. To address these issues, we propose a self-aware and cross-sample prototypical learning method (SCP-Net) to enhance the diversity of prediction in consistency learning by utilizing a broader range of semantic information derived from multiple inputs. Furthermore, we introduce a self-aware consistency learning method that exploits unlabeled data to improve the compactness of pseudo labels within each class. Moreover, a dual loss re-weighting method is integrated into the cross-sample prototypical consistency learning method to improve the reliability and stability of our model. Extensive experiments on ACDC dataset and PROMISE12 dataset validate that SCP-Net outperforms other state-of-the-art semi-supervised segmentation methods and achieves significant performance gains compared to the limited supervised training. Our code will come soon.
一致性学习在半监督医学图像分割中发挥着关键作用,因为它能够充分利用有限的标注数据,同时利用未标注数据的丰富性。一致性学习的性能和效率受到预测多样性和训练稳定性的挑战,往往被现有研究忽视。与此同时,训练数据的标注数量往往不足以满足形成伪标签班级内部凝聚力和班级间差异的充分表达。为了解决这些问题,我们提出了一种具有自我意识的交叉样本典型学习方法(SCP-Net),通过利用多个输入来源的更广泛的语义信息,增强一致性学习的预测多样性。我们还引入了一种具有自我意识的一致性学习方法,利用未标注数据改善每个班级伪标签班级内部凝聚力。此外,我们还将双重损失重新加权方法集成到交叉样本典型一致性学习方法中,以提高我们的模型的可靠性和稳定性。在ACDC数据和PROMISE12数据集上进行广泛的实验验证,SCP-Net相比于有限的监督训练,在性能上表现更好。我们的代码即将发布。
https://arxiv.org/abs/2305.16214