Along with the recent development of deep neural networks, appearance-based gaze estimation has succeeded considerably when training and testing within the same domain. Compared to the within-domain task, the variance of different domains makes the cross-domain performance drop severely, preventing gaze estimation deployment in real-world applications. Among all the factors, ranges of head pose and gaze are believed to play a significant role in the final performance of gaze estimation, while collecting large ranges of data is expensive. This work proposes an effective model training pipeline consisting of a training data synthesis and a gaze estimation model for unsupervised domain adaptation. The proposed data synthesis leverages the single-image 3D reconstruction to expand the range of the head poses from the source domain without requiring a 3D facial shape dataset. To bridge the inevitable gap between synthetic and real images, we further propose an unsupervised domain adaptation method suitable for synthetic full-face data. We propose a disentangling autoencoder network to separate gaze-related features and introduce background augmentation consistency loss to utilize the characteristics of the synthetic source domain. Through comprehensive experiments, we show that the model only using monocular-reconstructed synthetic training data can perform comparably to real data with a large label range. Our proposed domain adaptation approach further improves the performance on multiple target domains. The code and data will be available at \url{this https URL}.
与深度学习网络的最新发展相伴,基于外观的 gaze 估计在同域内的训练和测试中取得了显著成功。相对于同域任务,不同域的差异使跨域性能严重下降,从而阻止 gaze 估计在现实世界中的应用。在所有因素中,头部姿态和注视范围被认为是 gaze 估计最终性能的重要影响因素,而收集大量数据的成本很高。该工作提出了一种有效的模型训练 pipeline,包括一个训练数据合成和 gaze 估计模型的无监督跨域适应方法。该合成方法利用单图像三维重构扩大源域头部姿态范围,而不需要三维面部形状数据集。为了弥补合成和真实图像之间的必然差距,我们进一步提出了适合合成全貌数据的无监督跨域适应方法。我们提出了一个分离注意力相关的特征的解码网络,并引入背景增强一致性损失,利用合成源域的特点。通过综合实验,我们表明,仅使用单眼重构的合成训练数据可以使用与大量标签范围的真实数据相当的性能。我们提出的跨域适应方法进一步改进了多个目标域的性能。代码和数据将可在 \url{this https URL} 上获取。
https://arxiv.org/abs/2305.16140
Due to the unsupervised nature of anomaly detection, the key to fueling deep models is finding supervisory signals. Different from current reconstruction-guided generative models and transformation-based contrastive models, we devise novel data-driven supervision for tabular data by introducing a characteristic -- scale -- as data labels. By representing varied sub-vectors of data instances, we define scale as the relationship between the dimensionality of original sub-vectors and that of representations. Scales serve as labels attached to transformed representations, thus offering ample labeled data for neural network training. This paper further proposes a scale learning-based anomaly detection method. Supervised by the learning objective of scale distribution alignment, our approach learns the ranking of representations converted from varied subspaces of each data instance. Through this proxy task, our approach models inherent regularities and patterns within data, which well describes data "normality". Abnormal degrees of testing instances are obtained by measuring whether they fit these learned patterns. Extensive experiments show that our approach leads to significant improvement over state-of-the-art generative/contrastive anomaly detection methods.
由于异常检测的无监督性质,驱动深度模型的关键在于找到监督信号。与当前基于重构引导的生成模型和基于转换的比较模型不同,我们提出了一种新的基于数据驱动的监督方法,将特征——尺寸——作为数据标签。通过表示数据实例的不同子向量,我们定义尺寸为原始子向量维度与表示维度之间的关系。尺寸作为转换表示的标签,为神经网络训练提供了大量的标记数据。本文还提出了基于尺寸学习的异常检测方法。在尺寸分布对齐的学习目标的监督下,我们的算法学习从每个数据实例的不同子空间中转换表示的排名。通过这个代理任务,我们的算法模型数据内在的规律性和模式,很好地描述了数据“正常”性。测试实例的异常程度可以通过测量是否适应这些学习模式来获得。广泛的实验表明,我们的算法比当前最先进的生成/对比异常检测方法取得了显著的改进。
https://arxiv.org/abs/2305.16114
Learning quality document embeddings is a fundamental problem in natural language processing (NLP), information retrieval (IR), recommendation systems, and search engines. Despite recent advances in the development of transformer-based models that produce sentence embeddings with self-contrastive learning, the encoding of long documents (Ks of words) is still challenging with respect to both efficiency and quality considerations. Therefore, we train Longfomer-based document encoders using a state-of-the-art unsupervised contrastive learning method (SimCSE). Further on, we complement the baseline method -- siamese neural network -- with additional convex neural networks based on functional Bregman divergence aiming to enhance the quality of the output document representations. We show that overall the combination of a self-contrastive siamese network and our proposed neural Bregman network outperforms the baselines in two linear classification settings on three long document topic classification tasks from the legal and biomedical domains.
学习高质量的文档嵌入是自然语言处理(NLP)、信息检索(IR)、推荐系统、搜索引擎等应用领域中的 fundamental problem。尽管Transformer-based模型的开发取得了 recent advances,生成具有自对比学习功能的语句嵌入,但对于较长的文档(单词数量)的编码仍然具有效率和质量方面的挑战。因此,我们使用一种先进的无监督对比学习方法(SimCSE)训练基于Longfomer的文档编码器。接着,我们使用基于功能梯度散射的凸神经网络作为基方法,并补充了以提高输出文档表示质量的 additional 凸神经网络。我们证明了, overall,自对比的Siamese网络和我们的神经网络梯度散射网络的组合在三个法律和生物医学领域Longdocument主题分类任务中的两个线性分类设置中比基方法表现更好。
https://arxiv.org/abs/2305.16031
We propose a visually grounded speech model that acquires new words and their visual depictions from just a few word-image example pairs. Given a set of test images and a spoken query, we ask the model which image depicts the query word. Previous work has simplified this problem by either using an artificial setting with digit word-image pairs or by using a large number of examples per class. We propose an approach that can work on natural word-image pairs but with less examples, i.e. fewer shots. Our approach involves using the given word-image example pairs to mine new unsupervised word-image training pairs from large collections of unlabelled speech and images. Additionally, we use a word-to-image attention mechanism to determine word-image similarity. With this new model, we achieve better performance with fewer shots than any existing approach.
我们提出了一种视觉扎实的语音模型,该模型从少量单词图像示例 pairs 中获取新单词及其视觉描述。给定一组测试图像和口语查询,我们询问模型哪个图像描述了查询单词。先前的研究通过使用数字单词图像示例或使用每个类别大量示例来简化了这个问题。我们提出了一种方法,可以在自然单词图像示例中工作,但使用更少的例子,即更少的拍摄次数。我们的方法涉及使用给定的单词图像示例 pairs 从大型未标记语音和图像集合中挖掘新的无监督单词图像训练对。此外,我们使用单词到图像注意力机制来确定单词图像相似性。通过使用这个新模型,我们比任何现有方法更少的拍摄次数实现了更好的性能。
https://arxiv.org/abs/2305.15937
Unsupervised commonsense reasoning (UCR) is becoming increasingly popular as the construction of commonsense reasoning datasets is expensive, and they are inevitably limited in their scope. A popular approach to UCR is to fine-tune language models with external knowledge (e.g., knowledge graphs), but this usually requires a large number of training examples. In this paper, we propose to transform the downstream multiple choice question answering task into a simpler binary classification task by ranking all candidate answers according to their reasonableness. To this end, for training the model, we convert the knowledge graph triples into reasonable and unreasonable texts. Extensive experimental results show the effectiveness of our approach on various multiple choice question answering benchmarks. Furthermore, compared with existing UCR approaches using KGs, ours is less data hungry. Our code is available at this https URL.
无监督常识推理(UCR)正在变得越来越流行,因为构建常识推理数据集的成本很高,不可避免地会受到限制。UCR的一个流行的方法是通过外部知识(例如知识图谱)优化语言模型,但这通常需要大量训练示例。在本文中,我们提议将后续多项选择回答任务转换为更简单的二进制分类任务,通过按合理性排序所有备选答案来这样做。为了训练模型,我们将知识图谱三元组转换为合理和不合理的文本。广泛的实验结果表明,我们的方法在各种多项选择回答基准测试中的有效性。此外,与使用KGs的现有UCR方法相比,我们的方法的数据需求较少。我们的代码可用在此处https://github.com/lihaoyi21/UCR代码库中。
https://arxiv.org/abs/2305.15932
Low-dose computed tomography (CT) image denoising is crucial in medical image computing. Recent years have been remarkable improvement in deep learning-based methods for this task. However, training deep denoising neural networks requires low-dose and normal-dose CT image pairs, which are difficult to obtain in the clinic settings. To address this challenge, we propose a novel fully unsupervised method for low-dose CT image denoising, which is based on denoising diffusion probabilistic model -- a powerful generative model. First, we train an unconditional denoising diffusion probabilistic model capable of generating high-quality normal-dose CT images from random noise. Subsequently, the probabilistic priors of the pre-trained diffusion model are incorporated into a Maximum A Posteriori (MAP) estimation framework for iteratively solving the image denoising problem. Our method ensures the diffusion model produces high-quality normal-dose CT images while keeping the image content consistent with the input low-dose CT images. We evaluate our method on a widely used low-dose CT image denoising benchmark, and it outperforms several supervised low-dose CT image denoising methods in terms of both quantitative and visual performance.
低剂量核磁共振(CT)图像去噪在医学图像计算中至关重要。近年来,基于深度学习的方法在该领域取得了显著的进展。然而,训练深度去噪神经网络需要低剂量和正常剂量的CT图像对,这在临床 settings 中很难获取。为了解决这一挑战,我们提出了一种全新的完全 unsupervised 的方法,用于低剂量CT图像去噪,其基于去噪扩散概率模型,这是一种强大的生成模型。我们首先训练一个无条件去噪扩散概率模型,可以从随机噪声中生成高质量的正常剂量CT图像。随后,我们训练了预先训练的扩散模型的概率前向量,并将其融入最大后效估计框架中,以迭代地解决图像去噪问题。我们的方法和输入的低剂量CT图像的图像内容保持一致,确保了扩散模型生成高质量的正常剂量CT图像,同时保持图像质量与输入的低剂量CT图像相似。我们使用了一个广泛应用的低剂量CT图像去噪基准进行评估,该方法在 quantitative 和 visual 性能上均优于 several supervised low-剂量CT图像去噪方法。
https://arxiv.org/abs/2305.15887
Extracting dense representations for terms and phrases is a task of great importance for knowledge discovery platforms targeting highly-technical fields. Dense representations are used as features for downstream components and have multiple applications ranging from ranking results in search to summarization. Common approaches to create dense representations include training domain-specific embeddings with self-supervised setups or using sentence encoder models trained over similarity tasks. In contrast to static embeddings, sentence encoders do not suffer from the out-of-vocabulary (OOV) problem, but impose significant computational costs. In this paper, we propose a fully unsupervised approach to text encoding that consists of training small character-based models with the objective of reconstructing large pre-trained embedding matrices. Models trained with this approach can not only match the quality of sentence encoders in technical domains, but are 5 times smaller and up to 10 times faster, even on high-end GPUs.
提取针对术语和短语的密集表示对于针对高度技术领域的知识发现平台来说非常重要。密集表示被用作后续组件的特征,有多个应用,从搜索中的排名结果到摘要。创建密集表示的常见方法包括使用自我监督 setup 训练特定领域的嵌入或使用基于相似任务的语句编码模型训练。与静态嵌入不同,语句编码器不会面临词汇表问题,但会付出巨大的计算成本。在本文中,我们提出了一种完全无监督的文本编码方法,它包括训练小型字符模型,以重建大型预先训练嵌入矩阵的目标。训练使用这种方法的模型不仅可以与技术领域的语句编码器质量相当,而且会缩小5倍到10倍,即使在高性能GPU上也是如此。
https://arxiv.org/abs/2305.15867
Image fusion plays a key role in a variety of multi-sensor-based vision systems, especially for enhancing visual quality and/or extracting aggregated features for perception. However, most existing methods just consider image fusion as an individual task, thus ignoring its underlying relationship with these downstream vision problems. Furthermore, designing proper fusion architectures often requires huge engineering labor. It also lacks mechanisms to improve the flexibility and generalization ability of current fusion approaches. To mitigate these issues, we establish a Task-guided, Implicit-searched and Meta-initialized (TIM) deep model to address the image fusion problem in a challenging real-world scenario. Specifically, we first propose a constrained strategy to incorporate information from downstream tasks to guide the unsupervised learning process of image fusion. Within this framework, we then design an implicit search scheme to automatically discover compact architectures for our fusion model with high efficiency. In addition, a pretext meta initialization technique is introduced to leverage divergence fusion data to support fast adaptation for different kinds of image fusion tasks. Qualitative and quantitative experimental results on different categories of image fusion problems and related downstream tasks (e.g., visual enhancement and semantic understanding) substantiate the flexibility and effectiveness of our TIM. The source code will be available at this https URL.
图像融合在多种多传感器为基础的视觉系统中发挥着关键作用,特别是用于提高视觉质量和/或提取聚合特征以感知。然而,大多数现有方法只是将图像融合视为个人任务,从而忽视了它与这些后续视觉问题的潜在关系。此外,设计适当的融合架构往往需要巨大的工程劳动。它也缺乏机制来改善当前融合方法的灵活性和泛化能力。为了缓解这些问题,我们建立了一种任务引导、隐含搜索和元初始化(TIM)的深层模型,以在一个挑战性的现实世界场景中解决图像融合问题。具体来说,我们首先提出了一种约束策略,以从后续任务中引入信息,指导 unsupervised 的图像融合学习过程。在这个框架内,我们 then 设计了一种隐含搜索策略,以高效地自动发现我们的融合模型的紧凑架构。此外,我们还引入了一种基于 pretext 的元初始化技术,利用分化融合数据支持各种图像融合任务的快速适应。不同类别的图像融合问题和相关的后续任务(例如,视觉增强和语义理解)的定量和定性实验结果证实了我们的 TIM 的灵活性和有效性。源代码将在本 https URL 上提供。
https://arxiv.org/abs/2305.15862
This paper proposes an unsupervised anomalous sound detection method using sound separation. In factory environments, background noise and non-objective sounds obscure desired machine sounds, making it challenging to detect anomalous sounds. Therefore, using sounds not mixed with background noise or non-purpose sounds in the detection system is desirable. We compared two versions of our proposed method, one using sound separation as a pre-processing step and the other using separation-based outlier exposure that uses the error between two separated sounds. Based on the assumption that differences in separation performance between normal and anomalous sounds affect detection results, a sound separation model specific to a particular product type was used in both versions. Experimental results indicate that the proposed method improved anomalous sound detection performance for all Machine IDs, achieving a maximum improvement of 39%.
这篇文章提出了一种使用声音分离 unsupervised 的方法来解决异常声音检测问题。在工厂环境中,背景噪音和非客观的声音会掩盖想要的机器声音,这使得检测异常声音变得困难。因此,在检测系统中不使用与背景噪音或非目的声音混合的声音是理想的。我们比较了我们提出的方法的两个版本,一个使用声音分离作为预处理步骤,另一个使用基于分离的异常检测突出显示,使用了分离前后声音之间的误差。根据假设,正常和异常声音分离性能的差异会影响检测结果,因此我们在两个版本中都使用了特定于某一产品类型的声分离模型。实验结果表明, proposed 方法对所有机器编号的异常声音检测性能都做出了最大 39% 的提高。
https://arxiv.org/abs/2305.15859
Recent years have seen increasing concerns about the unsafe response generation of large-scale dialogue systems, where agents will learn offensive or biased behaviors from the real-world corpus. Some methods are proposed to address the above issue by detecting and replacing unsafe training examples in a pipeline style. Though effective, they suffer from a high annotation cost and adapt poorly to unseen scenarios as well as adversarial attacks. Besides, the neglect of providing safe responses (e.g. simply replacing with templates) will cause the information-missing problem of dialogues. To address these issues, we propose an unsupervised pseudo-label sampling method, TEMP, that can automatically assign potential safe responses. Specifically, our TEMP method groups responses into several clusters and samples multiple labels with an adaptively sharpened sampling strategy, inspired by the observation that unsafe samples in the clusters are usually few and distribute in the tail. Extensive experiments in chitchat and task-oriented dialogues show that our TEMP outperforms state-of-the-art models with weak supervision signals and obtains comparable results under unsupervised learning settings.
近年来,人们对大规模对话系统的不安全响应生成日益关注,这些系统将从现实世界的数据集学习具有攻击性或偏见的行为。有一些方法建议通过在管道中检测并替换不安全的训练示例来解决上述问题。虽然有效,但它们面临着高标注成本,并且对于未观察到的场景和对抗攻击的适应性较差。此外,忽略了提供安全响应(例如简单地替换为模板)将会导致对话信息的丢失问题。为了解决这些问题,我们提出了一种 unsupervised 的伪标签采样方法 TEMP,该方法可以自动分配可能的安全响应。具体而言,我们的 TEMP 方法将响应分为多个簇,并使用自适应的增强采样策略样本多个标签,灵感来自于观察簇中的不安全样本通常很少,分布在尾部。在闲聊和任务导向的对话实验中,广泛研究表明,我们的 TEMP 在弱监督信号下的表现力比先进的模型更强,并能够在无监督学习设置下获得类似的结果。
https://arxiv.org/abs/2305.15757
The term "Code Mixed" refers to the use of more than one language in the same text. This phenomenon is predominantly observed on social media platforms, with an increasing amount of adaptation as time goes on. It is critical to detect foreign elements in a language and process them correctly, as a considerable number of individuals are using code-mixed languages that could not be comprehended by understanding one of those languages. In this work, we focus on low-resource Hindi-English code-mixed language and enhancing the performance of different code-mixed natural language processing tasks such as sentiment analysis, emotion recognition, and hate speech identification. We perform a comparative analysis of different Transformer-based language Models pre-trained using unsupervised approaches. We have included the code-mixed models like HingBERT, HingRoBERTa, HingRoBERTa-Mixed, mBERT, and non-code-mixed models like AlBERT, BERT, and RoBERTa for comparative analysis of code-mixed Hindi-English downstream tasks. We report state-of-the-art results on respective datasets using HingBERT-based models which are specifically pre-trained on real code-mixed text. Our HingBERT-based models provide significant improvements thus highlighting the poor performance of vanilla BERT models on code-mixed text.
"代码混合"一词指的是在同一文本中使用多种语言的现象,这在社交媒体平台上尤为普遍,随着时间的流逝,适应度不断增加。重要的是要识别语言中的异国元素,并正确处理它们,因为相当多人使用代码混合语言,这些语言无法通过理解其中一种语言来理解。在本文中,我们重点关注资源有限的希伯来语-英语代码混合语言,并提高不同代码混合自然语言处理任务(如情感分析、情绪识别和恶言识别)的性能。我们使用无监督方法预先训练的不同Transformer-based语言模型进行了比较分析。我们包括代码混合模型,如HingBERT、HingRoBERTa、HingRoBERTa-混合、mBERT和非代码混合模型,如AlBERT、BERT和RoBERTa,以对代码混合希伯来语-英语下游任务进行代码混合语言比较分析。我们使用HingBERT-based模型分别报告了各自数据集的最佳结果,这些模型是在真实代码混合文本中进行预先训练的。我们的HingBERT-based模型提供了显著的改进,从而突出了代码混合文本中普通BERT模型表现不佳的情况。
https://arxiv.org/abs/2305.15722
Although existing image anomaly detection methods yield impressive results, they are mostly an offline learning paradigm that requires excessive data pre-collection, limiting their adaptability in industrial scenarios with online streaming data. Online learning-based image anomaly detection methods are more compatible with industrial online streaming data but are rarely noticed. For the first time, this paper presents a fully online learning image anomaly detection method, namely LeMO, learning memory for online image anomaly detection. LeMO leverages learnable memory initialized with orthogonal random noise, eliminating the need for excessive data in memory initialization and circumventing the inefficiencies of offline data collection. Moreover, a contrastive learning-based loss function for anomaly detection is designed to enable online joint optimization of memory and image target-oriented features. The presented method is simple and highly effective. Extensive experiments demonstrate the superior performance of LeMO in the online setting. Additionally, in the offline setting, LeMO is also competitive with the current state-of-the-art methods and achieves excellent performance in few-shot scenarios.
现有的图像异常检测方法取得了令人印象深刻的结果,但它们大多是一种 offline 学习范式,需要在数据收集前过度收集数据,因此在工业场景下与在线 streaming 数据的连接性受到限制。基于在线学习的图像异常检测方法更适合于工业在线 streaming 数据,但它们很少被注意到。本文首次提出了一种 fully online 学习的图像异常检测方法,即 LeMO,它是一种用于在线图像异常检测的学习记忆。 LeMO 利用Orthogonal 随机噪声初始化可学习的记忆,消除了在内存初始化中需要过度数据的问题,绕过了 offline 数据收集的效率低下的问题。此外,设计了一种用于异常检测的 contrastive 学习损失函数,以便在线联合优化记忆和图像目标特征。该方法简单而高效。广泛的实验证明了 LeMO 在在线场景中的卓越性能。此外,在 offline 场景下, LeMO 也与当前的最新方法竞争,并在少量的场景中实现了出色的性能。
https://arxiv.org/abs/2305.15652
Text-to-image diffusion models are now capable of generating images that are often indistinguishable from real images. To generate such images, these models must understand the semantics of the objects they are asked to generate. In this work we show that, without any training, one can leverage this semantic knowledge within diffusion models to find semantic correspondences -- locations in multiple images that have the same semantic meaning. Specifically, given an image, we optimize the prompt embeddings of these models for maximum attention on the regions of interest. These optimized embeddings capture semantic information about the location, which can then be transferred to another image. By doing so we obtain results on par with the strongly supervised state of the art on the PF-Willow dataset and significantly outperform (20.9% relative for the SPair-71k dataset) any existing weakly or unsupervised method on PF-Willow, CUB-200 and SPair-71k datasets.
文本到图像扩散模型现在能够生成往往与真实图像难以区分的图像。要生成这些图像,这些模型必须理解它们被要求生成的对象的意义。在本文中,我们表明,在没有训练的情况下,可以利用扩散模型中的语义知识找到语义对应物——在多个图像中具有相同语义意义的地点。具体来说,给定一个图像,我们优化这些模型的即时嵌入,以最大限度地关注感兴趣的区域。这些优化的嵌入捕获了关于位置的语义信息,然后可以转移到另一个图像。通过这样做,我们在PF-威尔逊数据集上的结果与强监督的最新进展相当,并且在PF-威尔逊、CUB-200和SPair-71k数据集上显著超越了任何现有的弱或无监督方法。
https://arxiv.org/abs/2305.15581
We propose an unsupervised speech-to-speech translation (S2ST) system that does not rely on parallel data between the source and target languages. Our approach maps source and target language speech signals into automatically discovered, discrete units and reformulates the problem as unsupervised unit-to-unit machine translation. We develop a three-step training procedure that involves (a) pre-training an unit-based encoder-decoder language model with a denoising objective (b) training it with word-by-word translated utterance pairs created by aligning monolingual text embedding spaces and (c) running unsupervised backtranslation bootstrapping off of the initial translation model. Our approach avoids mapping the speech signal into text and uses speech-to-unit and unit-to-speech models instead of automatic speech recognition and text to speech models. We evaluate our model on synthetic-speaker Europarl-ST English-German and German-English evaluation sets, finding that unit-based translation is feasible under this constrained scenario, achieving 9.29 ASR-BLEU in German to English and 8.07 in English to German.
我们提出了一个不需要源和目标语言平行数据的非监督语音到语音翻译系统(S2ST)。我们的方法将源和目标语言语音信号映射到自动发现、离散单元,并重新表述问题为无监督单元到单元机器翻译。我们开发了一个三步骤的训练程序,包括(a) 先训练一个单元基于编码-解码语言模型,以消除噪声目标(b) 通过对齐单语言文本嵌入空间创建 word-by-word 翻译 utterance 对进行训练,(c) 运行无监督反向翻译Bootstrapping 从初始翻译模型中启动。我们的方法避免将语音信号映射到文本,而是使用语音到单元和单元到语音模型,而不是自动语音识别和文本到语音模型。我们在合成听者欧洲语言资源( Europarl-ST)的英语-德语和德语-英语评估 sets 上评估我们的模型,发现在 this 约束条件下,单元翻译是可行的,实现德语到英语的 ASR-BLEU 值为 9.29,英语到德语的值为 8.07。
https://arxiv.org/abs/2305.15405
Multi-party dialogues are more difficult for models to understand than one-to-one two-party dialogues, since they involve multiple interlocutors, resulting in interweaving reply-to relations and information flows. To step over these obstacles, an effective way is to pre-train a model that understands the discourse structure of multi-party dialogues, namely, to whom each utterance is replying. However, due to the lack of explicitly annotated discourse labels in multi-party dialogue corpora, previous works fail to scale up the pre-training process by putting aside the unlabeled multi-party conversational data for nothing. To fully utilize the unlabeled data, we propose to treat the discourse structures as latent variables, then jointly infer them and pre-train the discourse-aware model by unsupervised latent variable inference methods. Experiments on multiple downstream tasks show that our pre-trained model outperforms strong baselines by large margins and achieves state-of-the-art (SOTA) results, justifying the effectiveness of our method. The official implementation of this paper is available at this https URL.
多对多的对话比一对一的对话更难让模型理解,因为它们涉及多个对话者,导致回复关系和信息流动交织在一起。要克服这些障碍,一种有效的方法是先训练一个能够理解多对多对话的言语结构模型,即每个说话者的回复对象。然而,由于多对多对话 corpora 中缺乏明确标注的言语标签,以前的工作无法通过将未标记的多对多对话数据视为无标签变量而将 pre-training 过程扩展到更大的规模。为了充分利用未标记数据,我们建议将言语结构视为隐变量,然后使用 unsupervised 隐变量推断方法联合推断它们,并先训练一个言语 aware 模型。多个后续任务的实验结果表明,我们的先训练模型在多项任务中表现出巨大的优势,并取得了最先进的结果,证明了我们方法的有效性。本文的官方实现可在 this https URL 中找到。
https://arxiv.org/abs/2305.15175
Contrastive learning has been the dominant approach to train state-of-the-art sentence embeddings. Previous studies have typically learned sentence embeddings either through the use of human-annotated natural language inference (NLI) data or via large-scale unlabeled sentences in an unsupervised manner. However, even in the case of unlabeled data, their acquisition presents challenges in certain domains due to various reasons. To address these issues, we present SynCSE, a contrastive learning framework that trains sentence embeddings with synthesized data. Specifically, we explore utilizing large language models to synthesize the required data samples for contrastive learning, including (1) producing positive and negative annotations given unlabeled sentences (SynCSE-partial), and (2) generating sentences along with their corresponding annotations from scratch (SynCSE-scratch). Experimental results on sentence similarity and reranking tasks indicate that both SynCSE-partial and SynCSE-scratch greatly outperform unsupervised baselines, and SynCSE-partial even achieves comparable performance to the supervised models in most settings.
对比学习一直是训练高级句子嵌入的主要方法。以往的研究通常通过使用人类标注的自然语言推断(NLI)数据或通过未标记的大句子进行无监督学习。然而,即使在未标记数据的情况下,获取它们仍然存在各种挑战,因为各种原因。为了解决这些问题,我们提出了 SynCSE,一个对比学习框架,使用合成数据训练句子嵌入。具体来说,我们探索使用大型语言模型合成对比学习所需的数据样本,包括(1)根据未标记句子产生正则化和负则化注释(SynCSE-partial),以及(2)从头生成句子及其相应的注释(SynCSE- scratch)。在句子相似性和重新排序任务的实验结果中,表明 SynCSE-partial 和 SynCSE-Scratch 远远超过了无监督基准,且 SynCSE-partial 在大多数情况下实现了与监督模型的相当性能。
https://arxiv.org/abs/2305.15077
Current state-of-the-art object-centric models use slots and attention-based routing for binding. However, this class of models has several conceptual limitations: the number of slots is hardwired; all slots have equal capacity; training has high computational cost; there are no object-level relational factors within slots. Synchrony-based models in principle can address these limitations by using complex-valued activations which store binding information in their phase components. However, working examples of such synchrony-based models have been developed only very recently, and are still limited to toy grayscale datasets and simultaneous storage of less than three objects in practice. Here we introduce architectural modifications and a novel contrastive learning method that greatly improve the state-of-the-art synchrony-based model. For the first time, we obtain a class of synchrony-based models capable of discovering objects in an unsupervised manner in multi-object color datasets and simultaneously representing more than three objects
当前先进的对象中心模型使用孔和注意力based routing进行绑定。然而,这种一类模型有几个概念上的限制:孔的数量是硬编的;所有孔都具有相等的能力;训练有高计算成本;在孔中不存在对象级别的关系因素。同步模型理论上可以通过使用复杂的值激活来解决这些问题,并将绑定信息存储在相位组件中。然而,最近才开发了这种同步模型的工作例子,仍然局限于玩具灰度数据集和实践中所存储的小于3个对象。在这里我们介绍了建筑修改和创新的对抗学习方法,极大地改进了先进的同步模型。首次,我们获得了一类能够在多对象彩色数据集中 unsupervised 地发现对象并同时代表多于3个对象的同步模型。
https://arxiv.org/abs/2305.15001
A trustworthy real-world prediction system should be well-calibrated; that is, its confidence in an answer is indicative of the likelihood that the answer is correct, enabling deferral to a more expensive expert in cases of low-confidence predictions. While recent studies have shown that unsupervised pre-training produces large language models (LMs) that are remarkably well-calibrated, the most widely-used LMs in practice are fine-tuned with reinforcement learning with human feedback (RLHF-LMs) after the initial unsupervised pre-training stage, and results are mixed as to whether these models preserve the well-calibratedness of their ancestors. In this paper, we conduct a broad evaluation of computationally feasible methods for extracting confidence scores from LLMs fine-tuned with RLHF. We find that with the right prompting strategy, RLHF-LMs verbalize probabilities that are much better calibrated than the model's conditional probabilities, enabling fairly well-calibrated predictions. Through a combination of prompting strategy and temperature scaling, we find that we can reduce the expected calibration error of RLHF-LMs by over 50%.
一个可靠的现实世界预测系统应该进行精确的校准。也就是说,其对答案的的信心反映了答案是否正确的可能性,从而能够在低信心预测的情况下将答案推迟到更昂贵的专家那里。尽管最近的研究表明,未监督的前训练产生大型语言模型(LMs)表现得非常校准,但在实践中,最常用的LMs是在最初未监督的前训练阶段通过强化学习与人类反馈(RLHF-LMs)进行微调的,结果好坏不一,这些模型是否保持了其祖先的校准性仍待验证。在本文中,我们对所有可行的计算方式进行了广泛的评估,以提取与RLHF-LMs微调后进行强化学习与人类反馈(RLHF-LMs)的信心评分。我们发现,通过适当的提示策略,RLHF-LMs用更校准的概率表示了模型的条件概率,使其能够进行相当校准的预测。通过结合提示策略和温度 scaling,我们发现,我们可以将RLHF-LMs的预期校准误差降低超过50%。
https://arxiv.org/abs/2305.14975
The remarkable capabilities of large language models have been accompanied by a persistent drawback: the generation of false and unsubstantiated claims commonly known as "hallucinations". To combat this issue, recent research has introduced approaches that involve editing and attributing the outputs of language models, particularly through prompt-based editing. However, the inference cost and speed of using large language models for editing currently bottleneck prompt-based methods. These bottlenecks motivate the training of compact editors, which is challenging due to the scarcity of training data for this purpose. To overcome these challenges, we exploit the power of large language models to introduce corruptions (i.e., noise) into text and subsequently fine-tune compact editors to denoise the corruptions by incorporating relevant evidence. Our methodology is entirely unsupervised and provides us with faux hallucinations for training in any domain. Our Petite Unsupervised Research and Revision model, PURR, not only improves attribution over existing editing methods based on fine-tuning and prompting, but also achieves faster execution times by orders of magnitude.
大型语言模型的卓越能力伴随着一个持久的缺点是生成虚假且缺乏证据的支持声称,这种声称通常被称为“幻觉”。为了解决这个问题,最近的研究引入了涉及编辑和 attributed 语言模型输出的方法,特别是基于提示的编辑。然而,使用大型语言模型进行编辑的推断成本和速度目前的瓶颈是基于提示的方法。这些瓶颈激励了紧凑编辑的训练,但由于训练数据匮乏,这是具有挑战性的。为了克服这些挑战,我们利用大型语言模型的力量将错误(即噪声)引入文本,然后通过集成相关证据微调紧凑编辑,以消除错误。我们的方法论是完全 unsupervised 的,为我们在任何领域训练中的虚假幻觉提供了伪现实。我们的小型 unsupervised 研究和修订模型 purR 不仅基于 fine-tuning 和提示改进了现有的编辑方法,而且通过数倍数的速度加快了执行时间。
https://arxiv.org/abs/2305.14908
Anomaly detection is represented as an unsupervised learning to identify deviated images from normal images. In general, there are two main challenges of anomaly detection tasks, i.e., the class imbalance and the unexpectedness of anomalies. In this paper, we propose a multiresolution feature guidance method based on Transformer named GTrans for unsupervised anomaly detection and localization. In GTrans, an Anomaly Guided Network (AGN) pre-trained on ImageNet is developed to provide surrogate labels for features and tokens. Under the tacit knowledge guidance of the AGN, the anomaly detection network named Trans utilizes Transformer to effectively establish a relationship between features with multiresolution, enhancing the ability of the Trans in fitting the normal data manifold. Due to the strong generalization ability of AGN, GTrans locates anomalies by comparing the differences in spatial distance and direction of multi-scale features extracted from the AGN and the Trans. Our experiments demonstrate that the proposed GTrans achieves state-of-the-art performance in both detection and localization on the MVTec AD dataset. GTrans achieves image-level and pixel-level anomaly detection AUROC scores of 99.0% and 97.9% on the MVTec AD dataset, respectively.
异常检测是一种 unsupervised 学习,用于识别偏离正常图像的特征。一般来说,异常检测任务面临两个主要挑战,即类别不平衡和异常的意外性。在本文中,我们提出了一种基于Transformer的多功能特征引导方法,名为 GTrans,用于 unsupervised 的异常检测和定位。在 GTrans 中,我们开发了基于 ImageNet 预训练的异常引导网络(AGN),以提供特征和代币的替代标签。在 AGN 的指导下,名为 Trans 的异常检测网络利用 Transformer 有效地建立多功能特征之间的关系,增强 Trans 适应正常数据集的能力。由于 AGN 的强烈泛化能力,GTrans 通过比较从 AGN 和 Trans 提取的多尺度特征的空间距离和方向来确定异常的位置。我们的实验表明,我们提出的 GTrans 在 MVTec AD 数据集上实现了最先进的检测和定位性能。GTrans 在 MVTec AD 数据集上分别实现了图像级和像素级的异常检测 AUROC 得分为 99.0%。
https://arxiv.org/abs/2305.14880