Self-supervised learning (SSL) is a powerful tool in machine learning, but understanding the learned representations and their underlying mechanisms remains a challenge. This paper presents an in-depth empirical analysis of SSL-trained representations, encompassing diverse models, architectures, and hyperparameters. Our study reveals an intriguing aspect of the SSL training process: it inherently facilitates the clustering of samples with respect to semantic labels, which is surprisingly driven by the SSL objective's regularization term. This clustering process not only enhances downstream classification but also compresses the data information. Furthermore, we establish that SSL-trained representations align more closely with semantic classes rather than random classes. Remarkably, we show that learned representations align with semantic classes across various hierarchical levels, and this alignment increases during training and when moving deeper into the network. Our findings provide valuable insights into SSL's representation learning mechanisms and their impact on performance across different sets of classes.
自监督学习(SSL)是一种强大的机器学习工具,但理解学习表示及其底层机制仍然是一个挑战。本文深入分析了SSL训练表示的方法,涵盖了各种模型、架构和超参数。我们的研究揭示了SSL训练过程令人好奇的特性:它本身促进对语义标签的聚类,这是SSL目标的正则化项意外推动的。这种聚类过程不仅增强了后续分类,还压缩了数据信息。此外,我们确定SSL训练表示更紧密地与语义类别而不是随机类别对齐。非常惊讶地,我们表明, learned representations在不同层级的类别之间对齐,并在训练和更深入网络时增加对齐。我们的发现提供了对SSL表示学习机制的宝贵见解,以及它们对不同类别组表现的影响。
https://arxiv.org/abs/2305.15614
Text-to-image diffusion models have made significant advances in generating and editing high-quality images. As a result, numerous approaches have explored the ability of diffusion model features to understand and process single images for downstream tasks, e.g., classification, semantic segmentation, and stylization. However, significantly less is known about what these features reveal across multiple, different images and objects. In this work, we exploit Stable Diffusion (SD) features for semantic and dense correspondence and discover that with simple post-processing, SD features can perform quantitatively similar to SOTA representations. Interestingly, the qualitative analysis reveals that SD features have very different properties compared to existing representation learning features, such as the recently released DINOv2: while DINOv2 provides sparse but accurate matches, SD features provide high-quality spatial information but sometimes inaccurate semantic matches. We demonstrate that a simple fusion of these two features works surprisingly well, and a zero-shot evaluation using nearest neighbors on these fused features provides a significant performance gain over state-of-the-art methods on benchmark datasets, e.g., SPair-71k, PF-Pascal, and TSS. We also show that these correspondences can enable interesting applications such as instance swapping in two images.
文本到图像扩散模型在生成和编辑高质量图像方面取得了重大进展。因此,许多方法都探索了扩散模型特征用于理解和处理单个图像的后续任务,例如分类、语义分割和风格化。然而,对于在不同图像和对象之间表达这些特征的情况了解得相对较少。在这项工作中,我们利用稳定扩散(SD)特征进行语义和密集对应,并发现通过简单的后期处理,SD特征可以表现出与SOTA表示相当的性能。有趣的是,定性分析表明,与现有的表示学习特征相比,SD特征具有非常不同的性质,例如最近发布的DINOv2:尽管DINOv2提供稀疏但准确的匹配,但SD特征提供高质量的空间信息,但有时不准确的语义匹配。我们证明了一个简单的融合这两个特征的工作非常奇怪,使用最近的邻居对这些融合特征是进行的零次计算评估提供了比基准数据集(例如Spair-71k、PF-Pascal和TSS)当前方法的性能增益。我们还表明,这些对应可以启用有趣的应用,例如两个图像实例交换。
https://arxiv.org/abs/2305.15347
Tabular representation learning has recently gained a lot of attention. However, existing approaches only learn a representation from a single table, and thus ignore the potential to learn from the full structure of relational databases, including neighboring tables that can contain important information for a contextualized representation. Moreover, current models are significantly limited in scale, which prevents that they learn from large databases. In this paper, we thus introduce our vision of relational representation learning, that can not only learn from the full relational structure, but also can scale to larger database sizes that are commonly found in real-world. Moreover, we also discuss opportunities and challenges we see along the way to enable this vision and present initial very promising results. Overall, we argue that this direction can lead to foundation models for relational databases that are today only available for text and images.
表格表示学习最近获得了很多关注。然而,现有的方法仅从单个表格中学习表示,因此忽略了从关系数据库的完整结构中学习的潜在机会,包括可以包含上下文化表示的重要相邻表格。此外,当前模型的规模和范围都非常有限,这阻止了它们从大型数据库中学习。在本文中,我们介绍了我们的关系表示学习愿景,不仅可以从完整的关系结构中学习,还可以扩展到通常存在于现实世界的更大的数据库规模。此外,我们还讨论了在实现这个愿景过程中看到的机会和挑战,并呈现了 initial 非常 promising 的结果。总的来说,我们认为这个方向可以导致关系数据库的基础模型,而今天这些模型只适用于文本和图像。
https://arxiv.org/abs/2305.15321
In this study, we propose Feature-aligned N-BEATS as a domain generalization model for univariate time series forecasting problems. The proposed model is an extension of the doubly residual stacking architecture of N-BEATS (Oreshkin et al. [34]) into a representation learning framework. The model is a new structure that involves marginal feature probability measures (i.e., pushforward measures of multiple source domains) induced by the intricate composition of residual operators of N-BEATS in each stack and aligns them stack-wise via an entropic regularized Wasserstein distance referred to as the Sinkhorn divergence (Genevay et al. [14]). The loss function consists of a typical forecasting loss for multiple source domains and an alignment loss calculated with the Sinkhorn divergence, which allows the model to learn invariant features stack-wise across multiple source data sequences while retaining N-BEATS's interpretable design. We conduct a comprehensive experimental evaluation of the proposed approach and the results demonstrate the model's forecasting and generalization capabilities in comparison with methods based on the original N-BEATS.
在本研究中,我们提出N-BEATS特征对齐作为单一变量时间序列预测问题的域泛化模型。该模型是N-BEATS的二次残留堆叠架构(Oreshkin等人[34])扩展到一个表示学习框架中。该模型是一个新结构,涉及每个堆中的边际特征概率测量(即多个源域的推进测量),由N-BEATS每个堆中的残留操作的复杂组成而激发,并通过熵 regularized Wasserstein距离称为Sinkhorn分歧(Genevay等人[14]) align them stack-wise。损失函数包括多个源域的典型预测损失和 align损失计算的Sinkhorn分歧,从而使模型能够在多个源数据序列中学习不变特征的stack-wise,同时保留N-BEATS可解释的设计。我们进行了该方法的全面实验评估,结果表明,与基于原始N-BEATS的方法相比,该模型的预测和泛化能力更加优秀。
https://arxiv.org/abs/2305.15196
Unsupervised learning has grown in popularity because of the difficulty of collecting annotated data and the development of modern frameworks that allow us to learn from unlabeled data. Existing studies, however, either disregard variations at different levels of similarity or only consider negative samples from one batch. We argue that image pairs should have varying degrees of similarity, and the negative samples should be allowed to be drawn from the entire dataset. In this work, we propose Search-based Unsupervised Visual Representation Learning (SUVR) to learn better image representations in an unsupervised manner. We first construct a graph from the image dataset by the similarity between images, and adopt the concept of graph traversal to explore positive samples. In the meantime, we make sure that negative samples can be drawn from the full dataset. Quantitative experiments on five benchmark image classification datasets demonstrate that SUVR can significantly outperform strong competing methods on unsupervised embedding learning. Qualitative experiments also show that SUVR can produce better representations in which similar images are clustered closer together than unrelated images in the latent space.
非监督学习因其收集标注数据的困难以及现代框架的发展而变得越来越受欢迎。然而,现有的研究要么忽略了不同相似性水平下的变化,要么只考虑了一组样本中的消极样本。我们认为图像对应该具有不同程度的相似性,并且消极样本应该从整个数据集随机抽取。在本研究中,我们提出了基于搜索的非监督视觉表示学习(SUVR),以在没有监督嵌入学习的情况下学习更好的图像表示。我们首先通过图像之间的相似性构建图像集的图,并采用图遍历的概念来探索积极样本。同时,我们确保可以从整个数据集随机抽取消极样本。对五个基准图像分类数据集进行定量实验表明,SUVR可以在无监督嵌入学习中显著优于强大的竞争方法。定性实验也表明,SUVR可以在潜在空间中相似的图像簇在一起,比无关的图像在分离空间中更紧密地聚集在一起,从而生成更好的表示。
https://arxiv.org/abs/2305.14754
Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-to-prompt to enable novel subject-driven generation and editing applications. Code and models will be released at this https URL. Project page at this https URL.
基于主题的文本到图像生成模型基于文本提示生成输入主题的新渲染。现有的模型经历漫长的微调和难以保持主题逼真度的困境。为了克服这些限制,我们引入了BLIP-Diffusion,这是一个基于主题的新图像生成模型,支持多通道控制,消耗主题图像和提示的输入。与其他基于主题的生成模型不同,BLIP-Diffusion引入了一个新的多通道编码器,已经预训练提供主题表示。我们首先在BLIP-2之后预训练多通道编码器,以产生与文本对齐的视觉表示。然后我们设计了一个主题表示学习任务,使扩散模型利用这种视觉表示生成新的主题渲染。与 Dreambooth 等先前方法相比,我们的模型可以实现零次主题生成,并且能够高效微调定制主题,速度达到20倍。我们还证明,BLIP-Diffusion可以灵活与控制Net 等现有技术如提示到提示相结合,以启用新的主题生成和编辑应用程序。代码和模型将在此httpsURL上发布。此httpsURL上的项目页面将包含更多信息。
https://arxiv.org/abs/2305.14720
Traditional sentence embedding models encode sentences into vector representations to capture useful properties such as the semantic similarity between sentences. However, in addition to similarity, sentence semantics can also be interpreted via compositional operations such as sentence fusion or difference. It is unclear whether the compositional semantics of sentences can be directly reflected as compositional operations in the embedding space. To more effectively bridge the continuous embedding and discrete text spaces, we explore the plausibility of incorporating various compositional properties into the sentence embedding space that allows us to interpret embedding transformations as compositional sentence operations. We propose InterSent, an end-to-end framework for learning interpretable sentence embeddings that supports compositional sentence operations in the embedding space. Our method optimizes operator networks and a bottleneck encoder-decoder model to produce meaningful and interpretable sentence embeddings. Experimental results demonstrate that our method significantly improves the interpretability of sentence embeddings on four textual generation tasks over existing approaches while maintaining strong performance on traditional semantic similarity tasks.
传统的句子嵌入模型将句子编码为向量表示,以捕捉句子之间的有用性质,例如句子语义相似性。然而,除了相似性,句子语义还可以通过组合性操作例如句子融合或差异来解释。目前尚不清楚句子组合性语义是否可以在嵌入空间中直接反映。为了更有效地连接连续嵌入和离散文本空间,我们探索将各种组合性属性嵌入句子嵌入空间的可能性,以便将嵌入变换解释为组合性句子操作。我们提出了InterSent,一个端到端的框架,以学习可解释的句子嵌入,支持在嵌入空间中的组合性句子操作。我们的方法和优化了操作网络和瓶颈编码解码模型,以产生有意义且可解释的句子嵌入。实验结果显示,我们的方法和现有的方法在四个文本生成任务上相比,在传统的语义相似任务上表现出显著提高,同时保持了强大的传统语义相似性任务表现。
https://arxiv.org/abs/2305.14599
We introduce Point2SSM, a novel unsupervised learning approach that can accurately construct correspondence-based statistical shape models (SSMs) of anatomy directly from point clouds. SSMs are crucial in clinical research for analyzing the population-level morphological variation in bones and organs. However, traditional methods for creating SSMs have limitations that hinder their widespread adoption, such as the need for noise-free surface meshes or binary volumes, reliance on assumptions or predefined templates, and simultaneous optimization of the entire cohort leading to lengthy inference times given new data. Point2SSM overcomes these barriers by providing a data-driven solution that infers SSMs directly from raw point clouds, reducing inference burdens and increasing applicability as point clouds are more easily acquired. Deep learning on 3D point clouds has seen recent success in unsupervised representation learning, point-to-point matching, and shape correspondence; however, their application to constructing SSMs of anatomies is largely unexplored. In this work, we benchmark state-of-the-art point cloud deep networks on the task of SSM and demonstrate that they are not robust to the challenges of anatomical SSM, such as noisy, sparse, or incomplete input and significantly limited training data. Point2SSM addresses these challenges via an attention-based module that provides correspondence mappings from learned point features. We demonstrate that the proposed method significantly outperforms existing networks in terms of both accurate surface sampling and correspondence, better capturing population-level statistics.
我们介绍了 Point2SSM,一种全新的无监督学习方法,可以从点云直接准确地构建解剖学的生物统计形状模型(SSMs)。SSMs在临床研究中对于分析骨骼和器官的级联形态变异非常重要。然而,传统的SSMs制作方法存在一些限制,这些限制妨碍了其广泛采用,例如需要无噪声的表面网格或二进制体积、依赖假设或预先定义的模板、以及同时优化整个群体,导致新数据下的推断时间变得非常长。Point2SSM通过提供一种数据驱动的解决方案,从 raw 点云推断出SSMs,从而减少了推断负担并增加了适用性,因为点云更容易获取。三维点云深度学习最近在无监督表示学习、点-点匹配和形状对应性方面取得了成功。然而,将其应用于构建解剖学的SSMs仍然未被充分探索。在这个工作中,我们基准了最先进的点云深度学习网络SSM任务的性能,并证明了它们对于解剖学SSM的挑战不具有较强的鲁棒性,例如噪声、稀疏或不完整输入,以及训练数据显著限制。Point2SSM通过提供一种注意力模块,从学习到的点特征提供形状对应映射,解决了这些挑战。我们证明了该方法在准确的表面采样和对应性方面显著优于现有的网络,更好地捕捉人口级统计。
https://arxiv.org/abs/2305.14486
Learning structured representations of the visual world in terms of objects promises to significantly improve the generalization abilities of current machine learning models. While recent efforts to this end have shown promising empirical progress, a theoretical account of when unsupervised object-centric representation learning is possible is still lacking. Consequently, understanding the reasons for the success of existing object-centric methods as well as designing new theoretically grounded methods remains challenging. In the present work, we analyze when object-centric representations can provably be learned without supervision. To this end, we first introduce two assumptions on the generative process for scenes comprised of several objects, which we call compositionality and irreducibility. Under this generative process, we prove that the ground-truth object representations can be identified by an invertible and compositional inference model, even in the presence of dependencies between objects. We empirically validate our results through experiments on synthetic data. Finally, we provide evidence that our theory holds predictive power for existing object-centric models by showing a close correspondence between models' compositionality and invertibility and their empirical identifiability.
学习以对象为中心的视觉世界的结构化表示,有望显著改善当前机器学习模型的泛化能力。尽管最近的努力表明已经取得了令人瞩目的 empirical 进展,但缺乏关于在没有监督的情况下学习对象中心表示的理论解释仍使我们感到挑战。因此,我们在本文中探讨了何时对象中心表示可以显然地学习 without supervision。为此,我们首先介绍了几个对象组成的场景生成过程的假设,我们称之为组合性和不可逆性。在这些生成过程中,我们证明,即使存在对象之间的依赖关系,一个可逆性和组合性推理模型仍然可以识别到对象的基元表示。通过实验合成数据验证我们的结果。最后,我们提供证据表明,我们的理论对现有的对象中心模型具有预测能力,通过展示模型的组合性和逆转性以及它们的 empirical 可辨识度之间的密切关系。
https://arxiv.org/abs/2305.14229
The ultimate goal for foundation models is realizing task-agnostic, i.e., supporting out-of-the-box usage without task-specific fine-tuning. Although breakthroughs have been made in natural language processing and image representation learning, it is still challenging for video models to reach it due to the increasing uncertainty of spatiotemporal signals. To ease training, existing works leverage image foundation models' prior knowledge and equip them with efficient temporal modules. Despite the satisfactory fine-tuning performance, we empirically find they fall short of out-of-the-box usage, given the even degraded performance in zero-shot/linear protocols compared to their baseline counterparts. In this work, we analyze the factor that leads to degradation from the perspective of language supervision distortion. We argue that tuning a text encoder end-to-end, as done in previous work, is suboptimal since it may overfit in terms of styles, thereby losing its original generalization ability to capture the semantics of various language registers. The overfitted text encoder, in turn, provides a harmful supervision signal, degrading the video representation. To tackle this issue, we propose a degradation-free pre-training strategy to retain the generalization ability of the text encoder via freezing shallow layers while enabling the task-related semantics capturing in tunable deep layers. As for the training objective, we adopted the transcript sorting task in TVTS incorporated with masking techniques to enable scalable training. As a result, we produce a series of models, dubbed TVTSv2, with up to one billion parameters. We achieve new state-of-the-arts on various video benchmarks with a frozen backbone, surpassing the recent ImageBind, InternVideo, etc. Code is available at this https URL.
基础模型的终极目标是实现任务无关性,即支持非特定任务微调而无需特定任务微调。尽管在自然语言处理和图像表示学习方面取得了突破,但由于时间空间和信号的不确定性不断增加,视频模型仍然难以达到这一目标。为了减轻训练难度,现有工作利用图像基础模型的先前知识并配备高效的时间模块。尽管微调表现令人满意,但我们经验证他们无法满足弹出使用的要求,因为与基准模型相比,他们的零Shot/线性协议性能甚至下降了。在这项工作中,我们分析导致下降的因素,即从语言监督失真的角度分析。我们指出,像先前工作一样全局微调文本编码器是最优的选择,因为它可能会过度适应风格,从而失去其捕捉各种语言寄存器语义的最初泛化能力。过度适应的文本编码器会提供有害的监督信号,降低视频表示。为了解决这个问题,我们提出了一个无退化预训练策略,通过冻结浅层层并启用可调整的深度语义捕捉,以保持文本编码器的泛化能力,同时允许任务相关的语义在可调整的深度层上捕捉。对于训练目标,我们采用了TVTS中的文字转录排序任务,并使用掩膜技术实现了 scalable 训练。因此,我们生产了一系列模型,称为 TVTSv2,具有高达10亿参数。我们实现了新的视频基准面上的性能,通过冻结主干线,超越了最近的ImageBind、InternVideo等。代码可在该https URL上获取。
https://arxiv.org/abs/2305.14173
Integrating the brain structural and functional connectivity features is of great significance in both exploring brain science and analyzing cognitive impairment clinically. However, it remains a challenge to effectively fuse structural and functional features in exploring the brain network. In this paper, a novel brain structure-function fusing-representation learning (BSFL) model is proposed to effectively learn fused representation from diffusion tensor imaging (DTI) and resting-state functional magnetic resonance imaging (fMRI) for mild cognitive impairment (MCI) analysis. Specifically, the decomposition-fusion framework is developed to first decompose the feature space into the union of the uniform and the unique spaces for each modality, and then adaptively fuse the decomposed features to learn MCI-related representation. Moreover, a knowledge-aware transformer module is designed to automatically capture local and global connectivity features throughout the brain. Also, a uniform-unique contrastive loss is further devised to make the decomposition more effective and enhance the complementarity of structural and functional features. The extensive experiments demonstrate that the proposed model achieves better performance than other competitive methods in predicting and analyzing MCI. More importantly, the proposed model could be a potential tool for reconstructing unified brain networks and predicting abnormal connections during the degenerative processes in MCI.
整合大脑结构和功能连接特征在探索大脑网络和临床分析认知障碍方面具有重要的意义。然而,在探索大脑网络时,有效地融合结构和功能特征仍然是一个挑战。在本文中,我们提出了一种新的想法,即大脑结构-功能融合表示学习(BSFL)模型,用于从 diffusion tensor imaging (DTI)和静止脑电图(fMRI)中有效地学习融合表示,以分析轻微认知障碍(MCI)。具体来说,我们开发了一个分解-融合框架,首先将特征空间分解为每个模式的统一和独特的空间,然后自适应地融合分解特征以学习 MCI 相关的表示。此外,我们设计了知识 aware Transformer 模块,以自动捕捉大脑整个区域和地方连接特征。此外,我们还设计了一个统一-独特比较损失,以更有效地分解特征并增强结构-功能特征的互补性。广泛的实验表明,我们提出的模型在预测和分析 MCI 方面表现更好,比其他竞争方法。更重要的是,我们提出的模型可能是构建统一大脑网络的潜在工具,在 MCI 的退化过程中预测异常连接。
https://arxiv.org/abs/2305.14404
Recent works show that the data distribution in a network's latent space is useful for estimating classification uncertainty and detecting Out-of-distribution (OOD) samples. To obtain a well-regularized latent space that is conducive for uncertainty estimation, existing methods bring in significant changes to model architectures and training procedures. In this paper, we present a lightweight, fast, and high-performance regularization method for Mahalanobis distance-based uncertainty prediction, and that requires minimal changes to the network's architecture. To derive Gaussian latent representation favourable for Mahalanobis Distance calculation, we introduce a self-supervised representation learning method that separates in-class representations into multiple Gaussians. Classes with non-Gaussian representations are automatically identified and dynamically clustered into multiple new classes that are approximately Gaussian. Evaluation on standard OOD benchmarks shows that our method achieves state-of-the-art results on OOD detection with minimal inference time, and is very competitive on predictive probability calibration. Finally, we show the applicability of our method to a real-life computer vision use case on microorganism classification.
最近的工作表明,网络的隐态空间的数据分布对于估计分类不确定性和检测分布不符(OOD)样本非常有用。为了获得有利于不确定性估计的充分的Regularization,现有方法对模型架构和训练程序进行了重大更改。在本文中,我们提出了一种轻量级、快速且高性能的Regularization方法,以马氏距离为基础的不确定性预测,该方法只需要对网络架构进行微小的修改。为了推导马氏距离计算有利的高斯隐态表示,我们引入了一种自监督表示学习方法,将班级表示分离成多个高斯函数。具有非高斯表示的班级自动识别并动态地聚类成多个近似高斯的新班级。在标准OOD基准测试中,我们的方法在 inference 时间 minimal 的情况下实现了OOD检测的最先进的结果,并在预测概率校准方面非常具有竞争力。最后,我们展示了我们方法对微生物分类的实际计算机视觉应用案例的适用性。
https://arxiv.org/abs/2305.13849
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning, leading to state-of-the-art models for various downstream multimodal tasks. However, recent research has highlighted severe limitations of these models in their ability to perform compositional reasoning over objects, attributes, and relations. Scene graphs have emerged as an effective way to understand images compositionally. These are graph-structured semantic representations of images that contain objects, their attributes, and relations with other objects in a scene. In this work, we consider the scene graph parsed from text as a proxy for the image scene graph and propose a graph decomposition and augmentation framework along with a coarse-to-fine contrastive learning objective between images and text that aligns sentences of various complexities to the same image. Along with this, we propose novel negative mining techniques in the scene graph space for improving attribute binding and relation understanding. Through extensive experiments, we demonstrate the effectiveness of our approach that significantly improves attribute binding, relation understanding, systematic generalization, and productivity on multiple recently proposed benchmarks (For example, improvements upto $18\%$ for systematic generalization, $16.5\%$ for relation understanding over a strong baseline), while achieving similar or better performance than CLIP on various general multimodal tasks.
Contrastively trained vision-language models在视觉和语言表示学习方面取得了显著的进展,导致了许多后续多任务 multimodal 模型的先进技术。然而,最近的研究突出了这些模型在对象、属性和关系方面的 composition 推理能力方面的严重限制。场景图成为了一种有效的理解图像 composition 的方法。这些是图像中的对象的 graph 结构语义表示,包含了它们在场景中出现的其他对象、它们的属性以及与其他对象之间的关系。在这项工作中,我们将场景图从文本中解析为图像场景图的代理,并提出了 graph 分解和增强框架,以及图像和文本之间的粗到细的对比度学习目标,将复杂的句子对齐到相同的图像上。此外,我们还提出了场景图空间中的 novel 负挖掘技术,以改善属性绑定和关系理解。通过广泛的实验,我们证明了我们的方法的有效性,它在多个最近提出的基准任务上显著改善属性绑定、关系理解和系统性泛化性能(例如,系统性泛化性能高达 $18\%$,关系理解性能超过强基线 $16.5\%$),同时在与 CLIP 在各种通用多任务任务方面的表现类似或更好的情况下,实现了其他一般多任务任务的性能。
https://arxiv.org/abs/2305.13812
Vision-and-language (VL) pre-training, which aims to learn a general representation of image-text pairs that can be transferred to various vision-and-language tasks. Compared with modeling uni-modal data, the main challenge of the VL model is: how to learn the cross-modal interaction from multimodal data, especially the fine-grained interaction. Existing works have shown that fully transformer-based models that adopt attention mechanisms to learn in-layer cross-model interaction can demonstrate impressive performance on various cross-modal downstream tasks. However, they ignored that the semantic information of the different modals at the same layer was not uniform, which leads to the cross-modal interaction collapsing into a limited multi-modal semantic information interaction. In this work, we propose the UNIMO-3 model, which has the capacity to simultaneously learn the multimodal in-layer interaction and cross-layer interaction. UNIMO-3 model can establish effective connections between different layers in a cross-modal encoder, and adaptively capture the interaction between two modalities at different levels. The experimental results show that our model achieves state-of-the-art performance in various downstream tasks, and through ablation study can prove that effective cross-layer learning improves the model's ability of multimodal representation.
视觉和语言(VL)预训练,旨在学习一种通用的表示图像和文本对,可以转移到各种视觉和语言任务。与建模单模态数据相比,VL模型的主要挑战是:如何从多模态数据中学习跨模态交互,特别是精细的交互。现有工作已经表明,采用注意力机制学习内层跨模型交互的完全Transformer模型可以在各种跨模态下游任务中表现出令人印象深刻的表现。然而,他们忽略了在同一层中不同模态的语义信息并不是一致的,这导致跨模态交互崩溃成为有限的多模态语义信息交互。在本文中,我们提出了UNIMO-3模型,它具有同时学习多模态内层交互和跨层交互的能力。UNIMO-3模型可以在跨模态编码器的不同层之间建立有效的连接,并自适应地捕捉两个模态在不同层次的交互。实验结果表明,我们的模型在各种下游任务中表现出最先进的性能,并通过削除研究可以证明有效的跨层学习可以提高模型的多模态表示能力。
https://arxiv.org/abs/2305.13697
Text-based person search aims to retrieve the specified person images given a textual description. The key to tackling such a challenging task is to learn powerful multi-modal representations. Towards this, we propose a Relation and Sensitivity aware representation learning method (RaSa), including two novel tasks: Relation-Aware learning (RA) and Sensitivity-Aware learning (SA). For one thing, existing methods cluster representations of all positive pairs without distinction and overlook the noise problem caused by the weak positive pairs where the text and the paired image have noise correspondences, thus leading to overfitting learning. RA offsets the overfitting risk by introducing a novel positive relation detection task (i.e., learning to distinguish strong and weak positive pairs). For another thing, learning invariant representation under data augmentation (i.e., being insensitive to some transformations) is a general practice for improving representation's robustness in existing methods. Beyond that, we encourage the representation to perceive the sensitive transformation by SA (i.e., learning to detect the replaced words), thus promoting the representation's robustness. Experiments demonstrate that RaSa outperforms existing state-of-the-art methods by 6.94%, 4.45% and 15.35% in terms of Rank@1 on CUHK-PEDES, ICFG-PEDES and RSTPReid datasets, respectively. Code is available at: this https URL.
文本人物搜索的目标是根据文本描述检索指定人物的图像。解决这一挑战性任务的关键要学习强大的多视角表示。为此,我们提出了一种关系和敏感性 aware 表示学习方法(RaSa),包括两个全新的任务:关系 aware 学习(RA)和敏感性 aware 学习(SA)。一方面,现有的方法将所有正交对的表示Cluster在一起,并忽略文本和配对图像中弱正交对造成的噪声问题,从而导致过拟合学习。RA 减少了过拟合的风险,通过引入一个独特的正交关系检测任务(即学习区分强和弱正交对)。另一方面,学习在数据增强下的不变表示(即对一些变换变得不敏感)是改进现有方法表示鲁棒性的通用做法。此外,我们鼓励表示学习通过 SA 感知敏感变换(即学习检测替换词),从而促进表示的鲁棒性。实验表明,RaSa 在 CUHK-PEDES、ICFG-PEDES 和 RSTPReid 数据集上的排名@1分别优于现有最先进的方法的 6.94%、4.45% 和 15.35%。代码已可在上述 https URL 上提供。
https://arxiv.org/abs/2305.13653
In cross-lingual named entity recognition (NER), self-training is commonly used to bridge the linguistic gap by training on pseudo-labeled target-language data. However, due to sub-optimal performance on target languages, the pseudo labels are often noisy and limit the overall performance. In this work, we aim to improve self-training for cross-lingual NER by combining representation learning and pseudo label refinement in one coherent framework. Our proposed method, namely ContProto mainly comprises two components: (1) contrastive self-training and (2) prototype-based pseudo-labeling. Our contrastive self-training facilitates span classification by separating clusters of different classes, and enhances cross-lingual transferability by producing closely-aligned representations between the source and target language. Meanwhile, prototype-based pseudo-labeling effectively improves the accuracy of pseudo labels during training. We evaluate ContProto on multiple transfer pairs, and experimental results show our method brings in substantial improvements over current state-of-the-art methods.
在跨语言命名实体识别(NER)中,通常使用自我训练来通过训练伪标签的目标语言数据来填补语言差距。然而,由于目标语言表现较差,伪标签往往噪声较多,并限制整体性能。在本研究中,我们旨在改进跨语言NER的自我训练,通过在一个一致性框架中结合表示学习和伪标签改进。我们提出的方法名为ContProto,其主要包含两个组件:(1)对比性自我训练,(2)基于原型的伪标签改进。我们的对比性自我训练可以通过分离不同类别的簇来促进跨语言分类,并通过在源和目标语言之间产生紧密对齐的表示来提高跨语言转移性。同时,基于原型的伪标签改进有效地在训练期间改善伪标签的准确性。我们针对多个转移对进行了评估,实验结果显示,我们的方法和当前最先进的方法之间存在显著的改进。
https://arxiv.org/abs/2305.13628
Semantic understanding of programs has attracted great attention in the community. Inspired by recent successes of large language models (LLMs) in natural language understanding, tremendous progress has been made by treating programming language as another sort of natural language and training LLMs on corpora of program code. However, programs are essentially different from texts after all, in a sense that they are normally heavily structured and syntax-strict. In particular, programs and their basic units (i.e., functions and subroutines) are designed to demonstrate a variety of behaviors and/or provide possible outputs, given different inputs. The relationship between inputs and possible outputs/behaviors represents the functions/subroutines and profiles the program as a whole. Therefore, we propose to incorporate such a relationship into learning, for achieving a deeper semantic understanding of programs. To obtain inputs that are representative enough to trigger the execution of most part of the code, we resort to fuzz testing and propose fuzz tuning to boost the performance of program understanding and code representation learning, given a pre-trained LLM. The effectiveness of the proposed method is verified on two program understanding tasks including code clone detection and code classification, and it outperforms current state-of-the-arts by large margins. Code is available at this https URL.
在社区中,对程序语义理解引起了广泛关注。受到大型语言模型(LLM)在自然语言理解方面的 recent 成功启发,我们采取了另一种方式来将编程语言视为自然语言,并在程序代码 corpora 上训练了LLM。然而,程序与文本本质上是不同的,因为它们通常被严重结构和语法严格规定。特别是,程序及其基本单元(即函数和子程序)旨在以不同输入为例,展示各种行为或提供可能的输出。输入与可能的输出/行为之间的关系代表了函数/子程序,并描述了整个程序的特征。因此,我们建议将这种关系纳入学习,以更深入地理解程序。为了获得足以触发大部分代码执行的代表性输入,我们采取了fuzz测试,并提出fuzz调整来提高程序理解和代码表示学习的性能。我们验证了该方法在两个程序理解任务中的有效性,包括代码克隆检测和代码分类,并且它显著超越了当前的前沿水平。代码可在 this https URL 中找到。
https://arxiv.org/abs/2305.13592
In this article, we present our approach to single-modality vision representation learning. Understanding vision representations of product content is vital for recommendations, search, and advertising applications in e-commerce. We detail and contrast techniques used to fine tune large-scale vision representation learning models in an efficient manner under low-resource settings, including several pretrained backbone architectures, both in the convolutional neural network as well as the vision transformer family. We highlight the challenges for e-commerce applications at-scale and highlight the efforts to more efficiently train, evaluate, and serve visual representations. We present ablation studies for several downstream tasks, including our visually similar ad recommendations. We evaluate the offline performance of the derived visual representations in downstream tasks. To this end, we present a novel text-to-image generative offline evaluation method for visually similar recommendation systems. Finally, we include online results from deployed machine learning systems in production at Etsy.
在本文中,我们介绍了我们Single-modality Vision Representation Learning的方法。理解产品内容的 Vision Representation 对于电子商务中的推荐、搜索和广告应用至关重要。我们详细对比了在资源非常有限的情况下,如何高效地优化大规模 Vision Representation 学习模型的方法,其中包括在卷积神经网络和视觉转换器家族中多个预训练主干架构的方法。我们强调了 Scale 上电子商务应用面临的挑战,并重点介绍了更有效地训练、评估和提供服务视觉Representation 的努力。我们介绍了多个后续任务的实验结果,包括我们类似的广告推荐。我们评估了后续任务中衍生的视觉Representation 的离线表现。为此,我们提出了一种 novel 的文本到图像生成离线评估方法,适用于类似的推荐系统。最后,我们包括在Etsy 生产中的部署机器学习系统的离线结果。
https://arxiv.org/abs/2305.13399
Recently, contrastive self-supervised learning, where the proximity of representations is determined based on the identities of samples, has made remarkable progress in unsupervised representation learning. SimSiam is a well-known example in this area, known for its simplicity yet powerful performance. However, it is known to be sensitive to changes in training configurations, such as hyperparameters and augmentation settings, due to its structural characteristics. To address this issue, we focus on the similarity between contrastive learning and the teacher-student framework in knowledge distillation. Inspired by the ensemble-based knowledge distillation approach, the proposed method, EnSiam, aims to improve the contrastive learning procedure using ensemble representations. This can provide stable pseudo labels, providing better performance. Experiments demonstrate that EnSiam outperforms previous state-of-the-art methods in most cases, including the experiments on ImageNet, which shows that EnSiam is capable of learning high-quality representations.
最近,对比性自监督学习,其中表示之间的接近度基于样本的身份来确定,在无监督表示学习方面取得了显著的进展。SimSiam是该领域的著名例子,因其简单但强大的性能而闻名于世。然而,由于其结构特性,它 known to be sensitive to changes in training configurations,如超参数和增强设置,因此对训练配置的变化非常敏感。为了解决这一问题,我们关注对比性学习和知识蒸馏中教师和学生框架之间的相似之处。基于群体知识蒸馏方法的想法, proposed method EnSiam旨在改进对比性学习程序,使用群体表示。这可以提供稳定的伪标签,提供更好的性能。实验结果表明,EnSiam在大多数情况下比先前的先进方法表现更好,包括在ImageNet上的实验,这表明EnSiam有能力学习高质量的表示。
https://arxiv.org/abs/2305.13391
Satellite image time series in the optical and infrared spectrum suffer from frequent data gaps due to cloud cover, cloud shadows, and temporary sensor outages. It has been a long-standing problem of remote sensing research how to best reconstruct the missing pixel values and obtain complete, cloud-free image sequences. We approach that problem from the perspective of representation learning and develop U-TILISE, an efficient neural model that is able to implicitly capture spatio-temporal patterns of the spectral intensities, and that can therefore be trained to map a cloud-masked input sequence to a cloud-free output sequence. The model consists of a convolutional spatial encoder that maps each individual frame of the input sequence to a latent encoding; an attention-based temporal encoder that captures dependencies between those per-frame encodings and lets them exchange information along the time dimension; and a convolutional spatial decoder that decodes the latent embeddings back into multi-spectral images. We experimentally evaluate the proposed model on EarthNet2021, a dataset of Sentinel-2 time series acquired all over Europe, and demonstrate its superior ability to reconstruct the missing pixels. Compared to a standard interpolation baseline, it increases the PSNR by 1.8 dB at previously seen locations and by 1.3 dB at unseen locations.
光学和红外光谱的卫星图像时间序列经常因为云覆盖、云阴影和临时传感器故障而出现数据缺失。这是一个长期存在的问题,即如何最好地重建缺失像素值并获得完整的无云图像序列。我们从这个表示学习的角度入手,开发了一种高效的神经网络模型——U-TILISE,它能够隐含地捕捉光谱强度的空间和时间模式,因此可以训练以将带云输入序列映射到无云输出序列。模型由一个卷积空间编码器来将输入序列中的每个帧映射到一个隐编码器,一个基于注意力的时间编码器来捕捉这些帧编码之间的依赖关系,并让它们在时间维度上交换信息,最后是一个卷积空间解码器来将隐编码器解码成多光谱图像。我们在 EarthNet2021 一个覆盖欧洲各地的 Sentinel-2 时间序列数据集上实验评估了该模型,并证明了它重建缺失像素的能力。与标准插值基线相比,它在先前看到的位置提高了 PSNR 值,而在未观察到的位置提高了 1.3 dB。
https://arxiv.org/abs/2305.13277