Deep neural network models have achieved remarkable progress in 3D scene understanding while trained in the closed-set setting and with full labels. However, the major bottleneck for current 3D recognition approaches is that they do not have the capacity to recognize any unseen novel classes beyond the training categories in diverse kinds of real-world applications. In the meantime, current state-of-the-art 3D scene understanding approaches primarily require high-quality labels to train neural networks, which merely perform well in a fully supervised manner. This work presents a generalized and simple framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy to extract and distill meaningful information from large-scale vision-language models, which helps benefit the open-vocabulary scene understanding tasks. To leverage the boundary information, we propose a novel energy-based loss with boundary awareness benefiting from the region-level boundary predictions. To encourage latent instance discrimination and to guarantee efficiency, we propose the unsupervised region-level semantic contrastive learning scheme for point clouds, using confident predictions of the neural network to discriminate the intermediate feature embeddings at multiple stages. Extensive experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning. All codes, models, and data are made publicly available at: this https URL.
深度神经网络模型在关闭设置和完整标签的情况下训练在3D场景理解方面取得了显著的进步。然而,当前的3D识别方法的主要瓶颈是,它们无法识别任何未在训练类别之外的新颖类别的现实世界应用。与此同时,最先进的3D场景理解方法主要需要高质量的标签来训练神经网络,而仅仅在完全监督的方式下表现良好。本文提出了一种处理有限标注场景的通用的简单框架。为了从预训练的视觉-语言模型中提取知识,我们提出了一种层次特征对齐的预训练和知识蒸馏策略,以提取和蒸馏大规模视觉-语言模型中的有意义的信息,从而帮助解决开箱见光的场景理解任务。为了利用边界信息,我们提出了一种基于能量的损失函数,其中边界感知使区域级别边界预测受益。为了鼓励潜在实例区分,并确保效率,我们提出了一种自监督的区域级别语义对比学习方案,基于神经网络的自信预测来区分中间特征嵌入的多阶段。在室内和室外场景的广泛实验中,我们的方法在数据有效的学习和开放世界少数样本学习方面都取得了显著的效果。所有代码、模型和数据都公开发布在以下这个链接上:https://this URL。
https://arxiv.org/abs/2312.00663
Unsupervised relation extraction (URE) aims to extract relations between named entities from raw text without requiring manual annotations or pre-existing knowledge bases. In recent studies of URE, researchers put a notable emphasis on contrastive learning strategies for acquiring relation representations. However, these studies often overlook two important aspects: the inclusion of diverse positive pairs for contrastive learning and the exploration of appropriate loss functions. In this paper, we propose AugURE with both within-sentence pairs augmentation and augmentation through cross-sentence pairs extraction to increase the diversity of positive pairs and strengthen the discriminative power of contrastive learning. We also identify the limitation of noise-contrastive estimation (NCE) loss for relation representation learning and propose to apply margin loss for sentence pairs. Experiments on NYT-FB and TACRED datasets demonstrate that the proposed relation representation learning and a simple K-Means clustering achieves state-of-the-art performance.
无监督关系提取(URE)旨在从原始文本中提取命名实体之间的关系,而不需要手动注释或预先存在的知识库。在URE recent studies中,研究人员对获得关系表示的对比学习策略给予了显著的关注。然而,这些研究往往忽视了两个重要的方面:包括对比学习中的多样正对和探索适当损失函数。在本文中,我们提出了一种增加积极对对多样性,并加强对比学习效果的方法:在句子内对成对进行增强,并通过跨句子对提取进行增强。我们还指出了NCE损失函数在关系表示学习中的局限性,并提出使用边缘损失来处理句子对。在NYT-FB和TACRED数据集上的实验表明,所提出的关系表示学习和简单的K-Means聚类达到了最先进的性能水平。
https://arxiv.org/abs/2312.00552
Cardiac MRI allows for a comprehensive assessment of myocardial structure, function, and tissue characteristics. Here we describe a foundational vision system for cardiac MRI, capable of representing the breadth of human cardiovascular disease and health. Our deep learning model is trained via self-supervised contrastive learning, by which visual concepts in cine-sequence cardiac MRI scans are learned from the raw text of the accompanying radiology reports. We train and evaluate our model on data from four large academic clinical institutions in the United States. We additionally showcase the performance of our models on the UK BioBank, and two additional publicly available external datasets. We explore emergent zero-shot capabilities of our system, and demonstrate remarkable performance across a range of tasks; including the problem of left ventricular ejection fraction regression, and the diagnosis of 35 different conditions such as cardiac amyloidosis and hypertrophic cardiomyopathy. We show that our deep learning system is capable of not only understanding the staggering complexity of human cardiovascular disease, but can be directed towards clinical problems of interest yielding impressive, clinical grade diagnostic accuracy with a fraction of the training data typically required for such tasks.
心脏MRI能够全面评估心肌结构、功能和组织的特性。在这里,我们描述了一个基本的心脏MRI视觉系统,具有评估人类心血管疾病和健康广泛的范围。我们的深度学习模型通过自监督的对比学习进行训练,通过学习随附的放射学报告的原始文本中的视觉概念来实现。我们在美国四个大型学术临床机构的数据上进行训练和评估。此外,我们还展示了我们的模型在英国生物银行和两个公开可用的外部数据集上的表现。我们探索了我们系统的涌现零样本能力,并展示了在不同任务上的卓越表现,包括左心室射血分数回归问题和35种疾病的诊断,如心脏淀粉样变和肥厚型心肌病。我们证明了我们的深度学习系统不仅能够理解人类心血管疾病的惊人的复杂性,而且还可以将兴趣指向感兴趣的临床问题,具有令人印象深刻的临床级别的诊断准确性和通常所需的训练数据的比例。
https://arxiv.org/abs/2312.00357
After pre-training by generating the next word conditional on previous words, the Language Model (LM) acquires the ability of In-Context Learning (ICL) that can learn a new task conditional on the context of the given in-context examples (ICEs). Similarly, visually-conditioned Language Modelling is also used to train Vision-Language Models (VLMs) with ICL ability. However, such VLMs typically exhibit weaker classification abilities compared to contrastive learning-based models like CLIP, since the Language Modelling objective does not directly contrast whether an object is paired with a text. To improve the ICL of classification, using more ICEs to provide more knowledge is a straightforward way. However, this may largely increase the selection time, and more importantly, the inclusion of additional in-context images tends to extend the length of the in-context sequence beyond the processing capacity of a VLM. To alleviate these limitations, we propose to manipulate the label space of each ICE to increase its knowledge density, allowing for fewer ICEs to convey as much information as a larger set would. Specifically, we propose two strategies which are Label Distribution Enhancement and Visual Descriptions Enhancement to improve In-context classification performance on diverse datasets, including the classic ImageNet and more fine-grained datasets like CUB-200. Specifically, using our approach on ImageNet, we increase accuracy from 74.70\% in a 4-shot setting to 76.21\% with just 2 shots. surpassing CLIP by 0.67\%. On CUB-200, our method raises 1-shot accuracy from 48.86\% to 69.05\%, 12.15\% higher than CLIP. The code is given in https://anonymous.4open.science/r/MLS_ICC.
在通过生成先前的单词的下一个词的条件下进行预训练后,语言模型(LM)获得了在给定上下文例子(ICEs)上进行新任务学习的(ICL)能力。同样,视觉条件下的语言建模也被用来训练具有 ICL 能力的 Vision-Language 模型(VLMs)。然而,这种 VLMs 通常表现出与基于对比学习模型的 CLIP 等相比较弱的分类能力,因为语言建模目标不直接比较一个对象是否与文本配对。为了提高 ICL 的分类能力,使用更多的 ICE 提供更多的知识是一个直接的方法。然而,这可能会大大增加选择时间,更重要的是,添加更多的上下文图像通常会使得 VLM 的上下文序列长度超过其处理能力。为了减轻这些限制,我们提出了一种操纵每个 ICE 的标签空间以增加其知识密度的方法,从而允许更少的 ICE 提供与更大的集合相同的信息。具体来说,我们提出了两种策略:标签分布增强和视觉描述增强,以提高在各种数据集上的 In-Context 分类性能,包括经典的 ImageNet 和更精细的数据集如 CUB-200。具体来说,在 ImageNet 上,我们使用我们的方法将准确率从 74.70% 提高到 76.21%,超越了 CLIP 0.67%。在 CUB-200 上,我们的方法将 1 shot 准确率从 48.86% 提高到 69.05%,比 CLIP 高出 12.15%。代码可以从 https://anonymous.4open.science/r/MLS_ICC.
https://arxiv.org/abs/2312.00351
Changing an attribute of a text without changing the content usually requires to first disentangle the text into irrelevant attributes and content representations. After that, in the inference phase, the representation of one attribute is tuned to a different value, expecting that the corresponding attribute of the text can also be changed accordingly. The usual way of disentanglement is to add some constraints on the latent space of an encoder-decoder architecture, including adversarial-based constraints and mutual-information-based constraints. However, the previous semi-supervised processes of attribute change are usually not enough to guarantee the success of attribute change and content preservation. In this paper, we propose a novel approach to achieve a robust control of attributes while enhancing content preservation. In this approach, we use a semi-supervised contrastive learning method to encourage the disentanglement of attributes in latent spaces. Differently from previous works, we re-disentangle the reconstructed sentence and compare the re-disentangled latent space with the original latent space, which makes a closed-loop disentanglement process. This also helps content preservation. In addition, the contrastive learning method is also able to replace the role of minimizing mutual information and adversarial training in the disentanglement process, which alleviates the computation cost. We conducted experiments on three text datasets, including the Yelp Service review dataset, the Amazon Product review dataset, and the GoEmotions dataset. The experimental results show the effectiveness of our model.
通常,在修改文本属性而不改变内容时,需要首先将文本分离为无关的属性和内容表示。然后,在推理阶段,对于一个属性的表示进行调整,期望相应的文本属性也能随之改变。通常的分离方法是在编码器-解码器架构的潜在空间上添加一些约束,包括基于对抗的约束和基于互信息 的约束。然而,之前基于半监督的属性更改方法通常不足以保证属性的成功更改和内容的保留。在本文中,我们提出了一种新的方法来实现在不改变内容的情况下增强属性的控制。在这种方法中,我们使用基于半监督的对比学习方法来鼓励在潜在空间中进行属性的分离。与之前的工作不同,我们重新分离了重构的句子并将其与原始潜在空间进行比较,形成了一个闭合环的分离过程。这也有助于内容的保留。此外,对比学习方法还能够代替在分离过程中最小化互信息和对抗训练的角色,从而减轻计算成本。我们对包括Yelp服务评论数据集、Amazon产品评论数据集和GoEmotions数据集在内的三个文本数据集进行了实验。实验结果表明,我们的模型非常有效。
https://arxiv.org/abs/2312.00277
Graph contrastive learning has shown great promise when labeled data is scarce, but large unlabeled datasets are available. However, it often does not take uncertainty estimation into account. We show that a variational Bayesian neural network approach can be used to improve not only the uncertainty estimates but also the downstream performance on semi-supervised node-classification tasks. Moreover, we propose a new measure of uncertainty for contrastive learning, that is based on the disagreement in likelihood due to different positive samples.
对比学习在标注数据有限的情况下表现出巨大的潜力,但在大型无标注数据集可用时,往往没有考虑到不确定性估计。我们证明了可以使用变分贝叶斯神经网络方法来提高不仅是不确定性估计,而且是在半监督节点分类任务上的下游表现。此外,我们提出了一个新的不确定度度量方法,该方法基于不同正样本之间的可能性差异。
https://arxiv.org/abs/2312.00232
Video topic segmentation unveils the coarse-grained semantic structure underlying videos and is essential for other video understanding tasks. Given the recent surge in multi-modal, relying solely on a single modality is arguably insufficient. On the other hand, prior solutions for similar tasks like video scene/shot segmentation cater to short videos with clear visual shifts but falter for long videos with subtle changes, such as livestreams. In this paper, we introduce a multi-modal video topic segmenter that utilizes both video transcripts and frames, bolstered by a cross-modal attention mechanism. Furthermore, we propose a dual-contrastive learning framework adhering to the unsupervised domain adaptation paradigm, enhancing our model's adaptability to longer, more semantically complex videos. Experiments on short and long video corpora demonstrate that our proposed solution, significantly surpasses baseline methods in terms of both accuracy and transferability, in both intra- and cross-domain settings.
视频主题分割揭示了视频中的粗粒度语义结构,对于其他视频理解任务至关重要。考虑到近年来多模态内容的爆发式增长,仅依赖单一模态可能不足以满足要求。另一方面,类似于视频场景/镜头分割的先前解决方案仅适用于清晰可见视觉变化的短视频,但对于 subtle 的变化,例如直播,则显得无力。在本文中,我们提出了一种多模态视频主题分割器,利用视频转录和图像,通过跨模态关注机制进行加强。此外,我们还提出了一种基于无监督领域自适应范式的双对比学习框架,增强了模型对更长、更语义复杂视频的适应性。对于短和长视频数据集的实验证明,与基线方法相比,我们提出的解决方案在准确性和可转移性方面均显著超过基线方法,不仅在内部领域,而且在跨领域。
https://arxiv.org/abs/2312.00220
With the remarkable advent of text-to-image diffusion models, image editing methods have become more diverse and continue to evolve. A promising recent approach in this realm is Delta Denoising Score (DDS) - an image editing technique based on Score Distillation Sampling (SDS) framework that leverages the rich generative prior of text-to-image diffusion models. However, relying solely on the difference between scoring functions is insufficient for preserving specific structural elements from the original image, a crucial aspect of image editing. Inspired by the similarity and importance differences between DDS and the contrastive learning for unpaired image-to-image translation (CUT), here we present an embarrassingly simple yet very powerful modification of DDS, called Contrastive Denoising Score (CDS), for latent diffusion models (LDM). Specifically, to enforce structural correspondence between the input and output while maintaining the controllability of contents, we introduce a straightforward approach to regulate structural consistency using CUT loss within the DDS framework. To calculate this loss, instead of employing auxiliary networks, we utilize the intermediate features of LDM, in particular, those from the self-attention layers, which possesses rich spatial information. Our approach enables zero-shot image-to-image translation and neural radiance field (NeRF) editing, achieving a well-balanced interplay between maintaining the structural details and transforming content. Qualitative results and comparisons demonstrates the effectiveness of our proposed method. Project page with code is available at this https URL.
随着文本到图像扩散模型令人印象深刻的引入,图像编辑方法变得更加丰富并不断演变。这个领域的一个有前景的最近方法是 Delta 去噪得分(DDS)- 一种基于评分函数的图像编辑技术,它利用了文本到图像扩散模型的丰富生成先验。然而,仅依赖于评分函数是不够的,以便保留原始图像中的特定结构元素,这是图像编辑的关键方面。受到 DDS 和无配对图像到图像翻译(CUT)中对比学习相似性和重要性的启发,我们在这里提出了一种 embarrassingly simple yet very powerful 的 DDS 修改,称为对比去噪得分(CDS),用于潜在扩散模型(LDM)。具体来说,为了在保留输入和输出之间结构对应性的同时保持内容的可控性,我们在 DDS 框架内引入了一种简单的方法来调节结构一致性,即利用 CUT 损失来控制结构一致性。为了计算这个损失,我们利用 LDM 的中间特征,特别是自注意力层,具有丰富的空间信息。我们的方法能够实现零散图像到图像的翻译和神经辐射场(NeRF)编辑,实现保留结构细节和转换内容的平衡。质控结果和比较证明了我们的提出的方法的 effectiveness。代码可访问的 URL 可以在该链接找到。
https://arxiv.org/abs/2311.18608
As text generative models can give increasingly long answers, we tackle the problem of synthesizing long text in digital ink. We show that the commonly used models for this task fail to generalize to long-form data and how this problem can be solved by augmenting the training data, changing the model architecture and the inference procedure. These methods use contrastive learning technique and are tailored specifically for the handwriting domain. They can be applied to any encoder-decoder model that works with digital ink. We demonstrate that our method reduces the character error rate on long-form English data by half compared to baseline RNN and by 16% compared to the previous approach that aims at addressing the same problem. We show that all three parts of the method improve recognizability of generated inks. In addition, we evaluate synthesized data in a human study and find that people perceive most of generated data as real.
作为文本生成模型可以给出越来越长的答案,我们解决了在数字墨水合成长文本的问题。我们证明了用于此任务的常用模型无法泛化到长形式数据,以及如何通过增加训练数据、改变模型架构和推理过程来解决这个问题。这些方法使用了对比学习技术,并专门针对手写领域。它们可以应用于任何使用数字墨水的编码器-解码器模型。我们证明了我们的方法将基于RNN的長英文數據的字符錯誤率降低了一半,比基線RNN低16%。我们证明了所有三个部分的方法都能提高生成的墨水的可识别性。此外,我们在人类研究中评估了合成数据,发现人们认为大多数生成的数据是真实的。
https://arxiv.org/abs/2311.17786
Contrastive learning has proven to be an effective method for pre-training models using weakly labeled data in the vision domain. Sentence transformers are the NLP counterparts to this architecture, and have been growing in popularity due to their rich and effective sentence representations. Having effective sentence representations is paramount in multiple tasks, such as information retrieval, retrieval augmented generation (RAG), and sentence comparison. Keeping in mind the deployability factor of transformers, evaluating the robustness of sentence transformers is of utmost importance. This work focuses on evaluating the robustness of the sentence encoders. We employ several adversarial attacks to evaluate its robustness. This system uses character-level attacks in the form of random character substitution, word-level attacks in the form of synonym replacement, and sentence-level attacks in the form of intra-sentence word order shuffling. The results of the experiments strongly undermine the robustness of sentence encoders. The models produce significantly different predictions as well as embeddings on perturbed datasets. The accuracy of the models can fall up to 15 percent on perturbed datasets as compared to unperturbed datasets. Furthermore, the experiments demonstrate that these embeddings does capture the semantic and syntactic structure (sentence order) of sentences. However, existing supervised classification strategies fail to leverage this information, and merely function as n-gram detectors.
对比学习已经在视觉域中使用弱标注数据训练模型的研究中证明了一种有效的预处理方法。句子变换器是这种架构在自然语言处理领域的对应物,由于它们丰富的语义和有效的句子表示,因此变得越来越受欢迎。在多个任务中拥有有效的句子表示至关重要,例如信息检索、检索增强生成(RAG)和句子比较。考虑到变换器的部署因素,评估句子的鲁棒性至关重要。本文重点评估句子编码器的鲁棒性。我们使用多种对抗攻击来评估其鲁棒性。这个系统采用随机字符替换、同义词替换和句子级别插入词序等攻击方式。实验结果证实了句子的鲁棒性非常差。与无扰数据集相比,模型在扰动数据集上的预测和嵌入有显著差异。模型的准确性在扰动数据集上可能会降低至15%。此外,实验还表明,这些嵌入确实捕捉到了句子的语义和句序结构(句子顺序)。然而,现有的监督分类策略没有利用到这些信息,而只是作为词的检测器。
https://arxiv.org/abs/2311.17722
Face recognition technology is widely used in the financial field, and various types of liveness attack behaviors need to be addressed. Existing liveness detection algorithms are trained on specific training datasets and tested on testing datasets, but their performance and robustness in transferring to unseen datasets are relatively poor. To tackle this issue, we propose a face liveness detection method based on image-text pairs and contrastive learning, dividing liveness attack problems in the financial field into eight categories and using text information to describe the images of these eight types of attacks. The text encoder and image encoder are used to extract feature vector representations for the classification description text and face images, respectively. By maximizing the similarity of positive samples and minimizing the similarity of negative samples, the model learns shared representations between images and texts. The proposed method is capable of effectively detecting specific liveness attack behaviors in certain scenarios, such as those occurring in dark environments or involving the tampering of ID card photos. Additionally, it is also effective in detecting traditional liveness attack methods, such as printing photo attacks and screen remake attacks. The zero-shot capabilities of face liveness detection on five public datasets, including NUAA, CASIA-FASD, Replay-Attack, OULU-NPU and MSU-MFSD also reaches the level of commercial algorithms. The detection capability of proposed algorithm was verified on 5 types of testing datasets, and the results show that the method outperformed commercial algorithms, and the detection rates reached 100% on multiple datasets. Demonstrating the effectiveness and robustness of introducing image-text pairs and contrastive learning into liveness detection tasks as proposed in this paper.
面部识别技术在金融领域得到了广泛应用,需要解决各种存活攻击行为。现有的存活检测算法在特定训练数据集上进行训练,在测试数据集上进行测试,但它们在转移到未见过的数据集时的性能和鲁棒性相对较差。为了解决这个问题,我们提出了一个基于图像-文本对和对比学习的人脸存活检测方法,将金融领域的存活攻击问题划分为八个类别,并使用文本信息描述这八种攻击的图像。图像编码器和解码器被用于提取分类描述文本和面部图像的特征向量。通过最大化正样本和负样本之间的相似性,最小化负样本和正样本之间的相似性,模型学会了图像和文本之间的共享表示。与该方法相比,该方法能够有效地检测某些场景下的特定存活攻击行为,例如在黑暗环境中发生的攻击或在ID卡照片上进行篡改的情况。此外,该方法还能够在检测传统存活攻击方法(如打印照片攻击和屏幕截图攻击)方面发挥作用。在包括NUAA、CASIA-FASD、Replay-Attack、OULU-NPU和MSU-MFSD在内的5个公共数据集上,该算法的检测能力已经达到了商业算法的水平。通过在5个测试数据集上验证该算法的检测能力,结果表明该方法超越了商业算法,在多个数据集上的检测率达到了100%。展示了将图像-文本对和对比学习引入存活检测任务中所提出的观点的有效性和鲁棒性。
https://arxiv.org/abs/2311.17583
We study the task of extending the large language model (LLM) into a vision-language instruction-following model. This task is crucial but challenging since the LLM is trained on text modality only, making it hard to effectively digest the visual modality. To address this, existing methods typically train a visual adapter to align the representation between a pre-trained vision transformer (ViT) and the LLM by a generative image captioning loss. However, we find that the generative objective can only produce weak alignment for vision and language, making the aligned vision-language model very hungry for the instruction fine-tuning data. In this paper, we propose CG-VLM that applies both Contrastive and Generative alignment objectives to effectively align the representation of ViT and LLM. Different from image level and sentence level alignment in common contrastive learning settings, CG-VLM aligns the image-patch level features and text-token level embeddings, which, however, is very hard to achieve as no explicit grounding patch-token relation provided in standard image captioning datasets. To address this issue, we propose to maximize the averaged similarity between pooled image-patch features and text-token embeddings. Extensive experiments demonstrate that the proposed CG-VLM produces strong vision-language alignment and is an efficient instruction learner. For example, using only 10% instruction tuning data, we reach 95% performance of state-of-the-art method LLaVA [29] on the zero-shot ScienceQA-Image benchmark.
我们研究将大型语言模型(LLM)扩展为视觉语言指令跟随模型的任务。这个任务至关重要但具有挑战性,因为LLM仅在文本模态上进行训练,使其很难有效地处理视觉模态。为解决这个问题,现有方法通常通过生成图像描述来将预训练的视觉Transformer(ViT)和LLM之间的表示对齐。然而,我们发现生成目标只能产生对视觉和语言表示的弱对齐,导致对齐的视觉-语言模型非常饥饿,需要指令微调数据。在本文中,我们提出CG-VLM,它应用了对比学习和生成对齐目标,有效地对齐ViT和LLM的表示。与常见的对比学习设置中的图像级别和句子级别对齐不同,CG-VLM对齐图像补丁级特征和文本词级嵌入,然而,这在标准图像注释数据集中没有提供明确的补丁词关系。为解决这个问题,我们提出最大化池化图像补丁特征和文本词嵌入的平均相似度。丰富的实验证明,所提出的CG-VLM产生了强大的视觉-语言对齐能力,是一种有效的指令学习器。例如,仅使用10%的指令微调数据,我们在零散科学QA-图像基准上达到了与最先进方法LaVaA [29]相同的表现水平。
https://arxiv.org/abs/2311.17945
Like masked language modeling (MLM) in natural language processing, masked image modeling (MIM) aims to extract valuable insights from image patches to enhance the feature extraction capabilities of the underlying deep neural network (DNN). Contrasted with other training paradigms like supervised learning and unsupervised contrastive learning, masked image modeling (MIM) pretraining typically demands significant computational resources in order to manage large training data batches (e.g., 4096). The significant memory and computation requirements pose a considerable challenge to its broad adoption. To mitigate this, we introduce a novel learning framework, termed~\textit{Block-Wise Masked Image Modeling} (BIM). This framework involves decomposing the MIM tasks into several sub-tasks with independent computation patterns, resulting in block-wise back-propagation operations instead of the traditional end-to-end approach. Our proposed BIM maintains superior performance compared to conventional MIM while greatly reducing peak memory consumption. Moreover, BIM naturally enables the concurrent training of numerous DNN backbones of varying depths. This leads to the creation of multiple trained DNN backbones, each tailored to different hardware platforms with distinct computing capabilities. This approach significantly reduces computational costs in comparison with training each DNN backbone individually. Our framework offers a promising solution for resource constrained training of MIM.
类似于自然语言处理中的遮罩语言建模(MLM),遮罩图像建模(MIM)旨在从图像补丁中提取有价值的见解,以增强底层深度神经网络(DNN)的特征提取能力。与监督学习和无监督对比学习等其他训练方法相比,MIM预训练通常需要大量的计算资源来管理大型训练数据集(例如,4096)。这种显著的内存和计算要求使得它的广泛采用面临着巨大的挑战。为了减轻这种挑战,我们引入了一种新学习框架,称为~\textit{块级遮罩图像建模}(BIM)。这个框架将MIM任务分解为几个具有独立计算模式的子任务,导致采用块级反向传播操作,而不是传统端到端方法。与传统MIM相比,我们提出的BIM保持了卓越的性能,同时大大减少了峰值内存消耗。此外,BIM自然地使许多不同深度的DNN后端同时训练。这导致创建了多个训练的DNN后端,每个后端都针对不同的硬件平台具有不同的计算能力。这种方法在相比之下比单独训练每个DNN后端时要显著减少计算成本。我们的框架为资源受限的MIM预训练提供了一个有前途的解决方案。
https://arxiv.org/abs/2311.17218
Speech-driven 3D facial animation has been an attractive task in both academia and industry. Traditional methods mostly focus on learning a deterministic mapping from speech to animation. Recent approaches start to consider the non-deterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. However, personalizing facial animation and accelerating animation generation are still two major limitations of existing diffusion-based methods. To address the above limitations, we propose DiffusionTalker, a diffusion-based method that utilizes contrastive learning to personalize 3D facial animation and knowledge distillation to accelerate 3D animation generation. Specifically, to enable personalization, we introduce a learnable talking identity to aggregate knowledge in audio sequences. The proposed identity embeddings extract customized facial cues across different people in a contrastive learning manner. During inference, users can obtain personalized facial animation based on input audio, reflecting a specific talking style. With a trained diffusion model with hundreds of steps, we distill it into a lightweight model with 8 steps for acceleration. Extensive experiments are conducted to demonstrate that our method outperforms state-of-the-art methods. The code will be released.
受言语驱动的3D面部动画在学术界和产业界都具有吸引力。传统方法主要关注从语音到动画的确定性映射学习。最近的方法开始考虑言语驱动3D面部动画的非确定性事实,并采用扩散模型进行任务。然而,现有的扩散基方法在个性化面部动画和加速动画生成方面还有两个主要限制。为了应对上述限制,我们提出了DiffusionTalker,一种基于扩散的方法,利用对比学习来个性化3D面部动画,并采用知识蒸馏来加速3D动画生成。具体来说,为了实现个性化,我们引入了一个可学习的话务身份来汇总音频序列中的知识。所提出的身份嵌入以对比学习方式提取不同人之间的自定义面部线索。在推理过程中,用户可以根据输入的音频获得个性化的面部动画,反映了一种特定的说话风格。通过训练具有数百步的扩散模型,我们将它转化为一个8步加速的轻量级模型。我们进行了大量实验来证明我们的方法超越了最先进的方法。代码将发布。
https://arxiv.org/abs/2311.16565
Video Moment Retrieval (MR) and Highlight Detection (HD) have attracted significant attention due to the growing demand for video analysis. Recent approaches treat MR and HD as similar video grounding problems and address them together with transformer-based architecture. However, we observe that the emphasis of MR and HD differs, with one necessitating the perception of local relationships and the other prioritizing the understanding of global contexts. Consequently, the lack of task-specific design will inevitably lead to limitations in associating the intrinsic specialty of two tasks. To tackle the issue, we propose a Unified Video COMprehension framework (UVCOM) to bridge the gap and jointly solve MR and HD effectively. By performing progressive integration on intra and inter-modality across multi-granularity, UVCOM achieves the comprehensive understanding in processing a video. Moreover, we present multi-aspect contrastive learning to consolidate the local relation modeling and global knowledge accumulation via well aligned multi-modal space. Extensive experiments on QVHighlights, Charades-STA, TACoS , YouTube Highlights and TVSum datasets demonstrate the effectiveness and rationality of UVCOM which outperforms the state-of-the-art methods by a remarkable margin.
视频时刻检索(MR)和高亮检测(HD)因视频分析需求的增长而引起了广泛关注。最近的方法将MR和HD视为类似视频建模问题,并使用基于Transformer的架构一起解决它们。然而,我们观察到MR和HD的重点不同,前者需要感知局部关系,而后者则优先理解全局上下文。因此,缺乏任务特定设计将不可避免地导致将两个任务的固有专业特点联系起来的局限性。为了解决这个问题,我们提出了一个统一视频COM理解框架(UVCOM),以弥合这个差距并有效解决MR和HD。通过在多粒度范围内进行逐步集成,UVCOM在处理视频时获得了全面的认知。此外,我们还通过与多模态空间对齐进行对比学习来巩固局部关系建模和全局知识积累。在QVHighlights、Charades-STA、TACoS、YouTube Highlights和TVSum等数据集上进行大量实验证明,UVCOM的有效性和合理性非常出色,其性能远远超过最先进的方法,取得了令人瞩目的成果。
https://arxiv.org/abs/2311.16464
Contrastive vision-language models, e.g., CLIP, have garnered substantial attention for their exceptional generalization capabilities. However, their robustness to perturbations has ignited concerns. Existing strategies typically reinforce their resilience against adversarial examples by enabling the image encoder to "see" these perturbed examples, often necessitating a complete retraining of the image encoder on both natural and adversarial samples. In this study, we propose a new method to enhance robustness solely through text augmentation, eliminating the need for retraining the image encoder on adversarial examples. Our motivation arises from the realization that text and image data inherently occupy a shared latent space, comprising latent content variables and style variables. This insight suggests the feasibility of learning to disentangle these latent content variables using text data exclusively. To accomplish this, we introduce an effective text augmentation method that focuses on modifying the style while preserving the content in the text data. By changing the style part of the text data, we empower the text encoder to emphasize latent content variables, ultimately enhancing the robustness of vision-language models. Our experiments across various datasets demonstrate substantial improvements in the robustness of the pre-trained CLIP model.
对比性视觉语言模型(例如,CLIP)因其出色的泛化能力而吸引了大量关注。然而,他们对扰动的鲁棒性引起了担忧。现有的策略通常通过使图像编码器能够“看到”这些扰动的例子来加强它们对 adversarial 样本的鲁棒性,往往需要对图像编码器在自然和 adversarial 样本上进行完整的重训练。在本文中,我们提出了一种仅通过文本增强来提高鲁棒性的新方法,消除了在 adversarial 样本上进行完整重新训练的需要。我们的动机源于意识到文本和图像数据固有的共享潜在空间,包括潜在内容变量和样式变量。这个启示表明了仅使用文本数据就能够学习分离这些潜在内容变量的可能性。为了实现这一目标,我们引入了一种有效的文本增强方法,重点在于在保留文本数据的内容的同时改变样式。通过改变文本数据的样式部分,我们使文本编码器能够强调潜在内容变量,从而增强视觉语言模型的鲁棒性。我们在各种数据集上的实验证明表明,预训练的 CLIP 模型的鲁棒性得到了显著提高。
https://arxiv.org/abs/2311.16445
In this study, we investigate the potential of Large Language Models to complement biomedical knowledge graphs in the training of semantic models for the biomedical and clinical domains. Drawing on the wealth of the UMLS knowledge graph and harnessing cutting-edge Large Language Models, we propose a new state-of-the-art approach for obtaining high-fidelity representations of biomedical concepts and sentences, consisting of three steps: an improved contrastive learning phase, a novel self-distillation phase, and a weight averaging phase. Through rigorous evaluations via the extensive BioLORD testing suite and diverse downstream tasks, we demonstrate consistent and substantial performance improvements over the previous state of the art (e.g. +2pts on MedSTS, +2.5pts on MedNLI-S, +6.1pts on EHR-Rel-B). Besides our new state-of-the-art biomedical model for English, we also distill and release a multilingual model compatible with 50+ languages and finetuned on 7 European languages. Many clinical pipelines can benefit from our latest models. Our new multilingual model enables a range of languages to benefit from our advancements in biomedical semantic representation learning, opening a new avenue for bioinformatics researchers around the world. As a result, we hope to see BioLORD-2023 becoming a precious tool for future biomedical applications.
在这项研究中,我们研究了大型语言模型在生物医学领域知识图谱训练语义模型中的应用潜力。利用UMLS知识图的丰富财富和前沿的大规模语言模型,我们提出了一个新的最高水平的方法来获得高质量的生物医学概念和句子的表示,包括三个步骤:改进的对比学习阶段、一种新颖的自蒸馏阶段和权重平均阶段。通过通过BioLORD测试套件的严格评估和多样下游任务,我们对以前的最先进状态进行了全面的评估,证实了在生物医学和临床领域取得了显著的性能提升(例如,在MedSTS上提高了+2点,在MedNLI-S上提高了+2.5点,在EHR-Rel-B上提高了+6.1点)。除了我们最新的生物医学英语模型外,我们还通过蒸馏和发布了一种支持50多种语言的多语言模型,并在7种欧洲语言上进行微调。许多临床管道可以从我们最新的模型中受益。我们的新多语言模型使各种语言都能从我们在生物医学语义表示学习方面的进步中受益,为生物信息学研究人员开辟了一条新的途径。因此,我们希望BioLORD-2023成为未来生物医学应用的珍贵工具。
https://arxiv.org/abs/2311.16075
We tackle the problem of 3D point cloud localization based on a few natural linguistic descriptions and introduce a novel neural network, Text2Loc, that fully interprets the semantic relationship between points and text. Text2Loc follows a coarse-to-fine localization pipeline: text-submap global place recognition, followed by fine localization. In global place recognition, relational dynamics among each textual hint are captured in a hierarchical transformer with max-pooling (HTM), whereas a balance between positive and negative pairs is maintained using text-submap contrastive learning. Moreover, we propose a novel matching-free fine localization method to further refine the location predictions, which completely removes the need for complicated text-instance matching and is lighter, faster, and more accurate than previous methods. Extensive experiments show that Text2Loc improves the localization accuracy by up to $2\times$ over the state-of-the-art on the KITTI360Pose dataset. We will make the code publicly available.
我们基于几个自然语言描述解决了3D点云定位问题,并引入了一种新颖的神经网络Text2Loc,它完全解释了点与文本之间的语义关系。Text2Loc采用粗-细定位管道:先进行文本子图全局定位,然后进行细定位。在全局定位过程中,每个文本提示之间的关系在具有最大池化的(HTM)层次变换中被捕捉到,而通过文本子图对比学习来维持正负对之间的平衡。此外,我们提出了一个新颖的无需匹配的细定位方法,进一步优化了位置预测,完全消除了复杂的文本实例匹配,且比以前的方法更轻、更快、更准确。大量实验证明,Text2Loc在KITTI360Pose数据集上的定位精度提高了2倍以上。我们将公开发布代码。
https://arxiv.org/abs/2311.15977
As 3D perception problems grow in popularity and the need for large-scale labeled datasets for LiDAR semantic segmentation increase, new methods arise that aim to reduce the necessity for dense annotations by employing weakly-supervised training. However these methods continue to show weak boundary estimation and high false negative rates for small objects and distant sparse regions. We argue that such weaknesses can be compensated by using RGB images which provide a denser representation of the scene. We propose an image-guidance network (IGNet) which builds upon the idea of distilling high level feature information from a domain adapted synthetically trained 2D semantic segmentation network. We further utilize a one-way contrastive learning scheme alongside a novel mixing strategy called FOVMix, to combat the horizontal field-of-view mismatch between the two sensors and enhance the effects of image guidance. IGNet achieves state-of-the-art results for weakly-supervised LiDAR semantic segmentation on ScribbleKITTI, boasting up to 98% relative performance to fully supervised training with only 8% labeled points, while introducing no additional annotation burden or computational/memory cost during inference. Furthermore, we show that our contributions also prove effective for semi-supervised training, where IGNet claims state-of-the-art results on both ScribbleKITTI and SemanticKITTI.
随着3D感知问题变得越来越受欢迎,对大规模标注数据集的需求也在增加,以支持激光雷达语义分割的需要,新方法出现了,旨在通过弱监督训练来降低对密集注释的依赖。然而,这些方法在小型物体和遥远稀疏区域上仍然表现出弱的边界估计和高假阴性率。我们认为,这些弱点可以通过使用提供场景更丰富表示的RGB图像来弥补。我们提出了一个图像指导网络(IGNet),它基于从适应训练的2D语义分割网络中提取高级特征信息的理念。此外,我们还采用了一种 one-way contrastive 学习策略与名为 FOVMix 的 novel 混合策略相结合,以解决两个传感器之间的水平视野差异,并增强图像指导的作用。IGNet 在 ScribbleKITTI 上实现了与仅使用8%标注点进行完全监督训练的相对最先进的弱监督LiDAR语义分割,IGNet 的相对性能可以达到98%,而无需增加额外的注释负担或计算/内存成本。此外,我们还证明了我们的贡献对于半监督训练同样有效,IGNet 在 SemanticKITTI 上声称与完全监督训练的相对最先进结果。
https://arxiv.org/abs/2311.15605
Audio-visual segmentation is a challenging task that aims to predict pixel-level masks for sound sources in a video. Previous work applied a comprehensive manually designed architecture with countless pixel-wise accurate masks as supervision. However, these pixel-level masks are expensive and not available in all cases. In this work, we aim to simplify the supervision as the instance-level annotation, i.e., weakly-supervised audio-visual segmentation. We present a novel Weakly-Supervised Audio-Visual Segmentation framework, namely WS-AVS, that can learn multi-scale audio-visual alignment with multi-scale multiple-instance contrastive learning for audio-visual segmentation. Extensive experiments on AVSBench demonstrate the effectiveness of our WS-AVS in the weakly-supervised audio-visual segmentation of single-source and multi-source scenarios.
音频视觉分割是一个具有挑战性的任务,旨在预测视频中音频源的像素级掩码。之前的工作应用了一个全面的由人工设计架构,其中包括成千上万个像素级的准确掩码作为监督。然而,这些像素级掩码代价昂贵,且在某些情况下不可用。在这项工作中,我们旨在简化监督,即实例级别注释,即弱监督音频视觉分割。我们提出了一个名为WS-AVS的新颖框架,可以利用多尺度音频视觉对齐和多尺度多实例对比学习进行弱监督音频视觉分割。AVSBench的大量实验证明,我们的WS-AVS在单源和多源场景下的弱监督音频视觉分割中具有有效性。
https://arxiv.org/abs/2311.15080