Diffusion models gain increasing popularity for their generative capabilities. Recently, there have been surging needs to generate customized images by inverting diffusion models from exemplar images. However, existing inversion methods mainly focus on capturing object appearances. How to invert object relations, another important pillar in the visual world, remains unexplored. In this work, we propose ReVersion for the Relation Inversion task, which aims to learn a specific relation (represented as "relation prompt") from exemplar images. Specifically, we learn a relation prompt from a frozen pre-trained text-to-image diffusion model. The learned relation prompt can then be applied to generate relation-specific images with new objects, backgrounds, and styles. Our key insight is the "preposition prior" - real-world relation prompts can be sparsely activated upon a set of basis prepositional words. Specifically, we propose a novel relation-steering contrastive learning scheme to impose two critical properties of the relation prompt: 1) The relation prompt should capture the interaction between objects, enforced by the preposition prior. 2) The relation prompt should be disentangled away from object appearances. We further devise relation-focal importance sampling to emphasize high-level interactions over low-level appearances (e.g., texture, color). To comprehensively evaluate this new task, we contribute ReVersion Benchmark, which provides various exemplar images with diverse relations. Extensive experiments validate the superiority of our approach over existing methods across a wide range of visual relations.
扩散模型因其生成能力而日益受欢迎。最近,有日益增长的需求,通过反转示例图像中的扩散模型来生成定制图像。然而,现有的反转方法主要关注捕捉对象外观。如何反转对象关系,视觉世界中的另一个重要支柱,仍未被探索。在本研究中,我们提出了关系反转任务 ReVersion,旨在从示例图像中学习特定关系(表示为“关系 prompt”)。具体来说,我们从 frozen 预训练文本到图像扩散模型中学习关系 prompt。学习的关系 prompt 可以应用于生成新对象、背景和风格的关系特定图像。我们的关键发现是“前置条件” - 真实的关系提示可以在一组基词的基础上稀疏激活。具体来说,我们提出了一种关系引导的Contrastive学习策略,以强加关系提示的两个关键特性:1) 关系提示应该捕捉对象之间的交互,由前置条件强制实施。2) 关系提示应该从对象外观中分离出来。我们还设计了关系焦点重要性采样策略,强调高层次交互胜过低层次外观(例如纹理和颜色)。为了全面评估这个新任务,我们贡献了 ReVersion 基准,提供了各种具有不同关系示例的图像。广泛的实验验证了我们方法相对于现有方法在多种视觉关系方面的优越性。
https://arxiv.org/abs/2303.13495
Images taken under low-light conditions tend to suffer from poor visibility, which can decrease image quality and even reduce the performance of the downstream tasks. It is hard for a CNN-based method to learn generalized features that can recover normal images from the ones under various unknow low-light conditions. In this paper, we propose to incorporate the contrastive learning into an illumination correction network to learn abstract representations to distinguish various low-light conditions in the representation space, with the purpose of enhancing the generalizability of the network. Considering that light conditions can change the frequency components of the images, the representations are learned and compared in both spatial and frequency domains to make full advantage of the contrastive learning. The proposed method is evaluated on LOL and LOL-V2 datasets, the results show that the proposed method achieves better qualitative and quantitative results compared with other state-of-the-arts.
在低光环境下拍摄的图像往往会出现视野不佳的情况,这可能会降低图像质量,甚至影响后续任务的表现。基于卷积神经网络的方法很难学习通用的特征,以便从各种未知低光环境下恢复正常图像。在本文中,我们提出将对比学习融入照明纠正网络中,学习抽象表示来在表示空间中区分各种低光条件,以增强网络的泛化能力。考虑到光照条件可以改变图像的频谱成分,我们将在空间域和频域中学习表示并进行比较,以充分利用对比学习的优势。我们使用LOL和LOL-V2数据集评估了所选方法,结果表明,与其他先进技术相比,该方法取得了更好的质量和定量结果。
https://arxiv.org/abs/2303.13412
Multiple Instance learning (MIL) models have been extensively used in pathology to predict biomarkers and risk-stratify patients from gigapixel-sized images. Machine learning problems in medical imaging often deal with rare diseases, making it important for these models to work in a label-imbalanced setting. Furthermore, these imbalances can occur in out-of-distribution (OOD) datasets when the models are deployed in the real-world. We leverage the idea that decoupling feature and classifier learning can lead to improved decision boundaries for label imbalanced datasets. To this end, we investigate the integration of supervised contrastive learning with multiple instance learning (SC-MIL). Specifically, we propose a joint-training MIL framework in the presence of label imbalance that progressively transitions from learning bag-level representations to optimal classifier learning. We perform experiments with different imbalance settings for two well-studied problems in cancer pathology: subtyping of non-small cell lung cancer and subtyping of renal cell carcinoma. SC-MIL provides large and consistent improvements over other techniques on both in-distribution (ID) and OOD held-out sets across multiple imbalanced settings.
多实例学习(MIL)模型在病理学中被广泛使用,从Gigapixel大小的图像中预测生物标记和风险分类患者。医学影像学中的机器学习问题通常涉及罕见的疾病,因此这些模型必须在标签不平衡的环境下工作。此外,当模型在现实世界部署时,标签不平衡可能会发生在非均匀分布的数据集上。我们利用的是 feature 和分类器学习解耦的概念,这可以导致标签不平衡的数据集决策边界改善。为此,我们研究了监督对比学习与多实例学习(SC-MIL)的集成。具体而言,我们提出了在标签不平衡的情况下进行联合训练的 MIL 框架,并逐步从学习袋级表示转移到最佳分类器学习。我们对两个在癌症病理学中被广泛研究的问题的不同类型进行了实验:非小细胞肺癌和肺癌的不同类型。SC-MIL 在均匀分布(ID)和 OOD 保留组中提供了比其他技术大且一致的改善。
https://arxiv.org/abs/2303.13405
This paper presents a new adversarial training framework for image inpainting with segmentation confusion adversarial training (SCAT) and contrastive learning. SCAT plays an adversarial game between an inpainting generator and a segmentation network, which provides pixel-level local training signals and can adapt to images with free-form holes. By combining SCAT with standard global adversarial training, the new adversarial training framework exhibits the following three advantages simultaneously: (1) the global consistency of the repaired image, (2) the local fine texture details of the repaired image, and (3) the flexibility of handling images with free-form holes. Moreover, we propose the textural and semantic contrastive learning losses to stabilize and improve our inpainting model's training by exploiting the feature representation space of the discriminator, in which the inpainting images are pulled closer to the ground truth images but pushed farther from the corrupted images. The proposed contrastive losses better guide the repaired images to move from the corrupted image data points to the real image data points in the feature representation space, resulting in more realistic completed images. We conduct extensive experiments on two benchmark datasets, demonstrating our model's effectiveness and superiority both qualitatively and quantitatively.
本论文提出了一种用于图像修复与分割混淆GAN训练(SCAT)和对比学习的新对抗训练框架。SCAT是一个修复生成器和分割网络之间的对抗游戏,它提供像素级别的局部训练信号,并能够适应具有自由漏洞的图像。通过将SCAT与标准全局GAN训练相结合,新框架同时表现出以下三个优点:(1)修复图像的全局一致性,(2)修复图像的局部细节特征,(3)能够灵活处理具有自由漏洞的图像。此外,我们提出了纹理和语义对比学习损失,以稳定和提高修复模型的训练,利用分选器的特征表示空间,在修复图像靠近 ground truth 图像但远离损坏图像的情况下,将修复图像从损坏图像数据点移动到特征表示空间的真实图像数据点,从而生成更真实的完成图像。我们在两个基准数据集上进行了广泛的实验,证明了我们的模型的有效性和优越性,既定性上也定量上。
https://arxiv.org/abs/2303.13133
Increasing scene-awareness is a key challenge in video anomaly detection (VAD). In this work, we propose a hierarchical semantic contrast (HSC) method to learn a scene-aware VAD model from normal videos. We first incorporate foreground object and background scene features with high-level semantics by taking advantage of pre-trained video parsing models. Then, building upon the autoencoder-based reconstruction framework, we introduce both scene-level and object-level contrastive learning to enforce the encoded latent features to be compact within the same semantic classes while being separable across different classes. This hierarchical semantic contrast strategy helps to deal with the diversity of normal patterns and also increases their discrimination ability. Moreover, for the sake of tackling rare normal activities, we design a skeleton-based motion augmentation to increase samples and refine the model further. Extensive experiments on three public datasets and scene-dependent mixture datasets validate the effectiveness of our proposed method.
增加场景意识是视频异常检测(VAD)中的一个关键挑战。在本研究中,我们提出了一种Hierarchical Semantic Contrast(HSC)方法,以从正常视频中提取场景意识的目标VAD模型。我们首先利用训练好的视频解析模型利用高级别语义信息融合前景对象和背景场景特征。然后,基于自编码器的重建框架,我们引入了场景和对象级别的对比学习,以强制相同的语义类别中的编码隐状态特征紧凑,但可以在不同类别之间分离。这种Hierarchical Semantic Contrast策略有助于处理正常模式的多样性,并增加它们的区分能力。此外,为了应对罕见的正常活动,我们设计了一种基于骨架的运动增强方法,以增加样本并进一步改进模型。对三个公共数据集和场景相关的混合数据集进行了广泛的实验,验证了我们提出的方法的有效性。
https://arxiv.org/abs/2303.13051
Existing defense methods against adversarial attacks can be categorized into training time and test time defenses. Training time defense, i.e., adversarial training, requires a significant amount of extra time for training and is often not able to be generalized to unseen attacks. On the other hand, test time defense by test time weight adaptation requires access to perform gradient descent on (part of) the model weights, which could be infeasible for models with frozen weights. To address these challenges, we propose DRAM, a novel defense method to Detect and Reconstruct multiple types of Adversarial attacks via Masked autoencoder (MAE). We demonstrate how to use MAE losses to build a KS-test to detect adversarial attacks. Moreover, the MAE losses can be used to repair adversarial samples from unseen attack types. In this sense, DRAM neither requires model weight updates in test time nor augments the training set with more adversarial samples. Evaluating DRAM on the large-scale ImageNet data, we achieve the best detection rate of 82% on average on eight types of adversarial attacks compared with other detection baselines. For reconstruction, DRAM improves the robust accuracy by 6% ~ 41% for Standard ResNet50 and 3% ~ 8% for Robust ResNet50 compared with other self-supervision tasks, such as rotation prediction and contrastive learning.
现有的防反欺诈方法可以分为两种:训练时间和测试时间防御。训练时间防御(也称为反欺诈训练)需要额外的训练时间,并且通常无法应用于未观察到的攻击。测试时间防御(也称为测试时间权重适应)需要访问模型权重的一部分进行梯度下降,这对于具有冻结权重的模型来说是可行的。为了解决这些挑战,我们提出了DRAM,一种 novel 防御方法,通过掩码自编码器(MAE)来检测和重构多种类型的反欺诈攻击。我们展示了如何使用MAE损失构建 KS-测试来检测反欺诈攻击。此外,MAE损失还可以用于修复未观察到攻击类型的反欺诈样本。因此,DRAM在测试时间内不需要模型权重更新,也不会增加训练集中的更多反欺诈样本。在评估DRAM的大型图像集数据上,我们平均实现了82%的反欺诈攻击检测率,相比其他检测基准线。对于重构,DRAM标准ResNet50的鲁棒精度提高了6%至41%,而 robust ResNet50的精度提高了3%至8%。与旋转预测和对比学习等其他自监督任务相比,DRAM实现了更好的鲁棒性精度。
https://arxiv.org/abs/2303.12848
This work focuses on sign language retrieval-a recently proposed task for sign language understanding. Sign language retrieval consists of two sub-tasks: text-to-sign-video (T2V) retrieval and sign-video-to-text (V2T) retrieval. Different from traditional video-text retrieval, sign language videos, not only contain visual signals but also carry abundant semantic meanings by themselves due to the fact that sign languages are also natural languages. Considering this character, we formulate sign language retrieval as a cross-lingual retrieval problem as well as a video-text retrieval task. Concretely, we take into account the linguistic properties of both sign languages and natural languages, and simultaneously identify the fine-grained cross-lingual (i.e., sign-to-word) mappings while contrasting the texts and the sign videos in a joint embedding space. This process is termed as cross-lingual contrastive learning. Another challenge is raised by the data scarcity issue-sign language datasets are orders of magnitude smaller in scale than that of speech recognition. We alleviate this issue by adopting a domain-agnostic sign encoder pre-trained on large-scale sign videos into the target domain via pseudo-labeling. Our framework, termed as domain-aware sign language retrieval via Cross-lingual Contrastive learning or CiCo for short, outperforms the pioneering method by large margins on various datasets, e.g., +22.4 T2V and +28.0 V2T R@1 improvements on How2Sign dataset, and +13.7 T2V and +17.1 V2T R@1 improvements on PHOENIX-2014T dataset. Code and models are available at: this https URL.
本研究专注于 Sign Language 检索--一项最近提出的理解 sign language 的任务。 Sign Language 检索包括两个子任务:文本到Sign视频(T2V)检索和Sign视频到文本(V2T)检索。与传统的 video-text 检索不同,Sign 视频不仅包含视觉信号,而且本身携带丰富的语义含义,因为 sign 语言也是自然语言。考虑到这一特点,我们将 Sign Language 检索界定为跨语言检索问题和视频-text 检索任务。具体来说,我们考虑了 Sign 语言和自然语言的语言学特征,同时同时识别 fine-grained 跨语言映射(即 sign-to-word 映射),而在 joint embedding 空间中,同时比较文本和 Sign 视频。这一过程被称为跨语言对比学习。此外,数据稀缺问题也带来了挑战--Sign 语言数据集的规模比语音识别数据集小得多。我们通过伪标签方式将具有广泛 Sign 语言训练数据的Sign 编码器应用于目标域。我们的框架,称为跨语言对比学习的 Sign 语言检索(CiCo),在多个数据集上比先驱方法表现更好,例如,How2Sign 数据集上的 T2V 检索改进了 22.4 倍,V2T 检索改进了 28.0 倍,而 PHOENIX-2014T 数据集上的 V2T 检索改进了 13.7 倍和 17.1 倍。代码和模型可在 this https URL 中找到。
https://arxiv.org/abs/2303.12793
The goal of video segmentation is to accurately segment and track every pixel in diverse scenarios. In this paper, we present Tube-Link, a versatile framework that addresses multiple core tasks of video segmentation with a unified architecture. Our framework is a near-online approach that takes a short subclip as input and outputs the corresponding spatial-temporal tube masks. To enhance the modeling of cross-tube relationships, we propose an effective way to perform tube-level linking via attention along the queries. In addition, we introduce temporal contrastive learning to instance-wise discriminative features for tube-level association. Our approach offers flexibility and efficiency for both short and long video inputs, as the length of each subclip can be varied according to the needs of datasets or scenarios. Tube-Link outperforms existing specialized architectures by a significant margin on five video segmentation datasets. Specifically, it achieves almost 13% relative improvements on VIPSeg and 4% improvements on KITTI-STEP over the strong baseline Video K-Net. When using a ResNet50 backbone on Youtube-VIS-2019 and 2021, Tube-Link boosts IDOL by 3% and 4%, respectively. Code will be available.
视频分割的目标是在各种不同的场景中准确地分割和跟踪每个像素。在本文中,我们介绍了 Tube-Link,一个多功能框架,以统一架构解决了视频分割多个核心任务。我们的框架是近在线方法,以短片段作为输入并输出相应的空间-时间 tube 掩码。为了提高交叉tube关系建模,我们提出了一种有效的方法,通过注意力在查询沿线进行 tube 级别链接。此外,我们引入了时间对比学习,以实例wise 区分性的 tube 级别特征。我们的方法提供了长短视频输入的灵活和效率,因为每个短片段的长度可以根据数据集或场景的需求进行调整。 Tube-Link 在五个视频分割数据集上比现有专门的架构表现更好,具体来说,它几乎实现了 VIP Segment 的相对改进以及 KITTI-Step 的 4% 改进,在使用ResNet50作为 Youtube-VIS-2019和2021的基线视频K-Net时, Tube-Link分别提高了DOLbyby 3% 和 4%。代码将可用。
https://arxiv.org/abs/2303.12782
Deep learning has achieved great success in recent years with the aid of advanced neural network structures and large-scale human-annotated datasets. However, it is often costly and difficult to accurately and efficiently annotate large-scale datasets, especially for some specialized domains where fine-grained labels are required. In this setting, coarse labels are much easier to acquire as they do not require expert knowledge. In this work, we propose a contrastive learning method, called $\textbf{Mask}$ed $\textbf{Con}$trastive learning~($\textbf{MaskCon}$) to address the under-explored problem setting, where we learn with a coarse-labelled dataset in order to address a finer labelling problem. More specifically, within the contrastive learning framework, for each sample our method generates soft-labels with the aid of coarse labels against other samples and another augmented view of the sample in question. By contrast to self-supervised contrastive learning where only the sample's augmentations are considered hard positives, and in supervised contrastive learning where only samples with the same coarse labels are considered hard positives, we propose soft labels based on sample distances, that are masked by the coarse labels. This allows us to utilize both inter-sample relations and coarse labels. We demonstrate that our method can obtain as special cases many existing state-of-the-art works and that it provides tighter bounds on the generalization error. Experimentally, our method achieves significant improvement over the current state-of-the-art in various datasets, including CIFAR10, CIFAR100, ImageNet-1K, Standford Online Products and Stanford Cars196 datasets. Code and annotations are available at this https URL.
深度学习在近年来凭借先进的神经网络结构和大量的人类标注数据取得了巨大的成功。然而,准确和高效地标注大规模数据往往成本较高且困难,特别是对于需要高精度标签的某些特定领域。在这种情况下,粗粒度标签更容易获得,因为它们不需要专业知识。在这项工作中,我们提出了一种对比学习方法,称为“Masked Contrastive Learning”($\textbf{MaskCon}$),以解决未被研究的问题解决设定,我们使用粗粒度标注的数据集来学习,以解决更细的标注问题。更具体地说,在对比学习框架内,为每个样本使用粗粒度标签与其他样本对比并生成软标签,同时使用其他样本的增强视图。与自监督对比学习相比,我们提出的是监督对比学习,其中只有样本的增强被视为硬阳性,而监督对比学习则只有使用相同的粗标签的样本被视为硬阳性。我们提出的软标签基于样本距离,被粗标签掩盖。这允许我们利用它们之间的交互关系和使用粗标签。我们证明,我们的方法可以作为特殊案例获取许多现有的先进技术工作,并提供了更紧密的泛化误差边界。实验中,我们的方法在包括CIFAR10、CIFAR100、ImageNet-1K、 Standford Online Products和Stanford cars196等多个数据集上实现了与当前先进技术相比的重大改进。代码和注释可在该https URL上获取。
https://arxiv.org/abs/2303.12756
Multi-view feature extraction is an efficient approach for alleviating the issue of dimensionality in highdimensional multi-view data. Contrastive learning (CL), which is a popular self-supervised learning method, has recently attracted considerable attention. In this study, we propose a novel multi-view feature extraction method based on triple contrastive heads, which combines the sample-, recovery- , and feature-level contrastive losses to extract the sufficient yet minimal subspace discriminative information in compliance with information bottleneck principle. In MFETCH, we construct the feature-level contrastive loss, which removes the redundent information in the consistency information to achieve the minimality of the subspace discriminative information. Moreover, the recovery-level contrastive loss is also constructed in MFETCH, which captures the view-specific discriminative information to achieve the sufficiency of the subspace discriminative information.The numerical experiments demonstrate that the proposed method offers a strong advantage for multi-view feature extraction.
多视图特征提取是一种高效的方法,以解决高维多视图数据中的维度问题。对比学习(CL)是一种常见的自监督学习方法,最近引起了广泛关注。在本研究中,我们提出了基于三对比头的新多视图特征提取方法,该方法将样本、恢复和特征级别的对比损失相结合,以提取符合信息瓶颈原则的足够但最小的子空间相关特征。在MFETCH中,我们建立了特征级别的对比损失,该损失消除了一致性信息中的冗余信息,以实现子空间相关特征的最小化。此外,在MFETCH中,我们还建立了恢复级别的对比损失,该损失捕捉视图特定的相关特征,以实现子空间相关特征的足够。数值实验表明,该方法为多视图特征提取提供了强有力的优势。
https://arxiv.org/abs/2303.12615
Renal transplantation emerges as the most effective solution for end-stage renal disease. Occurring from complex causes, a substantial risk of transplant chronic dysfunction persists and may lead to graft loss. Medical imaging plays a substantial role in renal transplant monitoring in clinical practice. However, graft supervision is multi-disciplinary, notably joining nephrology, urology, and radiology, while identifying robust biomarkers from such high-dimensional and complex data for prognosis is challenging. In this work, taking inspiration from the recent success of Large Language Models (LLMs), we propose MEDIMP -- Medical Images and Prompts -- a model to learn meaningful multi-modal representations of renal transplant Dynamic Contrast-Enhanced Magnetic Resonance Imaging (DCE MRI) by incorporating structural clinicobiological data after translating them into text prompts. MEDIMP is based on contrastive learning from joint text-image paired embeddings to perform this challenging task. Moreover, we propose a framework that generates medical prompts using automatic textual data augmentations from LLMs. Our goal is to learn meaningful manifolds of renal transplant DCE MRI, interesting for the prognosis of the transplant or patient status (2, 3, and 4 years after the transplant), fully exploiting the available multi-modal data in the most efficient way. Extensive experiments and comparisons with other renal transplant representation learning methods with limited data prove the effectiveness of MEDIMP in a relevant clinical setting, giving new directions toward medical prompts. Our code is available at this https URL.
移植成为处理终末型糖尿病最有效的方法。由于来自复杂的原因,移植的 chronic dysfunction 仍然存在并可能导致移植失败。医学成像在临床实践中对于移植监测非常重要。但是,移植监督是多学科的,特别是与神经学、肾脏学和影像学联合会管。然而,从这种高维度和复杂的数据中识别可靠的生物标志物,对于预测病情预后来说是一项挑战性的任务。在这项工作中,借鉴了大型语言模型(LLM)近期的成功,我们提出了 medIMP - 医疗图像和提示 - 一种模型,通过将结构生物信息学数据翻译成文本提示,将移植的动态 contrast-enhanced 磁共振成像(DCE MRI)有意义的多模态表示学习出来。medIMP 基于对比学习从联合文本-图像配对嵌入中进行挑战性的任务。此外,我们提出了一个框架,使用LLM自动生成的文本数据增强来生成医疗提示。我们的目标是学习移植 DCE MRI 的有意义的多模式分支,对于移植预后或患者状况(移植后2、3和4年)非常有趣,以最高效的方式充分利用可用的多模态数据。广泛的实验和与仅有有限数据的其他移植表示学习方法进行比较证明medIMP在相关临床环境中的有效性,为医疗提示提供了新的方向。我们的代码可在this https URL上获取。
https://arxiv.org/abs/2303.12445
Event-based cameras offer reliable measurements for preforming computer vision tasks in high-dynamic range environments and during fast motion maneuvers. However, adopting deep learning in event-based vision faces the challenge of annotated data scarcity due to recency of event cameras. Transferring the knowledge that can be obtained from conventional camera annotated data offers a practical solution to this challenge. We develop an unsupervised domain adaptation algorithm for training a deep network for event-based data image classification using contrastive learning and uncorrelated conditioning of data. Our solution outperforms the existing algorithms for this purpose.
基于事件的相机提供了在高动态范围环境和快速运动操作中执行计算机视觉任务可靠的测量。然而,在基于事件的图像处理中采用深度学习面临由于事件相机及时性导致的标注数据短缺的挑战。通过将传统相机的标注数据所获得的知识进行无监督领域适应算法的训练,为解决这个问题提供了一个实用的解决方案。我们开发了一种基于梯度的学习算法,用于训练基于事件的数据处理图像分类深度学习网络,使用对比学习和无相关化数据预处理。我们的解决方案在这个目标上比现有的算法表现更好。
https://arxiv.org/abs/2303.12424
Incomplete multi-view clustering (IMVC) is an unsupervised approach, among which IMVC via contrastive learning has received attention due to its excellent performance. The previous methods have the following problems: 1) Over-reliance on additional projection heads when solving the dimensional collapse problem in which latent features are only valid in lower-dimensional subspaces during clustering. However, many parameters in the projection heads are unnecessary. 2) The recovered view contain inconsistent private information and useless private information will mislead the learning of common semantics due to consistent learning and reconstruction learning on the same feature. To address the above issues, we propose a novel incomplete multi-view contrastive clustering framework. This framework directly optimizes the latent feature subspace, utilizes the learned feature vectors and their sub-vectors for reconstruction learning and consistency learning, thereby effectively avoiding dimensional collapse without relying on projection heads. Since reconstruction loss and contrastive loss are performed on different features, the adverse effect of useless private information is reduced. For the incomplete data, the missing information is recovered by the cross-view prediction mechanism and the inconsistent information from different views is discarded by the minimum conditional entropy to further avoid the influence of private information. Extensive experimental results of the method on 5 public datasets show that the method achieves state-of-the-art clustering results.
不完整的多视角聚类(IMVC)是一种无监督方法,而IMVC通过对比学习受到关注,因为其表现非常出色。以前的这些方法有以下问题:1) 在解决维度崩溃问题时,过度依赖额外的投影头,而 latent features 在聚类期间只能存在于低维度子空间中。但是,许多投影头参数是不必要的。2) 恢复视图中包含不一致的私人信息,而且无用的私人信息将误导对共同语义的学习,因为在同一特征上的一致性学习和重建学习同时存在。为了解决以上问题,我们提出了一种新的不完整的多视角对比聚类框架。这个框架直接优化了latent feature子空间,利用学到的特征向量和它们的子向量进行重建学习和一致性学习,从而有效地避免了维度崩溃,而无需依赖投影头。由于重建损失和对比损失是针对不同的特征进行的,无用的私人信息的副作用被减少了。对于不完整的数据,通过交叉视图预测机制恢复缺失信息,而不同视图中的不一致信息通过最小条件熵被抛弃,进一步避免了私人信息的影响。方法对5个公共数据集的广泛实验结果显示,该方法实现了先进的聚类结果。
https://arxiv.org/abs/2303.12241
We present a new method of self-supervised learning and knowledge distillation based on the multi-views and multi-representations (MV-MR). The MV-MR is based on the maximization of dependence between learnable embeddings from augmented and non-augmented views, jointly with the maximization of dependence between learnable embeddings from augmented view and multiple non-learnable representations from non-augmented view. We show that the proposed method can be used for efficient self-supervised classification and model-agnostic knowledge distillation. Unlike other self-supervised techniques, our approach does not use any contrastive learning, clustering, or stop gradients. MV-MR is a generic framework allowing the incorporation of constraints on the learnable embeddings via the usage of image multi-representations as regularizers. Along this line, knowledge distillation is considered a particular case of such a regularization. MV-MR provides the state-of-the-art performance on the STL10 and ImageNet-1K datasets among non-contrastive and clustering-free methods. We show that a lower complexity ResNet50 model pretrained using proposed knowledge distillation based on the CLIP ViT model achieves state-of-the-art performance on STL10 linear evaluation. The code is available at: this https URL
我们提出了基于多视角和多Representations(MV-MR)的自监督学习和知识蒸馏新方法。MV-MR基于增加和未增加视图的学习可解释嵌入之间的依赖最大化,同时增加视图和学习可解释嵌入之间的依赖最大化,同时减少未增加视图中多个非学习可解释表示之间的依赖。我们证明,该提议方法可以用于高效的自监督分类和模型无关的知识蒸馏。与其他自监督技术不同,我们的方法不使用任何对比学习、聚类或停止梯度。MV-MR是一个通用框架,允许使用图像多表示作为正则化器,通过使用图像多表示来添加约束。沿着这个方向,知识蒸馏被视为这种正则化的特殊例子。MV-MR在STL10和ImageNet-1K等非对比和无聚类方法中的最先进的性能提供了示范。代码可用在这个httpsURL上:。
https://arxiv.org/abs/2303.12130
The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation of captions generated from vision-and-language architectures. In this paper, we propose a new recipe for a contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S), that in a novel way unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data. Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos, outperforming existing reference-based metrics like CIDEr and SPICE and reference-free metrics like CLIP-Score. Finally, we test the system-level correlation of the proposed metric when considering popular image captioning approaches, and assess the impact of employing different cross-modal features. Our source code and trained models are publicly available at: this https URL.
CLIP模型最近被证明对于多种跨modal任务非常有效,包括从视觉和语言架构生成的caption评估。在本文中,我们提出了一种新的 recipe 用于图像captioning 的评价指标,即增强对比学习得分( PAC-S),以一种 novel 的方式将对比视觉语义学习与编辑数据生成图像和文本相结合。多个数据集的实验表明,我们的新指标在图像和视频对人类判断的相关性方面表现最优秀,比现有的基于参考的指标(如 CIDEr 和 SPICE )以及无参考指标(如 CLIP-Score)更好。最后,我们测试了 proposed 指标的系统级相关性,在考虑流行的图像captioning方法时,并评估了使用不同跨modal特征的影响。我们源代码和训练模型的公开地址为: this https URL.
https://arxiv.org/abs/2303.12112
Masked Autoencoders (MAEs) learn self-supervised representations by randomly masking input image patches and a reconstruction loss. Alternatively, contrastive learning self-supervised methods encourage two versions of the same input to have a similar representation, while pulling apart the representations for different inputs. We propose ViC-MAE, a general method that combines both MAE and contrastive learning by pooling the local feature representations learned under the MAE reconstruction objective and leveraging this global representation under a contrastive objective across video frames. We show that visual representations learned under ViC-MAE generalize well to both video classification and image classification tasks. Using a backbone ViT-B/16 network pre-trained on the Moments in Time (MiT) dataset, we obtain state-of-the-art transfer learning from video to images on Imagenet-1k by improving 1.58% in absolute top-1 accuracy from a recent previous work. Moreover, our method maintains a competitive transfer-learning performance of 81.50% top-1 accuracy on the Kinetics-400 video classification benchmark. In addition, we show that despite its simplicity, ViC-MAE yields improved results compared to combining MAE pre-training with previously proposed contrastive objectives such as VicReg and SiamSiam.
遮蔽自动编码器(MAEs)通过随机遮蔽输入图像点和重构损失学习自监督表示。Alternatively,比较性学习自监督方法鼓励相同的输入版本具有相似的表示,同时分离不同的输入版本的表示。我们提出了ViC-MAE,一种通用方法,将MAEs和比较性学习相结合,通过汇总在MAEs重构目标下学习的小特征表示并利用跨视频帧的比较目标上的优势,获得视频分类和图像分类任务的最新研究成果。通过在时间序列数据(MiT)数据集上预先训练的ViT-B/16网络,我们在Imagenet-1k上从视频到图像的迁移学习中获得最先进的结果,从最近的一项工作提高了1.58%的绝对准确率。此外,我们的方法在Kinetics-400视频分类基准上保持了具有竞争力的迁移学习性能,保持了81.50%的top-1准确率。此外,我们表明,尽管ViC-MAE的简单易用,但它比结合MAEs预训练与以前提出的比较性目标,如vicReg和SiamSiam等方法获得更好的结果。
https://arxiv.org/abs/2303.12001
Mobile Edge Caching (MEC) integrated with Deep Neural Networks (DNNs) is an innovative technology with significant potential for the future generation of wireless networks, resulting in a considerable reduction in users' latency. The MEC network's effectiveness, however, heavily relies on its capacity to predict and dynamically update the storage of caching nodes with the most popular contents. To be effective, a DNN-based popularity prediction model needs to have the ability to understand the historical request patterns of content, including their temporal and spatial correlations. Existing state-of-the-art time-series DNN models capture the latter by simultaneously inputting the sequential request patterns of multiple contents to the network, considerably increasing the size of the input sample. This motivates us to address this challenge by proposing a DNN-based popularity prediction framework based on the idea of contrasting input samples against each other, designed for the Unmanned Aerial Vehicle (UAV)-aided MEC networks. Referred to as the Contrastive Learning-based Survival Analysis (CLSA), the proposed architecture consists of a self-supervised Contrastive Learning (CL) model, where the temporal information of sequential requests is learned using a Long Short Term Memory (LSTM) network as the encoder of the CL architecture. Followed by a Survival Analysis (SA) network, the output of the proposed CLSA architecture is probabilities for each content's future popularity, which are then sorted in descending order to identify the Top-K popular contents. Based on the simulation results, the proposed CLSA architecture outperforms its counterparts across the classification accuracy and cache-hit ratio.
Mobile Edge caching (MEC) 与深度神经网络(DNN)集成是一项具有对未来无线网络潜在重要潜力的创新技术,其结果能够显著降低用户的延迟。然而,MEC 网络的有效性很大程度上依赖于其能够预测并动态更新最流行内容缓存节点的存储。要有效,一个基于 DNN 的流行度预测模型需要能够理解内容的历史请求模式,包括其时间轴和空间相关性。现有的先进的时间序列 DNN 模型通过同时输入多个内容的不同请求模式,能够捕捉后者,显著增加了输入样本的大小。这激励我们提出一个基于比较输入样本彼此对抗的想法设计的 DNN 流行度预测框架,该框架被称为Contrastive Learning-based survival Analysis (CLSA),它由一个自监督的Contrastive Learning (CL)模型组成,其中使用一个长短期记忆(LSTM)网络作为 CL 架构的编码器,学习Sequential Requests 的时间信息,然后使用生存分析(SA)网络排序,以确定前 K 流行内容。根据模拟结果,该提出的 CLSA 架构在分类准确率和缓存击中率方面都表现优异。
https://arxiv.org/abs/2303.12097
We tackle the task of text-to-3D creation with pre-trained latent-based NeRFs (NeRFs that generate 3D objects given input latent code). Recent works such as DreamFusion and Magic3D have shown great success in generating 3D content using NeRFs and text prompts, but the current approach of optimizing a NeRF for every text prompt is 1) extremely time-consuming and 2) often leads to low-resolution outputs. To address these challenges, we propose a novel method named 3D-CLFusion which leverages the pre-trained latent-based NeRFs and performs fast 3D content creation in less than a minute. In particular, we introduce a latent diffusion prior network for learning the w latent from the input CLIP text/image embeddings. This pipeline allows us to produce the w latent without further optimization during inference and the pre-trained NeRF is able to perform multi-view high-resolution 3D synthesis based on the latent. We note that the novelty of our model lies in that we introduce contrastive learning during training the diffusion prior which enables the generation of the valid view-invariant latent code. We demonstrate through experiments the effectiveness of our proposed view-invariant diffusion process for fast text-to-3D creation, e.g., 100 times faster than DreamFusion. We note that our model is able to serve as the role of a plug-and-play tool for text-to-3D with pre-trained NeRFs.
我们解决了使用预先训练的隐晦编码 NeRFs (NeRFs,根据输入隐晦编码生成3D对象) 的任务,也就是通过 NeRFs 和文本提示生成3D内容。最近的工作,如 DreamFusion 和 Magic3D,已经取得了很大的成功,通过 NeRFs 和文本提示生成3D内容。但是,当前对于每个文本提示优化 NeRF 的方法非常耗时且往往导致低分辨率输出。为了解决这些挑战,我们提出了一种名为 3D-CLFusion 的新方法,利用预先训练的隐晦编码 NeRFs 并能够在不到一分钟的时间内快速生成高质量的3D内容。特别地,我们引入了一种隐晦扩散先验网络,用于从输入CLIP文本/图像嵌入中学习 w 隐晦。这个管道可以在推理期间无需进一步优化生产 w 隐晦,而预先训练的 NeRF 能够基于隐晦进行多视角高分辨率3D合成。我们注意到,我们模型的创新之处在于,我们在训练扩散先验时引入了对比学习,这导致了有效的视角不变的隐晦编码的生成。我们通过实验证明了我们提出的视角不变的扩散过程对于快速文本-3D创建的高效性,例如比 DreamFusion 快100倍。我们注意到,我们的模型能够作为带有预先训练的 NeRFs 的文本-3D工具插件。
https://arxiv.org/abs/2303.11938
Cross-View Geo-Localisation is still a challenging task where additional modules, specific pre-processing or zooming strategies are necessary to determine accurate positions of images. Since different views have different geometries, pre-processing like polar transformation helps to merge them. However, this results in distorted images which then have to be rectified. Adding hard negatives to the training batch could improve the overall performance but with the default loss functions in geo-localisation it is difficult to include them. In this article, we present a simplified but effective architecture based on contrastive learning with symmetric InfoNCE loss that outperforms current state-of-the-art results. Our framework consists of a narrow training pipeline that eliminates the need of using aggregation modules, avoids further pre-processing steps and even increases the generalisation capability of the model to unknown regions. We introduce two types of sampling strategies for hard negatives. The first explicitly exploits geographically neighboring locations to provide a good starting point. The second leverages the visual similarity between the image embeddings in order to mine hard negative samples. Our work shows excellent performance on common cross-view datasets like CVUSA, CVACT, University-1652 and VIGOR. A comparison between cross-area and same-area settings demonstrate the good generalisation capability of our model.
跨视图地球位置定位仍然是一个挑战性的任务,需要更多的模块、特定的预处理或 zoom 策略来确定图像准确的位置。由于不同视图具有不同的几何形状,预处理像极变换可以帮助将它们合并。然而,这导致扭曲的图像,然后必须纠正。将硬负片添加到训练批次可能会改善整体表现,但在地球上位置定位默认损失函数的情况下,很难包括它们。在本文中,我们介绍了一种简化但有效的架构,基于对比学习和对称 infoNCE 损失,它比当前最先进的结果表现更好。我们的框架包括一条狭窄的训练管道,消除使用聚合模块的需求,避免进一步预处理步骤,甚至可以增加模型对未知区域泛化能力。我们介绍了两种用于硬负片采样的策略。第一种 explicitly 利用地理相邻位置提供一个好的起点,第二种利用图像嵌入之间的视觉相似性来挖掘硬负样本。我们的工作在常见的跨视图数据集上表现出卓越的表现,如CVUSA、CVACT、University-1652和VIGOR。跨区域和同区域设置之间的比较展示了我们的模型良好的泛化能力。
https://arxiv.org/abs/2303.11851
Lack of audio-video synchronization is a common problem during television broadcasts and video conferencing, leading to an unsatisfactory viewing experience. A widely accepted paradigm is to create an error detection mechanism that identifies the cases when audio is leading or lagging. We propose ModEFormer, which independently extracts audio and video embeddings using modality-specific transformers. Different from the other transformer-based approaches, ModEFormer preserves the modality of the input streams which allows us to use a larger batch size with more negative audio samples for contrastive learning. Further, we propose a trade-off between the number of negative samples and number of unique samples in a batch to significantly exceed the performance of previous methods. Experimental results show that ModEFormer achieves state-of-the-art performance, 94.5% for LRS2 and 90.9% for LRS3. Finally, we demonstrate how ModEFormer can be used for offset detection for test clips.
音频-视频同步缺乏是电视广播和视频会议中常见的问题,导致观看体验不佳。一种被广泛接受的模式是创建一种错误检测机制,以确定音频是否领先或滞后。我们提出了 ModE former,它使用特定模式transformers独立提取音频和视频嵌入。与其他基于transformer的方法不同, ModE former保留了输入模式,使我们可以使用更大的批量大小和更多的负音频样本进行比较学习。此外,我们提出了批量中负样本和独特样本之间的权衡,以显著超过先前方法的性能。实验结果显示, ModE former实现最先进的性能,LRS2的准确率为94.5%,LRS3的准确率为90.9%。最后,我们展示了 ModE former如何用于测试片段的偏移检测。
https://arxiv.org/abs/2303.11551