In this paper, we leverage CLIP for zero-shot sketch based image retrieval (ZS-SBIR). We are largely inspired by recent advances on foundation models and the unparalleled generalisation ability they seem to offer, but for the first time tailor it to benefit the sketch community. We put forward novel designs on how best to achieve this synergy, for both the category setting and the fine-grained setting ("all"). At the very core of our solution is a prompt learning setup. First we show just via factoring in sketch-specific prompts, we already have a category-level ZS-SBIR system that overshoots all prior arts, by a large margin (24.8%) - a great testimony on studying the CLIP and ZS-SBIR synergy. Moving onto the fine-grained setup is however trickier, and requires a deeper dive into this synergy. For that, we come up with two specific designs to tackle the fine-grained matching nature of the problem: (i) an additional regularisation loss to ensure the relative separation between sketches and photos is uniform across categories, which is not the case for the gold standard standalone triplet loss, and (ii) a clever patch shuffling technique to help establishing instance-level structural correspondences between sketch-photo pairs. With these designs, we again observe significant performance gains in the region of 26.9% over previous state-of-the-art. The take-home message, if any, is the proposed CLIP and prompt learning paradigm carries great promise in tackling other sketch-related tasks (not limited to ZS-SBIR) where data scarcity remains a great challenge. Code and models will be made available.
在本文中,我们利用CLIP实现零次请求的 Sketch 图像检索(ZS-SBIR),我们主要受到最近在基础模型方面取得的进步以及它们似乎提供的无与伦比的泛化能力启发,但首次为 Sketch 社区服务。我们提出了新的设计,以最大程度地实现这一协同作用,无论是按类别设置还是精细设置(“所有”)。我们的核心解决方案是prompt learning setup。我们首先通过考虑 Sketch 特定的提示因子,已经有了一个按类别设置的 ZS-SBIR 系统,比所有先前作品都超出了很大的比例(24.8%),这是研究 CLIP 和 ZS-SBIR协同作用的巨大证明。然而,切换到精细设置变得更加困难,需要更深入地探索这一协同作用。为此,我们提出了两个特定的设计,以解决精细匹配问题:(i)额外的正则化损失,以确保 Sketch 和照片之间的相对分离在所有类别上是均匀的,而不像标准单例差分损失那样,(ii)聪明的 patch shuffle 技术,以帮助建立 Sketch 和照片之间的实例级结构对应关系。通过这些设计,我们再次观察到在先前技术水平的26.9%范围内显著的性能提升。总之,任何消息都是关于 proposed CLIP 和 prompt learning 范式在处理其他 Sketch 相关任务(不仅仅限于 ZS-SBIR)时具有巨大的潜力,数据稀缺仍然是一个巨大挑战。代码和模型将公开提供。
https://arxiv.org/abs/2303.13440
Deep hashing has been extensively applied to massive image retrieval due to its efficiency and effectiveness. Recently, several adversarial attacks have been presented to reveal the vulnerability of deep hashing models against adversarial examples. However, existing attack methods suffer from degraded performance or inefficiency because they underutilize the semantic relations between original samples or spend a lot of time learning these relations with a deep neural network. In this paper, we propose a novel Pharos-guided Attack, dubbed PgA, to evaluate the adversarial robustness of deep hashing networks reliably and efficiently. Specifically, we design pharos code to represent the semantics of the benign image, which preserves the similarity to semantically relevant samples and dissimilarity to irrelevant ones. It is proven that we can quickly calculate the pharos code via a simple math formula. Accordingly, PgA can directly conduct a reliable and efficient attack on deep hashing-based retrieval by maximizing the similarity between the hash code of the adversarial example and the pharos code. Extensive experiments on the benchmark datasets verify that the proposed algorithm outperforms the prior state-of-the-arts in both attack strength and speed.
深度学习Hashing技术因其效率和有效性而被广泛应用到大规模图像检索中。最近, several adversarial attacks 出现了,以揭示深度学习Hashing模型对对抗样本的脆弱性。然而,现有的攻击方法却导致性能或效率下降,因为它们未能充分利用原始样本之间的语义关系,或者花费大量时间与深度神经网络学习这些关系。在本文中,我们提出了一种新颖的 Pharos-引导攻击,并将其称为 PgA,以评估深度学习Hashing网络的抗对抗性可靠性和高效性。具体而言,我们设计了 Pharos 代码来代表良性图像的语义,该代码保持了与语义相关的样本之间的相似性和与无关样本之间的不同性。已证明,通过简单的数学公式,我们可以快速计算 Pharos 代码。因此,PgA 可以直接对基于深度学习Hashing的图像检索进行可靠和高效的攻击,通过最大化对抗样本的哈希代码与 Pharos 代码之间的相似性。在基准数据集上进行广泛的实验验证,提出的算法在攻击强度和速度方面都优于先前的最佳方法。
https://arxiv.org/abs/2303.12658
This paper proposes a novel diffusion-based model, CompoDiff, for solving Composed Image Retrieval (CIR) with latent diffusion and presents a newly created dataset of 18 million reference images, conditions, and corresponding target image triplets to train the model. CompoDiff not only achieves a new zero-shot state-of-the-art on a CIR benchmark such as FashionIQ but also enables a more versatile CIR by accepting various conditions, such as negative text and image mask conditions, which are unavailable with existing CIR methods. In addition, the CompoDiff features are on the intact CLIP embedding space so that they can be directly used for all existing models exploiting the CLIP space. The code and dataset used for the training, and the pre-trained weights are available at this https URL
本论文提出了一种基于扩散的新型模型,称为CompoDiff,用于解决包含隐扩散的图像重新匹配问题(CIR)。该模型提出了一个由180万张参考图像、条件、以及对应的目标图像三对组成的新数据集,用于训练模型。 CompoDiff不仅在 fashionIQ 等CIR基准测试中达到了新的零样本水平,还可以通过接受各种条件(如负文本和图像掩膜条件),实现更灵活的 CIR 方法。此外,CompoDiff 的特征存在于完整的 CLIP 嵌入空间中,因此可以直接用于所有利用 CLIP 空间的现有模型。训练代码和数据集以及预训练权重都在此 https URL 上提供。
https://arxiv.org/abs/2303.11916
Medical imaging analysis plays a critical role in the diagnosis and treatment of various medical conditions. This paper focuses on chest X-ray images and their corresponding radiological reports. It presents a new model that learns a joint X-ray image & report representation. The model is based on a novel alignment scheme between the visual data and the text, which takes into account both local and global information. Furthermore, the model integrates domain-specific information of two types -- lateral images and the consistent visual structure of chest images. Our representation is shown to benefit three types of retrieval tasks: text-image retrieval, class-based retrieval, and phrase-grounding.
医学成像分析在诊断和治疗各种医疗条件中发挥着关键作用。本文重点讨论了胸部X光图像及其对应的放射性报告。本文提出了一种新模型,该模型学习了一份X光图像和报告的联合表示。该模型基于视觉数据和文本之间的一种新的对齐 scheme,考虑了本地和全球信息。此外,模型还整合了两种类型的域特异性信息:左右图像和胸部图像的一致性视觉结构。我们的表示被证明有助于三种检索任务:文本图像检索、分类检索和关键词提取。
https://arxiv.org/abs/2303.11755
Given an abstract, deformed, ordinary sketch from untrained amateurs like you and me, this paper turns it into a photorealistic image - just like those shown in Fig. 1(a), all non-cherry-picked. We differ significantly from prior art in that we do not dictate an edgemap-like sketch to start with, but aim to work with abstract free-hand human sketches. In doing so, we essentially democratise the sketch-to-photo pipeline, "picturing" a sketch regardless of how good you sketch. Our contribution at the outset is a decoupled encoder-decoder training paradigm, where the decoder is a StyleGAN trained on photos only. This importantly ensures that generated results are always photorealistic. The rest is then all centred around how best to deal with the abstraction gap between sketch and photo. For that, we propose an autoregressive sketch mapper trained on sketch-photo pairs that maps a sketch to the StyleGAN latent space. We further introduce specific designs to tackle the abstract nature of human sketches, including a fine-grained discriminative loss on the back of a trained sketch-photo retrieval model, and a partial-aware sketch augmentation strategy. Finally, we showcase a few downstream tasks our generation model enables, amongst them is showing how fine-grained sketch-based image retrieval, a well-studied problem in the sketch community, can be reduced to an image (generated) to image retrieval task, surpassing state-of-the-arts. We put forward generated results in the supplementary for everyone to scrutinise.
给定像我一样未经培训的业余爱好者生成的抽象、变形的普通 Sketch,本文将其转换为一张逼真的图像 - 与 Fig. 1(a)所示的图像一样,全部非 cherry-pick 的。我们与已有的文献有很大的不同,我们没有先指定一个边缘地图式的 Sketch,而是旨在与抽象手绘 human Sketch 合作。这样做,我们实质上将 Sketch 到照片的管道民主化,“想象”出 Sketch 不论 Sketch 质量如何。我们的贡献从一开始就是分离编码-解码训练范式,其中编码器仅训练基于照片的 StyleGAN。这重要的是确保生成的结果是 always 逼真的。剩下的都围绕着如何最好地处理 Sketch 和照片的抽象差距。为此,我们提出一种自回归 Sketch 映射器,训练基于 Sketch 和照片配对的 StyleGAN 隐藏空间映射。我们还介绍了一些设计来解决人类 Sketch 抽象性质,包括一个精细的辨别损失的训练 Sketch 照片检索模型的背后,以及一个部分自我意识的 Sketch 增强策略。最后,我们展示了一些我们的生成模型能够执行的下游任务,其中之一是展示如何通过精细的 Sketch 检索,一个在 Sketch 社区中广受研究的问题解决精细的 Sketch 图像检索问题可以归结为图像(生成)的任务,超越了现有水平。我们呈现了生成的结果,以便每个人审查。
https://arxiv.org/abs/2303.11162
Multiple imaging modalities are often used for disease diagnosis, prediction, or population-based analyses. However, not all modalities might be available due to cost, different study designs, or changes in imaging technology. If the differences between the types of imaging are small, data harmonization approaches can be used; for larger changes, direct image synthesis approaches have been explored. In this paper, we develop an approach based on multi-modal metric learning to synthesize images of diverse modalities. We use metric learning via multi-modal image retrieval, resulting in embeddings that can relate images of different modalities. Given a large image database, the learned image embeddings allow us to use k-nearest neighbor (k-NN) regression for image synthesis. Our driving medical problem is knee osteoarthritis (KOA), but our developed method is general after proper image alignment. We test our approach by synthesizing cartilage thickness maps obtained from 3D magnetic resonance (MR) images using 2D radiographs. Our experiments show that the proposed method outperforms direct image synthesis and that the synthesized thickness maps retain information relevant to downstream tasks such as progression prediction and Kellgren-Lawrence grading (KLG). Our results suggest that retrieval approaches can be used to obtain high-quality and meaningful image synthesis results given large image databases.
多种成像模式经常被用于疾病诊断、预测或基于人口的分析。然而,由于成本、不同的研究设计或成像技术的变化,可能不是所有的成像模式都可用。如果不同成像类型的差分相对距离很小,可以采用数据一致性的方法;如果差分相对距离较大,则已经探索了直接图像合成的方法。在本文中,我们基于多模态度量学习合成了多种成像模式的图像。通过多模态图像检索,使用度量学习,产生了可以与不同成像模式相关联的嵌入。给定一个大型图像数据库, learned image嵌入允许我们使用k-Nearest Neighbor(k-NN)回归用于图像合成。我们的驱动医疗问题是膝盖关节病(KOA),但我们的开发方法在正确的图像对齐后是通用的。我们测试了我们的方法,通过使用2D radiographs合成3D磁共振(MR)图像中的骨密度地图。我们的实验表明, proposed方法比直接图像合成表现更好,合成的密度地图保留了与后续任务如进展预测和凯勒-克莱德分类(KLG)相关的信息。我们的结果表明,检索方法可以用来获得高质量的有意义的图像合成结果,给定大型图像数据库。
https://arxiv.org/abs/2303.10249
While generative modeling has been ubiquitous in natural language processing and computer vision, its application to image retrieval remains unexplored. In this paper, we recast image retrieval as a form of generative modeling by employing a sequence-to-sequence model, contributing to the current unified theme. Our framework, IRGen, is a unified model that enables end-to-end differentiable search, thus achieving superior performance thanks to direct optimization. While developing IRGen we tackle the key technical challenge of converting an image into quite a short sequence of semantic units in order to enable efficient and effective retrieval. Empirical experiments demonstrate that our model yields significant improvement over three commonly used benchmarks, for example, 22.9\% higher than the best baseline method in precision@10 on In-shop dataset with comparable recall@10 score.
虽然生成建模在自然语言处理和计算机视觉中无处不在,但将其应用于图像检索仍未被探索。在本文中,我们将图像检索视为生成建模的形式,采用序列到序列模型,为当前统一的主题做出贡献。我们的框架IRGen是一个统一模型,能够实现端到端不同以往的搜索,因此得益于直接优化,取得了更好的性能。在开发IRGen时,我们解决了将图像转换为相当简短的语义单元的关键技术挑战,以实现高效有效的检索。实证实验表明,我们的模型比三个常用的基准方法取得了显著的改进,例如,在具有类似召回率的在线商店数据集上的精度@10得分中,提高了22.9%。
https://arxiv.org/abs/2303.10126
We study the task of Composed Image Retrieval (CoIR), where a query is composed of two modalities, image and text, extending the user's expression ability. Previous methods typically address this task by a separate encoding of each query modality, followed by late fusion of the extracted features. In this paper, we propose a new approach, Cross-Attention driven Shift Encoder (CASE), employing early fusion between modalities through a cross-attention module with an additional auxiliary task. We show that our method outperforms the existing state-of-the-art, on established benchmarks (FashionIQ and CIRR) by a large margin. However, CoIR datasets are a few orders of magnitude smaller compared to other vision and language (V&L) datasets, and some suffer from serious flaws (e.g., queries with a redundant modality). We address these shortcomings by introducing Large Scale Composed Image Retrieval (LaSCo), a new CoIR dataset x10 times larger than current ones. Pre-training on LaSCo yields a further performance boost. We further suggest a new analysis of CoIR datasets and methods, for detecting modality redundancy or necessity, in queries.
本研究探讨了组合图像检索(CoIR)任务,该任务要求查询由图像和文本两个感官类型组成,扩展了用户的表达能力。以前的研究方法通常通过分别对每个查询感官类型进行编码来解决该任务,然后 late fusion 提取特征。在本文中,我们提出了一种新的方法,称为 Cross-Attention driven Shift Encoder (CASE),通过一个额外的交叉注意力任务模块,采用早期融合感官类型。我们证明了我们的方法和现有先进技术在建立基准( fashionIQ 和 CIRR)上表现优异。然而,CoIR数据集比其他视觉和语言(V&L)数据集小几个数量级,其中一些数据集存在严重缺陷(例如,具有冗余感官类型的问题)。我们通过引入大型组合图像检索(LaSCo)数据集解决了这些缺陷,LaSCo 是目前 CoIR 数据集的 10 倍大小。在 LaSCo 上进行预训练进一步提高了性能。我们还建议对 CoIR 数据集和方法进行新的分析,以在查询中检测感官冗余或必要性。
https://arxiv.org/abs/2303.09429
Previous Knowledge Distillation based efficient image retrieval methods employs a lightweight network as the student model for fast inference. However, the lightweight student model lacks adequate representation capacity for effective knowledge imitation during the most critical early training period, causing final performance degeneration. To tackle this issue, we propose a Capacity Dynamic Distillation framework, which constructs a student model with editable representation capacity. Specifically, the employed student model is initially a heavy model to fruitfully learn distilled knowledge in the early training epochs, and the student model is gradually compressed during the training. To dynamically adjust the model capacity, our dynamic framework inserts a learnable convolutional layer within each residual block in the student model as the channel importance indicator. The indicator is optimized simultaneously by the image retrieval loss and the compression loss, and a retrieval-guided gradient resetting mechanism is proposed to release the gradient conflict. Extensive experiments show that our method has superior inference speed and accuracy, e.g., on the VeRi-776 dataset, given the ResNet101 as a teacher, our method saves 67.13% model parameters and 65.67% FLOPs (around 24.13% and 21.94% higher than state-of-the-arts) without sacrificing accuracy (around 2.11% mAP higher than state-of-the-arts).
以前的基于知识蒸馏的高效图像检索方法使用轻量级网络作为快速推理的学生模型。然而,轻量级的学生模型在训练初期最重要的阶段缺乏有效的知识模仿能力,导致最终性能退化。为了解决这一问题,我们提出了一种能力动态蒸馏框架,该框架构建了一个可编辑的学生模型。具体来说,使用的学生模型一开始是一个重型模型,在训练 epoch 的早期阶段积极地学习蒸馏知识,学生模型在训练过程中逐渐压缩。为了动态调整模型能力,我们的动态框架在每个残留块中嵌入一个可学习卷积层,作为通道重要性指示器。指示器同时由图像检索损失和压缩损失优化,并提出了检索引导梯度重置机制,以释放梯度冲突。广泛的实验表明,我们的方法具有更快的推理速度和更高的准确性,例如,在VeRi-776数据集上,将ResNet101作为老师,我们的方法节省67.13%的模型参数和65.67%的FLOPs(约24.13%和21.94%高于平均水平)而无需牺牲准确性(约2.11%的mAP高于平均水平)。
https://arxiv.org/abs/2303.09230
This paper investigates unsupervised representation learning for facial expression analysis. We think Unsupervised Facial Expression Representation (UFER) deserves exploration and has the potential to address some key challenges in facial expression analysis, such as scaling, annotation bias, the discrepancy between discrete labels and continuous emotions, and model pre-training. Such motivated, we propose a UFER method with contrastive local warping (ContraWarping), which leverages the insight that the emotional expression is robust to current global transformation (affine transformation, color jitter, etc.) but can be easily changed by random local warping. Therefore, given a facial image, ContraWarping employs some global transformations and local warping to generate its positive and negative samples and sets up a novel contrastive learning framework. Our in-depth investigation shows that: 1) the positive pairs from global transformations may be exploited with general self-supervised learning (e.g., BYOL) and already bring some informative features, and 2) the negative pairs from local warping explicitly introduce expression-related variation and further bring substantial improvement. Based on ContraWarping, we demonstrate the benefit of UFER under two facial expression analysis scenarios: facial expression recognition and image retrieval. For example, directly using ContraWarping features for linear probing achieves 79.14% accuracy on RAF-DB, significantly reducing the gap towards the full-supervised counterpart (88.92% / 84.81% with/without pre-training).
本研究探讨了用于面部表达分析的无监督表示学习。我们认为无监督面部表达表示(UFER)值得探索,并且有可能解决面部表达分析中的一些关键挑战,例如尺度、标注偏差、离散标签和连续情感之间的差异,以及模型预训练。基于这一动机,我们提出了一种对比性局部扭曲(ContraWarping)的方法,该方法利用 insights 指出,情感表达对当前全球变换(如affine变换、颜色抖动等)具有鲁棒性,但可以被随机局部扭曲轻易地改变。因此,给定一张面部图像,ContraWarping 使用一些全球变换和局部扭曲来生成其积极和消极样本,并建立了一个独特的对比学习框架。我们的深入调查表明: 1) 全球变换的积极对子可以用一般的自监督学习(如 BYOL)利用,并且已经带来了一些有用的特征, 2) 局部扭曲的消极对子明确引入了表达相关的变异,并进一步带来了实质性的改进。基于ContraWarping,我们展示了 UFER 在两个面部表达分析场景下的益处:面部表达识别和图像检索。例如,直接使用ContraWarping 的特征进行线性探测在 RAF-DB 上取得了79.14%的精度,显著减少了与完全监督对应的差距(88.92% / 84.81% 有/无预训练)。
https://arxiv.org/abs/2303.09034
Content-based image retrieval is the process of retrieving a subset of images from an extensive image gallery based on visual contents, such as color, shape or spatial relations, and texture. In some applications, such as localization, image retrieval is employed as the initial step. In such cases, the accuracy of the top-retrieved images significantly affects the overall system accuracy. The current paper introduces a simple yet efficient image retrieval system with a fewer trainable parameters, which offers acceptable accuracy in top-retrieved images. The proposed method benefits from a dilated residual convolutional neural network with triplet loss. Experimental evaluations show that this model can extract richer information (i.e., high-resolution representations) by enlarging the receptive field, thus improving image retrieval accuracy without increasing the depth or complexity of the model. To enhance the extracted representations' robustness, the current research obtains candidate regions of interest from each feature map and applies Generalized-Mean pooling to the regions. As the choice of triplets in a triplet-based network affects the model training, we employ a triplet online mining method. We test the performance of the proposed method under various configurations on two of the challenging image-retrieval datasets, namely Revisited Paris6k (RPar) and UKBench. The experimental results show an accuracy of 94.54 and 80.23 (mean precision at rank 10) in the RPar medium and hard modes and 3.86 (recall at rank 4) in the UKBench dataset, respectively.
基于内容的图像处理是从广泛的图像画廊中检索基于视觉内容的特定图像的过程,例如颜色、形状或空间关系以及纹理。在某些应用中,例如定位,图像处理被用作初始步骤。在这种情况下,最准确的检索图像 significantly 影响了整个系统的准确性。本文介绍了一个简单但高效的图像检索系统,具有较少可训练参数,能够在最准确的检索图像上提供合理的精度。该方法得益于一个扩展的残留卷积神经网络,具有三组元损失。实验评估表明,该模型可以通过扩大响应域来提取更丰富的信息(即高分辨率表示),从而提高图像检索的准确性,而无需增加模型的深度或复杂性。为了提高提取表示的鲁棒性,当前研究从每个特征映射中提取感兴趣的区域,并应用通用均值聚合对这些区域进行收集。由于在三组元网络中选择三组元的影响了模型训练,我们采用了一种三组元在线挖掘方法。我们测试了在挑战性的图像检索数据集上各种配置下,本文提出的方法的性能,包括重访巴黎6k(RPar)数据和英国bench数据集。实验结果显示,在RPar中等和困难模式中,该方法的精度为94.54和80.23(10个排名的平均精度),而在UKbench数据集上,该方法的召回率为3.86。
https://arxiv.org/abs/2303.08398
Rising concerns about privacy and anonymity preservation of deep learning models have facilitated research in data-free learning (DFL). For the first time, we identify that for data-scarce tasks like Sketch-Based Image Retrieval (SBIR), where the difficulty in acquiring paired photos and hand-drawn sketches limits data-dependent cross-modal learning algorithms, DFL can prove to be a much more practical paradigm. We thus propose Data-Free (DF)-SBIR, where, unlike existing DFL problems, pre-trained, single-modality classification models have to be leveraged to learn a cross-modal metric-space for retrieval without access to any training data. The widespread availability of pre-trained classification models, along with the difficulty in acquiring paired photo-sketch datasets for SBIR justify the practicality of this setting. We present a methodology for DF-SBIR, which can leverage knowledge from models independently trained to perform classification on photos and sketches. We evaluate our model on the Sketchy, TU-Berlin, and QuickDraw benchmarks, designing a variety of baselines based on state-of-the-art DFL literature, and observe that our method surpasses all of them by significant margins. Our method also achieves mAPs competitive with data-dependent approaches, all the while requiring no training data. Implementation is available at \url{this https URL}.
对深度学习模型隐私和匿名保护的日益关注使得数据自由学习(DFL)的研究变得更加容易。我们首次认识到,对于像 Sketch-Based Image Retrieval (SBIR)这样的数据匮乏任务,由于在获取一对照片和手绘 Sketch 的数据上所面临的困难,限制了基于数据依赖的跨模态学习算法,因此 DFL 可能是一个非常实用的范式。因此,我们提出了数据自由 (DF) -SBIR,与现有的 DFL 问题不同,我们要求使用预先训练的单一模态分类模型,以学习一个跨模态度量空间,而无需访问任何训练数据。预先训练的分类模型的广泛可用性和获取SBIR数据集的困难证明了这个设置的实际可行性。我们提出了一种方法,用于DF-SBIR,该方法可以利用独立训练用于对照片和 Sketch 进行分类的知识。我们使用 Sketchy、柏林工业大学和QuickDraw基准测试来评估我们的模型,并根据最新的 DFL 文献设计了多种基线,并观察到我们的方法在许多方面都超越了它们。我们的方法还实现了与基于数据依赖方法的 mAPs 相当的精度,而无需任何训练数据。实现代码已放在 url{this https URL} 上。
https://arxiv.org/abs/2303.07775
2D image understanding is a complex problem within Computer Vision, but it holds the key to providing human level scene comprehension. It goes further than identifying the objects in an image, and instead it attempts to understand the scene. Solutions to this problem form the underpinning of a range of tasks, including image captioning, Visual Question Answering (VQA), and image retrieval. Graphs provide a natural way to represent the relational arrangement between objects in an image, and thus in recent years Graph Neural Networks (GNNs) have become a standard component of many 2D image understanding pipelines, becoming a core architectural component especially in the VQA group of tasks. In this survey, we review this rapidly evolving field and we provide a taxonomy of graph types used in 2D image understanding approaches, a comprehensive list of the GNN models used in this domain, and a roadmap of future potential developments. To the best of our knowledge, this is the first comprehensive survey that covers image captioning, visual question answering, and image retrieval techniques that focus on using GNNs as the main part of their architecture.
2D图像理解是计算机视觉中的复杂问题,但这是提供人类水平场景理解的关键。它不仅仅是识别图像中的物体,而是试图理解场景。解决这个问题的解决方案是一系列任务的基础,包括图像标题制作、视觉问答(VQA)和图像检索。Graphs提供了一种自然的方式来代表图像中物体之间的关系安排,因此Graph Neural Networks(GNNs)在近年来已成为许多2D图像理解管道的标准组件,特别是在VQA任务组中成为了核心建筑组件。在这个调查中,我们回顾了这一快速发展的领域,并提供了2D图像理解方法中使用的Graph类型的分类,了一份这个领域内GNN模型的全面列表,以及未来潜在发展的路线图。据我们所知,这是第一个涵盖了图像标题制作、视觉问答和图像检索技巧的全面调查,重点使用GNNs将其架构的主要部分作为主要部分。
https://arxiv.org/abs/2303.03761
The amount of medical images stored in hospitals is increasing faster than ever; however, utilizing the accumulated medical images has been limited. This is because existing content-based medical image retrieval (CBMIR) systems usually require example images to construct query vectors; nevertheless, example images cannot always be prepared. Besides, there can be images with rare characteristics that make it difficult to find similar example images, which we call isolated samples. Here, we introduce a novel sketch-based medical image retrieval (SBMIR) system that enables users to find images of interest without example images. The key idea lies in feature decomposition of medical images, whereby the entire feature of a medical image can be decomposed into and reconstructed from normal and abnormal features. By extending this idea, our SBMIR system provides an easy-to-use two-step graphical user interface: users first select a template image to specify a normal feature and then draw a semantic sketch of the disease on the template image to represent an abnormal feature. Subsequently, it integrates the two kinds of input to construct a query vector and retrieves reference images with the closest reference vectors. Using two datasets, ten healthcare professionals with various clinical backgrounds participated in the user test for evaluation. As a result, our SBMIR system enabled users to overcome previous challenges, including image retrieval based on fine-grained image characteristics, image retrieval without example images, and image retrieval for isolated samples. Our SBMIR system achieves flexible medical image retrieval on demand, thereby expanding the utility of medical image databases.
医院中存储的医疗图像数量正在以前所未有的速度增加,然而,利用累积的医疗图像仍然受到限制。这是因为现有的基于内容的医学图像检索(CBMIR)系统通常需要示例图像来构建查询向量,但示例图像通常无法 always 准备。此外,可能存在具有罕见特征的图像,使得难以找到类似示例图像,我们称之为孤立样本。在这里,我们介绍了一种新的基于 Sketch 的医疗图像检索系统(SBMIR),它使用户可以在没有示例图像的情况下找到感兴趣的图像。关键思想在于医疗图像的特征分解,从而使医疗图像的所有特征可以从正常和异常特征中分解和重构。通过扩展这个思想,我们的 SBMIR 系统提供了易于使用的两个步骤图形用户界面:用户首先选择一个模板图像以指定正常特征,然后在该模板图像上绘制疾病语义 Sketch 以表示异常特征。随后,它整合两种输入来构建查询向量,并检索最接近参考向量的恢复图像。使用两个数据集,十名来自不同临床背景的医疗保健专业人员参加了用户测试,评估其性能。因此,我们的 SBMIR 系统使用户可以克服以前的挑战,包括基于精细图像特征的图像检索、在没有示例图像的情况下进行图像检索以及孤立样本的图像检索。我们的 SBMIR 系统实现了灵活的医学图像检索,从而扩大了医学图像数据库的实用性。
https://arxiv.org/abs/2303.03633
Image retrieval has garnered growing interest in recent times. The current approaches are either supervised or self-supervised. These methods do not exploit the benefits of hybrid learning using both supervision and self-supervision. We present a novel Master Assistant Buddy Network (MABNet) for image retrieval which incorporates both learning mechanisms. MABNet consists of master and assistant blocks, both learning independently through supervision and collectively via self-supervision. The master guides the assistant by providing its knowledge base as a reference for self-supervision and the assistant reports its knowledge back to the master by weight transfer. We perform extensive experiments on public datasets with and without post-processing.
图像检索最近吸引了越来越多的关注。当前的方法要么受监督,要么自监督。这些方法并没有利用同时使用监督和自监督的混合学习的好处。我们提出了一种图像检索的新型 Master AssistantBuddy Network (MABNet),它结合了监督和自监督两个学习机制。MABNet由Master和Assistant块组成,它们通过监督单独学习,并通过自监督共同学习。 Master指导Assistant,通过提供其知识库作为自监督的参考,Assistant将知识反馈给Master通过权重转移。我们在不同的公共数据集上进行了广泛的实验,包括经过预处理和未经过预处理的数据。
https://arxiv.org/abs/2303.03050
In the fashion domain, there exists a variety of vision-and-language (V+L) tasks, including cross-modal retrieval, text-guided image retrieval, multi-modal classification, and image captioning. They differ drastically in each individual input/output format and dataset size. It has been common to design a task-specific model and fine-tune it independently from a pre-trained V+L model (e.g., CLIP). This results in parameter inefficiency and inability to exploit inter-task relatedness. To address such issues, we propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL) in this work. Compared with existing approaches, FAME-ViL applies a single model for multiple heterogeneous fashion tasks, therefore being much more parameter-efficient. It is enabled by two novel components: (1) a task-versatile architecture with cross-attention adapters and task-specific adapters integrated into a unified V+L model, and (2) a stable and effective multi-task training strategy that supports learning from heterogeneous data and prevents negative transfer. Extensive experiments on four fashion tasks show that our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models. Code is available at this https URL.
在时尚领域,存在多种视觉和语言(V+L)任务,包括跨modal检索、文本引导的图像检索、多modal分类和图像captioning。它们在每个个体输入/输出格式和数据集大小上存在显著的不同。通常,人们会设计一个任务特定模型,并独立地微调它(例如,CLIP)。这会导致参数效率不足,无法充分利用任务间相关性。为了解决这些问题,我们提出了一种新的focus on FAshion的多功能高效学习方法,该方法适用于视觉和语言任务(FAME-ViL)。与现有的方法相比,FAME-ViL适用于多个不同类型的时尚任务,因此非常参数效率高。它由两个 novel 组件 enabled 实现:(1)一个任务多功能架构,将交叉注意力器和任务特定注意力器集成到一个统一的视觉和语言模型中,(2)一个稳定有效的多任务训练策略,支持从异质数据学习,并防止负面转移。对四个时尚任务进行广泛的实验表明,我们的FAME-ViL可以节省61.5%的参数,同时显著优于传统的独立训练单任务模型。代码可在该https URL上获取。
https://arxiv.org/abs/2303.02483
Accessing and understanding contemporary and historical events of global impact such as the US elections and the Olympic Games is a major prerequisite for cross-lingual event analytics that investigate event causes, perception and consequences across country borders. In this paper, we present the Open Event Knowledge Graph (OEKG), a multilingual, event-centric, temporal knowledge graph composed of seven different data sets from multiple application domains, including question answering, entity recommendation and named entity recognition. These data sets are all integrated through an easy-to-use and robust pipeline and by linking to the event-centric knowledge graph EventKG. We describe their common schema and demonstrate the use of the OEKG at the example of three use cases: type-specific image retrieval, hybrid question answering over knowledge graphs and news articles, as well as language-specific event recommendation. The OEKG and its query endpoint are publicly available.
访问和理解具有全球性影响的当代和历史事件,例如美国选举和奥运会,是进行跨语言事件分析,研究跨越国家边界事件原因、感知和后果的重要前提。在本文中,我们介绍了开放事件知识图(OEKG),它是一个多语言、事件中心、时间知识图,由来自不同应用领域的七个不同数据集组成,包括问题回答、实体推荐和命名实体识别。这些数据集都通过易于使用且可靠的管道进行整合,并连接到事件中心知识图EventKG。我们描述了它们的共同 schema 并展示了如何使用 OEKG 的三个用例:特定类型的图像检索、知识图和新闻文章混合问答,以及语言特定的事件推荐。OEKG 及其查询Endpoint 是公开可用的。
https://arxiv.org/abs/2302.14688
The gap between low-level visual signals and high-level semantics has been progressively bridged by continuous development of deep neural network (DNN). With recent progress of DNN, almost all image classification tasks have achieved new records of accuracy. To extend the ability of DNN to image retrieval tasks, we proposed a unified DNN model for image-query similarity calculation by simultaneously modeling image and query in one network. The unified DNN is named the cross space mapping (CSM) model, which contains two parts, a convolutional part and a query-embedding part. The image and query are mapped to a common vector space via these two parts respectively, and image-query similarity is naturally defined as an inner product of their mappings in the space. To ensure good generalization ability of the DNN, we learn weights of the DNN from a large number of click-through logs which consists of 23 million clicked image-query pairs between 1 million images and 11.7 million queries. Both the qualitative results and quantitative results on an image retrieval evaluation task with 1000 queries demonstrate the superiority of the proposed method.
低层次的视觉信号和高级别的语义逐渐通过深度神经网络(DNN)的发展而逐步被填补。随着DNN的进展,几乎所有图像分类任务都达到了准确性的新记录。为了将DNN的能力扩展到图像检索任务,我们提出了一种统一的图像查询相似度计算的DNN模型,通过同时建模图像和查询在一个网络中进行。这个统一的图像查询相似度计算的DNN模型被称为交叉空间映射(CSM)模型,它由两个部分组成,一个是卷积部分,另一个是查询嵌入部分。图像和查询通过这两个部分分别映射到一个共同的向量空间中,图像查询相似性自然定义为它们在空间中的内积。为了确保DNN的良好泛化能力,我们从大量的点击日志中学习DNN的权重,这些日志包括1000万对图像和查询点击的配对,其中配对的数量在1百万图像和11.7百万查询之间。在1000个查询的图像检索评估任务中,定性结果和定量结果都证明了我们提出的方法的优越性。
https://arxiv.org/abs/2302.13275
Recent advances in MRI have led to the creation of large datasets. With the increase in data volume, it has become difficult to locate previous scans of the same patient within these datasets (a process known as re-identification). To address this issue, we propose an AI-powered medical imaging retrieval framework called DeepBrainPrint, which is designed to retrieve brain MRI scans of the same patient. Our framework is a semi-self-supervised contrastive deep learning approach with three main innovations. First, we use a combination of self-supervised and supervised paradigms to create an effective brain fingerprint from MRI scans that can be used for real-time image retrieval. Second, we use a special weighting function to guide the training and improve model convergence. Third, we introduce new imaging transformations to improve retrieval robustness in the presence of intensity variations (i.e. different scan contrasts), and to account for age and disease progression in patients. We tested DeepBrainPrint on a large dataset of T1-weighted brain MRIs from the Alzheimer's Disease Neuroimaging Initiative (ADNI) and on a synthetic dataset designed to evaluate retrieval performance with different image modalities. Our results show that DeepBrainPrint outperforms previous methods, including simple similarity metrics and more advanced contrastive deep learning frameworks.
最近的MRI技术进步导致创造了大型数据集。随着数据量的增加,在这些数据集中找到同一个人的先前扫描变得越来越困难(这一过程被称为重名)。为了解决这个问题,我们提出了一个名为DeepBrainPrint的AI驱动的医疗影像检索框架,该框架旨在检索同一个人的脑MRI扫描。我们的框架是一种半自监督的对比深度学习方法,并具有三个主要创新。首先,我们使用自监督和监督范式的组合来创建一个有效的脑指纹,并将其用于实时图像检索。其次,我们使用一种特别的加权函数来指导训练并改善模型收敛。第三,我们引入了新的图像转换来改善在强度变化(即不同扫描对比)的情况下检索的可靠性,并考虑患者的年龄和疾病进展。我们测试了DeepBrainPrint在来自阿尔茨海默病神经影像学倡议(ADNI)的大规模T1加权脑MRI数据集中的一组数据,以及一个旨在评估不同图像模式检索性能的合成数据集。我们的结果表明,DeepBrainPrint比以前的方法表现出色,包括简单的相似度 metrics和更先进的对比深度学习框架。
https://arxiv.org/abs/2302.13057
Large vision-language models (VLMs), such as CLIP, learn rich joint image-text representations, facilitating advances in numerous downstream tasks, including zero-shot classification and text-to-image generation. Nevertheless, existing VLMs exhibit a prominent well-documented limitation - they fail to encapsulate compositional concepts such as counting. We introduce a simple yet effective method to improve the quantitative understanding of VLMs, while maintaining their overall performance on common benchmarks. Specifically, we propose a new counting-contrastive loss used to finetune a pre-trained VLM in tandem with its original objective. Our counting loss is deployed over automatically-created counterfactual examples, each consisting of an image and a caption containing an incorrect object count. For example, an image depicting three dogs is paired with the caption "Six dogs playing in the yard". Our loss encourages discrimination between the correct caption and its counterfactual variant which serves as a hard negative example. To the best of our knowledge, this work is the first to extend CLIP's capabilities to object counting. Furthermore, we introduce "CountBench" - a new image-text counting benchmark for evaluating a model's understanding of object counting. We demonstrate a significant improvement over state-of-the-art baseline models on this task. Finally, we leverage our count-aware CLIP model for image retrieval and text-conditioned image generation, demonstrating that our model can produce specific counts of objects more reliably than existing ones.
大型视觉语言模型(VLMs),如CLIP,学习丰富的 joint 图像-文本表示,促进了许多后续任务的进步,包括零样本分类和文本到图像生成。然而,现有的VLMs表现出一个显著且已证明的限制 - 它们无法包含诸如计数等构成性概念。我们提出了一种简单但有效的方法,以改善VLMs的量化理解,同时保持它们在常见基准上的整体表现。具体来说,我们提议了一种新的计数对比损失,用于微调已训练的VLM并与其原目标协同优化。我们的计数损失部署在自动生成的反事实示例中,每个示例包含一张图像和一张包含错误对象计数的caption。例如,一张描述三狗的图像与“六个狗在花园里玩”的caption配对。我们的损失鼓励对正确caption和其反事实变种的区分,作为强硬的负面例子。据我们所知,这项工作是第一款将CLIP的能力扩展至对象计数的工作。此外,我们提出了“Countbench” - 一个新的图像-文本计数基准,用于评估模型对对象计数的理解。我们在这个任务上展示了比当前最佳基准模型显著提高的表现。最后,我们利用我们的计数意识到的CLIP模型进行图像检索和文本Condition图像生成,表明我们的模型可以产生比现有对象计数更多的具体计数。
https://arxiv.org/abs/2302.12066