Deep quantization methods have shown high efficiency on large-scale image retrieval. However, current models heavily rely on ground-truth information, hindering the application of quantization in label-hungry scenarios. A more realistic demand is to learn from inexhaustible uploaded images that are associated with informal tags provided by amateur users. Though such sketchy tags do not obviously reveal the labels, they actually contain useful semantic information for supervising deep quantization. To this end, we propose Weakly-Supervised Deep Hyperspherical Quantization (WSDHQ), which is the first work to learn deep quantization from weakly tagged images. Specifically, 1) we use word embeddings to represent the tags and enhance their semantic information based on a tag correlation graph. 2) To better preserve semantic information in quantization codes and reduce quantization error, we jointly learn semantics-preserving embeddings and supervised quantizer on hypersphere by employing a well-designed fusion layer and tailor-made loss functions. Extensive experiments show that WSDHQ can achieve state-of-art performance on weakly-supervised compact coding. Code is available at this https URL.
深度量化方法在大型图像检索任务上表现出了高效性。然而,当前的模型在很大程度上依赖于真实数据,这阻碍了在有标签的场景中应用量化。一个更现实的需求是学习来自非正式标签的不可用上传图像,这些图像与业余用户提供的标签相关。尽管这些标签并不明显地揭示了标签,但它们实际上包含有关深度量化的有用语义信息。为此,我们提出了弱监督深度超球量化(WSDHQ),这是第一个从弱标签图像中学习深度量化的工作。具体来说,1)我们使用词向量来表示标签,并根据标签相关图增强其语义信息。2)为了更好地保留语义信息在量化代码中,并减少量化误差,我们通过采用设计巧妙的融合层和定制损失函数,在超球上共同学习和语义保持嵌入。大量实验证明,WSDHQ可以在弱监督的紧凑编码上实现最先进的性能。代码可在此处下载:https://url.cn/xyz6hU6
https://arxiv.org/abs/2404.04998
The chain-of-thought technique has been received well in multi-modal tasks. It is a step-by-step linear reasoning process that adjusts the length of the chain to improve the performance of generated prompts. However, human thought processes are predominantly non-linear, as they encompass multiple aspects simultaneously and employ dynamic adjustment and updating mechanisms. Therefore, we propose a novel Aggregation-Graph-of-Thought (AGoT) mechanism for soft-prompt tuning in multi-modal representation learning. The proposed AGoT models the human thought process not only as a chain but also models each step as a reasoning aggregation graph to cope with the overlooked multiple aspects of thinking in single-step reasoning. This turns the entire reasoning process into prompt aggregation and prompt flow operations. Experiments show that our multi-modal model enhanced with AGoT soft-prompting achieves good results in several tasks such as text-image retrieval, visual question answering, and image recognition. In addition, we demonstrate that it has good domain generalization performance due to better reasoning.
链式思考技术在多模态任务中得到了很好的接收。它是一种逐步线性推理过程,根据生成提示的长度调整链条的长度以提高生成提示的性能。然而,人类思维过程主要是非线性的,因为它们同时涵盖多个方面并采用动态调整和更新机制。因此,我们提出了一个名为聚合-图-思维(AGoT)的多模态表示学习软提示调整的新机制。与AGoT不同,我们提出的AGoT模型将人类思维过程不仅建模为链条,而且将每一步都建模为一个推理聚合图,以应对单步推理中忽视的多个方面。这使得整个推理过程转化为提示聚合和提示流操作。实验证明,我们的多模态模型(AGoT软提示)在文本图像检索、视觉问题回答和图像识别等任务中取得了良好的结果。此外,我们还证明了它具有良好的领域泛化性能,因为其推理能力更强。
https://arxiv.org/abs/2404.04538
How important is it for training and evaluation sets to not have class overlap in image retrieval? We revisit Google Landmarks v2 clean, the most popular training set, by identifying and removing class overlap with Revisited Oxford and Paris [34], the most popular evaluation set. By comparing the original and the new RGLDv2-clean on a benchmark of reproduced state-of-the-art methods, our findings are striking. Not only is there a dramatic drop in performance, but it is inconsistent across methods, changing the ranking.What does it take to focus on objects or interest and ignore background clutter when indexing? Do we need to train an object detector and the representation separately? Do we need location supervision? We introduce Single-stage Detect-to-Retrieve (CiDeR), an end-to-end, single-stage pipeline to detect objects of interest and extract a global image representation. We outperform previous state-of-the-art on both existing training sets and the new RGLDv2-clean. Our dataset is available at this https URL.
训练和评估集是否具有类重叠在图像检索中有多重要呢?我们重新审视了最受欢迎的训练集Google Landmarks v2 clean,通过识别并删除与Revisited Oxford和Paris [34]最流行的评估集中的类重叠,来研究这个问题。通过比较原始和新的RGLDv2-clean在先进方法上的基准,我们的研究结果是引人注目的。不仅性能急剧下降,而且方法之间存在差异,改变了排名。 在索引时专注于物体或兴趣并忽略背景噪音需要什么?我们需要单独训练物体检测器和表示吗?我们需要位置监督吗?我们引入了Single-stage Detect-to-Retrieve(CiDeR),一种端到端的单阶段管道,用于检测感兴趣的物体并提取全局图像表示。我们在现有的训练集和新的RGLDv2-clean上均超越了最先进的状态。我们的数据集可通过此链接获得。
https://arxiv.org/abs/2404.01524
In Visual Place Recognition (VPR) the pose of a query image is estimated by comparing the image to a map of reference images with known reference poses. As is typical for image retrieval problems, a feature extractor maps the query and reference images to a feature space, where a nearest neighbor search is then performed. However, till recently little attention has been given to quantifying the confidence that a retrieved reference image is a correct match. Highly certain but incorrect retrieval can lead to catastrophic failure of VPR-based localization pipelines. This work compares for the first time the main approaches for estimating the image-matching uncertainty, including the traditional retrieval-based uncertainty estimation, more recent data-driven aleatoric uncertainty estimation, and the compute-intensive geometric verification. We further formulate a simple baseline method, ``SUE'', which unlike the other methods considers the freely-available poses of the reference images in the map. Our experiments reveal that a simple L2-distance between the query and reference descriptors is already a better estimate of image-matching uncertainty than current data-driven approaches. SUE outperforms the other efficient uncertainty estimation methods, and its uncertainty estimates complement the computationally expensive geometric verification approach. Future works for uncertainty estimation in VPR should consider the baselines discussed in this work.
在视觉空间识别(VPR)中,通过将查询图像与已知参考图像的映射进行比较来估计查询图像的姿势。与图像检索问题典型的情况类似,特征提取器将查询和参考图像映射到特征空间,然后进行最近邻搜索。然而,到目前为止,还没有很少关注估计检索参考图像是否为正确匹配的概率。高度确定性的错误的检索可能会导致VPR基于局部定位管道灾难性失败。这项工作是首次将估计图像匹配不确定性的主要方法进行比较,包括传统的检索为基础的不确定性估计,更加最近的数据驱动的随机不确定性估计,以及计算密集型几何验证。我们进一步提出了一个简单的基准方法,“SUE”,它不同于其他方法,考虑了映射中可免费获得的参考图像的自由姿态。我们的实验结果表明,查询和参考描述符之间的L2距离已经比目前的data-driven方法更好估计图像匹配不确定性。SUE优于其他高效的 uncertainty estimation methods,其不确定性估计补充了计算密集型几何验证方法。未来在VPR中进行不确定性估计的研究应该考虑本文中讨论的基准方法。
https://arxiv.org/abs/2404.00546
Open-vocabulary vision-language models (VLMs) like CLIP, trained using contrastive loss, have emerged as a promising new paradigm for text-to-image retrieval. However, do VLMs understand compound nouns (CNs) (e.g., lab coat) as well as they understand nouns (e.g., lab)? We curate Compun, a novel benchmark with 400 unique and commonly used CNs, to evaluate the effectiveness of VLMs in interpreting CNs. The Compun benchmark challenges a VLM for text-to-image retrieval where, given a text prompt with a CN, the task is to select the correct image that shows the CN among a pair of distractor images that show the constituent nouns that make up the CN. Next, we perform an in-depth analysis to highlight CLIPs' limited understanding of certain types of CNs. Finally, we present an alternative framework that moves beyond hand-written templates for text prompts widely used by CLIP-like models. We employ a Large Language Model to generate multiple diverse captions that include the CN as an object in the scene described by the caption. Our proposed method improves CN understanding of CLIP by 8.25% on Compun. Code and benchmark are available at: this https URL
开放词汇视觉语言模型(VLMs)如CLIP,通过对比损失训练,已经成为文本到图像检索的有前景的新范式。然而,VLMs是否能够理解复合名词(CN)(例如实验室外套)以及它们是否能够理解名词(例如实验室)还有待观察。我们创建了Compun基准,一个包含400个独特且常用CN的新基准,以评估VLMs在解释CN方面的有效性。Compun基准挑战了一个VLM在文本到图像检索的任务,其中,给定一个文本提示,任务是选择一张正确的图像,该图像在一对显示构成CN的干扰图像中显示。接下来,我们进行了深入分析,以突出CLIP对某些类型的CN理解有限。最后,我们提出了一个超越了广泛使用的CLIP类似模型的手写模板的新框架。我们使用一个大型语言模型生成多个具有场景中CN作为对象的多样性 caption。我们的方法在Compun基准上提高了CLIP对CN的理解8.25%。代码和基准您可以在此处查看:https://this URL
https://arxiv.org/abs/2404.00419
Image retrieval, i.e., finding desired images given a reference image, inherently encompasses rich, multi-faceted search intents that are difficult to capture solely using image-based measures. Recent work leverages text instructions to allow users to more freely express their search intents. However, existing work primarily focuses on image pairs that are visually similar and/or can be characterized by a small set of pre-defined relations. The core thesis of this paper is that text instructions can enable retrieving images with richer relations beyond visual similarity. To show this, we introduce MagicLens, a series of self-supervised image retrieval models that support open-ended instructions. MagicLens is built on a key novel insight: image pairs that naturally occur on the same web pages contain a wide range of implicit relations (e.g., inside view of), and we can bring those implicit relations explicit by synthesizing instructions via large multimodal models (LMMs) and large language models (LLMs). Trained on 36.7M (query image, instruction, target image) triplets with rich semantic relations mined from the web, MagicLens achieves comparable or better results on eight benchmarks of various image retrieval tasks than prior state-of-the-art (SOTA) methods. Remarkably, it outperforms previous SOTA but with a 50X smaller model size on multiple benchmarks. Additional human analyses on a 1.4M-image unseen corpus further demonstrate the diversity of search intents supported by MagicLens.
图像检索,即根据给定的参考图像查找所需的图像,本质上涵盖了具有丰富多面性搜索意图,仅通过图像为基础的方法难以捕捉到的复杂搜索意图。最近的工作利用文本指令使用户能够更自由地表达他们的搜索意图。然而,现有的工作主要关注具有视觉相似性以及/或可以归因于预定义关系的一小部分图像对。本文论文的核心论点是,文本指令可以实现具有更丰富关系的图像检索,而不仅仅是视觉相似性。为了证明这一点,我们引入了MagicLens,一系列自监督图像检索模型,支持开放性指令。MagicLens基于一个关键的新见解:自然出现在同一网页上的图像对包含广泛的隐含关系(例如:内部视角),并且我们可以通过通过大型多模态模型(LMMs)和大型语言模型(LLM)合成指令来明确地表示这些隐含关系。在从互联网上挖掘了36.7M(查询图像,指令,目标图像)三元组训练的基础上,MagicLens在各种图像检索基准测试中的表现与最先进的(SOTA)方法相当或者更好。值得注意的是,它比SOTA的表现优秀,但模型大小缩小了50倍。此外,针对一个未见过的1.4M图像的更大人类分析进一步证明了MagicLens支持的各种搜索意图的多样性。
https://arxiv.org/abs/2403.19651
State-of-the-art (SOTA) hierarchical localisation pipelines (HLoc) rely on image retrieval (IR) techniques to establish 2D-3D correspondences by selecting the $k$ most similar images from a reference image database for a given query image. Although higher values of $k$ enhance localisation robustness, the computational cost for feature matching increases linearly with $k$. In this paper, we observe that queries that are the most similar to images in the database result in a higher proportion of feature matches and, thus, more accurate positioning. Thus, a small number of images is sufficient for queries very similar to images in the reference database. We then propose a novel approach, AIR-HLoc, which divides query images into different localisation difficulty levels based on their similarity to the reference image database. We consider an image with high similarity to the reference image as an easy query and an image with low similarity as a hard query. Easy queries show a limited improvement in accuracy when increasing $k$. Conversely, higher values of $k$ significantly improve accuracy for hard queries. Given the limited improvement in accuracy when increasing $k$ for easy queries and the significant improvement for hard queries, we adapt the value of $k$ to the query's difficulty level. Therefore, AIR-HLoc optimizes processing time by adaptively assigning different values of $k$ based on the similarity between the query and reference images without losing accuracy. Our extensive experiments on the Cambridge Landmarks, 7Scenes, and Aachen Day-Night-v1.1 datasets demonstrate our algorithm's efficacy, reducing 30\%, 26\%, and 11\% in computational overhead while maintaining SOTA accuracy compared to HLoc with fixed image retrieval.
先进的(SOTA)层次局部定位管道(HLoc)依赖于图像检索(IR)技术来通过从参考图像数据库中选择与给定查询图像最相似的$k$个图像来建立2D-3D对应关系。尽管$k$较高的值提高了局部定位的鲁棒性,但基于特征匹配的计算成本随$k$线性增加。在本文中,我们观察到与数据库中图像最相似的查询导致更多的特征匹配,从而实现更准确的定位。因此,对于与参考数据库中图像非常相似的查询,只需要几张图片就足够了。然后,我们提出了名为AIR-HLoc的新方法,根据查询图像与参考图像数据库的相似性将查询图像划分为不同的局部化难度级别。我们将高相似度的图像视为容易的查询,低相似度的图像视为困难的查询。容易的查询在增加$k$时,准确度改进有限。相反,高$k$值对于困难的查询有显著的提高准确度。由于在容易的查询和困难的查询之间,增加$k$对准确度的提升有限,我们将$k$的值自适应地分配给查询的难度级别。因此,AIR-HLoc通过根据查询图像和参考图像之间的相似性动态分配不同的$k$值来实现处理时间的优化,同时保持与固定图像检索的SOTA准确度。我们对剑桥地标、7Scenes和Aachen Day-Night-v1.1数据集的广泛实验证明了我们的算法的有效性,将计算开销减少30%、26%和11%,同时保持与固定图像检索的SOTA准确度。
https://arxiv.org/abs/2403.18281
We study the zero-shot Composed Image Retrieval (ZS-CIR) task, which is to retrieve the target image given a reference image and a description without training on the triplet datasets. Previous works generate pseudo-word tokens by projecting the reference image features to the text embedding space. However, they focus on the global visual representation, ignoring the representation of detailed attributes, e.g., color, object number and layout. To address this challenge, we propose a Knowledge-Enhanced Dual-stream zero-shot composed image retrieval framework (KEDs). KEDs implicitly models the attributes of the reference images by incorporating a database. The database enriches the pseudo-word tokens by providing relevant images and captions, emphasizing shared attribute information in various aspects. In this way, KEDs recognizes the reference image from diverse perspectives. Moreover, KEDs adopts an extra stream that aligns pseudo-word tokens with textual concepts, leveraging pseudo-triplets mined from image-text pairs. The pseudo-word tokens generated in this stream are explicitly aligned with fine-grained semantics in the text embedding space. Extensive experiments on widely used benchmarks, i.e. ImageNet-R, COCO object, Fashion-IQ and CIRR, show that KEDs outperforms previous zero-shot composed image retrieval methods.
我们研究零 shot 组合图像检索(ZS-CIR)任务,即根据参考图像和描述从三元组数据集中检索目标图像,而无需在训练数据集上进行训练。之前的工作通过将参考图像特征投影到文本嵌入空间来生成伪词标记。然而,他们集中于全局视觉表示,忽略了详细属性的表示,例如颜色、物体数量和布局。为了应对这个挑战,我们提出了一个知识增强的双流零 shot 组合图像检索框架(KEDs)。KEDs 通过引入数据库来模型的参考图像的属性。数据库通过提供相关图像和标题来丰富伪词标记,强调各种方面的共享属性信息。这样,KEDs 从不同的角度认识到参考图像。此外,KEDs 采用了一额外 stream,将伪词标记与文本概念对齐,利用从图像-文本对中挖掘的伪三元组。这个 stream 中生成的伪词标记在文本嵌入空间中具有明确的对齐关系。在广泛使用的基准上进行的大量实验,即 ImageNet-R、COCO 物体、Fashion-IQ 和 CIRR,证明了 KEDs 优于之前的零 shot 组合图像检索方法。
https://arxiv.org/abs/2403.16005
The burgeoning integration of 3D medical imaging into healthcare has led to a substantial increase in the workload of medical professionals. To assist clinicians in their diagnostic processes and alleviate their workload, the development of a robust system for retrieving similar case studies presents a viable solution. While the concept holds great promise, the field of 3D medical text-image retrieval is currently limited by the absence of robust evaluation benchmarks and curated datasets. To remedy this, our study presents a groundbreaking dataset, BIMCV-R (This dataset will be released upon acceptance.), which includes an extensive collection of 8,069 3D CT volumes, encompassing over 2 million slices, paired with their respective radiological reports. Expanding upon the foundational work of our dataset, we craft a retrieval strategy, MedFinder. This approach employs a dual-stream network architecture, harnessing the potential of large language models to advance the field of medical image retrieval beyond existing text-image retrieval solutions. It marks our preliminary step towards developing a system capable of facilitating text-to-image, image-to-text, and keyword-based retrieval tasks.
3D医疗影像融入医疗行业,导致医疗专业人员的负担大幅增加。为了帮助临床医生在诊断过程中减轻负担,开发一个稳健的系统检索类似病历片是一个可行的解决方案。虽然这一概念具有很大的潜力,但3D医疗文本图像检索领域目前仍然受到缺乏可靠评估标准和精心挑选的数据集的限制。为了弥补这一不足,我们的研究向我们展示了令人印象深刻的 dataset BIMCV-R(该数据集将在接受审核后发布),其中包括8,069个3D CT volume,涵盖超过200万切片,并与各自的放射学报告相对应。在拓展我们数据集的基础工作之上,我们制定了检索策略,MedFinder。这种方法采用了一种双流网络架构,利用大型语言模型的潜力,将医疗图像检索领域从现有的文本-图像检索解决方案中推向更远。这标志着我们迈向开发一个能够促进文本-图像、图像-文本和关键词-基础检索任务的系统的第一步。
https://arxiv.org/abs/2403.15992
Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-shot classification, text-image retrieval, and text-image generation by aligning image and text modalities. Despite its widespread adoption, a significant limitation of CLIP lies in the inadequate length of text input. The length of the text token is restricted to 77, and an empirical study shows the actual effective length is even less than 20. This prevents CLIP from handling detailed descriptions, limiting its applications for image retrieval and text-to-image generation with extensive prerequisites. To this end, we propose Long-CLIP as a plug-and-play alternative to CLIP that supports long-text input, retains or even surpasses its zero-shot generalizability, and aligns the CLIP latent space, making it readily replace CLIP without any further adaptation in downstream frameworks. Nevertheless, achieving this goal is far from straightforward, as simplistic fine-tuning can result in a significant degradation of CLIP's performance. Moreover, substituting the text encoder with a language model supporting longer contexts necessitates pretraining with vast amounts of data, incurring significant expenses. Accordingly, Long-CLIP introduces an efficient fine-tuning solution on CLIP with two novel strategies designed to maintain the original capabilities, including (1) a knowledge-preserved stretching of positional embedding and (2) a primary component matching of CLIP features. With leveraging just one million extra long text-image pairs, Long-CLIP has shown the superiority to CLIP for about 20% in long caption text-image retrieval and 6% in traditional text-image retrieval tasks, e.g., COCO and Flickr30k. Furthermore, Long-CLIP offers enhanced capabilities for generating images from detailed text descriptions by replacing CLIP in a plug-and-play manner.
对比性语言-图像预训练(CLIP)是零散分类、文本图像检索和文本图像生成的基石,通过将图像和文本模态对齐。尽管CLIP得到了广泛的采用,但CLIP的一个显著局限在于文本输入的长度不足。文本标记的长度限制为77个,而一个经验性的研究表明,实际有效长度甚至比20个更少。这使得CLIP无法处理详细的描述,限制了其在图像检索和具有广泛先决条件的文本-图像生成方面的应用。 为此,我们提出了Long-CLIP作为CLIP的插件和备选方案,支持长文本输入,保留或甚至超越零散分布的泛化能力,并使CLIP潜在空间对齐,使得在下游框架中无需进一步调整即可替代CLIP。然而,实现这一目标并不容易,因为简单的微调可能会导致CLIP性能的显著下降。此外,用支持较长上下文的语言模型替换文本编码器需要大量的预训练数据,产生相当大的费用。因此,Long-CLIP通过两种新颖策略在CLIP上实现有效微调,包括(1)保留位置嵌入的知识伸展和(2)与CLIP特征的主要成分匹配。借助仅利用100万对额外长文本图像对,Long-CLIP在长摘要文本图像检索和传统文本图像检索任务(如COCO和Flickr30k)中已经证明了与CLIP约20%的优越性。此外,Long-CLIP通过在插件和备选方式下生成图像,取代CLIP,从而增强其生成图像的能力。
https://arxiv.org/abs/2403.15378
Image generators are gaining vast amount of popularity and have rapidly changed how digital content is created. With the latest AI technology, millions of high quality images are being generated by the public, which are constantly motivating the research community to push the limits of generative models to create more complex and realistic images. This paper focuses on Cross-Domain Image Retrieval (CDIR) which can be used as an additional tool to inspect collections of generated images by determining the level of similarity between images in a dataset. An ideal retrieval system would be able to generalize to unseen complex images from multiple domains (e.g., photos, drawings and paintings). To address this goal, we propose a novel caption-matching approach that leverages multimodal language-vision architectures pre-trained on large datasets. The method is tested on DomainNet and Office-Home datasets and consistently achieves state-of-the-art performance over the latest approaches in the literature for cross-domain image retrieval. In order to verify the effectiveness with AI-generated images, the method was also put to test with a database composed by samples collected from Midjourney, which is a widely used generative platform for content creation.
图像生成器正在迅速获得大量关注,并已经彻底改变了数字内容是如何创作的。随着最新的AI技术,数百万高质量的图像是由公众生成的,这不断激励研究社区不断挑战生成模型的极限,以创建更复杂和逼真的图像。本文重点关注跨域图像检索(CDIR),可以作为进一步工具,通过确定数据集中图像之间的相似度来检查生成图像的收藏品。一个理想的检索系统应该能够泛化到多个领域的未见过的复杂图像(例如照片、绘画和绘画)。为了实现这个目标,我们提出了一个新颖的标题匹配方法,该方法利用预训练在大型数据集上的多模态语言视觉架构。该方法在DomainNet和Office-Home数据集上进行测试,并持续超越了文献中关于跨域图像检索的最新方法的性能。为了验证该方法与AI生成的图像的有效性,该方法还用于由Midjourney收集的样本的数据库上进行测试,这是一个广泛用于内容创作的生成平台。
https://arxiv.org/abs/2403.15152
Unsupervised deep metric learning (UDML) focuses on learning a semantic representation space using only unlabeled data. This challenging problem requires accurately estimating the similarity between data points, which is used to supervise a deep network. For this purpose, we propose to model the high-dimensional data manifold using a piecewise-linear approximation, with each low-dimensional linear piece approximating the data manifold in a small neighborhood of a point. These neighborhoods are used to estimate similarity between data points. We empirically show that this similarity estimate correlates better with the ground truth than the similarity estimates of current state-of-the-art techniques. We also show that proxies, commonly used in supervised metric learning, can be used to model the piecewise-linear manifold in an unsupervised setting, helping improve performance. Our method outperforms existing unsupervised metric learning approaches on standard zero-shot image retrieval benchmarks.
无监督深度度量学习(UDML)关注使用未标记数据学习语义表示空间。这个具有挑战性的问题要求准确估计数据点之间的相似性,用于指导深度网络。为此,我们提出使用分块线性近似来建模高维数据流形,其中每个低维线性片段在点的一个小邻域内近似数据流形。这些邻域用于估计数据点之间的相似性。我们通过实验证明,这种相似性估计与地面真实值的相关性比现有技术的相似性估计更好。我们还证明了在无需标注的情况下,代理商(通常用于监督度量学习)可以用于在无监督环境中建模分块线性流形,从而提高性能。我们的方法在标准零散图像检索基准上优于现有的无监督度量学习方法。
https://arxiv.org/abs/2403.14977
In analyzing vast amounts of digitally stored historical image data, existing content-based retrieval methods often overlook significant non-semantic information, limiting their effectiveness for flexible exploration across varied themes. To broaden the applicability of image retrieval methods for diverse purposes and uncover more general patterns, we innovatively introduce a crucial factor from computational aesthetics, namely image composition, into this topic. By explicitly integrating composition-related information extracted by CNN into the designed retrieval model, our method considers both the image's composition rules and semantic information. Qualitative and quantitative experiments demonstrate that the image retrieval network guided by composition information outperforms those relying solely on content information, facilitating the identification of images in databases closer to the target image in human perception. Please visit this https URL to try our codes.
在分析大量数字存储的历史图像数据时,现有的基于内容的检索方法通常忽视了重要的非语义信息,从而限制了它们在多主题灵活探索中的有效性。为了拓宽图像检索方法的应用范围,并揭示更一般的模式,我们创新地将计算美学的关键因素——图像构图——引入了这一主题。通过明确地将CNN提取的构图信息与设计检索模型集成,我们的方法同时考虑了图像的构图规则和语义信息。定性和定量的实验证明,引导构图信息的图像检索网络比仅依赖内容信息的网络在接近目标图像的人感知数据库中表现更好,从而有助于在人类感知中更准确地识别出图像。请访问此链接尝试我们的代码。
https://arxiv.org/abs/2403.14287
Deep hashing techniques have emerged as the predominant approach for efficient image retrieval. Traditionally, these methods utilize pre-trained convolutional neural networks (CNNs) such as AlexNet and VGG-16 as feature extractors. However, the increasing complexity of datasets poses challenges for these backbone architectures in capturing meaningful features essential for effective image retrieval. In this study, we explore the efficacy of employing high-resolution features learned through state-of-the-art techniques for image retrieval tasks. Specifically, we propose a novel methodology that utilizes High-Resolution Networks (HRNets) as the backbone for the deep hashing task, termed High-Resolution Hashing Network (HHNet). Our approach demonstrates superior performance compared to existing methods across all tested benchmark datasets, including CIFAR-10, NUS-WIDE, MS COCO, and ImageNet. This performance improvement is more pronounced for complex datasets, which highlights the need to learn high-resolution features for intricate image retrieval tasks. Furthermore, we conduct a comprehensive analysis of different HRNet configurations and provide insights into the optimal architecture for the deep hashing task
深度哈希技术已成为实现高效图像检索的主要方法。传统方法利用预训练的卷积神经网络(CNN)如AlexNet和VGG-16作为特征提取器。然而,数据集的复杂性对这些骨干架构捕捉有意义特征以实现有效的图像检索造成了挑战。在这项研究中,我们探讨了通过最先进的技术学习高分辨率特征在图像检索任务中的有效性。具体来说,我们提出了一个名为High-Resolution Hashing Network(HHNet)的新方法,作为深度哈希任务的骨架。我们的方法在所有测试基准数据集上的表现都优于现有方法,包括CIFAR-10、NUS-WIDE、MS COCO和ImageNet。这种性能提升在复杂数据集上更加明显,进一步强调了在复杂图像检索任务中学习高分辨率特征的必要性。此外,我们对不同的HRNet配置进行了全面分析,并提供了关于深度哈希任务的最佳架构的洞察。
https://arxiv.org/abs/2403.13747
Reconstructions of visual perception from brain activity have improved tremendously, but the practical utility of such methods has been limited. This is because such models are trained independently per subject where each subject requires dozens of hours of expensive fMRI training data to attain high-quality results. The present work showcases high-quality reconstructions using only 1 hour of fMRI training data. We pretrain our model across 7 subjects and then fine-tune on minimal data from a new subject. Our novel functional alignment procedure linearly maps all brain data to a shared-subject latent space, followed by a shared non-linear mapping to CLIP image space. We then map from CLIP space to pixel space by fine-tuning Stable Diffusion XL to accept CLIP latents as inputs instead of text. This approach improves out-of-subject generalization with limited training data and also attains state-of-the-art image retrieval and reconstruction metrics compared to single-subject approaches. MindEye2 demonstrates how accurate reconstructions of perception are possible from a single visit to the MRI facility. All code is available on GitHub.
从大脑活动的视觉感知重构已经取得了很大的进步,但这种方法的实用性受到了限制。这是因为这类模型在每名受试者上都是独立训练的,每位受试者需要花费数十小时的高昂fMRI训练数据才能达到高质量的结果。本文仅使用1小时的fMRI训练数据展示了高品质的视觉感知重构。我们在7个受试者上进行预训练,然后在新受试者上通过最小的数据进行微调。我们采用了新颖的功能对齐方法将所有脑数据映射到共享受试者潜在空间,然后通过共享非线性映射将潜在空间映射到CLIP图像空间。接着将CLIP空间映射到像素空间,通过微调Stable Diffusion XL接受CLIP潜在作为输入。这种方法在有限训练数据的情况下提高了对外部受试者的泛化能力,并且与单受试者方法相比,在图像检索和重构方面实现了最先进的性能。MindEye2展示了从一次到MRI设施参观如何实现对感知的高质量重构。所有代码都可在GitHub上找到。
https://arxiv.org/abs/2403.11207
The aim of this research is to refine knowledge transfer on audio-image temporal agreement for audio-text cross retrieval. To address the limited availability of paired non-speech audio-text data, learning methods for transferring the knowledge acquired from a large amount of paired audio-image data to shared audio-text representation have been investigated, suggesting the importance of how audio-image co-occurrence is learned. Conventional approaches in audio-image learning assign a single image randomly selected from the corresponding video stream to the entire audio clip, assuming their co-occurrence. However, this method may not accurately capture the temporal agreement between the target audio and image because a single image can only represent a snapshot of a scene, though the target audio changes from moment to moment. To address this problem, we propose two methods for audio and image matching that effectively capture the temporal information: (i) Nearest Match wherein an image is selected from multiple time frames based on similarity with audio, and (ii) Multiframe Match wherein audio and image pairs of multiple time frames are used. Experimental results show that method (i) improves the audio-text retrieval performance by selecting the nearest image that aligns with the audio information and transferring the learned knowledge. Conversely, method (ii) improves the performance of audio-image retrieval while not showing significant improvements in audio-text retrieval performance. These results indicate that refining audio-image temporal agreement may contribute to better knowledge transfer to audio-text retrieval.
本研究旨在改进音频-图像时间一致性对于音频-文本跨检索的目标。为了解决大规模非语音音频-图像数据对齐困难的问题,研究了将大量配对音频-图像数据获得的知識转移方法,以探究如何在共享音频-文本表示中学习知識。传统的音频-图像学习方法是将一个随机的视频流中的单个图像随机分配给整个音频剪辑,假设它们的共现。然而,这种方法可能无法准确捕捉目标音频和图像之间的时间一致性,因为单个图像只能代表场景的一个快照,尽管目标音频会随时刻变化。为了应对这个问题,我们提出了两种音频和图像匹配方法,它们有效地捕捉了时间信息:(i)最近匹配,即根据音频信息选择多个时间帧中的图像,并(ii)多帧匹配,即使用多个时间帧的音频和图像对。实验结果表明,方法(i)通过选择与音频信息最相似的图像来提高音频-文本检索性能,并将所學知識轉移。相反,方法(ii)在保持音频-图像检索性能的同时,没有在音频-文本检索性能上表现出显著的改善。这些结果表明,優化音频-图像时间一致性可能有助于更好地將知識傳遞到audio-text retrieval中。
https://arxiv.org/abs/2403.10756
In recent years, the dominant accuracy metric for vector search is the recall of a result list of fixed size (top-k retrieval), considering as ground truth the exact vector retrieval results. Although convenient to compute, this metric is distantly related to the end-to-end accuracy of a full system that integrates vector search. In this paper we focus on the common case where a hard decision needs to be taken depending on the vector retrieval results, for example, deciding whether a query image matches a database image or not. We solve this as a range search task, where all vectors within a certain radius from the query are returned. We show that the value of a range search result can be modeled rigorously based on the query-to-vector distance. This yields a metric for range search, RSM, that is both principled and easy to compute without running an end-to-end evaluation. We apply this metric to the case of image retrieval. We show that indexing methods that are adapted for top-k retrieval do not necessarily maximize the RSM. In particular, for inverted file based indexes, we show that visiting a limited set of clusters and encoding vectors compactly yields near optimal results.
近年来,在向量搜索中占主导地位的准确性度量是固定大小结果列表的召回度(top-k检索),将精确向量检索结果视为地面真值。虽然这个度量对计算来说很方便,但它与集成向量搜索系统的端到端准确性关系较远。在本文中,我们关注需要根据向量检索结果做出硬决策的情况,例如,决定查询图像是否与数据库图像匹配。我们将解决这个问题作为一个范围搜索任务,其中所有距离查询一定范围内的向量都返回。我们证明了范围搜索结果的值可以根据查询到向量的距离进行建模。这导致了一个既有原则性又有利于计算的距离搜索度量,即RSM。我们将这个度量应用于图像检索。我们证明了适应top-k检索的索引方法不一定会最大化RSM。特别地,对于基于倒置文件的索引,我们证明了访问有限个聚类和紧凑编码向量会导致几乎最优的结果。
https://arxiv.org/abs/2403.10746
We present PAPERCLIP (Proposal Abstracts Provide an Effective Representation for Contrastive Language-Image Pre-training), a method which associates astronomical observations imaged by telescopes with natural language using a neural network model. The model is fine-tuned from a pre-trained Contrastive Language-Image Pre-training (CLIP) model using successful observing proposal abstracts and corresponding downstream observations, with the abstracts optionally summarized via guided generation using large language models (LLMs). Using observations from the Hubble Space Telescope (HST) as an example, we show that the fine-tuned model embodies a meaningful joint representation between observations and natural language through tests targeting image retrieval (i.e., finding the most relevant observations using natural language queries) and description retrieval (i.e., querying for astrophysical object classes and use cases most relevant to a given observation). Our study demonstrates the potential for using generalist foundation models rather than task-specific models for interacting with astronomical data by leveraging text as an interface.
我们提出了PAPERCLIP(预提纲提供对比性语言-图像预训练的有效表示),一种使用神经网络模型将望远镜观测到的天体观察结果与自然语言关联的方法。该模型通过使用成功的观测建议摘要和相应下游观测结果进行预训练,然后通过使用大型语言模型(LLMs)进行摘要指导生成。作为哈勃空间望远镜(HST)的观察为例,我们证明了通过针对图像检索(即用自然语言查询找到相关观测)和描述检索(即查询与给定观测最相关的天体物体类别的使用案例)的测试,微调模型体现了观测和自然语言之间的有意义联合表示。我们的研究证明了利用文本作为接口,可以实现对天文数据的一般性基础模型,而不是针对特定任务的模型。
https://arxiv.org/abs/2403.08851
A typical assumption in state-of-the-art self-localization models is that an annotated training dataset is available in the target workspace. However, this does not always hold when a robot travels in a general open-world. This study introduces a novel training scheme for open-world distributed robot systems. In our scheme, a robot ("student") can ask the other robots it meets at unfamiliar places ("teachers") for guidance. Specifically, a pseudo-training dataset is reconstructed from the teacher model and thereafter used for continual learning of the student model. Unlike typical knowledge transfer schemes, our scheme introduces only minimal assumptions on the teacher model, such that it can handle various types of open-set teachers, including uncooperative, untrainable (e.g., image retrieval engines), and blackbox teachers (i.e., data privacy). Rather than relying on the availability of private data of teachers as in existing methods, we propose to exploit an assumption that holds universally in self-localization tasks: "The teacher model is a self-localization system" and to reuse the self-localization system of a teacher as a sole accessible communication channel. We particularly focus on designing an excellent student/questioner whose interactions with teachers can yield effective question-and-answer sequences that can be used as pseudo-training datasets for the student self-localization model. When applied to a generic recursive knowledge distillation scenario, our approach exhibited stable and consistent performance improvement.
在最先进的自局部化模型中,通常认为在目标工作空间中存在已注释的训练数据。然而,当机器人在一个通用开放世界中移动时,这个假设并不总是成立。本文介绍了一种新的开放世界分布式机器人系统的训练方案。在我们的方案中,一个机器人(学生)可以向其遇到的不熟悉的老师(教师)寻求指导。具体来说,从教师模型中重构伪训练数据,然后用于持续学习学生模型。与典型的知识传递方案不同,我们的方案对教师模型的假设非常少,可以处理各种类型的开放集教师,包括不合作、无法训练(例如,图像检索引擎)和黑盒教师(即数据隐私)。我们不依赖教师的私有数据,而是利用自局部化任务中普遍存在的假设: "教师模型是一个自局部化系统",并重新使用教师的自局部化系统作为唯一的可访问通信渠道。我们特别关注设计一个优秀的学生/问题者,其与教师的交互可以产生有效的问答序列,可以作为学生自局部化模型的伪训练数据。将我们的方法应用于通用递归知识蒸馏场景时,我们的方法表现出稳定和一致的性能提升。
https://arxiv.org/abs/2403.10552
This paper unravels the potential of sketches for diffusion models, addressing the deceptive promise of direct sketch control in generative AI. We importantly democratise the process, enabling amateur sketches to generate precise images, living up to the commitment of "what you sketch is what you get". A pilot study underscores the necessity, revealing that deformities in existing models stem from spatial-conditioning. To rectify this, we propose an abstraction-aware framework, utilising a sketch adapter, adaptive time-step sampling, and discriminative guidance from a pre-trained fine-grained sketch-based image retrieval model, working synergistically to reinforce fine-grained sketch-photo association. Our approach operates seamlessly during inference without the need for textual prompts; a simple, rough sketch akin to what you and I can create suffices! We welcome everyone to examine results presented in the paper and its supplementary. Contributions include democratising sketch control, introducing an abstraction-aware framework, and leveraging discriminative guidance, validated through extensive experiments.
本文探讨了对于扩散模型的 sketches 的潜在功能,并解决了生成式 AI 中直接绘制控制所带来的误导性承诺。我们重要的是使过程民主化,使业余 sketches 能够生成精确的图像,达到“你画什么,你就得到什么”的承诺。一个试点研究证实了必要性,揭示了现有模型的畸形源于空间约束。为了纠正这个问题,我们提出了一个抽象感知框架,利用了插图适配器、自适应时间步采样和预训练的精细颗粒插图基于图像检索模型的歧视性指导,协同工作以强化细粒度插图与照片的关联。在推理过程中,我们的方法无需文本提示操作顺畅;类似于我们和您可以创建的简单而粗糙的插图足够了!我们欢迎所有人研究论文及其补充。贡献包括使插图控制民主化、引入了抽象感知框架以及利用了歧视性指导,并通过广泛的实验验证了其有效性。
https://arxiv.org/abs/2403.07234