Reconstructions of visual perception from brain activity have improved tremendously, but the practical utility of such methods has been limited. This is because such models are trained independently per subject where each subject requires dozens of hours of expensive fMRI training data to attain high-quality results. The present work showcases high-quality reconstructions using only 1 hour of fMRI training data. We pretrain our model across 7 subjects and then fine-tune on minimal data from a new subject. Our novel functional alignment procedure linearly maps all brain data to a shared-subject latent space, followed by a shared non-linear mapping to CLIP image space. We then map from CLIP space to pixel space by fine-tuning Stable Diffusion XL to accept CLIP latents as inputs instead of text. This approach improves out-of-subject generalization with limited training data and also attains state-of-the-art image retrieval and reconstruction metrics compared to single-subject approaches. MindEye2 demonstrates how accurate reconstructions of perception are possible from a single visit to the MRI facility. All code is available on GitHub.
从大脑活动的视觉感知重构已经取得了很大的进步,但这种方法的实用性受到了限制。这是因为这类模型在每名受试者上都是独立训练的,每位受试者需要花费数十小时的高昂fMRI训练数据才能达到高质量的结果。本文仅使用1小时的fMRI训练数据展示了高品质的视觉感知重构。我们在7个受试者上进行预训练,然后在新受试者上通过最小的数据进行微调。我们采用了新颖的功能对齐方法将所有脑数据映射到共享受试者潜在空间,然后通过共享非线性映射将潜在空间映射到CLIP图像空间。接着将CLIP空间映射到像素空间,通过微调Stable Diffusion XL接受CLIP潜在作为输入。这种方法在有限训练数据的情况下提高了对外部受试者的泛化能力,并且与单受试者方法相比,在图像检索和重构方面实现了最先进的性能。MindEye2展示了从一次到MRI设施参观如何实现对感知的高质量重构。所有代码都可在GitHub上找到。
https://arxiv.org/abs/2403.11207
The aim of this research is to refine knowledge transfer on audio-image temporal agreement for audio-text cross retrieval. To address the limited availability of paired non-speech audio-text data, learning methods for transferring the knowledge acquired from a large amount of paired audio-image data to shared audio-text representation have been investigated, suggesting the importance of how audio-image co-occurrence is learned. Conventional approaches in audio-image learning assign a single image randomly selected from the corresponding video stream to the entire audio clip, assuming their co-occurrence. However, this method may not accurately capture the temporal agreement between the target audio and image because a single image can only represent a snapshot of a scene, though the target audio changes from moment to moment. To address this problem, we propose two methods for audio and image matching that effectively capture the temporal information: (i) Nearest Match wherein an image is selected from multiple time frames based on similarity with audio, and (ii) Multiframe Match wherein audio and image pairs of multiple time frames are used. Experimental results show that method (i) improves the audio-text retrieval performance by selecting the nearest image that aligns with the audio information and transferring the learned knowledge. Conversely, method (ii) improves the performance of audio-image retrieval while not showing significant improvements in audio-text retrieval performance. These results indicate that refining audio-image temporal agreement may contribute to better knowledge transfer to audio-text retrieval.
本研究旨在改进音频-图像时间一致性对于音频-文本跨检索的目标。为了解决大规模非语音音频-图像数据对齐困难的问题,研究了将大量配对音频-图像数据获得的知識转移方法,以探究如何在共享音频-文本表示中学习知識。传统的音频-图像学习方法是将一个随机的视频流中的单个图像随机分配给整个音频剪辑,假设它们的共现。然而,这种方法可能无法准确捕捉目标音频和图像之间的时间一致性,因为单个图像只能代表场景的一个快照,尽管目标音频会随时刻变化。为了应对这个问题,我们提出了两种音频和图像匹配方法,它们有效地捕捉了时间信息:(i)最近匹配,即根据音频信息选择多个时间帧中的图像,并(ii)多帧匹配,即使用多个时间帧的音频和图像对。实验结果表明,方法(i)通过选择与音频信息最相似的图像来提高音频-文本检索性能,并将所學知識轉移。相反,方法(ii)在保持音频-图像检索性能的同时,没有在音频-文本检索性能上表现出显著的改善。这些结果表明,優化音频-图像时间一致性可能有助于更好地將知識傳遞到audio-text retrieval中。
https://arxiv.org/abs/2403.10756
In recent years, the dominant accuracy metric for vector search is the recall of a result list of fixed size (top-k retrieval), considering as ground truth the exact vector retrieval results. Although convenient to compute, this metric is distantly related to the end-to-end accuracy of a full system that integrates vector search. In this paper we focus on the common case where a hard decision needs to be taken depending on the vector retrieval results, for example, deciding whether a query image matches a database image or not. We solve this as a range search task, where all vectors within a certain radius from the query are returned. We show that the value of a range search result can be modeled rigorously based on the query-to-vector distance. This yields a metric for range search, RSM, that is both principled and easy to compute without running an end-to-end evaluation. We apply this metric to the case of image retrieval. We show that indexing methods that are adapted for top-k retrieval do not necessarily maximize the RSM. In particular, for inverted file based indexes, we show that visiting a limited set of clusters and encoding vectors compactly yields near optimal results.
近年来,在向量搜索中占主导地位的准确性度量是固定大小结果列表的召回度(top-k检索),将精确向量检索结果视为地面真值。虽然这个度量对计算来说很方便,但它与集成向量搜索系统的端到端准确性关系较远。在本文中,我们关注需要根据向量检索结果做出硬决策的情况,例如,决定查询图像是否与数据库图像匹配。我们将解决这个问题作为一个范围搜索任务,其中所有距离查询一定范围内的向量都返回。我们证明了范围搜索结果的值可以根据查询到向量的距离进行建模。这导致了一个既有原则性又有利于计算的距离搜索度量,即RSM。我们将这个度量应用于图像检索。我们证明了适应top-k检索的索引方法不一定会最大化RSM。特别地,对于基于倒置文件的索引,我们证明了访问有限个聚类和紧凑编码向量会导致几乎最优的结果。
https://arxiv.org/abs/2403.10746
We present PAPERCLIP (Proposal Abstracts Provide an Effective Representation for Contrastive Language-Image Pre-training), a method which associates astronomical observations imaged by telescopes with natural language using a neural network model. The model is fine-tuned from a pre-trained Contrastive Language-Image Pre-training (CLIP) model using successful observing proposal abstracts and corresponding downstream observations, with the abstracts optionally summarized via guided generation using large language models (LLMs). Using observations from the Hubble Space Telescope (HST) as an example, we show that the fine-tuned model embodies a meaningful joint representation between observations and natural language through tests targeting image retrieval (i.e., finding the most relevant observations using natural language queries) and description retrieval (i.e., querying for astrophysical object classes and use cases most relevant to a given observation). Our study demonstrates the potential for using generalist foundation models rather than task-specific models for interacting with astronomical data by leveraging text as an interface.
我们提出了PAPERCLIP(预提纲提供对比性语言-图像预训练的有效表示),一种使用神经网络模型将望远镜观测到的天体观察结果与自然语言关联的方法。该模型通过使用成功的观测建议摘要和相应下游观测结果进行预训练,然后通过使用大型语言模型(LLMs)进行摘要指导生成。作为哈勃空间望远镜(HST)的观察为例,我们证明了通过针对图像检索(即用自然语言查询找到相关观测)和描述检索(即查询与给定观测最相关的天体物体类别的使用案例)的测试,微调模型体现了观测和自然语言之间的有意义联合表示。我们的研究证明了利用文本作为接口,可以实现对天文数据的一般性基础模型,而不是针对特定任务的模型。
https://arxiv.org/abs/2403.08851
A typical assumption in state-of-the-art self-localization models is that an annotated training dataset is available in the target workspace. However, this does not always hold when a robot travels in a general open-world. This study introduces a novel training scheme for open-world distributed robot systems. In our scheme, a robot ("student") can ask the other robots it meets at unfamiliar places ("teachers") for guidance. Specifically, a pseudo-training dataset is reconstructed from the teacher model and thereafter used for continual learning of the student model. Unlike typical knowledge transfer schemes, our scheme introduces only minimal assumptions on the teacher model, such that it can handle various types of open-set teachers, including uncooperative, untrainable (e.g., image retrieval engines), and blackbox teachers (i.e., data privacy). Rather than relying on the availability of private data of teachers as in existing methods, we propose to exploit an assumption that holds universally in self-localization tasks: "The teacher model is a self-localization system" and to reuse the self-localization system of a teacher as a sole accessible communication channel. We particularly focus on designing an excellent student/questioner whose interactions with teachers can yield effective question-and-answer sequences that can be used as pseudo-training datasets for the student self-localization model. When applied to a generic recursive knowledge distillation scenario, our approach exhibited stable and consistent performance improvement.
在最先进的自局部化模型中,通常认为在目标工作空间中存在已注释的训练数据。然而,当机器人在一个通用开放世界中移动时,这个假设并不总是成立。本文介绍了一种新的开放世界分布式机器人系统的训练方案。在我们的方案中,一个机器人(学生)可以向其遇到的不熟悉的老师(教师)寻求指导。具体来说,从教师模型中重构伪训练数据,然后用于持续学习学生模型。与典型的知识传递方案不同,我们的方案对教师模型的假设非常少,可以处理各种类型的开放集教师,包括不合作、无法训练(例如,图像检索引擎)和黑盒教师(即数据隐私)。我们不依赖教师的私有数据,而是利用自局部化任务中普遍存在的假设: "教师模型是一个自局部化系统",并重新使用教师的自局部化系统作为唯一的可访问通信渠道。我们特别关注设计一个优秀的学生/问题者,其与教师的交互可以产生有效的问答序列,可以作为学生自局部化模型的伪训练数据。将我们的方法应用于通用递归知识蒸馏场景时,我们的方法表现出稳定和一致的性能提升。
https://arxiv.org/abs/2403.10552
This paper unravels the potential of sketches for diffusion models, addressing the deceptive promise of direct sketch control in generative AI. We importantly democratise the process, enabling amateur sketches to generate precise images, living up to the commitment of "what you sketch is what you get". A pilot study underscores the necessity, revealing that deformities in existing models stem from spatial-conditioning. To rectify this, we propose an abstraction-aware framework, utilising a sketch adapter, adaptive time-step sampling, and discriminative guidance from a pre-trained fine-grained sketch-based image retrieval model, working synergistically to reinforce fine-grained sketch-photo association. Our approach operates seamlessly during inference without the need for textual prompts; a simple, rough sketch akin to what you and I can create suffices! We welcome everyone to examine results presented in the paper and its supplementary. Contributions include democratising sketch control, introducing an abstraction-aware framework, and leveraging discriminative guidance, validated through extensive experiments.
本文探讨了对于扩散模型的 sketches 的潜在功能,并解决了生成式 AI 中直接绘制控制所带来的误导性承诺。我们重要的是使过程民主化,使业余 sketches 能够生成精确的图像,达到“你画什么,你就得到什么”的承诺。一个试点研究证实了必要性,揭示了现有模型的畸形源于空间约束。为了纠正这个问题,我们提出了一个抽象感知框架,利用了插图适配器、自适应时间步采样和预训练的精细颗粒插图基于图像检索模型的歧视性指导,协同工作以强化细粒度插图与照片的关联。在推理过程中,我们的方法无需文本提示操作顺畅;类似于我们和您可以创建的简单而粗糙的插图足够了!我们欢迎所有人研究论文及其补充。贡献包括使插图控制民主化、引入了抽象感知框架以及利用了歧视性指导,并通过广泛的实验验证了其有效性。
https://arxiv.org/abs/2403.07234
Two primary input modalities prevail in image retrieval: sketch and text. While text is widely used for inter-category retrieval tasks, sketches have been established as the sole preferred modality for fine-grained image retrieval due to their ability to capture intricate visual details. In this paper, we question the reliance on sketches alone for fine-grained image retrieval by simultaneously exploring the fine-grained representation capabilities of both sketch and text, orchestrating a duet between the two. The end result enables precise retrievals previously unattainable, allowing users to pose ever-finer queries and incorporate attributes like colour and contextual cues from text. For this purpose, we introduce a novel compositionality framework, effectively combining sketches and text using pre-trained CLIP models, while eliminating the need for extensive fine-grained textual descriptions. Last but not least, our system extends to novel applications in composite image retrieval, domain attribute transfer, and fine-grained generation, providing solutions for various real-world scenarios.
在图像检索中,有两种主要的输入模式:手绘和文本。虽然文本在跨类别检索任务中应用广泛,但是手绘因其能够捕捉到细微的视觉细节而被确立为细粒度图像检索的单独首选模式。在本文中,我们质疑仅依赖手绘进行细粒度图像检索的可行性,通过同时探索手绘和文本的细粒度表示能力,安排两者之间的二重奏。最终成果实现了以前无法达到的精确检索,使用户能够提出越来越精细的查询,并从文本中提取颜色和上下文提示等属性。为此,我们引入了一个新颖的组合性框架,通过预训练的CLIP模型有效地将手绘和文本组合在一起,而无需进行大量细粒度的文本描述。最后,我们的系统扩展到了复合图像检索、领域属性转移和细粒度生成的全新应用中,为各种现实场景提供了解决方案。
https://arxiv.org/abs/2403.07222
This paper, for the first time, explores text-to-image diffusion models for Zero-Shot Sketch-based Image Retrieval (ZS-SBIR). We highlight a pivotal discovery: the capacity of text-to-image diffusion models to seamlessly bridge the gap between sketches and photos. This proficiency is underpinned by their robust cross-modal capabilities and shape bias, findings that are substantiated through our pilot studies. In order to harness pre-trained diffusion models effectively, we introduce a straightforward yet powerful strategy focused on two key aspects: selecting optimal feature layers and utilising visual and textual prompts. For the former, we identify which layers are most enriched with information and are best suited for the specific retrieval requirements (category-level or fine-grained). Then we employ visual and textual prompts to guide the model's feature extraction process, enabling it to generate more discriminative and contextually relevant cross-modal representations. Extensive experiments on several benchmark datasets validate significant performance improvements.
本文是首次探索基于零 shots素描的图像检索(ZS-SBIR)中的文本到图像扩散模型。我们突出一个关键发现:文本到图像扩散模型如何平滑地连接起素描和照片之间的巟巟差距。这一能力是由它们的稳健跨模态能力和形状偏差所支撑的,这些发现通过我们的初步研究得到了证实。为了有效地利用预训练的扩散模型,我们引入了一种简单而强大的策略,集中于两个关键方面:选择最优的特征层和利用视觉和文本提示。对于前者,我们确定哪些层富含信息并且最适合于特定的检索要求(分类级别或细粒度)。然后我们使用视觉和文本提示引导模型的特征提取过程,使其能够生成更具有区分度和相关性的跨模态表示。在多个基准数据集上进行的广泛实验证实了显著的性能提升。
https://arxiv.org/abs/2403.07214
In this paper, we propose a novel abstraction-aware sketch-based image retrieval framework capable of handling sketch abstraction at varied levels. Prior works had mainly focused on tackling sub-factors such as drawing style and order, we instead attempt to model abstraction as a whole, and propose feature-level and retrieval granularity-level designs so that the system builds into its DNA the necessary means to interpret abstraction. On learning abstraction-aware features, we for the first-time harness the rich semantic embedding of pre-trained StyleGAN model, together with a novel abstraction-level mapper that deciphers the level of abstraction and dynamically selects appropriate dimensions in the feature matrix correspondingly, to construct a feature matrix embedding that can be freely traversed to accommodate different levels of abstraction. For granularity-level abstraction understanding, we dictate that the retrieval model should not treat all abstraction-levels equally and introduce a differentiable surrogate Acc.@q loss to inject that understanding into the system. Different to the gold-standard triplet loss, our Acc.@q loss uniquely allows a sketch to narrow/broaden its focus in terms of how stringent the evaluation should be - the more abstract a sketch, the less stringent (higher $q$). Extensive experiments depict our method to outperform existing state-of-the-arts in standard SBIR tasks along with challenging scenarios like early retrieval, forensic sketch-photo matching, and style-invariant retrieval.
在本文中,我们提出了一个新颖的基于抽象意识到的图素-基于图像检索框架,能够处理不同抽象层次的抽象。 prior works主要集中在解决子因素,如绘制风格和顺序,我们而是将抽象建模为一个整体,并提出了基于特征级别和检索级别的设计,以便系统能够构建解读抽象所需的方法。在学习抽象意识到的特征时,我们首次利用预训练的StyleGAN模型的丰富语义嵌入,并引入了一个新颖的抽象级别映射器,解码抽象层次并提供相应的特征矩阵维度,以构建一个可以自由遍历的不同抽象层次的矩阵嵌入。对于粒度级别抽象理解,我们规定检索模型不应将所有抽象层次视为平等,并引入了不同的可导替代损失函数来注入这种理解。与黄金标准三元组损失不同,我们的Acc.@q损失独特地允许一个草图在评估应该有多严格方面缩小/放宽其关注点 - 越抽象的草图,评估应该越宽松(更高 $q$)。大量实验证明,我们的方法在标准SBIR任务以及具有挑战性的场景(如早期检索,法医画像匹配和风格无关检索)中均能优于现有技术水平。
https://arxiv.org/abs/2403.07203
Astronaut photography, spanning six decades of human spaceflight, presents a unique Earth observations dataset with immense value for both scientific research and disaster response. Despite its significance, accurately localizing the geographical extent of these images, crucial for effective utilization, poses substantial challenges. Current manual localization efforts are time-consuming, motivating the need for automated solutions. We propose a novel approach - leveraging image retrieval - to address this challenge efficiently. We introduce innovative training techniques, including Year-Wise Data Augmentation and a Neutral-Aware Multi-Similarity Loss, which contribute to the development of a high-performance model, EarthLoc. We develop six evaluation datasets and perform a comprehensive benchmark comparing EarthLoc to existing methods, showcasing its superior efficiency and accuracy. Our approach marks a significant advancement in automating the localization of astronaut photography, which will help bridge a critical gap in Earth observations data. Code and datasets are available at this https URL
astronaut photography, spanning six decades of human spaceflight, presents a unique Earth observations dataset with immense value for both scientific research and disaster response. Despite its significance, accurately localizing the geographical extent of these images, crucial for effective utilization, poses substantial challenges. 目前手动本地化努力耗时费力,这促使我们寻求自动解决方案。我们提出了一个创新的方法 - 利用图像检索 - 来解决这个问题。我们引入了创新训练技术,包括年度数据增强和一种中性感知多相似性损失,这些技术有助于开发高性能的模型,地球定位。我们开发了六个评估数据集,并全面比较地球定位现有方法和地球定位,展示其卓越的效率和准确性。我们的方法在自动航天员摄影本地化方面取得了显著的进步,这将有助于填补地球观测数据中的关键缺口。代码和数据集可在https://这个链接中获取:
https://arxiv.org/abs/2403.06758
Content-based image retrieval (CBIR) has the potential to significantly improve diagnostic aid and medical research in radiology. Current CBIR systems face limitations due to their specialization to certain pathologies, limiting their utility. In response, we propose using vision foundation models as powerful and versatile off-the-shelf feature extractors for content-based medical image retrieval. By benchmarking these models on a comprehensive dataset of 1.6 million 2D radiological images spanning four modalities and 161 pathologies, we identify weakly-supervised models as superior, achieving a P@1 of up to 0.594. This performance not only competes with a specialized model but does so without the need for fine-tuning. Our analysis further explores the challenges in retrieving pathological versus anatomical structures, indicating that accurate retrieval of pathological features presents greater difficulty. Despite these challenges, our research underscores the vast potential of foundation models for CBIR in radiology, proposing a shift towards versatile, general-purpose medical image retrieval systems that do not require specific tuning.
内容基于图像检索(CBIR)具有显著提高医学影像诊断援助和研究的潜力。当前的CBIR系统由于其专业化的特点,其适用性受限。为了应对这一问题,我们提出使用视觉基础模型作为强大的、多功能的离线特征提取器用于内容基于医疗图像检索。通过在一个包含1600万2D放射性图像的数据集中对这些模型进行基准测试,我们发现弱监督模型具有优势,达到的P@1值高达0.594。这种性能不仅与专用模型竞争,而且无需进行微调。我们的分析进一步探索了在检索病理性结构与解剖性结构方面所面临的挑战,表明准确检索病理性特征存在更大的困难。尽管存在这些挑战,我们的研究突出了基础模型在放射科内容基于图像检索中的巨大潜力,提出了向通用、不需特定调整的医疗图像检索系统转变的研究方向。
https://arxiv.org/abs/2403.06567
In this paper, we propose a new framework for improving Content Based Image Retrieval (CBIR) for texture images. This is achieved by using a new image representation based on the RCT-Plus transform which is a novel variant of the Redundant Contourlet transform that extracts a richer directional information in the image. Moreover, the process of image search is improved through a learning-based approach where the images of the database are classified using an adapted similarity metric to the statistical modeling of the RCT-Plus transform. A query is then first classified to select the best texture class after which the retained class images are ranked to select top ones. By this, we have achieved significant improvements in the retrieval rates compared to previous CBIR schemes.
在本文中,我们提出了一个新的框架,用于提高基于内容的图像检索(CBIR)对于纹理图像。这是通过使用一种基于RCT-Plus变换的新图像表示来实现的,这是一种新颖的变体,可以从图像中提取更丰富的方向信息。此外,通过一种基于学习的图像搜索方法来改进图像检索过程。具体来说,我们将数据库中的图像分类为使用自适应的相似度度量对RCT-Plus变换的统计建模。然后,在选择最佳纹理类之前,保留的纹理类图像按排名排序以选择前几名。通过这种方法,我们已经在与以前CBIR方案的检索率上取得了显著的改进。
https://arxiv.org/abs/2403.06048
We present a novel method for efficiently producing semi-dense matches across images. Previous detector-free matcher LoFTR has shown remarkable matching capability in handling large-viewpoint change and texture-poor scenarios but suffers from low efficiency. We revisit its design choices and derive multiple improvements for both efficiency and accuracy. One key observation is that performing the transformer over the entire feature map is redundant due to shared local information, therefore we propose an aggregated attention mechanism with adaptive token selection for efficiency. Furthermore, we find spatial variance exists in LoFTR's fine correlation module, which is adverse to matching accuracy. A novel two-stage correlation layer is proposed to achieve accurate subpixel correspondences for accuracy improvement. Our efficiency optimized model is $\sim 2.5\times$ faster than LoFTR which can even surpass state-of-the-art efficient sparse matching pipeline SuperPoint + LightGlue. Moreover, extensive experiments show that our method can achieve higher accuracy compared with competitive semi-dense matchers, with considerable efficiency benefits. This opens up exciting prospects for large-scale or latency-sensitive applications such as image retrieval and 3D reconstruction. Project page: this https URL.
我们提出了一种新的方法,可以在图像之间高效地生成半密集匹配。之前没有检测器-免费匹配器LoFTR在大视图变化和纹理贫瘠场景中表现出色,但效率低下。我们重新审视了其设计选择,并得出提高效率和准确性的多个改进。一个关键观察结果是,对整个特征图进行变换是多余的,因此我们提出了一个具有自适应标记选择的高效注意机制。此外,我们发现LoFTR的精度模块中存在空间方差,这对匹配准确性不利。为了提高准确性,我们提出了一个新的两阶段相关层,以实现准确的分像素匹配。我们的高效优化模型是LoFTR的$\sim 2.5$倍速度,甚至可以超过最先进的有效稀疏匹配管道SuperPoint + LightGlue。此外,广泛的实验表明,我们的方法可以比竞争性的半密度匹配器实现更高的准确性,具有显著的效率优势。这为大规模或延迟敏感应用(如图像检索和3D重建)打开了令人兴奋的潜力。项目页面:此链接:<https://this.html>
https://arxiv.org/abs/2403.04765
In the domain of image layout representation learning, the critical process of translating image layouts into succinct vector forms is increasingly significant across diverse applications, such as image retrieval, manipulation, and generation. Most approaches in this area heavily rely on costly labeled datasets and notably lack in adapting their modeling and learning methods to the specific nuances of photographic image layouts. This shortfall makes the learning process for photographic image layouts suboptimal. In our research, we directly address these challenges. We innovate by defining basic layout primitives that encapsulate various levels of layout information and by mapping these, along with their interconnections, onto a heterogeneous graph structure. This graph is meticulously engineered to capture the intricate layout information within the pixel domain explicitly. Advancing further, we introduce novel pretext tasks coupled with customized loss functions, strategically designed for effective self-supervised learning of these layout graphs. Building on this foundation, we develop an autoencoder-based network architecture skilled in compressing these heterogeneous layout graphs into precise, dimensionally-reduced layout representations. Additionally, we introduce the LODB dataset, which features a broader range of layout categories and richer semantics, serving as a comprehensive benchmark for evaluating the effectiveness of layout representation learning methods. Our extensive experimentation on this dataset demonstrates the superior performance of our approach in the realm of photographic image layout representation learning.
在图像布局表示学习领域,将图像布局转化为简洁的向量形式的关键过程在各种应用中(如图像检索、编辑和生成)越来越重要。该领域的大多数方法都依赖代价巨大的标记数据,并且显著缺乏将建模和学习方法适应具体的照片图像布局的细微差异。这一缺陷使得学习过程变得不最优。在本文的研究中,我们直接应对这些挑战。我们通过定义基本布局原型来封装各种布局信息,并通过映射这些以及它们的连接,将它们 onto 一个异质图形结构。这个图形结构被精心设计,以明确捕捉像素域内的复杂布局信息。进一步发展,我们引入了与自监督学习相结合的新预处理任务,这些任务针对这些布局图进行设计,以实现这些布局图的有效自监督学习。在这一点上,我们发展了一个基于自动编码器的网络架构,善于将异质布局图压缩成精确的维度减小的布局表示。此外,我们还引入了 LODB 数据集,该数据集涵盖了更广泛的布局类别,具有更丰富的语义,作为评估布局表示学习方法有效性的全面基准。我们对这个数据集的广泛实验表明,我们在照片图像布局表示学习领域取得了卓越的性能。
https://arxiv.org/abs/2403.03740
Image retrieval enables an efficient search through vast amounts of satellite imagery and returns similar images to a query. Deep learning models can identify images across various semantic concepts without the need for annotations. This work proposes to use Geospatial Foundation Models, like Prithvi, for remote sensing image retrieval with multiple benefits: i) the models encode multi-spectral satellite data and ii) generalize without further fine-tuning. We introduce two datasets to the retrieval task and observe a strong performance: Prithvi processes six bands and achieves a mean Average Precision of 97.62\% on BigEarthNet-43 and 44.51\% on ForestNet-12, outperforming other RGB-based models. Further, we evaluate three compression methods with binarized embeddings balancing retrieval speed and accuracy. They match the retrieval speed of much shorter hash codes while maintaining the same accuracy as floating-point embeddings but with a 32-fold compression. The code is available at this https URL.
图像检索通过处理大量的卫星图像,使您能够高效地搜索,并返回与查询相似的图像。深度学习模型可以识别各种语义概念的图像,而无需进行注释。本文提出使用Geospatial Foundation Models(如Prithvi)进行远程 sensing图像检索,具有多个优点:i)模型编码多光谱卫星数据,ii)无需进一步微调即可泛化。我们引入了两个数据集用于检索任务,并观察到出色的性能:Prithvi处理六个波段,在BigEarthNet-43和ForestNet-12上的平均准确率为97.62\%,优于其他基于RGB的模型。此外,我们评估了三种压缩方法,它们在保持检索速度的同时实现了与浮点嵌入相同的精度。它们与使用哈希代码的检索速度相同,但压缩了32倍。代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2403.02059
The task of composed image retrieval (CIR) aims to retrieve images based on the query image and the text describing the users' intent. Existing methods have made great progress with the advanced large vision-language (VL) model in CIR task, however, they generally suffer from two main issues: lack of labeled triplets for model training and difficulty of deployment on resource-restricted environments when deploying the large vision-language model. To tackle the above problems, we propose Image2Sentence based Asymmetric zero-shot composed image retrieval (ISA), which takes advantage of the VL model and only relies on unlabeled images for composition learning. In the framework, we propose a new adaptive token learner that maps an image to a sentence in the word embedding space of VL model. The sentence adaptively captures discriminative visual information and is further integrated with the text modifier. An asymmetric structure is devised for flexible deployment, in which the lightweight model is adopted for the query side while the large VL model is deployed on the gallery side. The global contrastive distillation and the local alignment regularization are adopted for the alignment between the light model and the VL model for CIR task. Our experiments demonstrate that the proposed ISA could better cope with the real retrieval scenarios and further improve retrieval accuracy and efficiency.
组合图像检索(CIR)任务的目的是根据查询图像和描述用户意图的文本来检索图像。在CIR任务中,使用先进的大视觉语言(VL)模型已经取得了很大的进展,然而,它们通常存在两个主要问题:模型训练中的缺乏带标签的三元组以及当部署大视觉语言模型时在资源受限环境中难以应用的问题。为解决上述问题,我们提出了基于图像到句子的异构图基于零样本组合图像检索(ISA),它利用了VL模型并仅依赖于未标记的图像进行组合学习。在框架中,我们提出了一个自适应的标记词学习器,它将图像映射到VL模型在词嵌入空间中的句子。句子自适应地捕捉了具有区分性的视觉信息,并进一步与文本修饰器集成。为了实现灵活的部署,采用了一个轻量模型用于查询侧,而大VL模型则部署在图库侧。对于CIR任务,采用了全局对比度和局部对齐正则化来对轻模型和大VL模型进行对齐。我们的实验结果表明,与传统方法相比,所提出的ISA在真实检索场景中表现更好,进一步提高检索准确性和效率。
https://arxiv.org/abs/2403.01431
Unsupervised cross-domain image retrieval (UCIR) aims to retrieve images sharing the same category across diverse domains without relying on labeled data. Prior approaches have typically decomposed the UCIR problem into two distinct tasks: intra-domain representation learning and cross-domain feature alignment. However, these segregated strategies overlook the potential synergies between these tasks. This paper introduces ProtoOT, a novel Optimal Transport formulation explicitly tailored for UCIR, which integrates intra-domain feature representation learning and cross-domain alignment into a unified framework. ProtoOT leverages the strengths of the K-means clustering method to effectively manage distribution imbalances inherent in UCIR. By utilizing K-means for generating initial prototypes and approximating class marginal distributions, we modify the constraints in Optimal Transport accordingly, significantly enhancing its performance in UCIR scenarios. Furthermore, we incorporate contrastive learning into the ProtoOT framework to further improve representation learning. This encourages local semantic consistency among features with similar semantics, while also explicitly enforcing separation between features and unmatched prototypes, thereby enhancing global discriminativeness. ProtoOT surpasses existing state-of-the-art methods by a notable margin across benchmark datasets. Notably, on DomainNet, ProtoOT achieves an average P@200 enhancement of 24.44%, and on Office-Home, it demonstrates a P@15 improvement of 12.12%. Code is available at this https URL.
无监督跨域图像检索(UCIR)旨在在没有标签数据的情况下,检索具有相同类别的多样领域中的图像。先前的方法通常将UCIR问题分解为两个不同的任务:领域内特征表示学习和跨领域特征对齐。然而,这些分而治之的方法忽略了这些任务之间的潜在协同作用。本文介绍了一种新的Optimal Transport形式,为UCIR专门设计的,将领域内特征表示学习和跨领域对齐集成到一个统一的框架中。通过利用K-means聚类的优势来有效地管理UCIR中存在的分布不平衡,我们相应地修改了Optimal Transport中的约束,从而在UCIR场景中显著增强其性能。此外,我们还将对比学习引入ProtoOT框架,进一步改进特征表示学习。这鼓励具有相似语义特征的 feature 之间的局部语义一致性,同时明确地强制分离特征和未匹配的 prototype,从而增强全局鉴别性。ProtoOT在基准数据集上超越了现有最先进的方法。值得注意的是,在DomainNet上,ProtoOT的平均P@200评估值为24.44%,而在Office-Home上,它展示了12.12%的P@15改进。代码可在此处下载:https://url.com/
https://arxiv.org/abs/2402.18411
Learned sparse retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors that can be indexed and retrieved efficiently with an inverted index. We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval. While LSR has seen success in text retrieval, its application in multimodal retrieval remains underexplored. Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets. Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors. We address issues of high dimension co-activation and semantic deviation through a new training algorithm, using Bernoulli random variables to control query expansion. Experiments with two dense models (BLIP, ALBEF) and two datasets (MSCOCO, Flickr30k) show that our proposed algorithm effectively reduces co-activation and semantic deviation. Our best-performing sparsified model outperforms state-of-the-art text-image LSR models with a shorter training time and lower GPU memory requirements. Our approach offers an effective solution for training LSR retrieval models in multimodal settings. Our code and model checkpoints are available at this http URL
学习稀疏检索(LSR)是一系列将查询和文档编码为稀疏词汇向量的神经方法,可以利用倒置索引 efficiently进行索引和检索。我们将LSR应用于多模态领域,重点关注文本图像检索。虽然LSR在文本检索方面取得了成功,但在多模态检索中的应用仍然缺乏探索。当前的方法如LexLIP和STAIR需要在大规模数据集上进行复杂的多步骤训练。我们提出的方法将大型数据集中的密集向量转换为稀疏词汇向量。我们通过一种新的训练算法解决了高维度共激活问题和语义偏差。通过二项分布随机变量控制查询扩展。用两个大型数据集(BLIP,ALBEF)和两个数据集(MSCOCO,Flickr30k)进行实验,结果表明,我们提出的算法有效地减少了共激活和语义偏差。我们性能最好的稀疏模型在训练时间和GPU内存需求上优于最先进的文本图像LSR模型。我们的方法为在多模态环境中训练LSR检索模型提供了一种有效的解决方案。我们的代码和模型检查点可在该http URL
https://arxiv.org/abs/2402.17535
Text-to-image retrieval plays a crucial role across various applications, including digital libraries, e-commerce platforms, and multimedia databases, by enabling the search for images using text queries. Despite the advancements in Multimodal Large Language Models (MLLMs), which offer leading-edge performance, their applicability in large-scale, varied, and ambiguous retrieval scenarios is constrained by significant computational demands and the generation of injective embeddings. This paper introduces the Text2Pic Swift framework, tailored for efficient and robust retrieval of images corresponding to extensive textual descriptions in sizable datasets. The framework employs a two-tier approach: the initial Entity-based Ranking (ER) stage addresses the ambiguity inherent in lengthy text queries through a multiple-queries-to-multiple-targets strategy, effectively narrowing down potential candidates for subsequent analysis. Following this, the Summary-based Re-ranking (SR) stage further refines these selections based on concise query summaries. Additionally, we present a novel Decoupling-BEiT-3 encoder, specifically designed to tackle the challenges of ambiguous queries and to facilitate both stages of the retrieval process, thereby significantly improving computational efficiency via vector-based similarity assessments. Our evaluation, conducted on the AToMiC dataset, demonstrates that Text2Pic Swift outperforms current MLLMs by achieving up to an 11.06% increase in Recall@1000, alongside reductions in training and retrieval durations by 68.75% and 99.79%, respectively.
文本到图像检索在各种应用中扮演着至关重要的角色,包括数字图书馆、电子商务平台和多媒体数据库,通过使用文本查询搜索图像。尽管在多模态大型语言模型(MLLMs)方面取得了进步,但它们在大型、多样且模糊的检索场景中的应用受到了显著的计算需求和生成注入性嵌入的限制。本文介绍了一个名为Text2Pic Swift的框架,专门为在大型数据集中有效地和稳健地检索相应文本描述的图像而设计。该框架采用双层方法:基于实体的排名阶段通过多查询到多目标策略解决了长文本查询中的歧义问题,从而有效缩小后续分析的可能候选者。接着,摘要为基础的重新排序(SR)阶段进一步根据简洁的查询摘要对这些选择进行优化。此外,我们提出了一种新的解耦-BEiT-3编码器,专门应对模糊查询的问题,并通过基于向量的相似性评估促进检索过程的两个阶段,从而大大提高计算效率。通过对ATOMiC数据集的评估,我们发现Text2Pic Swift在实现11.06%的召回度增加的同时,将训练和检索持续时间分别减少了68.75%和99.79%。
https://arxiv.org/abs/2402.15276
Contrastive language-image pre-training (CLIP) models have demonstrated considerable success across various vision-language tasks, such as text-to-image retrieval, where the model is required to effectively process natural language input to produce an accurate visual output. However, current models still face limitations in dealing with linguistic variations in input queries, such as paraphrases, making it challenging to handle a broad range of user queries in real-world applications. In this study, we introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases. Our approach involves a two-step paraphrase generation process, where we automatically create two categories of paraphrases from web-scale image captions by leveraging large language models. Subsequently, we fine-tune the CLIP text encoder using these generated paraphrases while freezing the image encoder. Our resulting model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks, including paraphrased retrieval (with rank similarity scores improved by up to 2.0% and 5.6%), Visual Genome Relation and Attribution, as well as seven semantic textual similarity tasks.
对比语言-图像预训练(CLIP)模型在各种视觉-语言任务上的成功表明,这些模型在处理自然语言输入以产生准确视觉输出方面取得了显著的成功。然而,当前的模型在处理输入查询中的语言差异方面仍然存在局限性,例如同义词,这使得在现实应用中处理广泛的用户查询具有挑战性。在这项研究中,我们提出了一个直接调优方法,以增强CLIP模型的并行表示,以处理同义词。我们的方法包括一个两步同义词生成过程,通过利用大型语言模型自动创建两种同义词类别。然后,我们通过这些生成的同义词对CLIP文本编码器进行微调,同时冻结图像编码器。我们得到的模型称为ParaCLIP,在各种任务上都优于基线CLIP模型,包括同义词检索(排名相似度得分提高至2.0%和5.6%),视觉基因组关系和归属,以及七个语义文本相似性任务。
https://arxiv.org/abs/2402.15120