Over the last decades, ample achievements have been made on Structure from motion (SfM). However, the vast majority of them basically work in an offline manner, i.e., images are firstly captured and then fed together into a SfM pipeline for obtaining poses and sparse point cloud. In this work, on the contrary, we present an on-the-fly SfM: running online SfM while image capturing, the newly taken On-the-Fly image is online estimated with the corresponding pose and points, i.e., what you capture is what you get. Specifically, our approach firstly employs a vocabulary tree that is unsupervised trained using learning-based global features for fast image retrieval of newly fly-in image. Then, a robust feature matching mechanism with least squares (LSM) is presented to improve image registration performance. Finally, via investigating the influence of newly fly-in image's connected neighboring images, an efficient hierarchical weighted local bundle adjustment (BA) is used for optimization. Extensive experimental results demonstrate that on-the-fly SfM can meet the goal of robustly registering the images while capturing in an online way.
几十年来,在结构从运动(SfM)方面取得了大量成就。然而,绝大多数研究基本上采用 offline 的方式,即先拍摄图像,然后将它们输入到 SfM 流程中以获得姿态和稀疏点云。在本研究中,我们提出了一种实时的 SfM:在图像拍摄的同时运行在线的 SfM,并将新拍摄的图像在线估计其对应的姿态和点,即你得到什么就是什么。具体来说,我们的方法和先使用基于学习 global 特征的无监督训练来快速检索新飞入图像的图像。然后,我们介绍了一种基于最小平方的强特征匹配机制以提高图像对齐性能。最后,通过研究新飞入图像相邻图像的影响,我们使用了高效的分层加权局部卷积调整(BA)进行优化。广泛的实验结果证明,实时的 SfM可以在在线方式下稳健地对齐图像。
https://arxiv.org/abs/2309.11883
Composed image retrieval is a type of image retrieval task where the user provides a reference image as a starting point and specifies a text on how to shift from the starting point to the desired target image. However, most existing methods focus on the composition learning of text and reference images and oversimplify the text as a description, neglecting the inherent structure and the user's shifting intention of the texts. As a result, these methods typically take shortcuts that disregard the visual cue of the reference images. To address this issue, we reconsider the text as instructions and propose a Semantic Shift network (SSN) that explicitly decomposes the semantic shifts into two steps: from the reference image to the visual prototype and from the visual prototype to the target image. Specifically, SSN explicitly decomposes the instructions into two components: degradation and upgradation, where the degradation is used to picture the visual prototype from the reference image, while the upgradation is used to enrich the visual prototype into the final representations to retrieve the desired target image. The experimental results show that the proposed SSN demonstrates a significant improvement of 5.42% and 1.37% on the CIRR and FashionIQ datasets, respectively, and establishes a new state-of-the-art performance. Codes will be publicly available.
组合图像检索是一种图像检索任务,用户提供一个参考图像作为起点,并指定文本,如何从起点到目标图像。然而,大多数现有方法专注于文本和参考图像的组成学习和简化文本为描述,忽略了文本固有的结构和用户的意图转移。因此,这些方法通常采取捷径,忽视了参考图像的视觉提示。为了解决这个问题,我们重新考虑文本作为指令并提出一个语义转换网络(SSN),该网络 explicitly分解语义转换 into two步骤:从参考图像到视觉原型,从视觉原型到目标图像。具体来说,SSN explicitly分解指令为两个组件:退化和升级,其中退化用于从参考图像的图像原型中描绘视觉原型,而升级用于将视觉原型丰富到最终表示中以检索所需的目标图像。实验结果显示, proposed SSN 在CIIR和 fashionIQ数据集上表现出5.42%和1.37%的显著改进,并建立了新的最高水平。代码将公开可用。
https://arxiv.org/abs/2309.09531
The ability to retrieve a photo by mere free-hand sketching highlights the immense potential of Fine-grained sketch-based image retrieval (FG-SBIR). However, its rapid practical adoption, as well as scalability, is limited by the expense of acquiring faithful sketches for easily available photo counterparts. A solution to this problem is Active Learning, which could minimise the need for labeled sketches while maximising performance. Despite extensive studies in the field, there exists no work that utilises it for reducing sketching effort in FG-SBIR tasks. To this end, we propose a novel active learning sampling technique that drastically minimises the need for drawing photo sketches. Our proposed approach tackles the trade-off between uncertainty and diversity by utilising the relationship between the existing photo-sketch pair to a photo that does not have its sketch and augmenting this relation with its intermediate representations. Since our approach relies only on the underlying data distribution, it is agnostic of the modelling approach and hence is applicable to other cross-modal instance-level retrieval tasks as well. With experimentation over two publicly available fine-grained SBIR datasets ChairV2 and ShoeV2, we validate our approach and reveal its superiority over adapted baselines.
仅仅通过自由手绘 Sketch 来提取照片,突出了 Fine-grained Sketch-based 图像检索的巨大潜力(FG-SBIR)。然而,其迅速 practical 应用的规模和 scalability 受到难以获取逼真 Sketch 照片的代价的限制。解决这个问题的方法是 Active Learning,它可以最小化标记 Sketch 的需求,同时最大化性能。尽管在该领域进行了广泛的研究,但仍没有任何工作利用它减少 FG-SBIR 任务中的 Sketch 工作量。为此,我们提出了一种新 Active Learning 采样技术,它可以显著减少绘制照片 Sketch 的需求。我们提出的方法解决不确定性和多样性之间的权衡,通过利用现有的照片 Sketch 对和一个没有 Sketch 的照片之间的关系,以及通过中间表示来增强这种关系。由于我们的方法仅依赖于底层数据分布,它对于建模方法来说是无偏见的,因此也适用于其他跨modal 实例级检索任务。通过实验使用两个公开的 Fine-grained SBIR 数据集 chairV2 和 shoeV2,我们验证了我们的方法和证明了它相对于适应基线的优势。
https://arxiv.org/abs/2309.08743
In image retrieval, standard evaluation metrics rely on score ranking, \eg average precision (AP), recall at k (R@k), normalized discounted cumulative gain (NDCG). In this work we introduce a general framework for robust and decomposable rank losses optimization. It addresses two major challenges for end-to-end training of deep neural networks with rank losses: non-differentiability and non-decomposability. Firstly we propose a general surrogate for ranking operator, SupRank, that is amenable to stochastic gradient descent. It provides an upperbound for rank losses and ensures robust training. Secondly, we use a simple yet effective loss function to reduce the decomposability gap between the averaged batch approximation of ranking losses and their values on the whole training set. We apply our framework to two standard metrics for image retrieval: AP and R@k. Additionally we apply our framework to hierarchical image retrieval. We introduce an extension of AP, the hierarchical average precision $\mathcal{H}$-AP, and optimize it as well as the NDCG. Finally we create the first hierarchical landmarks retrieval dataset. We use a semi-automatic pipeline to create hierarchical labels, extending the large scale Google Landmarks v2 dataset. The hierarchical dataset is publicly available at this https URL. Code will be released at this https URL.
在图像检索中,标准评估指标依赖于评分排序,例如平均精度(AP)、k点召回(R@k)、正则化累加总 gain(NDCG)。在本研究中,我们提出了一个通用框架,用于 robust 和可分解的排序损失的优化。它解决了深度学习网络中排序损失导致的 end-to-end 训练的两个主要挑战:不可变性和非分解性。首先,我们提出了一种通用替代排序操作 SupRank,使其适用于随机梯度下降。它提供了排序损失的上界,并确保了稳健的训练。其次,我们使用一个简单的但有效的损失函数,以减少平均批量近似排序损失和整个训练集值之间的分解性差距。我们将其应用于图像检索的两个标准指标:AP 和 R@k。此外,我们还将其应用于Hierarchical 图像检索。我们引入了 AP 的扩展,即Hierarchical 平均精度 $\mathcal{H}$-AP,并优化其和NDCG。最后,我们创造了第一个Hierarchical 地标检索数据集。我们使用半自动流程创建Hierarchical 标签,扩展了大规模的谷歌地标数据集。Hierarchical 数据集在此 https URL 上公开可用。代码将在此 https URL 上发布。
https://arxiv.org/abs/2309.08250
Loop Closure Detection (LCD) is an essential task in robotics and computer vision, serving as a fundamental component for various applications across diverse domains. These applications encompass object recognition, image retrieval, and video analysis. LCD consists in identifying whether a robot has returned to a previously visited location, referred to as a loop, and then estimating the related roto-translation with respect to the analyzed location. Despite the numerous advantages of radar sensors, such as their ability to operate under diverse weather conditions and provide a wider range of view compared to other commonly used sensors (e.g., cameras or LiDARs), integrating radar data remains an arduous task due to intrinsic noise and distortion. To address this challenge, this research introduces RadarLCD, a novel supervised deep learning pipeline specifically designed for Loop Closure Detection using the FMCW Radar (Frequency Modulated Continuous Wave) sensor. RadarLCD, a learning-based LCD methodology explicitly designed for radar systems, makes a significant contribution by leveraging the pre-trained HERO (Hybrid Estimation Radar Odometry) model. Being originally developed for radar odometry, HERO's features are used to select key points crucial for LCD tasks. The methodology undergoes evaluation across a variety of FMCW Radar dataset scenes, and it is compared to state-of-the-art systems such as Scan Context for Place Recognition and ICP for Loop Closure. The results demonstrate that RadarLCD surpasses the alternatives in multiple aspects of Loop Closure Detection.
循环闭环检测(LCD)是机器人和计算机视觉领域的一个关键任务,作为各种不同领域应用程序的基础组件。这些应用程序包括对象识别、图像检索和视频分析。LCD是指确定机器人是否已返回先前访问的位置,称为“循环”,然后估计与分析位置相关的 roto- Translation。尽管雷达传感器有许多优点,例如能够在各种天气条件下运行并提供比其他常见传感器(如相机或LiDAR)更广阔的视野,但整合雷达数据仍然是一项艰巨的任务,因为固有的噪声和失真难以克服。为了应对这一挑战,本研究介绍了RadarLCD,一个专门设计的 supervised 深度学习管道,使用FMCW雷达(频率调制连续波)传感器进行循环闭环检测。RadarLCD是一种基于学习的LCD方法, explicitly designed for雷达系统,通过利用预训练的HERO(混合估计雷达测距)模型做出了重要贡献。HERO是一种最初为雷达测距开发的方法,其特征被用于选择对于LCD任务至关重要的关键点。该方法在多种FMCW雷达数据集场景上进行评估,并将其与最先进的系统(如扫描场景以地点识别和循环闭环ICP)进行比较。结果显示,RadarLCD在循环闭环检测多个方面的性能超过了其他选项。
https://arxiv.org/abs/2309.07094
This paper proposes an introspective deep metric learning (IDML) framework for uncertainty-aware comparisons of images. Conventional deep metric learning methods focus on learning a discriminative embedding to describe the semantic features of images, which ignore the existence of uncertainty in each image resulting from noise or semantic ambiguity. Training without awareness of these uncertainties causes the model to overfit the annotated labels during training and produce unsatisfactory judgments during inference. Motivated by this, we argue that a good similarity model should consider the semantic discrepancies with awareness of the uncertainty to better deal with ambiguous images for more robust training. To achieve this, we propose to represent an image using not only a semantic embedding but also an accompanying uncertainty embedding, which describes the semantic characteristics and ambiguity of an image, respectively. We further propose an introspective similarity metric to make similarity judgments between images considering both their semantic differences and ambiguities. The gradient analysis of the proposed metric shows that it enables the model to learn at an adaptive and slower pace to deal with the uncertainty during training. The proposed IDML framework improves the performance of deep metric learning through uncertainty modeling and attains state-of-the-art results on the widely used CUB-200-2011, Cars196, and Stanford Online Products datasets for image retrieval and clustering. We further provide an in-depth analysis of our framework to demonstrate the effectiveness and reliability of IDML. Code: this https URL.
本论文提出了一种introspective deep metric learning(IDML)框架,用于aware比较图像的不确定性。传统的deep metric learning方法主要关注学习一个区别性的嵌入来描述图像的语义特征,而忽视了每个图像由于噪声或语义歧义而导致的不确定性。没有认识到这些不确定性会导致模型在训练期间过度拟合标签,并在推理期间产生不满意的结论。基于这一点,我们主张一个好的相似性模型应该考虑语义差异并意识到不确定性,以更好地处理具有不确定性的图像,以进行更加鲁棒的训练。为了实现这一点,我们提议不仅使用语义嵌入,而且使用伴随的不确定性嵌入,分别描述图像的语义特征和不确定性。我们还提出了一种introspective相似的度量,以便在图像之间的相似性判断时考虑到它们的语义差异和不确定性。梯度分析表明,该度量可以使模型在训练期间以自适应和较慢的速度学习如何处理不确定性。我们提出的IDML框架通过不确定性建模来提高深度 metric learning的性能,并在广泛使用的CUB-200-2011、Cars196和斯坦福大学在线产品数据集上实现了最先进的结果,用于图像检索和聚类。我们还深入分析了我们的框架,以证明IDML的效率和可靠性。代码:这个httpsURL。
https://arxiv.org/abs/2309.09982
This paper introduces the first two pixel retrieval benchmarks. Pixel retrieval is segmented instance retrieval. Like semantic segmentation extends classification to the pixel level, pixel retrieval is an extension of image retrieval and offers information about which pixels are related to the query object. In addition to retrieving images for the given query, it helps users quickly identify the query object in true positive images and exclude false positive images by denoting the correlated pixels. Our user study results show pixel-level annotation can significantly improve the user experience. Compared with semantic and instance segmentation, pixel retrieval requires a fine-grained recognition capability for variable-granularity targets. To this end, we propose pixel retrieval benchmarks named PROxford and PRParis, which are based on the widely used image retrieval datasets, ROxford and RParis. Three professional annotators label 5,942 images with two rounds of double-checking and refinement. Furthermore, we conduct extensive experiments and analysis on the SOTA methods in image search, image matching, detection, segmentation, and dense matching using our pixel retrieval benchmarks. Results show that the pixel retrieval task is challenging to these approaches and distinctive from existing problems, suggesting that further research can advance the content-based pixel-retrieval and thus user search experience. The datasets can be downloaded from \href{this https URL}{this link}.
本 paper 介绍了前两个像素检索基准。像素检索是分割实例检索。与语义分割将分类扩展到像素级别一样,像素检索是图像检索的扩展,并提供了与查询对象相关的像素信息。除了检索给定查询的图像,它还帮助用户快速在True Positive图像中识别查询对象,并排除False Positive图像,通过标注相关的像素。我们的用户研究结果显示,像素级别的标注可以显著改善用户体验。与语义和实例分割相比,像素检索需要对具有不同粒径的目标进行精细识别。为此,我们提出了名为PROxford和PRParis的像素检索基准,基于广泛使用的图像检索数据集R Oxford和RParis。三名专业标注员对5,942张照片进行了两次检查和精调。此外,我们使用我们的像素检索基准进行了广泛的图像搜索、图像匹配、检测、分割和密集匹配方法的实验和分析。结果显示,像素检索任务对这些方法来说是具有挑战性的,与现有问题不同,这表明进一步的研究工作可以推动基于内容的像素检索,从而改善用户搜索体验。数据集可以从 \href{this https URL}{this link} 下载。
https://arxiv.org/abs/2309.05438
Self-supervised learning (SSL) using mixed images has been studied to learn various image representations. Existing methods using mixed images learn a representation by maximizing the similarity between the representation of the mixed image and the synthesized representation of the original images. However, few methods consider the synthesis of representations from the perspective of mathematical logic. In this study, we focused on a synthesis method of representations. We proposed a new SSL with mixed images and a new representation format based on many-valued logic. This format can indicate the feature-possession degree, that is, how much of each image feature is possessed by a representation. This representation format and representation synthesis by logic operation realize that the synthesized representation preserves the remarkable characteristics of the original representations. Our method performed competitively with previous representation synthesis methods for image classification tasks. We also examined the relationship between the feature-possession degree and the number of classes of images in the multilabel image classification dataset to verify that the intended learning was achieved. In addition, we discussed image retrieval, which is an application of our proposed representation format using many-valued logic.
使用混合图像进行自监督学习(SSL)已经被研究来学习各种图像表示。使用混合图像的方法已经通过最大化混合图像表示和原始图像合成表示之间的相似性来学习表示。然而,只有少数方法考虑了数学逻辑中的表示合成。在这项研究中,我们关注了表示合成的方法。我们提出了一种新的SSL方法,该方法使用混合图像,并基于多值逻辑提出了一种新的表示格式。该格式可以表示特征掌握的程度,即一个表示中所拥有每个图像特征的程度。该表示格式通过逻辑操作进行表示合成,实现了合成表示保留了原始表示的显著特征。我们的方法和以前的表示合成方法在图像分类任务中竞争地表现良好。我们还研究了图像检索,这是使用多值逻辑应用我们提出的表示格式的一种方式。我们还讨论了图像重放,这是一种使用多值逻辑应用我们提出表示格式的方法。
https://arxiv.org/abs/2309.04148
Composed image retrieval, a task involving the search for a target image using a reference image and a complementary text as the query, has witnessed significant advancements owing to the progress made in cross-modal modeling. Unlike the general image-text retrieval problem with only one alignment relation, i.e., image-text, we argue for the existence of two types of relations in composed image retrieval. The explicit relation pertains to the reference image & complementary text-target image, which is commonly exploited by existing methods. Besides this intuitive relation, the observations during our practice have uncovered another implicit yet crucial relation, i.e., reference image & target image-complementary text, since we found that the complementary text can be inferred by studying the relation between the target image and the reference image. Regrettably, existing methods largely focus on leveraging the explicit relation to learn their networks, while overlooking the implicit relation. In response to this weakness, We propose a new framework for composed image retrieval, termed dual relation alignment, which integrates both explicit and implicit relations to fully exploit the correlations among the triplets. Specifically, we design a vision compositor to fuse reference image and target image at first, then the resulted representation will serve two roles: (1) counterpart for semantic alignment with the complementary text and (2) compensation for the complementary text to boost the explicit relation modeling, thereby implant the implicit relation into the alignment learning. Our method is evaluated on two popular datasets, CIRR and FashionIQ, through extensive experiments. The results confirm the effectiveness of our dual-relation learning in substantially enhancing composed image retrieval performance.
Composed image retrieval,涉及使用参考图像和补充文本作为查询的目标图像搜索任务,已经由于跨modal建模的进展而取得了重大进展。与一般的图像-文本检索问题只有一组匹配关系(即图像-文本问题)不同,我们主张存在两种类型的匹配关系: explicit 关系涉及参考图像和补充文本-目标图像,这是现有方法普遍利用的关系。除了这个直觉关系,我们在实验中观察到的另一个 implicit 但关键的关系是:参考图像和目标图像-补充文本,因为我们发现通过研究目标图像和参考图像之间的关系,可以推断补充文本。不幸的是,现有方法主要关注利用 explicit 关系学习网络,而忽视了 implicit 关系。为了应对这个弱点,我们提出了一个新的框架,称为双关系匹配,它整合了 explicit 和 implicit 关系,充分利用三组之间的相关关系。具体而言,我们设计了一个视觉合成器,首先将参考图像和目标图像融合,然后 resulting 表示将担任两个角色:(1)与补充文本进行语义匹配的图像对应物,(2)补充文本的补偿,以促进 explicit 关系建模,并将 implicit 关系嵌入对齐学习。我们的方法在两个流行的数据集上(CIIR 和 fashionIQ)进行了评估。结果确认了我们的双关系学习在显著增强Composed image retrieval 表现方面的有效性。
https://arxiv.org/abs/2309.02169
Today, the exponential rise of large models developed by academic and industrial institutions with the help of massive computing resources raises the question of whether someone without access to such resources can make a valuable scientific contribution. To explore this, we tried to solve the challenging task of multilingual image retrieval having a limited budget of $1,000. As a result, we present NLLB-CLIP - CLIP model with a text encoder from the NLLB model. To train the model, we used an automatically created dataset of 106,246 good-quality images with captions in 201 languages derived from the LAION COCO dataset. We trained multiple models using image and text encoders of various sizes and kept different parts of the model frozen during the training. We thoroughly analyzed the trained models using existing evaluation datasets and newly created XTD200 and Flickr30k-200 datasets. We show that NLLB-CLIP is comparable in quality to state-of-the-art models and significantly outperforms them on low-resource languages.
当今世界,借助大量计算资源,学术和工业机构开发的大型模型的指数增长引发了一个问题:是否有人无法访问这些资源却能做出有价值的科学贡献。为了解决这个问题,我们尝试解决一个预算只有1,000美元的挑战:多语言图像检索。为此,我们提出了NLLB-CLIP模型,该模型是NLLB模型的文本编码器。为了训练模型,我们使用从LAION COCO数据集中自动生成的106,246张高质量的图像和字幕,其中涉及201种语言。我们使用各种大小的图像和文本编码器训练了多个模型,并在训练期间将模型的不同部分冻结。我们使用现有的评估数据和新创建的XTD200和Flickr30k-200数据集彻底分析了训练模型。我们表明,NLLB-CLIP在质量上与最先进的模型相当,并在低资源语言中显著优于它们。
https://arxiv.org/abs/2309.01859
The shear number of sources that will be detected by next-generation radio surveys will be astronomical, which will result in serendipitous discoveries. Data-dependent deep hashing algorithms have been shown to be efficient at image retrieval tasks in the fields of computer vision and multimedia. However, there are limited applications of these methodologies in the field of astronomy. In this work, we utilize deep hashing to rapidly search for similar images in a large database. The experiment uses a balanced dataset of 2708 samples consisting of four classes: Compact, FRI, FRII, and Bent. The performance of the method was evaluated using the mean average precision (mAP) metric where a precision of 88.5\% was achieved. The experimental results demonstrate the capability to search and retrieve similar radio images efficiently and at scale. The retrieval is based on the Hamming distance between the binary hash of the query image and those of the reference images in the database.
下一代无线电探测计划将发现的 sources 数量太过惊人,可能会导致意外发现。数据依存的深哈希算法在计算机视觉和多媒体领域表现出在图像检索任务中的高效性。然而,在天文学领域,这些方法的应用有限。在本研究中,我们使用深哈希技术快速搜索大型数据库中的类似图像。实验使用一个平衡的数据集,包含2708个样本,分为四个类别:紧凑型、FRI、FRII和弯曲型。方法的性能使用平均精确度(mAP)指标进行评估,该指标达到88.5%。实验结果表明,我们可以高效地搜索和检索类似的无线电图像,并且可以大规模地实现。检索基于查询图像二进制哈希与数据库中参考图像之间的哈希距离。
https://arxiv.org/abs/2309.00932
We consider the problem of composed image retrieval that takes an input query consisting of an image and a modification text indicating the desired changes to be made on the image and retrieves images that match these changes. Current state-of-the-art techniques that address this problem use global features for the retrieval, resulting in incorrect localization of the regions of interest to be modified because of the global nature of the features, more so in cases of real-world, in-the-wild images. Since modifier texts usually correspond to specific local changes in an image, it is critical that models learn local features to be able to both localize and retrieve better. To this end, our key novelty is a new gradient-attention-based learning objective that explicitly forces the model to focus on the local regions of interest being modified in each retrieval step. We achieve this by first proposing a new visual image attention computation technique, which we call multi-modal gradient attention (MMGrad) that is explicitly conditioned on the modifier text. We next demonstrate how MMGrad can be incorporated into an end-to-end model training strategy with a new learning objective that explicitly forces these MMGrad attention maps to highlight the correct local regions corresponding to the modifier text. By training retrieval models with this new loss function, we show improved grounding by means of better visual attention maps, leading to better explainability of the models as well as competitive quantitative retrieval performance on standard benchmark datasets.
我们考虑了组合图像检索的问题,该问题涉及输入查询包括一张图像和一个修改文本,该文本指示图像需要修改的特定区域,并检索出与这些修改匹配的图像。目前解决这个问题的最新技术使用 global 特征进行检索,导致修改的目标区域不准确定位,因为特征的 global 性质更常见于真实室外的图像。由于修改文本通常对应图像中的特定 local 变化,因此关键的问题是模型学习 local 特征,以便更好地 local 和检索。为此,我们的主要创新是一种新的梯度注意力学习目标,它 explicitly 迫使模型在每个检索步骤中专注于正在修改的 local 区域。我们通过首先提出一种新的视觉图像注意力计算技术,我们称之为多模态梯度注意力(MMGrad),并 explicitly 条件它与修改文本。我们接下来演示如何使用 MMGrad 将 MMGrad 注意力地图纳入端到端模型训练策略,并 explicitly 迫使这些 MMGrad 注意力地图强调与修改文本对应的正确 local 区域。通过使用这个新的损失函数训练检索模型,我们展示了通过更好的视觉注意力地图改善的基座,导致模型更好的解释性和在标准基准数据集上的竞争 quantitative 检索性能。
https://arxiv.org/abs/2308.16649
Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. Our experiments further demonstrate that training a CoVR model on our dataset effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on both the CIRR and FashionIQ benchmarks. Our code, datasets, and models are publicly available at this https URL.
合成图像检索(CoIR)最近变得越来越流行,成为一个涉及文本和图像查询的任务,以在数据库中寻找相关图像。大多数CoIR方法需要手动标注的数据集,包括图像-文本-图像三体,其中文本描述了从查询图像到目标图像的修改。然而,手动编辑CoIR三体是非常昂贵的,并且限制了可扩展性。在这项工作中,我们提出了一种可扩展的自动数据集创建方法,根据视频字幕对生成三体,同时也扩大了任务的范围,包括合成视频检索(CoVR)。为此,我们从一个大型数据库中检索具有相似字幕的配对视频,并利用大型语言模型生成相应的修改文本。将这个方法应用于广泛的WebVid2M集合中,我们自动构建了我们WebVid-CoVR数据集,产生了160万三体。此外,我们提出了一个新的CoVR基准集,并使用手动标注的评估集,与基线结果一起提出了。我们的实验进一步证明,在我们 dataset 中训练的 CoVR 模型有效地转移到了 CoIR,导致在CIVR和时尚IQ基准任务中零样本 setup 中提高最先进的性能。我们的代码、数据和模型都在这个 https URL 上公开可用。
https://arxiv.org/abs/2308.14746
Patent retrieval has been attracting tremendous interest from researchers in intellectual property and information retrieval communities in the past decades. However, most existing approaches rely on textual and metadata information of the patent, and content-based image-based patent retrieval is rarely investigated. Based on traits of patent drawing images, we present a simple and lightweight model for this task. Without bells and whistles, this approach significantly outperforms other counterparts on a large-scale benchmark and noticeably improves the state-of-the-art by 33.5% with the mean average precision (mAP) score. Further experiments reveal that this model can be elaborately scaled up to achieve a surprisingly high mAP of 93.5%. Our method ranks first in the ECCV 2022 Patent Diagram Image Retrieval Challenge.
专利检索在过去几十年中吸引了知识产权和信息检索社区研究人员的极大兴趣。然而,大多数现有方法都依赖于专利文本和元数据信息,而基于图像的专利检索很少被研究。基于专利绘图图像的特征,我们提出了一种简单且轻便的模型来完成这个任务。不需要额外的功能,这种方法在一个大规模基准上显著优于其他类似的模型,并明显改进了现有的技术水平,平均精确度(mAP)得分提高了33.5%。进一步的研究表明,这个模型可以详细地扩展来提高惊人的mAP得分达到93.5%。我们的方法在ECCV 2022年专利图检索挑战中排名第一。
https://arxiv.org/abs/2308.13749
Text-Pedestrian Image Retrieval aims to use the text describing pedestrian appearance to retrieve the corresponding pedestrian image. This task involves not only modality discrepancy, but also the challenge of the textual diversity of pedestrians with the same identity. At present, although existing research progress has been made in text-pedestrian image retrieval, these methods do not comprehensively consider the above-mentioned problems. Considering these, this paper proposes a progressive feature mining and external knowledge-assisted feature purification method. Specifically, we use a progressive mining mode to enable the model to mine discriminative features from neglected information, thereby avoiding the loss of discriminative information and improving the expression ability of features. In addition, to further reduce the negative impact of modal discrepancy and text diversity on cross-modal matching, we propose to use other sample knowledge of the same modality, i.e., external knowledge to enhance identity-consistent features and weaken identity-inconsistent features. This process purifies features and alleviates the interference caused by textual diversity and negative sample correlation features of the same modal. Extensive experiments on three challenging datasets demonstrate the effectiveness and superiority of the proposed method, and the retrieval performance even surpasses that of the large-scale model-based method on large-scale datasets.
文本行人图像检索旨在使用描述行人外貌的文本来检索相应的行人图像。这项工作不仅涉及模式差异,还涉及具有相同身份的行人文本多样性的挑战。尽管现有的文本行人图像检索研究已经取得了进展,但这些方法并未全面考虑上述问题。考虑到这些因素,本文提出了一种逐步特征挖掘和外部知识协助特征净化的方法。具体来说,我们使用逐步挖掘模式来使模型从忽略的信息中挖掘特征,从而避免失去有用的特征信息,并提高特征的表达能力。此外,为了进一步减少模式差异和文本多样性对跨模态匹配的负面影响,我们建议使用相同模态的其他样本知识,即外部知识,增强一致性特征,削弱不一致特征。这一过程净化特征并减轻相同模态的负面样本相关特征的影响。在三个具有挑战性的dataset上进行了广泛的实验,证明了该方法的有效性和优越性,检索性能甚至在大型数据集上超过了基于大型模型的方法。
https://arxiv.org/abs/2308.11994
Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one that integrates the modifications expressed by the caption. Given that recent research has demonstrated the efficacy of large-scale vision and language pre-trained (VLP) models in various tasks, we rely on features from the OpenAI CLIP model to tackle the considered task. We initially perform a task-oriented fine-tuning of both CLIP encoders using the element-wise sum of visual and textual features. Then, in the second stage, we train a Combiner network that learns to combine the image-text features integrating the bimodal information and providing combined features used to perform the retrieval. We use contrastive learning in both stages of training. Starting from the bare CLIP features as a baseline, experimental results show that the task-oriented fine-tuning and the carefully crafted Combiner network are highly effective and outperform more complex state-of-the-art approaches on FashionIQ and CIRR, two popular and challenging datasets for composed image retrieval. Code and pre-trained models are available at this https URL
给定一个由参考图像和相对标题组成的查询,Composed Image Retrieval的目标是检索与参考图像 visually similar 的图像,并整合标题所表达的变化。鉴于最近的研究表明,大规模视觉和语言预训练(VLP)模型在各种任务中的有效性,我们依赖 OpenAI CLIP 模型的特征来解决 considered 的任务。我们首先通过视觉和文本特征元素的总和进行任务导向的 fine-tuning。然后,在第二个阶段,我们训练一个组合网络,学习将图像和文本特征组合在一起,整合双通道信息,并提供用于检索的合并特征。我们在两个阶段的训练中使用对抗学习。从 bare CLIP 特征作为基线开始,实验结果显示,任务导向的 fine-tuning 和精心制作的组合网络是非常有效的,在 fashionIQ 和 CIIR 等Composed Image Retrieval 常用的挑战性数据集上优于更复杂的先进方法。代码和预训练模型在这个 https URL 上可用。
https://arxiv.org/abs/2308.11485
Cross-modal pre-training has shown impressive performance on a wide range of downstream tasks, benefiting from massive image-text pairs collected from the Internet. In practice, online data are growing constantly, highlighting the importance of the ability of pre-trained model to learn from data that is continuously growing. Existing works on cross-modal pre-training mainly focus on training a network with fixed architecture. However, it is impractical to limit the model capacity when considering the continuously growing nature of pre-training data in real-world applications. On the other hand, it is important to utilize the knowledge in the current model to obtain efficient training and better performance. To address the above issues, in this paper, we propose GrowCLIP, a data-driven automatic model growing algorithm for contrastive language-image pre-training with continuous image-text pairs as input. Specially, we adopt a dynamic growth space and seek out the optimal architecture at each growth step to adapt to online learning scenarios. And the shared encoder is proposed in our growth space to enhance the degree of cross-modal fusion. Besides, we explore the effect of growth in different dimensions, which could provide future references for the design of cross-modal model architecture. Finally, we employ parameter inheriting with momentum (PIM) to maintain the previous knowledge and address the issue of the local minimum dilemma. Compared with the existing methods, GrowCLIP improves 2.3% average top-1 accuracy on zero-shot image classification of 9 downstream tasks. As for zero-shot image retrieval, GrowCLIP can improve 1.2% for top-1 image-to-text recall on Flickr30K dataset.
跨模态预训练在多个下游任务中表现出令人印象深刻的性能,得益于从互联网收集的大量图像-文本对。在实践中,在线数据正在不断增长,突出了预训练模型能够从不断增长的数据中学习的重要性。现有的跨模态预训练工作主要关注训练一个具有固定架构的网络。然而,考虑到现实世界中预训练数据不断增长的性质,限制模型能力是不可行的。另一方面,利用当前模型的知识进行高效训练和更好的性能非常重要。为了解决上述问题,在本文中,我们提出了 growCLIP,一种基于数据驱动的自动模型增长算法,用于contrastive语言-图像预训练,并使用持续的图像-文本对作为输入。特别地,我们采用动态增长空间,在每个增长步骤中寻找最佳架构,以适应在线学习场景。我们提出了共享编码器来提高跨模态融合的程度。此外,我们探索了不同维度的增长效应,这可以为跨模态模型架构的设计提供未来的参考。最后,我们采用了继承 momentum (PIM)参数,以保持先前知识并解决本地最小二乘法问题。与现有方法相比, growCLIP在零样本图像分类中的平均 top-1 准确率提高了 2.3%。对于零样本图像检索, growCLIP可以在Flickr30K数据集上提高 1.2%。
https://arxiv.org/abs/2308.11331
Visual Place Recognition is a task that aims to predict the place of an image (called query) based solely on its visual features. This is typically done through image retrieval, where the query is matched to the most similar images from a large database of geotagged photos, using learned global descriptors. A major challenge in this task is recognizing places seen from different viewpoints. To overcome this limitation, we propose a new method, called EigenPlaces, to train our neural network on images from different point of views, which embeds viewpoint robustness into the learned global descriptors. The underlying idea is to cluster the training data so as to explicitly present the model with different views of the same points of interest. The selection of this points of interest is done without the need for extra supervision. We then present experiments on the most comprehensive set of datasets in literature, finding that EigenPlaces is able to outperform previous state of the art on the majority of datasets, while requiring 60\% less GPU memory for training and using 50\% smaller descriptors. The code and trained models for EigenPlaces are available at {\small{\url{this https URL}}}, while results with any other baseline can be computed with the codebase at {\small{\url{this https URL}}}.
视觉位置识别任务旨在基于图像的视觉特征预测图像的位置(称为查询),通常通过图像检索来实现,其中查询与一张标注有地理位置的大规模照片库中最相似的图像进行匹配,使用学习得到的全局描述符。该任务的一个大挑战是识别从不同角度看到的位置。为了克服这个限制,我们提出了一种新的方法,称为 Eigenplaces,用于训练神经网络,该方法使用从不同角度的图像来训练,将视角稳定性嵌入到学习得到的全局描述符中。基本思想是将它们进行聚类,以 explicitly 呈现相同的兴趣点从不同角度的视角。选择这些兴趣点不需要额外的监督。然后,我们介绍了文献中最全面的dataset set的实验,发现 Eigen Places 在大多数dataset上能够比先前的最高水平表现出色,同时训练所需的GPU内存减少60%,使用描述符大小减小50%。 Eigen places 的代码和训练模型可在 {\small{\url{this https URL}}} 中找到,而与其他基准结果的计算可以使用代码库在 {\small{\url{this https URL}}} 中。
https://arxiv.org/abs/2308.10832
Multi-turn textual feedback-based fashion image retrieval focuses on a real-world setting, where users can iteratively provide information to refine retrieval results until they find an item that fits all their requirements. In this work, we present a novel memory-based method, called FashionNTM, for such a multi-turn system. Our framework incorporates a new Cascaded Memory Neural Turing Machine (CM-NTM) approach for implicit state management, thereby learning to integrate information across all past turns to retrieve new images, for a given turn. Unlike vanilla Neural Turing Machine (NTM), our CM-NTM operates on multiple inputs, which interact with their respective memories via individual read and write heads, to learn complex relationships. Extensive evaluation results show that our proposed method outperforms the previous state-of-the-art algorithm by 50.5%, on Multi-turn FashionIQ -- the only existing multi-turn fashion dataset currently, in addition to having a relative improvement of 12.6% on Multi-turn Shoes -- an extension of the single-turn Shoes dataset that we created in this work. Further analysis of the model in a real-world interactive setting demonstrates two important capabilities of our model -- memory retention across turns, and agnosticity to turn order for non-contradictory feedback. Finally, user study results show that images retrieved by FashionNTM were favored by 83.1% over other multi-turn models. Project page: this https URL
基于多轮文本反馈的服装图像检索关注现实世界的情况,用户可以迭代提供信息来优化检索结果,直到找到符合所有要求的商品。在本研究中,我们提出了一种名为 FashionNTM 的新基于记忆的方法,以构建这样的多轮系统。我们的框架采用了一种名为Cascaded Memory Neural Turing Machine(CM-NTM)的方法,用于 implicit 状态管理,并学习在所有过去的轮次中整合信息以检索特定轮次中的新图像。与普通的神经网络 Turing 机器(NTM)不同,我们的 CM-NTM 采用多个输入,通过个体读取和写入头相互作用,学习复杂的关系。广泛的评估结果显示,我们提出的方法和以前的最先进的算法在 Multi-turn FashionIQ 上表现相当,这是目前唯一的多轮服装数据集,并且在 Multi-turn 鞋子方面相对于我们的 single-turn 鞋子数据集有相对改善12.6%。在现实世界交互环境中对模型的分析表明,我们的模型有两个重要的能力——跨轮次记忆的保留和对任意反馈的轮序中立性。最后,用户研究结果显示,通过 FashionNTM 检索的图像比其他多轮模型偏好83.1%。项目页面: this https URL
https://arxiv.org/abs/2308.10170
Logo embedding plays a crucial role in various e-commerce applications by facilitating image retrieval or recognition, such as intellectual property protection and product search. However, current methods treat logo embedding as a purely visual problem, which may limit their performance in real-world scenarios. A notable issue is that the textual knowledge embedded in logo images has not been adequately explored. Therefore, we propose a novel approach that leverages textual knowledge as an auxiliary to improve the robustness of logo embedding. The emerging Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in both visual and textual understanding and could become valuable visual assistants in understanding logo images. Inspired by this observation, our proposed method, FashionLOGO, aims to utilize MLLMs to enhance fashion logo embedding. We explore how MLLMs can improve logo embedding by prompting them to generate explicit textual knowledge through three types of prompts, including image OCR, brief captions, and detailed descriptions prompts, in a zero-shot setting. We adopt a cross-attention transformer to enable image embedding queries to learn supplementary knowledge from textual embeddings automatically. To reduce computational costs, we only use the image embedding model in the inference stage, similar to traditional inference pipelines. Our extensive experiments on three real-world datasets demonstrate that FashionLOGO learns generalized and robust logo embeddings, achieving state-of-the-art performance in all benchmark datasets. Furthermore, we conduct comprehensive ablation studies to demonstrate the performance improvements resulting from the introduction of MLLMs.
商标嵌入在各种电子商务应用中发挥着关键作用,通过方便图像检索或识别,如知识产权保护和商品搜索。然而,当前的方法将商标嵌入视为纯粹的视觉问题,这可能限制它们在现实世界中的表现。一个显著的问题是,商标图像中所嵌入的文本知识尚未得到充分探索。因此,我们提出了一种新的方法来利用文本知识作为辅助,以提高商标嵌入的鲁棒性。新兴的多模态大型语言模型(MLLMs)已经表现出在视觉和文本理解方面的卓越能力,并且可以成为理解商标图像的重要视觉助手。受到这一观察的启发,我们的 proposed 方法称为 fashionLOGO,旨在利用 MLLMs 提高时尚商标嵌入。我们探索了 MLLMs 如何通过三种类型的启示(包括图像自动识别、简要标题和详细描述启示)促使它们生成明确文本知识,在一个零次启示的情况下,以改善商标嵌入。我们采用交叉注意力Transformer来实现图像嵌入查询自动从文本嵌入学习补充知识。为了降低计算成本,我们只在推理阶段使用图像嵌入模型,类似于传统的推理管道。我们针对三个实际数据集进行了广泛的实验,表明 fashionLOGO 学会了通用和鲁棒的商标嵌入,在所有基准数据集上实现了最先进的性能。此外,我们进行了全面消除研究,以证明引入 MLLMs 带来的性能改善。
https://arxiv.org/abs/2308.09012