In this paper, we leverage CLIP for zero-shot sketch based image retrieval (ZS-SBIR). We are largely inspired by recent advances on foundation models and the unparalleled generalisation ability they seem to offer, but for the first time tailor it to benefit the sketch community. We put forward novel designs on how best to achieve this synergy, for both the category setting and the fine-grained setting ("all"). At the very core of our solution is a prompt learning setup. First we show just via factoring in sketch-specific prompts, we already have a category-level ZS-SBIR system that overshoots all prior arts, by a large margin (24.8%) - a great testimony on studying the CLIP and ZS-SBIR synergy. Moving onto the fine-grained setup is however trickier, and requires a deeper dive into this synergy. For that, we come up with two specific designs to tackle the fine-grained matching nature of the problem: (i) an additional regularisation loss to ensure the relative separation between sketches and photos is uniform across categories, which is not the case for the gold standard standalone triplet loss, and (ii) a clever patch shuffling technique to help establishing instance-level structural correspondences between sketch-photo pairs. With these designs, we again observe significant performance gains in the region of 26.9% over previous state-of-the-art. The take-home message, if any, is the proposed CLIP and prompt learning paradigm carries great promise in tackling other sketch-related tasks (not limited to ZS-SBIR) where data scarcity remains a great challenge. Code and models will be made available.
在本文中,我们利用CLIP实现零次请求的 Sketch 图像检索(ZS-SBIR),我们主要受到最近在基础模型方面取得的进步以及它们似乎提供的无与伦比的泛化能力启发,但首次为 Sketch 社区服务。我们提出了新的设计,以最大程度地实现这一协同作用,无论是按类别设置还是精细设置(“所有”)。我们的核心解决方案是prompt learning setup。我们首先通过考虑 Sketch 特定的提示因子,已经有了一个按类别设置的 ZS-SBIR 系统,比所有先前作品都超出了很大的比例(24.8%),这是研究 CLIP 和 ZS-SBIR协同作用的巨大证明。然而,切换到精细设置变得更加困难,需要更深入地探索这一协同作用。为此,我们提出了两个特定的设计,以解决精细匹配问题:(i)额外的正则化损失,以确保 Sketch 和照片之间的相对分离在所有类别上是均匀的,而不像标准单例差分损失那样,(ii)聪明的 patch shuffle 技术,以帮助建立 Sketch 和照片之间的实例级结构对应关系。通过这些设计,我们再次观察到在先前技术水平的26.9%范围内显著的性能提升。总之,任何消息都是关于 proposed CLIP 和 prompt learning 范式在处理其他 Sketch 相关任务(不仅仅限于 ZS-SBIR)时具有巨大的潜力,数据稀缺仍然是一个巨大挑战。代码和模型将公开提供。
https://arxiv.org/abs/2303.13440
Exploiting fine-grained correspondence and visual-semantic alignments has shown great potential in image-text matching. Generally, recent approaches first employ a cross-modal attention unit to capture latent region-word interactions, and then integrate all the alignments to obtain the final similarity. However, most of them adopt one-time forward association or aggregation strategies with complex architectures or additional information, while ignoring the regulation ability of network feedback. In this paper, we develop two simple but quite effective regulators which efficiently encode the message output to automatically contextualize and aggregate cross-modal representations. Specifically, we propose (i) a Recurrent Correspondence Regulator (RCR) which facilitates the cross-modal attention unit progressively with adaptive attention factors to capture more flexible correspondence, and (ii) a Recurrent Aggregation Regulator (RAR) which adjusts the aggregation weights repeatedly to increasingly emphasize important alignments and dilute unimportant ones. Besides, it is interesting that RCR and RAR are plug-and-play: both of them can be incorporated into many frameworks based on cross-modal interaction to obtain significant benefits, and their cooperation achieves further improvements. Extensive experiments on MSCOCO and Flickr30K datasets validate that they can bring an impressive and consistent R@1 gain on multiple models, confirming the general effectiveness and generalization ability of the proposed methods. Code and pre-trained models are available at: this https URL.
在图像-文本匹配中,利用精细的对应关系和视觉语义对齐展现了巨大的潜力。一般而言,最近的方法先使用跨媒体注意力单元捕捉潜在区域-单词相互作用,然后将它们整合到一起以获得最终相似性。然而,大多数方法采用复杂的架构或附加信息,同时忽略了网络反馈的调节能力。在本文中,我们开发了两个非常简单但非常有效的监管者,它们高效编码了消息输出,自动 contextualize 和聚合跨媒体表示。具体而言,我们提出了 (i) 循环对应监管器(RCR),它促进跨媒体注意力单元逐步适应注意力因素,以捕捉更灵活的对应关系,以及 (ii) 循环聚合监管器(RAR),它反复调整聚合权重,越来越强调重要的对齐,并稀释不重要的对齐。此外,有趣的是,RCR 和 RAR是可插拔的:它们都可以被整合到基于跨媒体交互的许多框架中,以获得重大好处,并且它们的合作可以实现进一步的改进。在 MSCOCO 和 Flickr30K 数据集上的广泛实验证实了它们可以在多个模型上产生令人印象深刻且一致的 R@1 增益,确认了所提出方法的一般效率和泛化能力。代码和预训练模型可在 this https URL 中找到。
https://arxiv.org/abs/2303.13371
This short technical report demonstrates a simple technique that yields state of the art results in medical image-text matching tasks. We analyze the use of OpenAI's CLIP, a general image-text matching model, and observe that CLIP's limited textual input size has negative impact on downstream performance in the medical domain where encoding longer textual contexts is often required. We thus train and release ClipMD, which is trained with a simple sliding window technique to encode textual captions. ClipMD was tested on two medical image-text datasets and compared with other image-text matching models. The results show that ClipMD outperforms other models on both datasets by a large margin. We make our code and pretrained model publicly available.
这段简短的技术报告展示了一种简单的技术,可以在医学图像-文本匹配任务中获得最先进的结果。我们分析了OpenAI的Clip,一个通用的图像-文本匹配模型,并观察了Clip有限的文字输入大小的消极影响,因为在医学领域中,通常需要编码更长的文字上下文。因此,我们训练并发布了ClipMD,它是通过一个简单的滑动窗口技术编码文本标题的训练方法。ClipMD对两个医学图像-文本数据集进行了测试,并与其他图像-文本匹配模型进行了比较。结果表明,ClipMD在两个数据集上比其他模型表现更好。我们公开发布了我们的代码和预训练模型。
https://arxiv.org/abs/2303.13340
Learning dense visual representations without labels is an arduous task and more so from scene-centric data. We propose to tackle this challenging problem by proposing a Cross-view consistency objective with an Online Clustering mechanism (CrOC) to discover and segment the semantics of the views. In the absence of hand-crafted priors, the resulting method is more generalizable and does not require a cumbersome pre-processing step. More importantly, the clustering algorithm conjointly operates on the features of both views, thereby elegantly bypassing the issue of content not represented in both views and the ambiguous matching of objects from one crop to the other. We demonstrate excellent performance on linear and unsupervised segmentation transfer tasks on various datasets and similarly for video object segmentation. Our code and pre-trained models are publicly available at this https URL.
学习没有标签的高密度视觉表示是一项困难的任务,尤其是对于场景为中心的数据。我们提议解决这一挑战性的问题,并提出了一种交叉视图一致性目标,并使用在线簇集机制(CrOC)来发现和分割视图的语义。在没有手动构建的先验的情况下, resulting 方法更加通用,不需要繁琐的预处理步骤。更重要的是,簇集算法同时作用于两个视图的特征,从而巧妙地绕过了内容在两个视图中未被表示的问题,以及从一个作物到另一个作物的模糊匹配问题。我们证明了在多种数据集和视频对象分割中表现出优异的线性和 unsupervised 分割转移任务的性能。我们的代码和预训练模型在此 https URL 上公开可用。
https://arxiv.org/abs/2303.13245
We propose VADER, a spatio-temporal matching, alignment, and change summarization method to help fight misinformation spread via manipulated videos. VADER matches and coarsely aligns partial video fragments to candidate videos using a robust visual descriptor and scalable search over adaptively chunked video content. A transformer-based alignment module then refines the temporal localization of the query fragment within the matched video. A space-time comparator module identifies regions of manipulation between aligned content, invariant to any changes due to any residual temporal misalignments or artifacts arising from non-editorial changes of the content. Robustly matching video to a trusted source enables conclusions to be drawn on video provenance, enabling informed trust decisions on content encountered.
我们提出了VADER,一种时间和空间匹配、对齐和变化概述方法,以帮助打击通过操纵视频传播虚假信息的行为。VADER使用一种 robust 的视觉描述器和自适应分割的视频内容,以将部分视频片段与候选视频进行稳健匹配和粗略对齐。一个Transformer-based对齐模块随后 refine 了匹配视频中查询片段的时间和空间定位。一个空间-时间比较模块识别出对齐内容包括之间的操纵区域,不受任何Residual temporal Misalignment或内容非编辑变化所导致的任何变化的影响。通过稳健地匹配视频与信任来源,可以得出视频溯源的结论,从而在处理遇到的内容包括做出知情的信任决策。
https://arxiv.org/abs/2303.13193
Existing Optimal Transport (OT) methods mainly derive the optimal transport plan/matching under the criterion of transport cost/distance minimization, which may cause incorrect matching in some cases. In many applications, annotating a few matched keypoints across domains is reasonable or even effortless in annotation burden. It is valuable to investigate how to leverage the annotated keypoints to guide the correct matching in OT. In this paper, we propose a novel KeyPoint-Guided model by ReLation preservation (KPG-RL) that searches for the optimal matching (i.e., transport plan) guided by the keypoints in OT. To impose the keypoints in OT, first, we propose a mask-based constraint of the transport plan that preserves the matching of keypoint pairs. Second, we propose to preserve the relation of each data point to the keypoints to guide the matching. The proposed KPG-RL model can be solved by Sinkhorn's algorithm and is applicable even when distributions are supported in different spaces. We further utilize the relation preservation constraint in the Kantorovich Problem and Gromov-Wasserstein model to impose the guidance of keypoints in them. Meanwhile, the proposed KPG-RL model is extended to the partial OT setting. Moreover, we deduce the dual formulation of the KPG-RL model, which is solved using deep learning techniques. Based on the learned transport plan from dual KPG-RL, we propose a novel manifold barycentric projection to transport source data to the target domain. As applications, we apply the proposed KPG-RL model to the heterogeneous domain adaptation and image-to-image translation. Experiments verified the effectiveness of the proposed approach.
现有的最优传输(OT)方法主要基于运输成本/距离最小化的标准来推导最优传输计划/匹配,这可能在某些情况下导致不匹配。在许多应用中,对跨域匹配的一些关键点进行注释是合理的,甚至注释负担更轻松。研究如何利用注释的关键点在OT中指导正确的匹配非常重要。在本文中,我们提出了一种新的关键点引导模型,称为ReLation preservation(KPG-RL),它搜索最优匹配(即传输计划)由OT中的关键点引导。为了在OT中强加关键点,我们首先提出了基于掩膜的运输计划约束,以保留匹配关键点的一对关键点。其次,我们提出了保留每个数据点与关键点的关系以指导匹配。提出的KPG-RL模型可以使用Sinkhorn算法解决,即使在不同的空间中支持分布的情况下也适用。我们还利用 Kantorovich Problem和Gromov-Wasserstein模型中的关系保留约束来强加关键点的指导。同时,我们推导了KPG-RL模型的 dual 形式,该形式使用深度学习技术解决。基于从双KPG-RL模型中学习的运输计划,我们提出了一种独特的多平面巴尔干投影,以将源数据传输到目标域。作为应用,我们应用提出的KPG-RL模型到异质域适应和图像到图像翻译。实验证实了提出的这种方法的 effectiveness。
https://arxiv.org/abs/2303.13102
Open-vocabulary detection (OVD) is an object detection task aiming at detecting objects from novel categories beyond the base categories on which the detector is trained. Recent OVD methods rely on large-scale visual-language pre-trained models, such as CLIP, for recognizing novel objects. We identify the two core obstacles that need to be tackled when incorporating these models into detector training: (1) the distribution mismatch that happens when applying a VL-model trained on whole images to region recognition tasks; (2) the difficulty of localizing objects of unseen classes. To overcome these obstacles, we propose CORA, a DETR-style framework that adapts CLIP for Open-vocabulary detection by Region prompting and Anchor pre-matching. Region prompting mitigates the whole-to-region distribution gap by prompting the region features of the CLIP-based region classifier. Anchor pre-matching helps learning generalizable object localization by a class-aware matching mechanism. We evaluate CORA on the COCO OVD benchmark, where we achieve 41.7 AP50 on novel classes, which outperforms the previous SOTA by 2.4 AP50 even without resorting to extra training data. When extra training data is available, we train CORA$^+$ on both ground-truth base-category annotations and additional pseudo bounding box labels computed by CORA. CORA$^+$ achieves 43.1 AP50 on the COCO OVD benchmark and 28.1 box APr on the LVIS OVD benchmark.
Open-vocabulary detection (OVD) 是一种目标检测任务,旨在检测对象来自新分类类别超越了检测器训练的基础类别。最近的 OVD 方法依赖于大型视觉语言预训练模型,如 CLIP,以识别新对象。我们识别了两个核心障碍,当将这些模型融入检测器训练时需要解决:(1) 应用整个图像训练的 VL-模型用于区域识别任务时的分布不匹配;(2) 难以定位未观测过类的对象。为了克服这些障碍,我们提出了 CORA,一种 DETR 风格框架,通过区域提示和Anchor 前匹配来适应 CLIP 以进行 Open-vocabulary 检测。区域提示缓解了整个到区域的分布差距,通过提示 CLIP 基于区域分类器的区域特征。Anchor 前匹配帮助学习基于类元匹配机制的可通用对象定位。我们在 COCO OVD 基准上评估了 CORA,在 novel 类上取得了 41.7 AP50,比先前的 SOTA 提高了 2.4 AP50 甚至不需要额外的训练数据。当有额外的训练数据时,我们训练了 CORA$^+$,在真实的基类注释和由 CORA 计算的额外的伪边界框标签上训练。 CORA$^+$ 在 COCO OVD 基准上取得了 43.1 AP50,在 LVIS OVD 基准上取得了 28.1 框 APr。
https://arxiv.org/abs/2303.13076
Two sound field reproduction methods, weighted pressure matching and weighted mode matching, are theoretically and experimentally compared. The weighted pressure and mode matching are a generalization of conventional pressure and mode matching, respectively. Both methods are derived by introducing a weighting matrix in the pressure and mode matching. The weighting matrix in the weighted pressure matching is defined on the basis of the kernel interpolation of the sound field from pressure at a discrete set of control points. In the weighted mode matching, the weighting matrix is defined by a regional integration of spherical wavefunctions. It is theoretically shown that the weighted pressure matching is a special case of the weighted mode matching by infinite-dimensional harmonic analysis for estimating expansion coefficients from pressure observations. The difference between the two methods are discussed through experiments.
两种声场复制方法——加权压力匹配和加权模式匹配,进行了理论和实验比较。加权压力和模式匹配是传统的压力和模式匹配的泛化。两种方法都是通过在压力和模式匹配中引入权重矩阵来推导的。在加权压力匹配中,权重矩阵是根据从离散控制点的压力推断的声场内核插值定义的。在加权模式匹配中,权重矩阵是由球形波函数的区域集成定义的。理论上表明,加权压力匹配是加权模式匹配的一种特殊情况,通过使用无限维哈勃分析从压力观测中估计膨胀系数。两种方法之间的差异通过实验进行了讨论。
https://arxiv.org/abs/2303.13027
Finding localized correspondences across different images of the same object is crucial to understand its geometry. In recent years, this problem has seen remarkable progress with the advent of deep learning based local image features and learnable matchers. Still, learnable matchers often underperform when there exists only small regions of co-visibility between image pairs (i.e. wide camera baselines). To address this problem, we leverage recent progress in coarse single-view geometry estimation methods. We propose LFM-3D, a Learnable Feature Matching framework that uses models based on graph neural networks, and enhances their capabilities by integrating noisy, estimated 3D signals to boost correspondence estimation. When integrating 3D signals into the matcher model, we show that a suitable positional encoding is critical to effectively make use of the low-dimensional 3D information. We experiment with two different 3D signals - normalized object coordinates and monocular depth estimates - and evaluate our method on large-scale (synthetic and real) datasets containing object-centric image pairs across wide baselines. We observe strong feature matching improvements compared to 2D-only methods, with up to +6% total recall and +28% precision at fixed recall. We additionally demonstrate that the resulting improved correspondences lead to much higher relative posing accuracy for in-the-wild image pairs, with a more than 8% boost compared to the 2D-only approach.
找到同一物体不同图像之间的局部对应关系是理解其几何结构的关键。近年来,这个问题随着深度学习基于局部图像特征和可学习匹配器的出现而取得了显著进展。然而,可学习匹配器在图像 pairs 之间只有小可见区域的情况下往往表现不好(即 wide camera baselines 宽相机基线)。为了解决这个问题,我们利用了最近在粗粒度单视图几何估计方法方面的进展。我们提出了 LFM-3D,一个基于图神经网络的可学习特征匹配框架,并增强了其能力,通过集成噪声估计的 3D 信号来提高对应估计。在将 3D 信号集成到匹配模型中时,我们表明适当的位置编码是至关重要的,有效地利用低维 3D 信息。我们实验了两种 3D 信号——物体坐标系标准化和单目深度估计——并在包含 wide baselines 的大尺度(合成和真实)数据集上评估了我们的方法和与 2D 方法相比强大的特征匹配改进。我们还观察到相比 2D 方法,野生图像对之间的相对姿态精度更高,达到了 +6% 的总召回和 +28% 的精确度。此外,我们证明,改进的对应关系导致了更高的相对姿态精度,比仅使用 2D 方法增加了超过 8%。
https://arxiv.org/abs/2303.12779
Text-to-image person retrieval aims to identify the target person based on a given textual description query. The primary challenge is to learn the mapping of visual and textual modalities into a common latent space. Prior works have attempted to address this challenge by leveraging separately pre-trained unimodal models to extract visual and textual features. However, these approaches lack the necessary underlying alignment capabilities required to match multimodal data effectively. Besides, these works use prior information to explore explicit part alignments, which may lead to the distortion of intra-modality information. To alleviate these issues, we present IRRA: a cross-modal Implicit Relation Reasoning and Aligning framework that learns relations between local visual-textual tokens and enhances global image-text matching without requiring additional prior supervision. Specifically, we first design an Implicit Relation Reasoning module in a masked language modeling paradigm. This achieves cross-modal interaction by integrating the visual cues into the textual tokens with a cross-modal multimodal interaction encoder. Secondly, to globally align the visual and textual embeddings, Similarity Distribution Matching is proposed to minimize the KL divergence between image-text similarity distributions and the normalized label matching distributions. The proposed method achieves new state-of-the-art results on all three public datasets, with a notable margin of about 3%-9% for Rank-1 accuracy compared to prior methods.
文本到图像人物检索的目标是根据给定的文本描述查询识别目标人物。主要挑战是学习视觉和文本模式的映射到共同的潜在空间。以前的工作曾尝试通过利用单独训练的 unimodal 模型提取视觉和文本特征来解决这一挑战。但是这些方法缺乏有效地匹配多模式数据所需的必要 underlying alignment 能力。此外,这些工作使用先前信息来探索显式部分匹配,这可能导致modality 内部信息失真。为了减轻这些问题,我们提出了IRRA:一种跨modal隐含关系推理和对齐框架,该框架学习 local 视觉文本代币之间的关系并提高全球图像文本匹配的性能,而无需额外的先前监督。具体来说,我们首先在掩码语言建模范式中设计了一个隐含关系推理模块。这通过将视觉线索集成到文本代币中使用跨modal多模式交互编码器实现跨modal交互。其次,为了全局对齐视觉和文本嵌入,我们提出了相似度分布匹配方法,以最小化图像文本相似度分布和正则化标签匹配分布的KL 差异。 proposed method 在三个公共数据集上实现了新的前沿技术结果,与以前的方法相比,Rank-1 准确性提高了约3%-9%。
https://arxiv.org/abs/2303.12501
Digital image inpainting is an interpolation problem, inferring the content in the missing (unknown) region to agree with the known region data such that the interpolated result fulfills some prior knowledge. Low-rank and nonlocal self-similarity are two important priors for image inpainting. Based on the nonlocal self-similarity assumption, an image is divided into overlapped square target patches (submatrices) and the similar patches of any target patch are reshaped as vectors and stacked into a patch matrix. Such a patch matrix usually enjoys a property of low rank or approximately low rank, and its missing entries are recoveried by low-rank matrix approximation (LRMA) algorithms. Traditionally, $n$ nearest neighbor similar patches are searched within a local window centered at a target patch. However, for an image with missing lines, the generated patch matrix is prone to having entirely-missing rows such that the downstream low-rank model fails to reconstruct it well. To address this problem, we propose a region-wise matching (RwM) algorithm by dividing the neighborhood of a target patch into multiple subregions and then search the most similar one within each subregion. A non-convex weighted low-rank decomposition (NC-WLRD) model for LRMA is also proposed to reconstruct all degraded patch matrices grouped by the proposed RwM algorithm. We solve the proposed NC-WLRD model by the alternating direction method of multipliers (ADMM) and analyze the convergence in detail. Numerous experiments on line inpainting (entire-row/column missing) demonstrate the superiority of our method over other competitive inpainting algorithms. Unlike other low-rank-based matrix completion methods and inpainting algorithms, the proposed model NC-WLRD is also effective for removing random-valued impulse noise and structural noise (stripes).
数字图像填充是一种插值问题,通过推断缺失(未知)区域的内容与已知区域数据相等,使填充结果满足某些先前知识。低秩和非局部相似性是图像填充的重要前提。基于非局部相似性假设,图像被分解成重叠的正方形目标点(子矩阵),任意目标点的目标点相似点被重构为向量并堆叠成点矩阵。这样的点矩阵通常具有低秩或近似低秩的性质,其缺失项可以通过低秩矩阵逼近算法(LRMA)恢复。传统上,$n$个最邻近的相似点需要在目标点周围的 local 窗口中搜索。但对于有线条缺失的图像,生成的点矩阵容易出现完全缺失的行,导致后续低秩模型无法重构它。为了解决这个问题,我们提出了一种区域匹配(RwM)算法,通过将目标点周围的邻居区域划分为多个子区域,然后在每个子区域内搜索最相似的点。一种非凸加权低秩分解(NC-WLRD)模型也被提出用于 LRMA 算法恢复所有退化的点矩阵。通过交替方向乘法方法(ADMM)解决提出的 NC-WLRD 模型,并详细分析了收敛情况。许多实验在线条填充(全部行/列缺失)展示了我们方法的优势胜过其他竞争填充算法。与其他基于低秩的矩阵 completion 方法和填充算法不同,提出的 NC-WLRD 模型还有效去除随机值快速噪声和结构噪声(条纹)。
https://arxiv.org/abs/2303.12421
As one of the most fundamental techniques in multimodal learning, cross-modal matching aims to project various sensory modalities into a shared feature space. To achieve this, massive and correctly aligned data pairs are required for model training. However, unlike unimodal datasets, multimodal datasets are extremely harder to collect and annotate precisely. As an alternative, the co-occurred data pairs (e.g., image-text pairs) collected from the Internet have been widely exploited in the area. Unfortunately, the cheaply collected dataset unavoidably contains many mismatched data pairs, which have been proven to be harmful to the model's performance. To address this, we propose a general framework called BiCro (Bidirectional Cross-modal similarity consistency), which can be easily integrated into existing cross-modal matching models and improve their robustness against noisy data. Specifically, BiCro aims to estimate soft labels for noisy data pairs to reflect their true correspondence degree. The basic idea of BiCro is motivated by that -- taking image-text matching as an example -- similar images should have similar textual descriptions and vice versa. Then the consistency of these two similarities can be recast as the estimated soft labels to train the matching model. The experiments on three popular cross-modal matching datasets demonstrate that our method significantly improves the noise-robustness of various matching models, and surpass the state-of-the-art by a clear margin.
作为一种在多模态学习中最为重要的技术,跨模态匹配旨在将各种感官模态放到一个共享特征空间中。为了实现这一点,需要大量的、正确地对齐的数据对来进行模型训练。然而,与单模态数据集不同,多模态数据集是非常难以精确收集和标注的。作为一种替代方案,从互联网上收集的共现数据对(如图像-文本对)在该地区被广泛利用。不幸的是,低成本收集的数据集不可避免地包含了许多不匹配的数据对,这些证明对模型性能有害。为了解决这一问题,我们提出了一个名为BiCro(Bidirectional Cross-modal Similarity consistency)的一般性框架,它可以轻松地集成到现有的跨模态匹配模型中,并改进其对噪声数据的稳定性。具体而言,BiCro旨在估计噪声数据对的软标签,以反映它们的真正匹配程度。BiCro的基本思想是由——以图像-文本对为例——相似的图像应该具有相似的文本描述,反之亦然。然后,这两个相似性的一致性可以重新转换为估计的软标签,用于训练匹配模型。对三个流行的跨模态匹配数据集的实验表明,我们的方法显著改进了各种匹配模型的噪声稳定性,并以明显的优势超越了当前的技术水平。
https://arxiv.org/abs/2303.12419
We propose a content-based system for matching video and background music. The system aims to address the challenges in music recommendation for new users or new music give short-form videos. To this end, we propose a cross-modal framework VMCML that finds a shared embedding space between video and music representations. To ensure the embedding space can be effectively shared by both representations, we leverage CosFace loss based on margin-based cosine similarity loss. Furthermore, we establish a large-scale dataset called MSVD, in which we provide 390 individual music and the corresponding matched 150,000 videos. We conduct extensive experiments on Youtube-8M and our MSVD datasets. Our quantitative and qualitative results demonstrate the effectiveness of our proposed framework and achieve state-of-the-art video and music matching performance.
我们提出了一种基于内容匹配视频和背景音乐的系统。该系统旨在解决对新用户或新音乐提供简短视频的音乐推荐面临的挑战。为此,我们提出了一种跨媒体框架VMCML,该框架在视频和音乐表示之间找到了共享嵌入空间。为了确保嵌入空间能够 effectively shared by both representations,我们利用基于 margin-based cosine similarity loss的CosFace损失。此外,我们建立了一个大型数据集MSVD,其中提供了390个单独的音乐视频,并匹配了150,000个相关视频。我们在Youtube-8M和MSVD数据集上进行了广泛的实验。我们的定量和定性结果证明了我们提出的框架的有效性,并实现了最先进的视频和音乐匹配表现。
https://arxiv.org/abs/2303.12379
Sequential video understanding, as an emerging video understanding task, has driven lots of researchers' attention because of its goal-oriented nature. This paper studies weakly supervised sequential video understanding where the accurate time-stamp level text-video alignment is not provided. We solve this task by borrowing ideas from CLIP. Specifically, we use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video, respectively. To model the correspondence between text and video, we propose a multiple granularity loss, where the video-paragraph contrastive loss enforces matching between the whole video and the complete script, and a fine-grained frame-sentence contrastive loss enforces the matching between each action and its description. As the frame-sentence correspondence is not available, we propose to use the fact that video actions happen sequentially in the temporal domain to generate pseudo frame-sentence correspondence and supervise the network training with the pseudo labels. Extensive experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin, which validates the effectiveness of our proposed approach. Code is available at this https URL
Sequential video understanding,作为新兴的视频理解任务,吸引了许多研究人员的关注,因为它具有目标导向的性质。本文研究了未提供准确时间戳级别文本-视频对齐的弱监督Sequential视频理解任务。我们借鉴了CLIP的思想,具体来说,我们使用Transformer将帧级特征整合用于视频表示,使用预先训练的文本编码器分别编码每个行动和整个视频对应的文本。为了建模文本和视频之间的对应关系,我们提出了多个粒度的损失,其中视频段落对比度损失强迫整个视频和完整脚本匹配,而精细的帧语句对比度损失强迫每个行动和其描述匹配。由于帧语句对应关系不可得,我们提出了利用时间域中视频行动Sequential的顺序性生成伪帧语句对应关系,并监督网络使用伪标签进行训练。在视频序列验证和文本到视频匹配方面的广泛实验结果表明,我们的方法比基准方法表现更好,这验证了我们提出的方法的有效性。代码可在该https URL处获取。
https://arxiv.org/abs/2303.12370
On-demand ride services or ride-sourcing services have been experiencing fast development in the past decade. Various mathematical models and optimization algorithms have been developed to help ride-sourcing platforms design operational strategies with higher efficiency. However, due to cost and reliability issues (implementing an immature algorithm for real operations may result in system turbulence), it is commonly infeasible to validate these models and train/test these optimization algorithms within real-world ride sourcing platforms. Acting as a useful test bed, a simulation platform for ride-sourcing systems will be very important to conduct algorithm training/testing or model validation through trails and errors. While previous studies have established a variety of simulators for their own tasks, it lacks a fair and public platform for comparing the models or algorithms proposed by different researchers. In addition, the existing simulators still face many challenges, ranging from their closeness to real environments of ride-sourcing systems, to the completeness of different tasks they can implement. To address the challenges, we propose a novel multi-functional and open-sourced simulation platform for ride-sourcing systems, which can simulate the behaviors and movements of various agents on a real transportation network. It provides a few accessible portals for users to train and test various optimization algorithms, especially reinforcement learning algorithms, for a variety of tasks, including on-demand matching, idle vehicle repositioning, and dynamic pricing. In addition, it can be used to test how well the theoretical models approximate the simulated outcomes. Evaluated on real-world data based experiments, the simulator is demonstrated to be an efficient and effective test bed for various tasks related to on-demand ride service operations.
过去几年中,实时骑行服务或骑行找车服务正在经历快速发展。开发了许多数学模型和优化算法,以帮助骑行找车平台以更高效的方式设计 operational strategies。然而,由于成本和可靠性问题(在真实业务中实施不成熟算法可能导致系统动荡),在现实世界骑行找车平台上验证这些模型和训练/测试这些优化算法通常是不太可能的。作为一个有用的测试平台,骑行找车系统的模拟平台非常重要,可以通过 trails and errors 进行算法培训和测试或模型验证。虽然在先前的研究中已经建立了许多不同的模拟工具,但它们缺乏一个公正的公共平台,以比较不同研究人员提出的模型或算法。此外,现有的模拟工具仍然面临许多挑战,包括它们与真实骑行找车系统环境的接近程度,以及它们可以执行的不同任务的完整程度。为了应对这些挑战,我们提出了一个新的多功能、开源的骑行找车系统模拟平台,它可以模拟真实交通网络中的各种代理的行为和移动。它提供了几个可用的门户,供用户训练和测试各种优化算法,特别是强化学习算法,以各种任务,包括实时匹配、闲置车辆重新定位和动态定价。此外,它还可以用于测试理论模型如何近似模拟结果。根据基于实际数据的实验评估,模拟器表明它是与实时骑行服务操作相关的各种任务的有效和高效的测试平台。
https://arxiv.org/abs/2303.12336
3D single object tracking in LiDAR point clouds (LiDAR SOT) plays a crucial role in autonomous driving. Current approaches all follow the Siamese paradigm based on appearance matching. However, LiDAR point clouds are usually textureless and incomplete, which hinders effective appearance matching. Besides, previous methods greatly overlook the critical motion clues among targets. In this work, beyond 3D Siamese tracking, we introduce a motion-centric paradigm to handle LiDAR SOT from a new perspective. Following this paradigm, we propose a matching-free two-stage tracker M^2-Track. At the 1st-stage, M^2-Track localizes the target within successive frames via motion transformation. Then it refines the target box through motion-assisted shape completion at the 2nd-stage. Due to the motion-centric nature, our method shows its impressive generalizability with limited training labels and provides good differentiability for end-to-end cycle training. This inspires us to explore semi-supervised LiDAR SOT by incorporating a pseudo-label-based motion augmentation and a self-supervised loss term. Under the fully-supervised setting, extensive experiments confirm that M^2-Track significantly outperforms previous state-of-the-arts on three large-scale datasets while running at 57FPS (~8%, ~17% and ~22% precision gains on KITTI, NuScenes, and Waymo Open Dataset respectively). While under the semi-supervised setting, our method performs on par with or even surpasses its fully-supervised counterpart using fewer than half labels from KITTI. Further analysis verifies each component's effectiveness and shows the motion-centric paradigm's promising potential for auto-labeling and unsupervised domain adaptation.
在激光雷达点云中的三维单物体跟踪(LiDAR SOT)在无人驾驶中扮演着关键角色。当前的方法都基于外观匹配,但LiDAR点云通常缺乏纹理和不完整,这阻碍了有效的外观匹配。此外,以前的方法严重忽略了目标之间的关键运动线索。在本文中,除了3D Siamese跟踪,我们引入了一种以运动为中心的范式,从新的角度处理LiDAR SOT。遵循这个范式,我们提出了一个无匹配的两步跟踪器M^2-Track。在第一个阶段,M^2-Track通过运动变换在相邻帧内定位目标。然后,在第二个阶段,它通过运动辅助的形状重构优化目标框。由于运动中心性质,我们的方法和 limited训练标签的情况下表现出令人印象深刻的泛化能力,并为端到端循环训练提供了良好的不同iability。这激励我们探索半监督的LiDAR SOT,通过添加伪标签的运动增强和自监督损失函数。在完全监督的情况下,广泛的实验确认M^2-Track在三个大规模数据集上显著优于以前的最高水平,同时运行在57FPS(KITTI、NuScenes和Waymo Open Dataset分别提高了~8%、~17%和~22%的精度)。在半监督的情况下,我们的方法和使用KITTI不到一半的标签数量的性能与它的完全监督对手相当或甚至超过了它。进一步的分析证实了每个组件的有效性,并展示了运动中心范式在自动 labeling和无监督域适应方面的潜力。
https://arxiv.org/abs/2303.12535
We propose an end-to-end one-step person search approach with learnable proposals, named LEAPS. Given a set of sparse and learnable proposals, LEAPS employs a dynamic person search head to directly perform person detection and corresponding re-id feature generation without non-maximum suppression post-processing. The dynamic person search head comprises a detection head and a novel flexible re-id head. Our flexible re-id head first employs a dynamic region-of-interest (RoI) operation to extract discriminative RoI features of the proposals. Then, it generates re-id features using a plain and a hierarchical interaction re-id module. To better guide discriminative re-id feature learning, we introduce a diverse re-id sample matching strategy, instead of bipartite matching in detection head. Comprehensive experiments reveal the benefit of the proposed LEAPS, achieving a favorable performance on two public person search benchmarks: CUHK-SYSU and PRW. When using the same ResNet50 backbone, our LEAPS obtains a mAP score of 55.0%, outperforming the best reported results in literature by 1.7%, while achieving around a two-fold speedup on the challenging PRW dataset. Our source code and models will be released.
我们提出了一种具有可学习提案的全方位一步人搜索方法,名为LEAPS。给定一组稀疏且可学习的提案,LEAPS使用动态人搜索头直接进行人检测和对应的重写识别特征生成,而不需要非最大抑制后处理。动态人搜索头由检测头和创新的 flexible re-id头组成。我们的 flexible re-id头首先使用动态区域兴趣(RoI)操作提取提案中的区域兴趣特征。然后,它使用一个 plain 和Hierarchical interaction re-id module 生成重写识别特征。为了更好地指导重写识别特征的学习,我们引入了多种重写样本匹配策略,而不是在检测头中使用二元匹配。全面实验揭示了 proposed LEAPS 的优点,在两个公共人搜索基准上取得了有利的性能:CUHK-SYSU 和 PRW。使用相同的 ResNet50 骨干,我们的 LEAPS 获得 55.0% 的 mAP 得分,比文献中报道的最佳结果高出 1.7%,并在挑战性的PRW数据集上实现了大约两倍的速度提升。我们的源代码和模型将发布。
https://arxiv.org/abs/2303.11859
Deep learning based methods have become a paradigm for cover song identification (CSI) in recent years, where the ByteCover systems have achieved state-of-the-art results on all the mainstream datasets of CSI. However, with the burgeon of short videos, many real-world applications require matching short music excerpts to full-length music tracks in the database, which is still under-explored and waiting for an industrial-level solution. In this paper, we upgrade the previous ByteCover systems to ByteCover3 that utilizes local features to further improve the identification performance of short music queries. ByteCover3 is designed with a local alignment loss (LAL) module and a two-stage feature retrieval pipeline, allowing the system to perform CSI in a more precise and efficient way. We evaluated ByteCover3 on multiple datasets with different benchmark settings, where ByteCover3 beat all the compared methods including its previous versions.
深度学习方法已经成为歌曲覆盖识别(CSI)的范式,Byte Cover系统在CSI主流数据集上取得了最先进的结果。然而,随着短视频的兴起,许多现实世界的应用需要将简短的音乐片段与数据库中的完整音乐曲目匹配,该领域仍待探索并等待工业级解决方案。在本文中,我们将以前的Byte Cover系统升级到Byte Cover3,该系统利用本地特征进一步改进了简短的音乐查询识别性能。Byte Cover3采用了 local alignment loss (LAL)模块和两个阶段的特征提取管道,使系统能够以更精确和高效的方式执行CSI。我们在不同的基准设置下评估了Byte Cover3,其中Byte Cover3在所有比较方法中取得了领先的结果。
https://arxiv.org/abs/2303.11692
Opinion summarization provides an important solution for summarizing opinions expressed among a large number of reviews. However, generating aspect-specific and general summaries is challenging due to the lack of annotated data. In this work, we propose two simple yet effective unsupervised approaches to generate both aspect-specific and general opinion summaries by training on synthetic datasets constructed with aspect-related review contents. Our first approach, Seed Words Based Leave-One-Out (SW-LOO), identifies aspect-related portions of reviews simply by exact-matching aspect seed words and outperforms existing methods by 3.4 ROUGE-L points on SPACE and 0.5 ROUGE-1 point on OPOSUM+ for aspect-specific opinion summarization. Our second approach, Natural Language Inference Based Leave-One-Out (NLI-LOO) identifies aspect-related sentences utilizing an NLI model in a more general setting without using seed words and outperforms existing approaches by 1.2 ROUGE-L points on SPACE for aspect-specific opinion summarization and remains competitive on other metrics.
观点总结提供了总结大量评论意见的重要解决方案。然而,由于缺乏标注数据,生成特定方面的一般和总体观点总结是一项挑战。在本研究中,我们提出了两个简单但有效的未监督方法,通过训练基于与特定评论内容相关的词汇的合成数据集,生成特定方面的一般和总体观点总结。我们的第一种方法是基于词干选择 leave-one-out (SW-LOO),仅通过匹配词干来识别评论中的相关部分,在空间上比现有方法高出3.4 ROUGE-L点,在OPOSUM+上高出0.5 ROUGE-1点,对于特定观点总结表现优异。我们的第二种方法是基于自然语言推断 leave-one-out (NLI-LOO),在没有词干的情况下使用NLP模型识别相关句子,在空间上比现有方法高出1.2 ROUGE-L点,对于特定观点总结表现优异,在其他指标上仍然竞争力。
https://arxiv.org/abs/2303.11660
We present a new table structure recognition (TSR) approach, called TSRFormer, to robustly recognizing the structures of complex tables with geometrical distortions from various table images. Unlike previous methods, we formulate table separation line prediction as a line regression problem instead of an image segmentation problem and propose a new two-stage dynamic queries enhanced DETR based separation line regression approach, named DQ-DETR, to predict separation lines from table images directly. Compared to Vallina DETR, we propose three improvements in DQ-DETR to make the two-stage DETR framework work efficiently and effectively for the separation line prediction task: 1) A new query design, named Dynamic Query, to decouple single line query into separable point queries which could intuitively improve the localization accuracy for regression tasks; 2) A dynamic queries based progressive line regression approach to progressively regressing points on the line which further enhances localization accuracy for distorted tables; 3) A prior-enhanced matching strategy to solve the slow convergence issue of DETR. After separation line prediction, a simple relation network based cell merging module is used to recover spanning cells. With these new techniques, our TSRFormer achieves state-of-the-art performance on several benchmark datasets, including SciTSR, PubTabNet, WTW and FinTabNet. Furthermore, we have validated the robustness and high localization accuracy of our approach to tables with complex structures, borderless cells, large blank spaces, empty or spanning cells as well as distorted or even curved shapes on a more challenging real-world in-house dataset.
我们提出了一种新的表格结构识别方法,称为TSR Former,以 robustly 识别从各种表格图像中产生的复杂的表格结构。与以前的方法不同,我们将表格分离线预测视为一条线回归问题,而不是图像分割问题,并提出了一种新的基于动态查询增强的分离线回归方法,称为DQ-DETR,直接从表格图像中预测分离线。与 VallinaDETR 相比,我们提出了 three 项改进,以使两阶段 DETR 框架在分离线预测任务中高效和有效地工作:1) 一种新的查询设计,称为动态查询,将单个线条查询分解为可分离的点查询,可以直觉地提高回归任务的本地化准确性;2) 一种基于动态查询的逐步线回归方法,以逐步回归线上的点,进一步加强了扭曲表格的本地化准确性;3) 一种增强前向匹配策略,以解决 DETR 的缓慢收敛问题。在分离线预测后,使用一个简单的关系网络 based 细胞融合模块来恢复连通细胞。通过这些新技术,我们的 TSR Former 在多个基准数据集上实现了最先进的性能,包括 SciTSR、PubTabNet、WTW 和 FinTabNet。此外,我们还在一个更具挑战性的现实世界内部数据集上验证了我们方法的鲁棒性和高本地化准确性。
https://arxiv.org/abs/2303.11615