The field of vision and language has witnessed a proliferation of pre-trained foundation models. Most existing methods are independently pre-trained with contrastive objective like CLIP, image-to-text generative objective like PaLI, or text-to-image generative objective like Parti. However, the three objectives can be pre-trained on the same data, image-text pairs, and intuitively they complement each other as contrasting provides global alignment capacity and generation grants fine-grained understanding. In this work, we present a Contrastive Bi-directional Image-Text generation model (CoBIT), which attempts to unify the three pre-training objectives in one framework. Specifically, CoBIT employs a novel unicoder-decoder structure, consisting of an image unicoder, a text unicoder and a cross-modal decoder. The image/text unicoders can switch between encoding and decoding in different tasks, enabling flexibility and shared knowledge that benefits both image-to-text and text-to-image generations. CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE) and text-based content creation, particularly in zero-shot scenarios. For instance, 82.7% in zero-shot ImageNet classification, 9.37 FID score in zero-shot text-to-image generation and 44.8 CIDEr in zero-shot captioning.
视觉和语言领域已经见证了预先训练的基础模型的蓬勃发展。大多数现有方法都是独立地使用类似于CLIP、 PaLI 或 Parti的图像到文本生成目标或文本到图像生成目标进行预先训练。然而,这三个目标可以在相同的数据集上、图像到文本对或文本到图像对上进行预先训练,Intuitively,它们之间的互相补充是因为对比提供了全球对齐能力,生成则提供了精细的理解能力。在本工作中,我们提出了一种Contrastive Bi-directional Image-Text 生成模型(CoBIT),试图在一个框架中统一这三个预先训练目标。具体而言,CoBIT 采用了一种 novel Unicoder-decoder 结构,由一个图像Unicoder、一个文本Unicoder 和一个跨modal decoder 组成。图像/文本Unicoders 在不同的任务中可以切换编码和解码,从而提供灵活性和共享知识,这对图像到文本和文本到图像的生成都有益处。CoBIT 在图像理解、图像到文本理解(检索、标题生成、视觉问答、SNLI-VE) 和文本based 内容创建方面取得了卓越的性能,尤其是在零样本情况下。例如,在零样本ImageNet分类中达到了82.7%的准确率,在零样本文本到图像生成中达到了9.37 FID 得分,在零样本标题生成中达到了44.8的CIDEr。
https://arxiv.org/abs/2303.13455
This short technical report demonstrates a simple technique that yields state of the art results in medical image-text matching tasks. We analyze the use of OpenAI's CLIP, a general image-text matching model, and observe that CLIP's limited textual input size has negative impact on downstream performance in the medical domain where encoding longer textual contexts is often required. We thus train and release ClipMD, which is trained with a simple sliding window technique to encode textual captions. ClipMD was tested on two medical image-text datasets and compared with other image-text matching models. The results show that ClipMD outperforms other models on both datasets by a large margin. We make our code and pretrained model publicly available.
这段简短的技术报告展示了一种简单的技术,可以在医学图像-文本匹配任务中获得最先进的结果。我们分析了OpenAI的Clip,一个通用的图像-文本匹配模型,并观察了Clip有限的文字输入大小的消极影响,因为在医学领域中,通常需要编码更长的文字上下文。因此,我们训练并发布了ClipMD,它是通过一个简单的滑动窗口技术编码文本标题的训练方法。ClipMD对两个医学图像-文本数据集进行了测试,并与其他图像-文本匹配模型进行了比较。结果表明,ClipMD在两个数据集上比其他模型表现更好。我们公开发布了我们的代码和预训练模型。
https://arxiv.org/abs/2303.13340
In this paper, we investigate an open research task of generating controllable 3D textured shapes from the given textual descriptions. Previous works either require ground truth caption labeling or extensive optimization time. To resolve these issues, we present a novel framework, TAPS3D, to train a text-guided 3D shape generator with pseudo captions. Specifically, based on rendered 2D images, we retrieve relevant words from the CLIP vocabulary and construct pseudo captions using templates. Our constructed captions provide high-level semantic supervision for generated 3D shapes. Further, in order to produce fine-grained textures and increase geometry diversity, we propose to adopt low-level image regularization to enable fake-rendered images to align with the real ones. During the inference phase, our proposed model can generate 3D textured shapes from the given text without any additional optimization. We conduct extensive experiments to analyze each of our proposed components and show the efficacy of our framework in generating high-fidelity 3D textured and text-relevant shapes.
在本文中,我们探讨了一个开放的研究任务,即从给定的文字描述中生成可控制3D形状纹理。以前的工作要么需要真实的标题标签标注,要么需要进行大量的优化时间。为了解决这些难题,我们提出了一个新的框架TAPS3D,以训练一个基于伪标题的文本引导3D形状生成器。具体来说,基于渲染的2D图像,我们从CLIP词汇库中检索相关词汇,并使用模板使用模板构建伪标题。我们构建的伪标题为生成的3D形状提供了高水平的语义监督。此外,为了产生细致的纹理和提高几何多样性,我们提议采用低层次的图像 Regularization 方法,使伪渲染图像与真实图像对齐。在推理阶段,我们的提议模型可以从给定的文字中不需要任何额外的优化就能生成3D形状纹理。我们进行了广泛的实验来分析我们提出的每个组件,并展示我们框架在生成高保真的3D形状和与文本相关的形状方面的效率。
https://arxiv.org/abs/2303.13273
Recent open-vocabulary detection methods aim to detect novel objects by distilling knowledge from vision-language models (VLMs) trained on a vast amount of image-text pairs. To improve the effectiveness of these methods, researchers have utilized datasets with a large vocabulary that contains a large number of object classes, under the assumption that such data will enable models to extract comprehensive knowledge on the relationships between various objects and better generalize to unseen object classes. In this study, we argue that more fine-grained labels are necessary to extract richer knowledge about novel objects, including object attributes and relationships, in addition to their names. To address this challenge, we propose a simple and effective method named Pseudo Caption Labeling (PCL), which utilizes an image captioning model to generate captions that describe object instances from diverse perspectives. The resulting pseudo caption labels offer dense samples for knowledge distillation. On the LVIS benchmark, our best model trained on the de-duplicated VisualGenome dataset achieves an AP of 34.5 and an APr of 30.6, comparable to the state-of-the-art performance. PCL's simplicity and flexibility are other notable features, as it is a straightforward pre-processing technique that can be used with any image captioning model without imposing any restrictions on model architecture or training process.
最近的开放词汇检测方法旨在通过从训练大量图像文本对视觉语言模型(VLMs)的知识进行蒸馏,检测新对象。为了改进这些方法的效果,研究人员使用了大量的词汇表数据,其中包含大量对象类别,假设这些数据可以让模型提取关于各种对象的关系和更广泛地应用于未观测的对象类别的全面知识。在本研究中,我们认为需要更多的精细标签才能提取更丰富的知识,包括对象属性和关系,除了名称。为了解决这个挑战,我们提出了一种简单的有效的方法名为“伪标题标签”(PCL),该方法使用图像标题生成模型生成描述对象实例的不同角度的摘要。所产生的伪标题标签提供了密度丰富的知识蒸馏样本。在LVIS基准测试中,我们训练的最优模型在未重复训练的视觉基因组数据集上取得了34.5的AP和30.6的APr,与最先进的性能相当。PCL的简单易用和灵活性是其他显著的特征,它是一种简单的预处理技术,可以与任何图像标题生成模型一起使用,而无需对模型架构或训练过程施加任何限制。
https://arxiv.org/abs/2303.13040
Foundation models have shown outstanding performance and generalization capabilities across domains. Since most studies on foundation models mainly focus on the pretraining phase, a naive strategy to minimize a single task-specific loss is adopted for fine-tuning. However, such fine-tuning methods do not fully leverage other losses that are potentially beneficial for the target task. Therefore, we propose MEta Loss TRansformer (MELTR), a plug-in module that automatically and non-linearly combines various loss functions to aid learning the target task via auxiliary learning. We formulate the auxiliary learning as a bi-level optimization problem and present an efficient optimization algorithm based on Approximate Implicit Differentiation (AID). For evaluation, we apply our framework to various video foundation models (UniVL, Violet and All-in-one), and show significant performance gain on all four downstream tasks: text-to-video retrieval, video question answering, video captioning, and multi-modal sentiment analysis. Our qualitative analyses demonstrate that MELTR adequately `transforms' individual loss functions and `melts' them into an effective unified loss. Code is available at this https URL.
基线模型在多个领域中表现出卓越的性能和泛化能力。由于大多数基线模型研究主要关注预训练阶段,因此一种简单的策略是最小化一个特定任务的损失,用于微调。然而, such微调方法并未充分利用可能对目标任务有益的其他损失。因此,我们提出了 Melta LossTRansformer(MELTR),它是一个插件模块,自动和非线性地组合各种损失函数,以协助通过辅助学习学习目标任务。我们将辅助学习表示为两个水平的优化问题,并提出了基于approximate Implicit differentiation(AID)的高效优化算法。为了评估,我们应用我们的框架对各种视频基线模型(UniVL、Violet和All-in-one)进行训练,并在所有四个后续任务中表现出显著的性能提升:文本到视频检索、视频问答、视频字幕和多模态情感分析。我们定性分析表明,MELTR适当地`transforms' individual损失函数,并将其`融化'为有效的统一损失。代码可在该 https URL 上获取。
https://arxiv.org/abs/2303.13009
Video captioning aims to describe the content of videos using natural language. Although significant progress has been made, there is still much room to improve the performance for real-world applications, mainly due to the long-tail words challenge. In this paper, we propose a text with knowledge graph augmented transformer (TextKG) for video captioning. Notably, TextKG is a two-stream transformer, formed by the external stream and internal stream. The external stream is designed to absorb additional knowledge, which models the interactions between the additional knowledge, e.g., pre-built knowledge graph, and the built-in information of videos, e.g., the salient object regions, speech transcripts, and video captions, to mitigate the long-tail words challenge. Meanwhile, the internal stream is designed to exploit the multi-modality information in videos (e.g., the appearance of video frames, speech transcripts, and video captions) to ensure the quality of caption results. In addition, the cross attention mechanism is also used in between the two streams for sharing information. In this way, the two streams can help each other for more accurate results. Extensive experiments conducted on four challenging video captioning datasets, i.e., YouCookII, ActivityNet Captions, MSRVTT, and MSVD, demonstrate that the proposed method performs favorably against the state-of-the-art methods. Specifically, the proposed TextKG method outperforms the best published results by improving 18.7% absolute CIDEr scores on the YouCookII dataset.
视频字幕旨在用自然语言描述视频内容。尽管已经取得了很大进展,但针对实际应用程序的性能仍有很多可以提高的空间,主要原因是长词挑战。在本文中,我们提出了一种基于知识图增强的文本transformer(TextKG)用于视频字幕。值得注意的是,TextKG是一个由外部流和内部流组成的二元transformer,以外部流和内部流为基础。外部流旨在吸收额外的知识,以模拟额外的知识、例如预先构建的知识图以及视频内置信息,例如引人注目的对象区域、语音转录和视频字幕,以减轻长词挑战。与此同时,内部流旨在利用视频的多媒体信息(例如视频帧的外观、语音转录和视频字幕),以确保字幕结果的质量。此外,在两个流之间还使用了交叉注意力机制来共享信息。因此,两个流可以互相帮助,获得更准确的结果。在四个具有挑战性的视频字幕数据集上进行了广泛的实验,包括YouCookII、ActivityNetcaptions、MSRVTT和MSVD,结果表明,我们提出的方法在与当前最佳方法的比较中表现良好。具体而言,我们提出的TextKG方法在YouCookII数据集上比最佳公开结果提高了18.7%的绝对CIDEr得分。
https://arxiv.org/abs/2303.12423
Sequential video understanding, as an emerging video understanding task, has driven lots of researchers' attention because of its goal-oriented nature. This paper studies weakly supervised sequential video understanding where the accurate time-stamp level text-video alignment is not provided. We solve this task by borrowing ideas from CLIP. Specifically, we use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video, respectively. To model the correspondence between text and video, we propose a multiple granularity loss, where the video-paragraph contrastive loss enforces matching between the whole video and the complete script, and a fine-grained frame-sentence contrastive loss enforces the matching between each action and its description. As the frame-sentence correspondence is not available, we propose to use the fact that video actions happen sequentially in the temporal domain to generate pseudo frame-sentence correspondence and supervise the network training with the pseudo labels. Extensive experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin, which validates the effectiveness of our proposed approach. Code is available at this https URL
Sequential video understanding,作为新兴的视频理解任务,吸引了许多研究人员的关注,因为它具有目标导向的性质。本文研究了未提供准确时间戳级别文本-视频对齐的弱监督Sequential视频理解任务。我们借鉴了CLIP的思想,具体来说,我们使用Transformer将帧级特征整合用于视频表示,使用预先训练的文本编码器分别编码每个行动和整个视频对应的文本。为了建模文本和视频之间的对应关系,我们提出了多个粒度的损失,其中视频段落对比度损失强迫整个视频和完整脚本匹配,而精细的帧语句对比度损失强迫每个行动和其描述匹配。由于帧语句对应关系不可得,我们提出了利用时间域中视频行动Sequential的顺序性生成伪帧语句对应关系,并监督网络使用伪标签进行训练。在视频序列验证和文本到视频匹配方面的广泛实验结果表明,我们的方法比基准方法表现更好,这验证了我们提出的方法的有效性。代码可在该https URL处获取。
https://arxiv.org/abs/2303.12370
While generative modeling on multimodal image-text data has been actively developed with large-scale paired datasets, there have been limited attempts to generate both image and text data by a single model rather than a generation of one fixed modality conditioned on the other modality. In this paper, we explore a unified generative vision-and-language (VL) model that can produce both images and text sequences. Especially, we propose a generative VL transformer based on the non-autoregressive mask prediction, named MAGVLT, and compare it with an autoregressive generative VL transformer (ARGVLT). In comparison to ARGVLT, the proposed MAGVLT enables bidirectional context encoding, fast decoding by parallel token predictions in an iterative refinement, and extended editing capabilities such as image and text infilling. For rigorous training of our MAGVLT with image-text pairs from scratch, we combine the image-to-text, text-to-image, and joint image-and-text mask prediction tasks. Moreover, we devise two additional tasks based on the step-unrolled mask prediction and the selective prediction on the mixture of two image-text pairs. Experimental results on various downstream generation tasks of VL benchmarks show that our MAGVLT outperforms ARGVLT by a large margin even with significant inference speedup. Particularly, MAGVLT achieves competitive results on both zero-shot image-to-text and text-to-image generation tasks from MS-COCO by one moderate-sized model (fewer than 500M parameters) even without the use of monomodal data and networks.
尽管在多模态图像和文本数据的生成建模方面,已经利用大规模配对数据集进行了积极的开发,但仍然存在有限的机会,通过一个模型生成 both 图像和文本数据,而不是通过一个固定的模态条件生成 one 的模态。在本文中,我们探索了一种统一的视觉和语言 (VL) 生成模型,可以生成 both 图像和文本序列。特别是,我们提出了基于非自回归掩码预测的生成 VL 变换器,称为 MAGVLT,并将其与自回归生成 VL 变换器 (ARGVLT)进行比较。与 ARGVLT 相比,我们提出的 MAGVLT 实现了双向上下文编码、通过并行 token 预测快速解码、以及扩展了编辑能力,如图像和文本填充。为了从 scratch 开始训练我们的 MAGVLT 与图像和文本对的严格训练,我们结合了图像到文本、文本到图像和联合图像和文本掩码预测任务。此外,我们还设计了两个基于步展开掩码预测和选择性预测的图像到文本和对图像和文本混合物的任务。VL 基准测试中的各种下游生成任务的实验结果表明,即使Inference速度有很大的提高,我们的 MAGVLT 仍然比 ARGVLT 在 MS-COCO 中实现竞争结果,并且即使不使用单模态数据和网络。特别是,MAGVLT 通过一个规模适中的模型(小于 500 兆参数)从零样本图像到文本和文本到图像的生成任务中实现了竞争结果,即使在不使用单模态数据和网络的情况下。
https://arxiv.org/abs/2303.12208
The large-scale vision-language models (e.g., CLIP) are leveraged by different methods to detect unseen objects. However, most of these works require additional captions or images for training, which is not feasible in the context of zero-shot detection. In contrast, the distillation-based method is an extra-data-free method, but it has its limitations. Specifically, existing work creates distillation regions that are biased to the base categories, which limits the distillation of novel category information and harms the distillation efficiency. Furthermore, directly using the raw feature from CLIP for distillation neglects the domain gap between the training data of CLIP and the detection datasets, which makes it difficult to learn the mapping from the image region to the vision-language feature space - an essential component for detecting unseen objects. As a result, existing distillation-based methods require an excessively long training schedule. To solve these problems, we propose Efficient feature distillation for Zero-Shot Detection (EZSD). Firstly, EZSD adapts the CLIP's feature space to the target detection domain by re-normalizing CLIP to bridge the domain gap; Secondly, EZSD uses CLIP to generate distillation proposals with potential novel instances, to avoid the distillation being overly biased to the base categories. Finally, EZSD takes advantage of semantic meaning for regression to further improve the model performance. As a result, EZSD achieves state-of-the-art performance in the COCO zero-shot benchmark with a much shorter training schedule and outperforms previous work by 4% in LVIS overall setting with 1/10 training time.
大型视觉语言模型(例如Clip)利用不同方法检测未观测到的对象。然而,这些方法中大多数需要额外的标题或图像来进行训练,这在零样本检测上下文中是不可行的。相比之下,基于离散傅里叶变换的方法是一种无额外数据的方法,但它也有其限制。具体来说,现有的工作创造了基于基类的偏置区域,这限制了新类别信息的汇聚和损害了汇聚效率。此外,直接使用Clip的 raw feature 进行汇聚忽略了Clip的训练数据和检测数据之间的域差,这使从图像区域到视觉语言特征空间的映射学习变得困难 - 这是检测未观测到对象的关键组件。因此,现有的基于汇聚的方法需要过长的训练时间表。为了解决这些问题,我们提出了高效的特征汇聚零样本检测(EZSD)方法。首先,EZSD将Clip的特征空间适应到目标检测域,通过归一化Clip来弥散域差;其次,EZSD使用Clip生成可能的新实例汇聚提议,以避免汇聚过度偏置基类。最后,EZSD利用回归语义意义进一步改善模型性能。因此,EZSD在COCO零样本基准测试中取得了最先进的性能,训练时间表只有1/10,但比先前工作高出4%。在LVIS整体设置中,通过训练时间仅为1/10,EZSD表现优于先前工作。
https://arxiv.org/abs/2303.12145
The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation of captions generated from vision-and-language architectures. In this paper, we propose a new recipe for a contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S), that in a novel way unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data. Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos, outperforming existing reference-based metrics like CIDEr and SPICE and reference-free metrics like CLIP-Score. Finally, we test the system-level correlation of the proposed metric when considering popular image captioning approaches, and assess the impact of employing different cross-modal features. Our source code and trained models are publicly available at: this https URL.
CLIP模型最近被证明对于多种跨modal任务非常有效,包括从视觉和语言架构生成的caption评估。在本文中,我们提出了一种新的 recipe 用于图像captioning 的评价指标,即增强对比学习得分( PAC-S),以一种 novel 的方式将对比视觉语义学习与编辑数据生成图像和文本相结合。多个数据集的实验表明,我们的新指标在图像和视频对人类判断的相关性方面表现最优秀,比现有的基于参考的指标(如 CIDEr 和 SPICE )以及无参考指标(如 CLIP-Score)更好。最后,我们测试了 proposed 指标的系统级相关性,在考虑流行的图像captioning方法时,并评估了使用不同跨modal特征的影响。我们源代码和训练模型的公开地址为: this https URL.
https://arxiv.org/abs/2303.12112
Human sketch has already proved its worth in various visual understanding tasks (e.g., retrieval, segmentation, image-captioning, etc). In this paper, we reveal a new trait of sketches - that they are also salient. This is intuitive as sketching is a natural attentive process at its core. More specifically, we aim to study how sketches can be used as a weak label to detect salient objects present in an image. To this end, we propose a novel method that emphasises on how "salient object" could be explained by hand-drawn sketches. To accomplish this, we introduce a photo-to-sketch generation model that aims to generate sequential sketch coordinates corresponding to a given visual photo through a 2D attention mechanism. Attention maps accumulated across the time steps give rise to salient regions in the process. Extensive quantitative and qualitative experiments prove our hypothesis and delineate how our sketch-based saliency detection model gives a competitive performance compared to the state-of-the-art.
人类 Sketch 已经在各种视觉理解任务中证明了其价值(例如检索、分割、图像标题等)。在本文中,我们揭示了 Sketch 的新特征——它们也是引人注目的。这是因为 Sketch 本质上是一个自然的注意力过程的核心。更具体地说,我们旨在研究如何通过 Sketch 用作弱标签来检测图像中的引人注目物体。为此,我们提出了一种新方法,强调了如何通过手绘 Sketch 解释“引人注目的物体”。为了实现这一点,我们引入了一个照片到 Sketch 生成模型,旨在通过 2D 注意力机制生成与给定视觉照片相应的Sequential Sketch 坐标。注意力图在时间步上累积,从而产生引人注目的区域。广泛的定量和定性实验证明了我们的假设,并概述了我们 Sketch based 引人注目检测模型相对于最先进的技术提供的竞争性能。
https://arxiv.org/abs/2303.11502
Implicit neural representations (INR) have gained significant popularity for signal and image representation for many end-tasks, such as superresolution, 3D modeling, and more. Most INR architectures rely on sinusoidal positional encoding, which accounts for high-frequency information in data. However, the finite encoding size restricts the model's representational power. Higher representational power is needed to go from representing a single given image to representing large and diverse datasets. Our approach addresses this gap by representing an image with a polynomial function and eliminates the need for positional encodings. Therefore, to achieve a progressively higher degree of polynomial representation, we use element-wise multiplications between features and affine-transformed coordinate locations after every ReLU layer. The proposed method is evaluated qualitatively and quantitatively on large datasets like ImageNet. The proposed Poly-INR model performs comparably to state-of-the-art generative models without any convolution, normalization, or self-attention layers, and with far fewer trainable parameters. With much fewer training parameters and higher representative power, our approach paves the way for broader adoption of INR models for generative modeling tasks in complex domains. The code is available at \url{this https URL}
隐含神经网络表示(INR)在信号和图像表示的许多最终任务中变得越来越受欢迎,例如超分辨率、三维建模和更多。大多数 INR 架构依赖于斯芬克斯位置编码,该编码在数据中利用了高频信息。然而,有限编码大小限制了模型的表示能力。要从表示一个给定的图像到表示大型、多样化的数据集需要更多的表示能力。我们的方法通过使用多项式函数代表图像,并消除了位置编码的需求,以逐步增加多项式表示的程度。在 ImageNet 等大型数据集上对这种方法进行了定性和定量评估。 proposed 的Poly-INR 模型与最先进的生成模型的性能相当,而不需要卷积、归一化和自注意力层,并且训练参数更少。由于训练参数更少,以及更高的表示能力,我们的方法为复杂 domains 中的生成建模任务更广泛的采用铺平了道路。代码可在 \url{this https URL} 找到。
https://arxiv.org/abs/2303.11424
Large Language Models (LLMs) have so far impressed the world, with unprecedented capabilities that emerge in models at large scales. On the vision side, transformer models (i.e., ViT) are following the same trend, achieving the best performance on challenging benchmarks. With the abundance of such unimodal models, a natural question arises; do we need also to follow this trend to tackle multimodal tasks? In this work, we propose to rather direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. In particular, they still train a large number of parameters, rely on large multimodal pretraining, use encoders (e.g., CLIP) trained on huge image-text datasets, and add significant inference overhead. In addition, most of these approaches have focused on Zero-Shot and In Context Learning, with little to no effort on direct finetuning. We investigate the minimal computational effort needed to adapt unimodal models for multimodal tasks and propose a new challenging setup, alongside different approaches, that efficiently adapts unimodal pretrained models. We show that by freezing more than 99\% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning across Image, Video, and Audio modalities, following the proposed setup. The code will be available here: this https URL.
大型语言模型(LLM)已经在世界上引起了轰动,它们在大规模模型中展现了前所未有的能力。在视觉方面,Transformer模型(即ViT)正在遵循相同的趋势,并在挑战性的基准测试中取得了最佳表现。由于这种单目模型的供应充足,一个自然的问题是:我们需要也遵循这种趋势来解决多目任务吗?在本文中,我们建议直接努力优化现有的模型,并建议通过感知来增强语言模型。现有的用于适应视觉语言任务预训练模型的方法仍然依赖于几个关键组件,限制了它们的效率。特别是,他们仍然训练大量的参数,依赖于大规模的多目预训练,使用训练在巨大图像文本数据集上的编码器,并增加了大量的推理 overhead。此外,这些方法大多数都关注零知识和上下文学习,几乎没有直接微调的努力。我们研究了适应单目模型对多目任务所需的最小计算 effort,并提出了一种新的挑战性架构,与其他方法一起,高效地适应单目预训练模型。我们表明,通过冻结超过99\%的整个参数,只训练一个线性投影层,并添加一个可训练的元,我们的方法(称为eP-ALM)在图像、视频和音频modality之间的视频问答和标题生成任务中显著优于其他基准。代码将在这里可用: this https URL。
https://arxiv.org/abs/2303.11403
Pose transfer aims to transfer a given person into a specified posture, has recently attracted considerable attention. A typical pose transfer framework usually employs representative datasets to train a discriminative model, which is often violated by out-of-distribution (OOD) instances. Recently, test-time adaption (TTA) offers a feasible solution for OOD data by using a pre-trained model that learns essential features with self-supervision. However, those methods implicitly make an assumption that all test distributions have a unified signal that can be learned directly. In open-world conditions, the pose transfer task raises various independent signals: OOD appearance and skeleton, which need to be extracted and distributed in speciality. To address this point, we develop a SEquential Test-time Adaption (SETA). In the test-time phrase, SETA extracts and distributes external appearance texture by augmenting OOD data for self-supervised training. To make non-Euclidean similarity among different postures explicit, SETA uses the image representations derived from a person re-identification (Re-ID) model for similarity computation. By addressing implicit posture representation in the test-time sequentially, SETA greatly improves the generalization performance of current pose transfer models. In our experiment, we first show that pose transfer can be applied to open-world applications, including Tiktok reenactment and celebrity motion synthesis.
姿态转移的目标是将给定的人转移到指定的姿势,最近吸引了相当大的关注。典型的姿态转移框架通常使用代表性的数据集来训练一个鉴别模型,这常常受到分布之外(OOD)实例的违反。最近,测试时间适应(TTA)提供了一个可行的解决方案,使用一个自监督学习的训练模型来学习关键特征,以自学训练。然而,这些方法隐含地假设所有测试分布都有一个通用的信号,可以直接学习。在开放世界条件下,姿态转移任务产生了各种独立的信号:OOD的外观和骨骼,需要提取和分布在特定领域的专业知识。为了解决这一问题,我们开发了平方测试时间适应(SETA)。在测试时间短语中,SETA通过增加OOD数据来提高外部外观纹理,以自学训练。为了使不同姿势之间的非欧几何相似性更加明显,SETA使用从人身份验证(Re-ID)模型推导的图像表示来进行相似性计算。通过在测试时间顺序解决暗示的姿态表示问题,SETA极大地改善了当前姿态转移模型的泛化性能。在我们的实验中,我们首先展示了姿态转移可以应用于开放世界应用程序,包括 Tiktok重编和名人运动合成。
https://arxiv.org/abs/2303.10945
Despite recent attention and exploration of depth for various tasks, it is still an unexplored modality for weakly-supervised object detection (WSOD). We propose an amplifier method for enhancing the performance of WSOD by integrating depth information. Our approach can be applied to any WSOD method based on multiple-instance learning, without necessitating additional annotations or inducing large computational expenses. Our proposed method employs a monocular depth estimation technique to obtain hallucinated depth information, which is then incorporated into a Siamese WSOD network using contrastive loss and fusion. By analyzing the relationship between language context and depth, we calculate depth priors to identify the bounding box proposals that may contain an object of interest. These depth priors are then utilized to update the list of pseudo ground-truth boxes, or adjust the confidence of per-box predictions. Our proposed method is evaluated on six datasets (COCO, PASCAL VOC, Conceptual Captions, Clipart1k, Watercolor2k, and Comic2k) by implementing it on top of two state-of-the-art WSOD methods, and we demonstrate a substantial enhancement in performance.
尽管最近对深度在各种任务上的关注和研究增多,但对于弱监督对象检测(WSOD)的方法,仍然未进行充分的探索。我们提出了一种集成深度信息来提高WSOD性能的方法。这种方法可以适用于基于多实例学习的任何WSOD方法,而不需要额外的标注或高昂的计算成本。我们的方法使用单眼深度估计技术来获取幻觉深度信息,然后将其集成到使用对比损失和融合的Siamese WSOD网络中。通过分析语言上下文和深度之间的关系,我们计算深度先验,以确定可能包含感兴趣对象的 bounding box提议。这些深度先验则用于更新伪真实标签盒子列表,或调整每个盒子的预测 confidence。我们的方法在六个数据集(COCO、PASCAL VOC、概念标题、Clipart1k、Watercolor2k和漫画2k)上进行评估,将其置于两个最先进的WSOD方法之上,并证明了性能的重大增强。
https://arxiv.org/abs/2303.10937
Multifold observations are common for different data modalities, e.g., a 3D shape can be represented by multi-view images and an image can be described with different captions. Existing cross-modal contrastive representation learning (XM-CLR) methods such as CLIP are not fully suitable for multifold data as they only consider one positive pair and treat other pairs as negative when computing the contrastive loss. In this paper, we propose MXM-CLR, a unified framework for contrastive learning of multifold cross-modal representations. MXM-CLR explicitly models and learns the relationships between multifold observations of instances from different modalities for more comprehensive representation learning. The key of MXM-CLR is a novel multifold-aware hybrid loss which considers multiple positive observations when computing the hard and soft relationships for the cross-modal data pairs. We conduct quantitative and qualitative comparisons with SOTA baselines for cross-modal retrieval tasks on the Text2Shape and Flickr30K datasets. We also perform extensive evaluations on the adaptability and generalizability of MXM-CLR, as well as ablation studies on the loss design and effects of batch sizes. The results show the superiority of MXM-CLR in learning better representations for the multifold data. The code is available at this https URL.
多视角图像可以代表三维形状,图像可以有不同的描述词。现有的跨模态对比表示学习(XM-CLR)方法,如Clip,不适合处理多视角数据,因为它们只考虑一个正对和一个负对,并在计算对比损失时将其视为负对。在本文中,我们提出了MXM-CLR,一个统一的对比学习多视角跨模态表示框架。MXM-CLR explicitly models and learns来自不同模态的多视角实例之间的关系,以更全面地表示学习。MXM-CLR的关键是一种独特的多视角aware混合损失,它在计算跨模态数据对的硬和软关系时考虑多个正对。我们在Text2Shape和Flickr30K数据集上的跨模态检索任务中进行了与SOTA基准模型的定量和定性比较。我们还对MXM-CLR的适应性和泛化性进行了广泛的评估,并进行了 batch size 设计和损失设计的影响的 ablation study。结果表明,MXM-CLR在多视角数据的学习更好的表示方面具有优势。代码可在本URL上获取。
https://arxiv.org/abs/2303.10839
Deep neural networks have achieved promising results in automatic image captioning due to their effective representation learning and context-based content generation capabilities. As a prominent type of deep features used in many of the recent image captioning methods, the well-known bottomup features provide a detailed representation of different objects of the image in comparison with the feature maps directly extracted from the raw image. However, the lack of high-level semantic information about the relationships between these objects is an important drawback of bottom-up features, despite their expensive and resource-demanding extraction procedure. To take advantage of visual relationships in caption generation, this paper proposes a deep neural network architecture for image captioning based on fusing the visual relationships information extracted from an image's scene graph with the spatial feature maps of the image. A multi-modal reward function is then introduced for deep reinforcement learning of the proposed network using a combination of language and vision similarities in a common embedding space. The results of extensive experimentation on the MSCOCO dataset show the effectiveness of using visual relationships in the proposed captioning method. Moreover, the results clearly indicate that the proposed multi-modal reward in deep reinforcement learning leads to better model optimization, outperforming several state-of-the-art image captioning algorithms, while using light and easy to extract image features. A detailed experimental study of the components constituting the proposed method is also presented.
神经网络在自动图像翻译中取得了令人瞩目的成果,因为它们有效地学习了表示和基于上下文的内容生成能力。作为最近在许多图像翻译方法中广泛应用的深度特征类型,著名的bottom-up特征提供了与从原始图像直接提取的特征映射相比更详细的图像对象表示。然而,缺乏这些对象之间的高级语义信息是bottom-up特征的一个重要缺点,尽管它们的提取程序昂贵且资源要求高。为了利用图像关系在翻译生成中的作用,本文提出了一种基于融合图像场景图提取的视觉关系信息与图像空间特征映射的深度学习神经网络架构。然后,在共同嵌入空间中通过语言和视觉相似性的组合引入一种多模态奖励函数,用于训练 proposed 网络的深度强化学习。对MSCOCO数据集进行广泛的实验结果显示,使用视觉关系在所提出的翻译方法中的有效性。此外,实验结果清楚地表明,所提出的深度强化学习多模态奖励导致更好的模型优化,比一些最先进的图像翻译算法表现更好,同时使用简单易提取的图像特征。还介绍了组成所提出方法的详细实验研究的组件。
https://arxiv.org/abs/2303.10766
Large pre-trained vision-language models (VLMs) reduce the time for developing predictive models for various vision-grounded language downstream tasks by providing rich, adaptable image and text representations. However, these models suffer from societal biases owing to the skewed distribution of various identity groups in the training data. These biases manifest as the skewed similarity between the representations for specific text concepts and images of people of different identity groups and, therefore, limit the usefulness of such models in real-world high-stakes applications. In this work, we present DeAR (Debiasing with Additive Residuals), a novel debiasing method that learns additive residual image representations to offset the original representations, ensuring fair output representations. In doing so, it reduces the ability of the representations to distinguish between the different identity groups. Further, we observe that the current fairness tests are performed on limited face image datasets that fail to indicate why a specific text concept should/should not apply to them. To bridge this gap and better evaluate DeAR, we introduce the Protected Attribute Tag Association (PATA) dataset - a new context-based bias benchmarking dataset for evaluating the fairness of large pre-trained VLMs. Additionally, PATA provides visual context for a diverse human population in different scenarios with both positive and negative connotations. Experimental results for fairness and zero-shot performance preservation using multiple datasets demonstrate the efficacy of our framework.
大型预训练的视觉语言模型(VLMs)通过提供丰富的、适应性的图像和文本表示,减少了开发针对各种视觉主导的后续语言任务的预测模型的时间。然而,这些模型由于训练数据中各种身份团体分布的偏斜而遭受社会偏见。这些偏见表现为特定文本概念和不同身份团体的人的图像表示之间的偏斜相似性,因此,这些模型在现实世界的重要应用中的局限性。在这项工作中,我们提出了DeAR(添加附加余量的清洁方法),这是一种新的清洁方法,学习添加附加余量的图像表示来抵消原始表示,确保公平的输出表示。通过这样做,它减少了表示的能力,以区分不同身份团体。此外,我们观察到,当前的公平测试是在有限面部图像数据集上进行的,无法说明为什么特定文本概念应该/不应该适用于它们。为了弥补这一差距并更好地评估DeAR,我们介绍了保护属性标签Association(PATA)数据集,这是一个基于上下文的偏见基准数据集,用于评估大型预训练VLMs的公平性。此外,PATA为不同身份团体的多样化的人类群体在不同情境下提供了视觉上下文。使用多个数据集的实验结果,公平性和零次机会性能保护证明了我们的框架的有效性。
https://arxiv.org/abs/2303.10431
Visual Abductive Reasoning (VAR) is an emerging vision-language (VL) topic where the model needs to retrieve/generate a likely textual hypothesis from a visual input (image or part of an image) using backward reasoning based on prior knowledge or commonsense. Unlike in conventional VL retrieval or captioning tasks, where entities of texts appear in the image, in abductive inferences, the relevant facts about inferences are not directly visible in the input images. Besides, the inferences are causally relevant to regional visual hints and vary with the latter. Existing works highlight visual parts from a global background with specific prompt tuning techniques (e.g., colorful prompt tuning) on top of foundation models, like CLIP. However, these methods uniformly patchify "regional hints" and "global context" at the same granularity level and may lose fine-grained visual details significant for abductive reasoning. To tackle this, we propose a simple yet effective Regional Prompt Tuning, which encodes "regional visual hints" and "global contexts" separately at fine and coarse-grained levels. Specifically, our model explicitly upsamples, then patchify local hints to get fine-grained regional prompts. These prompts are concatenated with coarse-grained contextual tokens from whole images. We also equip our model with a new Dual-Contrastive Loss to regress the visual feature simultaneously toward features of factual description (a.k.a. clue text) and plausible hypothesis (abductive inference text) during training. Extensive experiments on the Sherlock dataset demonstrate that our fully fine-tuned RGP/RGPs with Dual-Contrastive Loss significantly outperforms previous SOTAs, achieving the 1 rank on abductive reasoning leaderboards among all submissions, under all metrics (e.g., P@1$_{i->t}$: RGPs 38.78 vs CPT-CLIP 33.44, higher=better). We would open-source our codes for further research.
视觉推理(VAR)是一个新兴的视觉语言(VL)主题,该模型需要使用基于先前知识或常识的backward推理从视觉输入(图像或图像的一部分)中检索/生成可能的文字假设,而与传统VL检索或标题制作任务不同,在视觉推断中,相关事实关于推断并没有直接在输入图像中可见。此外,推断与区域视觉提示之间存在因果关系,并且与该区域视觉提示相关的细节可能会在训练过程中丢失。为了解决这一问题,我们提出了一种简单但有效的区域PromptTuning方法,该方法分别编码“区域视觉提示”和“全球上下文”在不同粒度级别上。具体来说,我们的模型明确扩展,然后补丁本地提示以获得高精度的区域提示。这些提示与整个图像的高精度上下文 token 一起concatenated。我们还为模型配备了新的 Dual-Contrastive Loss 以便在训练期间同时回归视觉特征,以向事实描述(即线索文本)和合理假设(即视觉推断文本)的精度目标进行回归。在Sherlock数据集上进行广泛的实验表明,我们的 fully fine-tuned RGP/RGPs 与 Dual-Contrastive Loss 的新方法在先前SOTA中显著优于以前的方法,在所有指标上取得了视觉推断排名榜单上的1名(例如 P@1$_{i->t}$: RGPs 38.78 vs CPT-CLIP 33.44,更高=更好)。我们将开源我们的代码供进一步研究。
https://arxiv.org/abs/2303.10428
Vision-language pretraining to learn a fine-grained, region-word alignment between image-caption pairs has propelled progress in open-vocabulary object detection. We observe that region-word alignment methods are typically used in detection with respect to only object nouns, and the impact of other rich context in captions, such as attributes, is unclear. In this study, we explore how language context affects downstream object detection and propose to enhance the role of context. In particular, we show how to strategically contextualize the grounding pretraining objective for improved alignment. We further hone in on attributes as especially useful object context and propose a novel adjective and noun-based negative sampling strategy for increasing their focus in contrastive learning. Overall, our methods enhance object detection when compared to the state-of-the-art in region-word pretraining. We also highlight the fine-grained utility of an attribute-sensitive model through text-region retrieval and phrase grounding analysis.
图像和标题之间的精细区域词汇匹配在学习开放词汇的目标检测方面推动了进展。我们发现,区域词汇匹配方法通常只用于检测对象名词,而标题中其他丰富的上下文,例如属性,的影响则不明确。在这个研究中,我们探讨了语言上下文如何影响后续目标检测,并提出应该增强上下文的作用。特别是,我们展示了如何战略地 contextualize grounding pretraining objective 以改善匹配。我们进一步强调了属性作为特别有用的目标上下文,并提出了一个独特的形容词和名词为基础的负样本策略,以提高他们在对比学习中的重点。总体而言,我们的方法和在区域词汇预训练方面的当前最佳方法相比,提高了目标检测能力。我们还通过文本区域提取和短语grounding分析突出了属性敏感模型的精细 utility。
https://arxiv.org/abs/2303.10093