The field of vision and language has witnessed a proliferation of pre-trained foundation models. Most existing methods are independently pre-trained with contrastive objective like CLIP, image-to-text generative objective like PaLI, or text-to-image generative objective like Parti. However, the three objectives can be pre-trained on the same data, image-text pairs, and intuitively they complement each other as contrasting provides global alignment capacity and generation grants fine-grained understanding. In this work, we present a Contrastive Bi-directional Image-Text generation model (CoBIT), which attempts to unify the three pre-training objectives in one framework. Specifically, CoBIT employs a novel unicoder-decoder structure, consisting of an image unicoder, a text unicoder and a cross-modal decoder. The image/text unicoders can switch between encoding and decoding in different tasks, enabling flexibility and shared knowledge that benefits both image-to-text and text-to-image generations. CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE) and text-based content creation, particularly in zero-shot scenarios. For instance, 82.7% in zero-shot ImageNet classification, 9.37 FID score in zero-shot text-to-image generation and 44.8 CIDEr in zero-shot captioning.
视觉和语言领域已经见证了预先训练的基础模型的蓬勃发展。大多数现有方法都是独立地使用类似于CLIP、 PaLI 或 Parti的图像到文本生成目标或文本到图像生成目标进行预先训练。然而,这三个目标可以在相同的数据集上、图像到文本对或文本到图像对上进行预先训练,Intuitively,它们之间的互相补充是因为对比提供了全球对齐能力,生成则提供了精细的理解能力。在本工作中,我们提出了一种Contrastive Bi-directional Image-Text 生成模型(CoBIT),试图在一个框架中统一这三个预先训练目标。具体而言,CoBIT 采用了一种 novel Unicoder-decoder 结构,由一个图像Unicoder、一个文本Unicoder 和一个跨modal decoder 组成。图像/文本Unicoders 在不同的任务中可以切换编码和解码,从而提供灵活性和共享知识,这对图像到文本和文本到图像的生成都有益处。CoBIT 在图像理解、图像到文本理解(检索、标题生成、视觉问答、SNLI-VE) 和文本based 内容创建方面取得了卓越的性能,尤其是在零样本情况下。例如,在零样本ImageNet分类中达到了82.7%的准确率,在零样本文本到图像生成中达到了9.37 FID 得分,在零样本标题生成中达到了44.8的CIDEr。
https://arxiv.org/abs/2303.13455
Recent open-vocabulary detection methods aim to detect novel objects by distilling knowledge from vision-language models (VLMs) trained on a vast amount of image-text pairs. To improve the effectiveness of these methods, researchers have utilized datasets with a large vocabulary that contains a large number of object classes, under the assumption that such data will enable models to extract comprehensive knowledge on the relationships between various objects and better generalize to unseen object classes. In this study, we argue that more fine-grained labels are necessary to extract richer knowledge about novel objects, including object attributes and relationships, in addition to their names. To address this challenge, we propose a simple and effective method named Pseudo Caption Labeling (PCL), which utilizes an image captioning model to generate captions that describe object instances from diverse perspectives. The resulting pseudo caption labels offer dense samples for knowledge distillation. On the LVIS benchmark, our best model trained on the de-duplicated VisualGenome dataset achieves an AP of 34.5 and an APr of 30.6, comparable to the state-of-the-art performance. PCL's simplicity and flexibility are other notable features, as it is a straightforward pre-processing technique that can be used with any image captioning model without imposing any restrictions on model architecture or training process.
最近的开放词汇检测方法旨在通过从训练大量图像文本对视觉语言模型(VLMs)的知识进行蒸馏,检测新对象。为了改进这些方法的效果,研究人员使用了大量的词汇表数据,其中包含大量对象类别,假设这些数据可以让模型提取关于各种对象的关系和更广泛地应用于未观测的对象类别的全面知识。在本研究中,我们认为需要更多的精细标签才能提取更丰富的知识,包括对象属性和关系,除了名称。为了解决这个挑战,我们提出了一种简单的有效的方法名为“伪标题标签”(PCL),该方法使用图像标题生成模型生成描述对象实例的不同角度的摘要。所产生的伪标题标签提供了密度丰富的知识蒸馏样本。在LVIS基准测试中,我们训练的最优模型在未重复训练的视觉基因组数据集上取得了34.5的AP和30.6的APr,与最先进的性能相当。PCL的简单易用和灵活性是其他显著的特征,它是一种简单的预处理技术,可以与任何图像标题生成模型一起使用,而无需对模型架构或训练过程施加任何限制。
https://arxiv.org/abs/2303.13040
While generative modeling on multimodal image-text data has been actively developed with large-scale paired datasets, there have been limited attempts to generate both image and text data by a single model rather than a generation of one fixed modality conditioned on the other modality. In this paper, we explore a unified generative vision-and-language (VL) model that can produce both images and text sequences. Especially, we propose a generative VL transformer based on the non-autoregressive mask prediction, named MAGVLT, and compare it with an autoregressive generative VL transformer (ARGVLT). In comparison to ARGVLT, the proposed MAGVLT enables bidirectional context encoding, fast decoding by parallel token predictions in an iterative refinement, and extended editing capabilities such as image and text infilling. For rigorous training of our MAGVLT with image-text pairs from scratch, we combine the image-to-text, text-to-image, and joint image-and-text mask prediction tasks. Moreover, we devise two additional tasks based on the step-unrolled mask prediction and the selective prediction on the mixture of two image-text pairs. Experimental results on various downstream generation tasks of VL benchmarks show that our MAGVLT outperforms ARGVLT by a large margin even with significant inference speedup. Particularly, MAGVLT achieves competitive results on both zero-shot image-to-text and text-to-image generation tasks from MS-COCO by one moderate-sized model (fewer than 500M parameters) even without the use of monomodal data and networks.
尽管在多模态图像和文本数据的生成建模方面,已经利用大规模配对数据集进行了积极的开发,但仍然存在有限的机会,通过一个模型生成 both 图像和文本数据,而不是通过一个固定的模态条件生成 one 的模态。在本文中,我们探索了一种统一的视觉和语言 (VL) 生成模型,可以生成 both 图像和文本序列。特别是,我们提出了基于非自回归掩码预测的生成 VL 变换器,称为 MAGVLT,并将其与自回归生成 VL 变换器 (ARGVLT)进行比较。与 ARGVLT 相比,我们提出的 MAGVLT 实现了双向上下文编码、通过并行 token 预测快速解码、以及扩展了编辑能力,如图像和文本填充。为了从 scratch 开始训练我们的 MAGVLT 与图像和文本对的严格训练,我们结合了图像到文本、文本到图像和联合图像和文本掩码预测任务。此外,我们还设计了两个基于步展开掩码预测和选择性预测的图像到文本和对图像和文本混合物的任务。VL 基准测试中的各种下游生成任务的实验结果表明,即使Inference速度有很大的提高,我们的 MAGVLT 仍然比 ARGVLT 在 MS-COCO 中实现竞争结果,并且即使不使用单模态数据和网络。特别是,MAGVLT 通过一个规模适中的模型(小于 500 兆参数)从零样本图像到文本和文本到图像的生成任务中实现了竞争结果,即使在不使用单模态数据和网络的情况下。
https://arxiv.org/abs/2303.12208
The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation of captions generated from vision-and-language architectures. In this paper, we propose a new recipe for a contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S), that in a novel way unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data. Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos, outperforming existing reference-based metrics like CIDEr and SPICE and reference-free metrics like CLIP-Score. Finally, we test the system-level correlation of the proposed metric when considering popular image captioning approaches, and assess the impact of employing different cross-modal features. Our source code and trained models are publicly available at: this https URL.
CLIP模型最近被证明对于多种跨modal任务非常有效,包括从视觉和语言架构生成的caption评估。在本文中,我们提出了一种新的 recipe 用于图像captioning 的评价指标,即增强对比学习得分( PAC-S),以一种 novel 的方式将对比视觉语义学习与编辑数据生成图像和文本相结合。多个数据集的实验表明,我们的新指标在图像和视频对人类判断的相关性方面表现最优秀,比现有的基于参考的指标(如 CIDEr 和 SPICE )以及无参考指标(如 CLIP-Score)更好。最后,我们测试了 proposed 指标的系统级相关性,在考虑流行的图像captioning方法时,并评估了使用不同跨modal特征的影响。我们源代码和训练模型的公开地址为: this https URL.
https://arxiv.org/abs/2303.12112
Human sketch has already proved its worth in various visual understanding tasks (e.g., retrieval, segmentation, image-captioning, etc). In this paper, we reveal a new trait of sketches - that they are also salient. This is intuitive as sketching is a natural attentive process at its core. More specifically, we aim to study how sketches can be used as a weak label to detect salient objects present in an image. To this end, we propose a novel method that emphasises on how "salient object" could be explained by hand-drawn sketches. To accomplish this, we introduce a photo-to-sketch generation model that aims to generate sequential sketch coordinates corresponding to a given visual photo through a 2D attention mechanism. Attention maps accumulated across the time steps give rise to salient regions in the process. Extensive quantitative and qualitative experiments prove our hypothesis and delineate how our sketch-based saliency detection model gives a competitive performance compared to the state-of-the-art.
人类 Sketch 已经在各种视觉理解任务中证明了其价值(例如检索、分割、图像标题等)。在本文中,我们揭示了 Sketch 的新特征——它们也是引人注目的。这是因为 Sketch 本质上是一个自然的注意力过程的核心。更具体地说,我们旨在研究如何通过 Sketch 用作弱标签来检测图像中的引人注目物体。为此,我们提出了一种新方法,强调了如何通过手绘 Sketch 解释“引人注目的物体”。为了实现这一点,我们引入了一个照片到 Sketch 生成模型,旨在通过 2D 注意力机制生成与给定视觉照片相应的Sequential Sketch 坐标。注意力图在时间步上累积,从而产生引人注目的区域。广泛的定量和定性实验证明了我们的假设,并概述了我们 Sketch based 引人注目检测模型相对于最先进的技术提供的竞争性能。
https://arxiv.org/abs/2303.11502
Implicit neural representations (INR) have gained significant popularity for signal and image representation for many end-tasks, such as superresolution, 3D modeling, and more. Most INR architectures rely on sinusoidal positional encoding, which accounts for high-frequency information in data. However, the finite encoding size restricts the model's representational power. Higher representational power is needed to go from representing a single given image to representing large and diverse datasets. Our approach addresses this gap by representing an image with a polynomial function and eliminates the need for positional encodings. Therefore, to achieve a progressively higher degree of polynomial representation, we use element-wise multiplications between features and affine-transformed coordinate locations after every ReLU layer. The proposed method is evaluated qualitatively and quantitatively on large datasets like ImageNet. The proposed Poly-INR model performs comparably to state-of-the-art generative models without any convolution, normalization, or self-attention layers, and with far fewer trainable parameters. With much fewer training parameters and higher representative power, our approach paves the way for broader adoption of INR models for generative modeling tasks in complex domains. The code is available at \url{this https URL}
隐含神经网络表示(INR)在信号和图像表示的许多最终任务中变得越来越受欢迎,例如超分辨率、三维建模和更多。大多数 INR 架构依赖于斯芬克斯位置编码,该编码在数据中利用了高频信息。然而,有限编码大小限制了模型的表示能力。要从表示一个给定的图像到表示大型、多样化的数据集需要更多的表示能力。我们的方法通过使用多项式函数代表图像,并消除了位置编码的需求,以逐步增加多项式表示的程度。在 ImageNet 等大型数据集上对这种方法进行了定性和定量评估。 proposed 的Poly-INR 模型与最先进的生成模型的性能相当,而不需要卷积、归一化和自注意力层,并且训练参数更少。由于训练参数更少,以及更高的表示能力,我们的方法为复杂 domains 中的生成建模任务更广泛的采用铺平了道路。代码可在 \url{this https URL} 找到。
https://arxiv.org/abs/2303.11424
Pose transfer aims to transfer a given person into a specified posture, has recently attracted considerable attention. A typical pose transfer framework usually employs representative datasets to train a discriminative model, which is often violated by out-of-distribution (OOD) instances. Recently, test-time adaption (TTA) offers a feasible solution for OOD data by using a pre-trained model that learns essential features with self-supervision. However, those methods implicitly make an assumption that all test distributions have a unified signal that can be learned directly. In open-world conditions, the pose transfer task raises various independent signals: OOD appearance and skeleton, which need to be extracted and distributed in speciality. To address this point, we develop a SEquential Test-time Adaption (SETA). In the test-time phrase, SETA extracts and distributes external appearance texture by augmenting OOD data for self-supervised training. To make non-Euclidean similarity among different postures explicit, SETA uses the image representations derived from a person re-identification (Re-ID) model for similarity computation. By addressing implicit posture representation in the test-time sequentially, SETA greatly improves the generalization performance of current pose transfer models. In our experiment, we first show that pose transfer can be applied to open-world applications, including Tiktok reenactment and celebrity motion synthesis.
姿态转移的目标是将给定的人转移到指定的姿势,最近吸引了相当大的关注。典型的姿态转移框架通常使用代表性的数据集来训练一个鉴别模型,这常常受到分布之外(OOD)实例的违反。最近,测试时间适应(TTA)提供了一个可行的解决方案,使用一个自监督学习的训练模型来学习关键特征,以自学训练。然而,这些方法隐含地假设所有测试分布都有一个通用的信号,可以直接学习。在开放世界条件下,姿态转移任务产生了各种独立的信号:OOD的外观和骨骼,需要提取和分布在特定领域的专业知识。为了解决这一问题,我们开发了平方测试时间适应(SETA)。在测试时间短语中,SETA通过增加OOD数据来提高外部外观纹理,以自学训练。为了使不同姿势之间的非欧几何相似性更加明显,SETA使用从人身份验证(Re-ID)模型推导的图像表示来进行相似性计算。通过在测试时间顺序解决暗示的姿态表示问题,SETA极大地改善了当前姿态转移模型的泛化性能。在我们的实验中,我们首先展示了姿态转移可以应用于开放世界应用程序,包括 Tiktok重编和名人运动合成。
https://arxiv.org/abs/2303.10945
Deep neural networks have achieved promising results in automatic image captioning due to their effective representation learning and context-based content generation capabilities. As a prominent type of deep features used in many of the recent image captioning methods, the well-known bottomup features provide a detailed representation of different objects of the image in comparison with the feature maps directly extracted from the raw image. However, the lack of high-level semantic information about the relationships between these objects is an important drawback of bottom-up features, despite their expensive and resource-demanding extraction procedure. To take advantage of visual relationships in caption generation, this paper proposes a deep neural network architecture for image captioning based on fusing the visual relationships information extracted from an image's scene graph with the spatial feature maps of the image. A multi-modal reward function is then introduced for deep reinforcement learning of the proposed network using a combination of language and vision similarities in a common embedding space. The results of extensive experimentation on the MSCOCO dataset show the effectiveness of using visual relationships in the proposed captioning method. Moreover, the results clearly indicate that the proposed multi-modal reward in deep reinforcement learning leads to better model optimization, outperforming several state-of-the-art image captioning algorithms, while using light and easy to extract image features. A detailed experimental study of the components constituting the proposed method is also presented.
神经网络在自动图像翻译中取得了令人瞩目的成果,因为它们有效地学习了表示和基于上下文的内容生成能力。作为最近在许多图像翻译方法中广泛应用的深度特征类型,著名的bottom-up特征提供了与从原始图像直接提取的特征映射相比更详细的图像对象表示。然而,缺乏这些对象之间的高级语义信息是bottom-up特征的一个重要缺点,尽管它们的提取程序昂贵且资源要求高。为了利用图像关系在翻译生成中的作用,本文提出了一种基于融合图像场景图提取的视觉关系信息与图像空间特征映射的深度学习神经网络架构。然后,在共同嵌入空间中通过语言和视觉相似性的组合引入一种多模态奖励函数,用于训练 proposed 网络的深度强化学习。对MSCOCO数据集进行广泛的实验结果显示,使用视觉关系在所提出的翻译方法中的有效性。此外,实验结果清楚地表明,所提出的深度强化学习多模态奖励导致更好的模型优化,比一些最先进的图像翻译算法表现更好,同时使用简单易提取的图像特征。还介绍了组成所提出方法的详细实验研究的组件。
https://arxiv.org/abs/2303.10766
Large pre-trained vision-language models (VLMs) reduce the time for developing predictive models for various vision-grounded language downstream tasks by providing rich, adaptable image and text representations. However, these models suffer from societal biases owing to the skewed distribution of various identity groups in the training data. These biases manifest as the skewed similarity between the representations for specific text concepts and images of people of different identity groups and, therefore, limit the usefulness of such models in real-world high-stakes applications. In this work, we present DeAR (Debiasing with Additive Residuals), a novel debiasing method that learns additive residual image representations to offset the original representations, ensuring fair output representations. In doing so, it reduces the ability of the representations to distinguish between the different identity groups. Further, we observe that the current fairness tests are performed on limited face image datasets that fail to indicate why a specific text concept should/should not apply to them. To bridge this gap and better evaluate DeAR, we introduce the Protected Attribute Tag Association (PATA) dataset - a new context-based bias benchmarking dataset for evaluating the fairness of large pre-trained VLMs. Additionally, PATA provides visual context for a diverse human population in different scenarios with both positive and negative connotations. Experimental results for fairness and zero-shot performance preservation using multiple datasets demonstrate the efficacy of our framework.
大型预训练的视觉语言模型(VLMs)通过提供丰富的、适应性的图像和文本表示,减少了开发针对各种视觉主导的后续语言任务的预测模型的时间。然而,这些模型由于训练数据中各种身份团体分布的偏斜而遭受社会偏见。这些偏见表现为特定文本概念和不同身份团体的人的图像表示之间的偏斜相似性,因此,这些模型在现实世界的重要应用中的局限性。在这项工作中,我们提出了DeAR(添加附加余量的清洁方法),这是一种新的清洁方法,学习添加附加余量的图像表示来抵消原始表示,确保公平的输出表示。通过这样做,它减少了表示的能力,以区分不同身份团体。此外,我们观察到,当前的公平测试是在有限面部图像数据集上进行的,无法说明为什么特定文本概念应该/不应该适用于它们。为了弥补这一差距并更好地评估DeAR,我们介绍了保护属性标签Association(PATA)数据集,这是一个基于上下文的偏见基准数据集,用于评估大型预训练VLMs的公平性。此外,PATA为不同身份团体的多样化的人类群体在不同情境下提供了视觉上下文。使用多个数据集的实验结果,公平性和零次机会性能保护证明了我们的框架的有效性。
https://arxiv.org/abs/2303.10431
Vision-language pretraining to learn a fine-grained, region-word alignment between image-caption pairs has propelled progress in open-vocabulary object detection. We observe that region-word alignment methods are typically used in detection with respect to only object nouns, and the impact of other rich context in captions, such as attributes, is unclear. In this study, we explore how language context affects downstream object detection and propose to enhance the role of context. In particular, we show how to strategically contextualize the grounding pretraining objective for improved alignment. We further hone in on attributes as especially useful object context and propose a novel adjective and noun-based negative sampling strategy for increasing their focus in contrastive learning. Overall, our methods enhance object detection when compared to the state-of-the-art in region-word pretraining. We also highlight the fine-grained utility of an attribute-sensitive model through text-region retrieval and phrase grounding analysis.
图像和标题之间的精细区域词汇匹配在学习开放词汇的目标检测方面推动了进展。我们发现,区域词汇匹配方法通常只用于检测对象名词,而标题中其他丰富的上下文,例如属性,的影响则不明确。在这个研究中,我们探讨了语言上下文如何影响后续目标检测,并提出应该增强上下文的作用。特别是,我们展示了如何战略地 contextualize grounding pretraining objective 以改善匹配。我们进一步强调了属性作为特别有用的目标上下文,并提出了一个独特的形容词和名词为基础的负样本策略,以提高他们在对比学习中的重点。总体而言,我们的方法和在区域词汇预训练方面的当前最佳方法相比,提高了目标检测能力。我们还通过文本区域提取和短语grounding分析突出了属性敏感模型的精细 utility。
https://arxiv.org/abs/2303.10093
Cytopathology report generation is a necessary step for the standardized examination of pathology images. However, manually writing detailed reports brings heavy workloads for pathologists. To improve efficiency, some existing works have studied automatic generation of cytopathology reports, mainly by applying image caption generation frameworks with visual encoders originally proposed for natural images. A common weakness of these works is that they do not explicitly model the structural information among cells, which is a key feature of pathology images and provides significant information for making diagnoses. In this paper, we propose a novel graph-based framework called GNNFormer, which seamlessly integrates graph neural network (GNN) and Transformer into the same framework, for cytopathology report generation. To the best of our knowledge, GNNFormer is the first report generation method that explicitly models the structural information among cells in pathology images. It also effectively fuses structural information among cells, fine-grained morphology features of cells and background features to generate high-quality reports. Experimental results on the NMI-WSI dataset show that GNNFormer can outperform other state-of-the-art baselines.
细胞病理学报告生成是标准化病理学图像检查的必要步骤。然而,手动撰写详细报告给病理学家带来了巨大的工作负担。为了提高效率,一些现有工作已经研究了自动生成细胞病理学报告的方法,主要是通过应用最初为自然图像所设计的可视化编码框架来实现图像标题生成框架。这些工作的一个共同弱点是它们并不明确 Modeling 细胞之间的结构信息,这是病理学图像的一个关键特征,并为诊断提供重要信息。在本文中,我们提出了一种新的基于图的结构主义框架,称为 GNN former,它将图神经网络 (GNN) 和变分自编码器无缝集成到同一个框架中,用于细胞病理学报告生成。据我们所知,GNN former 是第一种明确 Modeling 病理学图像中细胞之间的结构信息的报告生成方法。它还有效地融合细胞之间的结构信息、细胞细粒度形态特征和背景特征,生成高质量的报告。在NMI-WSI数据集上的实验结果显示,GNN former 可以与其他先进的基准方法相比表现更好。
https://arxiv.org/abs/2303.09956
Distance-based classification is frequently used in transductive few-shot learning (FSL). However, due to the high-dimensionality of image representations, FSL classifiers are prone to suffer from the hubness problem, where a few points (hubs) occur frequently in multiple nearest neighbour lists of other points. Hubness negatively impacts distance-based classification when hubs from one class appear often among the nearest neighbors of points from another class, degrading the classifier's performance. To address the hubness problem in FSL, we first prove that hubness can be eliminated by distributing representations uniformly on the hypersphere. We then propose two new approaches to embed representations on the hypersphere, which we prove optimize a tradeoff between uniformity and local similarity preservation -- reducing hubness while retaining class structure. Our experiments show that the proposed methods reduce hubness, and significantly improves transductive FSL accuracy for a wide range of classifiers.
基于距离的分类常常用于递归 few-shot 学习(FSL)。然而,由于图像表示的高维度性,FSL 分类器容易受到中心化问题的困扰。其中一些点(中心点)经常出现在其他点的所有近邻列表中的多个点中。中心化在一类点中的中心点出现在另一类点中的近邻点中时会消极影响基于距离的分类,从而降低分类器的性能。为了解决 FSL 中的中心化问题,我们首先证明可以通过均匀分布表示在球面上消除中心化。然后我们提出了两个新的方法和在球面上嵌入表示的方法,我们证明可以优化均匀性和局部相似保留之间的权衡,以减少中心化同时保留类别结构。我们的实验结果表明,提出的方法和对各类别的分类器显著改善了递归 FSL 的准确性。
https://arxiv.org/abs/2303.09352
Self-supervised learning has recently emerged as a strong alternative in document analysis. These approaches are now capable of learning high-quality image representations and overcoming the limitations of supervised methods, which require a large amount of labeled data. However, these methods are unable to capture new knowledge in an incremental fashion, where data is presented to the model sequentially, which is closer to the realistic scenario. In this paper, we explore the potential of continual self-supervised learning to alleviate the catastrophic forgetting problem in handwritten text recognition, as an example of sequence recognition. Our method consists in adding intermediate layers called adapters for each task, and efficiently distilling knowledge from the previous model while learning the current task. Our proposed framework is efficient in both computation and memory complexity. To demonstrate its effectiveness, we evaluate our method by transferring the learned model to diverse text recognition downstream tasks, including Latin and non-Latin scripts. As far as we know, this is the first application of continual self-supervised learning for handwritten text recognition. We attain state-of-the-art performance on English, Italian and Russian scripts, whilst adding only a few parameters per task. The code and trained models will be publicly available.
自监督学习最近成为了文档分析中的一种强大替代方案。这些方法现在能够学习高质量的图像表示,并克服监督方法的限制,这需要大量标记数据。然而,这些方法无法以增量方式捕获新的知识,因为数据是依次向模型呈现的,更接近实际场景。在本文中,我们探讨了持续自监督学习减轻手写文本识别中灾难性的忘记问题的潜力,将其作为序列识别的一个例子。我们的方法和其实现 consists in为每个任务添加称为Adapter的中间层,并有效地从先前模型中提取知识,同时学习当前任务。我们提出的框架在计算和内存复杂性方面都高效。为了证明其有效性,我们将其学习方法转移到各种文本识别下游任务,包括拉丁和非拉丁字母。据我们所知,这是自监督学习手写文本识别的第一个应用。我们在英语、意大利语和俄语字符上实现了最先进的性能,同时每个任务只添加了几个参数。代码和训练模型将公开可用。
https://arxiv.org/abs/2303.09347
Diffusion models have shown remarkable success in visual synthesis, but have also raised concerns about potential abuse for malicious purposes. In this paper, we seek to build a detector for telling apart real images from diffusion-generated images. We find that existing detectors struggle to detect images generated by diffusion models, even if we include generated images from a specific diffusion model in their training data. To address this issue, we propose a novel image representation called DIffusion Reconstruction Error (DIRE), which measures the error between an input image and its reconstruction counterpart by a pre-trained diffusion model. We observe that diffusion-generated images can be approximately reconstructed by a diffusion model while real images cannot. It provides a hint that DIRE can serve as a bridge to distinguish generated and real images. DIRE provides an effective way to detect images generated by most diffusion models, and it is general for detecting generated images from unseen diffusion models and robust to various perturbations. Furthermore, we establish a comprehensive diffusion-generated benchmark including images generated by eight diffusion models to evaluate the performance of diffusion-generated image detectors. Extensive experiments on our collected benchmark demonstrate that DIRE exhibits superiority over previous generated-image detectors. The code and dataset are available at this https URL.
扩散模型在视觉合成方面取得了显著成功,但也引发了对于可能用于恶意目的滥用的担忧。在本文中,我们寻求建立一个探测器来分辨出真实的图像和扩散生成图像。我们发现,现有的探测器很难检测扩散模型生成的图像,即使我们将从特定扩散模型生成的图像在训练数据中 included 进去。为了解决这一问题,我们提出了一种新图像表示,称为扩散重建错误(DIRE),它通过预先训练的扩散模型测量输入图像和其重建副本之间的误差。我们观察到,扩散生成图像可以近似地由扩散模型重建,而真实的图像却无法。这提供了一个暗示,DIRE可以作为区分生成和真实图像的桥。DIRE提供了检测大多数扩散模型生成的图像的有效方法,并且对于从未曾见过的扩散模型生成的图像和对各种干扰的鲁棒性都适用。此外,我们建立了一个包括八个扩散模型生成的图像的全面扩散生成基准,以评估扩散生成图像探测器的性能。我们对收集的基准进行广泛的实验,证明了DIRE比先前生成的图像探测器表现更好。代码和数据集在本 https URL 上可用。
https://arxiv.org/abs/2303.09295
Aiming to improve the Automatic Speech Recognition (ASR) outputs with a post-processing step, ASR error correction (EC) techniques have been widely developed due to their efficiency in using parallel text data. Previous works mainly focus on using text or/ and speech data, which hinders the performance gain when not only text and speech information, but other modalities, such as visual information are critical for EC. The challenges are mainly two folds: one is that previous work fails to emphasize visual information, thus rare exploration has been studied. The other is that the community lacks a high-quality benchmark where visual information matters for the EC models. Therefore, this paper provides 1) simple yet effective methods, namely gated fusion and image captions as prompts to incorporate visual information to help EC; 2) large-scale benchmark datasets, namely Visual-ASR-EC, where each item in the training data consists of visual, speech, and text information, and the test data are carefully selected by human annotators to ensure that even humans could make mistakes when visual information is missing. Experimental results show that using captions as prompts could effectively use the visual information and surpass state-of-the-art methods by upto 1.2% in Word Error Rate(WER), which also indicates that visual information is critical in our proposed Visual-ASR-EC dataset
旨在通过预处理步骤改善自动语音识别(ASR)输出的精度,ASR错误纠正(EC)技术已经得到了广泛的发展,因为使用并行文本数据的效率很高。以前的工作主要关注使用文本或/和语音数据,这使得在不仅文本和语音信息,而且其他模式,如视觉信息对于EC非常重要的情况下,性能提升受到了阻碍。主要挑战有两个:一个是以前的工作没有强调视觉信息,因此罕见地的探索已经被研究了。另一个是社区缺乏对于EC模型来说视觉信息重要的高质量基准。因此,本文提供了1)简单但有效的方法,即门控融合和图像标题作为提示,以集成视觉信息来帮助EC; 2)大规模的基准数据集,即Visual-ASR-EC,其中训练数据每个元素包括视觉、语音和文本信息,测试数据由人类标注者精心挑选,以确保即使在视觉信息缺失的情况下,人类也可能发生错误。实验结果表明,使用标题作为提示可以有效地利用视觉信息,在Word错误率(WER)方面超过现有方法的1.2%。这也表明视觉信息在我们所提出的Visual-ASR-EC数据集上非常重要。
https://arxiv.org/abs/2303.10160
Vulnerability to lexical perturbation is a critical weakness of automatic evaluation metrics for image captioning. This paper proposes Perturbation Robust Multi-Lingual CLIPScore(PR-MCS), which exhibits robustness to such perturbations, as a novel reference-free image captioning metric applicable to multiple languages. To achieve perturbation robustness, we fine-tune the text encoder of CLIP with our language-agnostic method to distinguish the perturbed text from the original text. To verify the robustness of PR-MCS, we introduce a new fine-grained evaluation dataset consisting of detailed captions, critical objects, and the relationships between the objects for 3, 000 images in five languages. In our experiments, PR-MCS significantly outperforms baseline metrics in capturing lexical noise of all various perturbation types in all five languages, proving that PR-MCS is highly robust to lexical perturbations.
词汇干扰是图像字幕自动评估 metrics 的一个关键弱点。本文提出了词汇干扰稳定的多语言 CLIPScore(PR-MCS),该指标表现出对这类干扰的稳健性,成为适用于多种语言的无参考图像字幕评估指标。为了实现词汇干扰稳健性,我们使用我们的语言无关的方法微调了 CLIP 的文本编码器,以区分受到干扰的文本与原始文本。为了验证PR-MCS的稳健性,我们介绍了一个细致的评估数据集,包括五个语言中的详细caption、关键对象及其它们之间的关系,共包含3000个图像。在我们的实验中,PR-MCS在捕获所有不同干扰类型的词汇噪声方面显著优于基准评估指标,证明PR-MCS对词汇干扰非常稳健。
https://arxiv.org/abs/2303.08389
Weird, unusual, and uncanny images pique the curiosity of observers because they challenge commonsense. For example, an image released during the 2022 world cup depicts the famous soccer stars Lionel Messi and Cristiano Ronaldo playing chess, which playfully violates our expectation that their competition should occur on the football field. Humans can easily recognize and interpret these unconventional images, but can AI models do the same? We introduce WHOOPS!, a new dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney. We consider several tasks posed over the dataset. In addition to image captioning, cross-modal matching, and visual question answering, we introduce a difficult explanation generation task, where models must identify and explain why a given image is unusual. Our results show that state-of-the-art models such as GPT3 and BLIP2 still lag behind human performance on WHOOPS!. We hope our dataset will inspire the development of AI models with stronger visual commonsense reasoning abilities. Data, models and code are available at the project website: this http URL
奇怪、不寻常和反常的图像激发了观察家的好奇,因为它们挑战了常识。例如,在2022年世界杯期间,一张图片描绘了著名的足球运动员Lionel Messi和Cristiano Ronaldo玩 chess,这种游戏性的违反常识的期望我们期望他们的比赛应该在足球场上进行。人类很容易识别和理解这些非常规的图像,但是AI模型是否可以做到呢?我们介绍了WHOOPS!,一个用于视觉常识推理的新数据集和基准。数据集由设计师使用 Midjourney等公开可用的图像生成工具故意制作的违反常识的图像组成。我们考虑在数据集中设置几个任务。除了图像标题翻译、跨媒体匹配和视觉问题回答,我们还引入了一个困难解释生成任务,模型必须识别和解释为什么给定图像反常。我们的结果表明,最先进的模型如GPT3和BLIP2在WHOOPS!上的表现仍然落后于人类。我们希望我们的数据集将激励开发具有更强的视觉常识推理能力更强的AI模型。数据、模型和代码可在项目网站上找到:此httpURL。
https://arxiv.org/abs/2303.07274
Foundation models trained on large-scale dataset gain a recent surge in CV and NLP. In contrast, development in biomedical domain lags far behind due to data scarcity. To address this issue, we build and release PMC-OA, a biomedical dataset with 1.6M image-caption pairs collected from PubMedCentral's OpenAccess subset, which is 8 times larger than before. PMC-OA covers diverse modalities or diseases, with majority of the image-caption samples aligned at finer-grained level, i.e., subfigure and subcaption. While pretraining a CLIP-style model on PMC-OA, our model named PMC-CLIP achieves state-of-the-art results on various downstream tasks, including image-text retrieval on ROCO, MedMNIST image classification, Medical VQA, i.e. +8.1% R@10 on image-text retrieval, +3.9% accuracy on image classification.
训练大规模数据 foundation models 获得了 cv 和 nlp 领域的近期增长,而生物医学领域的开发由于数据缺乏而落后很远。为了解决这个问题,我们建造并发布了 PMC-OA,这是一个从PubMed Central的开放获取部分收集的1600万图像标题对的生物医学数据集,比之前扩大了8倍。PMC-OA涵盖了多种模式或疾病,其中大多数图像标题样本对齐了更精细的水平,即小数点后面的小数部分和标题。在 PMC-OA 上进行CLIP风格模型的前训练时,我们开发了名为 PMC-CLIP 的模型,并在各种后续任务中取得了最先进的结果,包括ROCO图像文本检索、MedMNIST图像分类和医学问答QA(即图像文本检索准确率提高8.1%,图像分类准确率提高3.9%)。
https://arxiv.org/abs/2303.07240
Over the last years, advancements in deep learning models for computer vision have led to a dramatic improvement in their image classification accuracy. However, models with a higher accuracy in the task they were trained on do not necessarily develop better image representations that allow them to also perform better in other tasks they were not trained on. In order to investigate the representation learning capabilities of prominent high-performing computer vision models, we investigated how well they capture various indices of perceptual similarity from large-scale behavioral datasets. We find that higher image classification accuracy rates are not associated with a better performance on these datasets, and in fact we observe no improvement in performance since GoogLeNet (released 2015) and VGG-M (released 2014). We speculate that more accurate classification may result from hyper-engineering towards very fine-grained distinctions between highly similar classes, which does not incentivize the models to capture overall perceptual similarities.
过去几年中,对于计算机视觉的深度学习模型的改进导致了图像分类准确率的显著提高。然而,训练任务更准确的模型并不一定能够发展出更好的图像表示,从而在他们没有训练过的其他任务中表现更好。为了研究 prominent 高性能计算机视觉模型的表示学习能力,我们研究了它们从大型行为数据集中提取感知相似性的各种指标的表达能力。我们发现,更高的图像分类准确率与这些数据集的性能没有直接关系,事实上,自GoogLeNet(2015年发布)和VGG-M(2014年发布)以来,性能没有发生变化。我们猜测,更精确的分类可能源于过度优化,倾向于在高度相似的类之间进行非常精细的区分,这并没有激励模型捕获整体感知相似性。
https://arxiv.org/abs/2303.07084
Table structure recognition aims to extract the logical and physical structure of unstructured table images into a machine-readable format. The latest end-to-end image-to-text approaches simultaneously predict the two structures by two decoders, where the prediction of the physical structure (the bounding boxes of the cells) is based on the representation of the logical structure. However, the previous methods struggle with imprecise bounding boxes as the logical representation lacks local visual information. To address this issue, we propose an end-to-end sequential modeling framework for table structure recognition called VAST. It contains a novel coordinate sequence decoder triggered by the representation of the non-empty cell from the logical structure decoder. In the coordinate sequence decoder, we model the bounding box coordinates as a language sequence, where the left, top, right and bottom coordinates are decoded sequentially to leverage the inter-coordinate dependency. Furthermore, we propose an auxiliary visual-alignment loss to enforce the logical representation of the non-empty cells to contain more local visual details, which helps produce better cell bounding boxes. Extensive experiments demonstrate that our proposed method can achieve state-of-the-art results in both logical and physical structure recognition. The ablation study also validates that the proposed coordinate sequence decoder and the visual-alignment loss are the keys to the success of our method.
表格结构识别的目标是将无结构表格图像的逻辑和物理结构提取到一个可读格式中。最新的端到端图像到文本方法同时由两个解码器预测两个结构,其中物理结构的预测(细胞的边界框)是基于逻辑结构的表示。然而,以前的方法和方法 struggle with imprecise bounding boxes,因为逻辑表示缺乏局部视觉信息。为了解决这个问题,我们提出了一种名为VAST的端到端Sequential建模框架,它包含一个由逻辑结构解码器触发的非空细胞表示引起的新 coordinate sequence解码器。在 coordinate sequence解码器中,我们将边界框坐标建模为语言序列,其中左、上、右和下坐标依次解码以利用交互坐标依赖。此外,我们提出了一种辅助的视觉对齐损失,以强制非空细胞的逻辑表示包含更多的局部视觉细节,这有助于生成更好的细胞边界框。广泛的实验结果表明,我们提出的方法可以在逻辑和物理结构识别中实现最先进的结果。削除研究也证明了我们所提出的 coordinate sequence解码器和视觉对齐损失是我们方法成功的关键。
https://arxiv.org/abs/2303.06949