The core problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes. Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining, and struggles to prevent the language model from forgetting unseen classes. We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training, while simultaneously saving memory and computation. Secondly, the feature pyramid network and the detection head are modified to include trainable gated shortcuts, which encourages vision-text feature alignment and guarantees it at the start of detection training. Finally, a self-training approach is used to leverage a larger corpus of image-text pairs thus improving detection performance on classes with no human annotated bounding boxes. Our three methods are evaluated on the zero-shot version of the LVIS benchmark, each of them showing clear and significant benefits. Our final network achieves the new stateof-the-art on the mAP-all metric and demonstrates competitive performance for mAP-rare, as well as superior transfer to COCO and Objects365.
在零样本开放词汇检测中,核心问题是如何对齐视觉和文本特征,以便检测器在未训练过的类上表现良好。以前的算法从开始训练就开始训练特征金字塔和检测头,这破坏了在预训练期间建立的视觉文本特征对齐,并努力防止语言模型忘记未训练过的类。我们提出了三种方法来缓解这些问题。第一种方法是使用简单的方案来增加文本嵌入,以防止在训练期间看到的少数类上过度拟合,同时同时节省内存和计算。第二种方法是修改特征金字塔网络和检测头,包括可训练的门控快捷方式,这鼓励视觉文本特征对齐,并在检测训练开始时保证它。最后一种方法是利用更大的图像文本对语料库,从而提高检测在这些类上没有人类标注 bounding box 的检测性能。我们三种方法在 LVIS 基准测试的零样本版本上进行评估,每个方法都表现出明显和重要的 benefits。我们的最终网络在 mAP-all 度量上实现了新的前沿技术,并表现出 mAP-罕见的类上的 competitive 性能,以及与 COCO 和 Object365 相比更好的传输性能。
https://arxiv.org/abs/2303.13518
This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual recognition tasks. Typically, state-of-the-art foundation models are pretrained using large scale (weakly) supervised datasets with billions of images. We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model. While MAE has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well. Thus, our MAE-based pre-pretraining scales with both model and data size making it applicable for training foundation models. Pre-pretraining consistently improves both the model convergence and the downstream transfer performance across a range of model scales (millions to billions of parameters), and dataset sizes (millions to billions of images). We measure the effectiveness of pre-pretraining on 10 different visual recognition tasks spanning image classification, video recognition, object detection, low-shot classification and zero-shot recognition. Our largest model achieves new state-of-the-art results on iNaturalist-18 (91.3%), 1-shot ImageNet-1k (62.1%), and zero-shot transfer on Food-101 (96.0%). Our study reveals that model initialization plays a significant role, even for web-scale pretraining with billions of images.
这篇文章重写了计算机视觉中用于视觉识别任务的标准预训练-再微调范式。通常,最先进的基础模型是通过大规模(较弱)监督数据集训练的,包含数百万图像。我们引入了一个简单的预训练阶段,并使用自监督MAE技术初始化模型。虽然MAE只表现出与模型大小的关系,但我们发现它与训练数据集大小也有关系。因此,我们的MAE基于预训练方法适用于训练基础模型。预训练 consistently 改善模型收敛和下游转移性能,涵盖了模型大小(数百万到数十亿参数)和数据大小(数百万到数十亿图像)。我们测试了10个不同的视觉识别任务,包括图像分类、视频识别、对象检测、低尺度分类和零尺度识别。我们最大的模型在iNaturalist-18上取得了新的最先进的结果(91.3%),在1-视角的ImageNet-1k上取得了62.1%的准确率,并在Food-101上实现了零视角转移(96.0%)。我们的研究表明,模型初始化在包含数十亿图像的大规模预训练任务中发挥着重要作用。
https://arxiv.org/abs/2303.13496
Grounding object properties and relations in 3D scenes is a prerequisite for a wide range of artificial intelligence tasks, such as visually grounded dialogues and embodied manipulation. However, the variability of the 3D domain induces two fundamental challenges: 1) the expense of labeling and 2) the complexity of 3D grounded language. Hence, essential desiderata for models are to be data-efficient, generalize to different data distributions and tasks with unseen semantic forms, as well as ground complex language semantics (e.g., view-point anchoring and multi-object reference). To address these challenges, we propose NS3D, a neuro-symbolic framework for 3D grounding. NS3D translates language into programs with hierarchical structures by leveraging large language-to-code models. Different functional modules in the programs are implemented as neural networks. Notably, NS3D extends prior neuro-symbolic visual reasoning methods by introducing functional modules that effectively reason about high-arity relations (i.e., relations among more than two objects), key in disambiguating objects in complex 3D scenes. Modular and compositional architecture enables NS3D to achieve state-of-the-art results on the ReferIt3D view-dependence task, a 3D referring expression comprehension benchmark. Importantly, NS3D shows significantly improved performance on settings of data-efficiency and generalization, and demonstrate zero-shot transfer to an unseen 3D question-answering task.
将对象属性和关系嵌入三维场景是许多人工智能任务的必要条件,例如视觉grounded对话和身体操纵。然而,三维领域的变量导致了两个基本挑战:1)Labeling的成本和2)3D grounded语言的复杂性。因此,模型的重要目标是数据高效性,适用于不同数据分布和具有未呈现语义形式的任务,以及 ground 复杂的语言语义(例如,观点Anchoring和多物体参考)。为了解决这些挑战,我们提出了NS3D,一个神经符号化的三维基元框架。NS3D利用大型语言到代码模型将语言转换为具有层次结构的程序。程序中的不同功能模块被实现为神经网络。值得注意的是,NS3D扩展了先前的神经符号性视觉推理方法,引入了功能模块,有效地推理高arity关系(即,关系 among 超过两个物体),这在复杂三维场景中澄清物体是至关重要的。模块化和组合式架构使NS3D能够在数据效率和泛化设置方面实现最先进的结果,并证明零shot转移至一个未呈现的三维问答任务。
https://arxiv.org/abs/2303.13483
The field of vision and language has witnessed a proliferation of pre-trained foundation models. Most existing methods are independently pre-trained with contrastive objective like CLIP, image-to-text generative objective like PaLI, or text-to-image generative objective like Parti. However, the three objectives can be pre-trained on the same data, image-text pairs, and intuitively they complement each other as contrasting provides global alignment capacity and generation grants fine-grained understanding. In this work, we present a Contrastive Bi-directional Image-Text generation model (CoBIT), which attempts to unify the three pre-training objectives in one framework. Specifically, CoBIT employs a novel unicoder-decoder structure, consisting of an image unicoder, a text unicoder and a cross-modal decoder. The image/text unicoders can switch between encoding and decoding in different tasks, enabling flexibility and shared knowledge that benefits both image-to-text and text-to-image generations. CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE) and text-based content creation, particularly in zero-shot scenarios. For instance, 82.7% in zero-shot ImageNet classification, 9.37 FID score in zero-shot text-to-image generation and 44.8 CIDEr in zero-shot captioning.
视觉和语言领域已经见证了预先训练的基础模型的蓬勃发展。大多数现有方法都是独立地使用类似于CLIP、 PaLI 或 Parti的图像到文本生成目标或文本到图像生成目标进行预先训练。然而,这三个目标可以在相同的数据集上、图像到文本对或文本到图像对上进行预先训练,Intuitively,它们之间的互相补充是因为对比提供了全球对齐能力,生成则提供了精细的理解能力。在本工作中,我们提出了一种Contrastive Bi-directional Image-Text 生成模型(CoBIT),试图在一个框架中统一这三个预先训练目标。具体而言,CoBIT 采用了一种 novel Unicoder-decoder 结构,由一个图像Unicoder、一个文本Unicoder 和一个跨modal decoder 组成。图像/文本Unicoders 在不同的任务中可以切换编码和解码,从而提供灵活性和共享知识,这对图像到文本和文本到图像的生成都有益处。CoBIT 在图像理解、图像到文本理解(检索、标题生成、视觉问答、SNLI-VE) 和文本based 内容创建方面取得了卓越的性能,尤其是在零样本情况下。例如,在零样本ImageNet分类中达到了82.7%的准确率,在零样本文本到图像生成中达到了9.37 FID 得分,在零样本标题生成中达到了44.8的CIDEr。
https://arxiv.org/abs/2303.13455
In this paper, we leverage CLIP for zero-shot sketch based image retrieval (ZS-SBIR). We are largely inspired by recent advances on foundation models and the unparalleled generalisation ability they seem to offer, but for the first time tailor it to benefit the sketch community. We put forward novel designs on how best to achieve this synergy, for both the category setting and the fine-grained setting ("all"). At the very core of our solution is a prompt learning setup. First we show just via factoring in sketch-specific prompts, we already have a category-level ZS-SBIR system that overshoots all prior arts, by a large margin (24.8%) - a great testimony on studying the CLIP and ZS-SBIR synergy. Moving onto the fine-grained setup is however trickier, and requires a deeper dive into this synergy. For that, we come up with two specific designs to tackle the fine-grained matching nature of the problem: (i) an additional regularisation loss to ensure the relative separation between sketches and photos is uniform across categories, which is not the case for the gold standard standalone triplet loss, and (ii) a clever patch shuffling technique to help establishing instance-level structural correspondences between sketch-photo pairs. With these designs, we again observe significant performance gains in the region of 26.9% over previous state-of-the-art. The take-home message, if any, is the proposed CLIP and prompt learning paradigm carries great promise in tackling other sketch-related tasks (not limited to ZS-SBIR) where data scarcity remains a great challenge. Code and models will be made available.
在本文中,我们利用CLIP实现零次请求的 Sketch 图像检索(ZS-SBIR),我们主要受到最近在基础模型方面取得的进步以及它们似乎提供的无与伦比的泛化能力启发,但首次为 Sketch 社区服务。我们提出了新的设计,以最大程度地实现这一协同作用,无论是按类别设置还是精细设置(“所有”)。我们的核心解决方案是prompt learning setup。我们首先通过考虑 Sketch 特定的提示因子,已经有了一个按类别设置的 ZS-SBIR 系统,比所有先前作品都超出了很大的比例(24.8%),这是研究 CLIP 和 ZS-SBIR协同作用的巨大证明。然而,切换到精细设置变得更加困难,需要更深入地探索这一协同作用。为此,我们提出了两个特定的设计,以解决精细匹配问题:(i)额外的正则化损失,以确保 Sketch 和照片之间的相对分离在所有类别上是均匀的,而不像标准单例差分损失那样,(ii)聪明的 patch shuffle 技术,以帮助建立 Sketch 和照片之间的实例级结构对应关系。通过这些设计,我们再次观察到在先前技术水平的26.9%范围内显著的性能提升。总之,任何消息都是关于 proposed CLIP 和 prompt learning 范式在处理其他 Sketch 相关任务(不仅仅限于 ZS-SBIR)时具有巨大的潜力,数据稀缺仍然是一个巨大挑战。代码和模型将公开提供。
https://arxiv.org/abs/2303.13440
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. In this paper, we introduce a new task of zero-shot text-to-video generation and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable for the video domain. Our key modifications include (i) enriching the latent codes of the generated frames with motion dynamics to keep the global scene and the background time consistent; and (ii) reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object. Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing. As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data. Our code will be open sourced at: this https URL .
近年来,文本到视频生成方法主要依赖于计算密集型训练,并需要大规模的视频数据集。在这篇文章中,我们介绍了一种全新的任务:零样本文本到视频生成,并提出了一种低成本的方法(不需要训练或优化),通过利用现有的文本到图像生成方法(例如稳定扩散)的力量,将它们适用于视频领域。我们的 key 修改包括(一)丰富生成帧的潜在编码,使其与运动动态相结合,以保持全局场景和背景时间一致;(二)重新编程帧级别的自注意力,使用第一帧上的新交叉帧注意力,以保留前景对象的背景、外观和身份。实验表明,这导致低 overhead 且高品质、一致性极高的视频生成。此外,我们的方法不仅局限于文本到视频合成,还可以适用于其他任务,例如条件视频生成、内容和专业化视频生成,以及视频指令-Pix2Pix,即指导视频编辑。实验表明,我们的方法和最近的方法表现相似或有时更好,尽管没有训练额外的视频数据。我们的代码将开源在此处:这个 https URL 上。
https://arxiv.org/abs/2303.13439
Automated diagnosis prediction from medical images is a valuable resource to support clinical decision-making. However, such systems usually need to be trained on large amounts of annotated data, which often is scarce in the medical domain. Zero-shot methods address this challenge by allowing a flexible adaption to new settings with different clinical findings without relying on labeled data. Further, to integrate automated diagnosis in the clinical workflow, methods should be transparent and explainable, increasing medical professionals' trust and facilitating correctness verification. In this work, we introduce Xplainer, a novel framework for explainable zero-shot diagnosis in the clinical setting. Xplainer adapts the classification-by-description approach of contrastive vision-language models to the multi-label medical diagnosis task. Specifically, instead of directly predicting a diagnosis, we prompt the model to classify the existence of descriptive observations, which a radiologist would look for on an X-Ray scan, and use the descriptor probabilities to estimate the likelihood of a diagnosis. Our model is explainable by design, as the final diagnosis prediction is directly based on the prediction of the underlying descriptors. We evaluate Xplainer on two chest X-ray datasets, CheXpert and ChestX-ray14, and demonstrate its effectiveness in improving the performance and explainability of zero-shot diagnosis. Our results suggest that Xplainer provides a more detailed understanding of the decision-making process and can be a valuable tool for clinical diagnosis.
医学图像的自动诊断预测是一种重要的资源,以支持临床决策。然而,这种系统通常需要从大量的标注数据中进行训练,这在医学领域中往往是缺乏的。零样本方法解决了这个问题,它可以在没有标记数据的情况下灵活适应不同的临床发现设置,无需依赖标签数据。进一步,将自动诊断集成到临床工作流程中,方法应该透明和可解释,增加医务人员的信任,并方便正确性验证。在这个项目中,我们介绍了Xplainer,一个可在临床环境中解释零样本诊断的新框架。Xplainer将竞争视觉语言模型的描述分类方法应用于多标签医学诊断任务。具体来说,我们不再直接预测诊断,而是促使模型分类描述观察的存在,这是放射科医生在X射线扫描中会寻找的描述观察,并使用描述概率估计诊断的可能性。我们的模型是设计可解释的,因为其最终诊断预测直接基于底层描述预测。我们评估了 CheXpert和 chestX-ray14两个心电学数据集,并证明了Xplainer在改善零样本诊断性能和解释性方面的效力。我们的结果表明,Xplainer提供了更详细的理解决策过程,可以成为临床诊断的宝贵工具。
https://arxiv.org/abs/2303.13391
Label scarcity is a bottleneck for improving task performance in specialised domains. We propose a novel compositional transfer learning framework (DoT5 - domain compositional zero-shot T5) for zero-shot domain transfer. Without access to in-domain labels, DoT5 jointly learns domain knowledge (from MLM of unlabelled in-domain free text) and task knowledge (from task training on more readily available general-domain data) in a multi-task manner. To improve the transferability of task training, we design a strategy named NLGU: we simultaneously train NLG for in-domain label-to-data generation which enables data augmentation for self-finetuning and NLU for label prediction. We evaluate DoT5 on the biomedical domain and the resource-lean subdomain of radiology, focusing on NLI, text summarisation and embedding learning. DoT5 demonstrates the effectiveness of compositional transfer learning through multi-task learning. In particular, DoT5 outperforms the current SOTA in zero-shot transfer by over 7 absolute points in accuracy on RadNLI. We validate DoT5 with ablations and a case study demonstrating its ability to solve challenging NLI examples requiring in-domain expertise.
标签稀缺是改善特定领域的任务表现的瓶颈。我们提出了一种全新的组件化 Transfer Learning 框架(DoT5 - 域组件式零次输入 T5),用于零次输入域转移。在没有访问域内标签的情况下,DoT5 以多任务方式共同学习域知识和任务知识(从未标记的域内自由文本的 LM 中提取任务知识,并从任务训练更常见的通用数据集中提取数据增强和 NLU)。为了提高任务训练的可转移性,我们设计了一种名为 NLGU 的策略:我们同时训练 In-domain 标签到数据生成 NLG,这可以实现数据增强的自训练和标签预测的 NLU。我们在生物医学领域和放射学资源受限的子领域评估了 DoT5,重点关注 NLI、文本摘要和嵌入学习。DoT5 通过多任务学习证明了组件化转移学习的 effectiveness。特别是,DoT5 在 RadNLI 上的零次输入转移中比当前的最佳方法高出超过 7 的绝对点的准确性。我们通过实验和案例研究验证了 DoT5 的能力,以解决需要域内专业知识的具有挑战性的 NLI 示例。
https://arxiv.org/abs/2303.13386
Scene Graph Generation (SGG) aims to extract <subject, predicate, object> relationships in images for vision understanding. Although recent works have made steady progress on SGG, they still suffer long-tail distribution issues that tail-predicates are more costly to train and hard to distinguish due to a small amount of annotated data compared to frequent predicates. Existing re-balancing strategies try to haddle it via prior rules but are still confined to pre-defined conditions, which are not scalable for various models and datasets. In this paper, we propose a Cross-modal prediCate boosting (CaCao) framework, where a visually-prompted language model is learned to generate diverse fine-grained predicates in a low-resource way. The proposed CaCao can be applied in a plug-and-play fashion and automatically strengthen existing SGG to tackle the long-tailed problem. Based on that, we further introduce a novel Entangled cross-modal prompt approach for open-world predicate scene graph generation (Epic), where models can generalize to unseen predicates in a zero-shot manner. Comprehensive experiments on three benchmark datasets show that CaCao consistently boosts the performance of multiple scene graph generation models in a model-agnostic way. Moreover, our Epic achieves competitive performance on open-world predicate prediction.
场景Graph生成(SGG)旨在从图像中提取<主题、谓词、对象>关系以视觉理解。尽管最近的工作在SGG方面取得了稳定的进展,但它们仍然面临长尾巴分布问题,长谓词训练代价更高,且由于少量的注释数据,与频繁谓词相比难以区分。现有的平衡策略试图通过先前规则来实现,但仍然局限于预定义条件,这对各种模型和数据集是不可扩展的。在本文中,我们提出了一种跨模态预比较增强(CaCao)框架,其中视觉提示的语言模型以低资源方式生成多种精细的谓词。 proposed CaCao可以以一种可插拔的方式应用,并自动加强现有的SGG以解决长尾巴问题。基于这一点,我们进一步介绍了一种名为“开放世界谓词场景Graph生成(Epic)”的全新的、相互交织的跨模态提示方法,其中模型可以在零样本情况下 generalization到未观察到的谓词。对三个基准数据集的全面实验表明,CaCao consistentlyBoost了多个场景Graph生成模型的性能,以一种模型无关的方式。此外,我们的Epic在开放世界谓词预测方面实现了竞争性能。
https://arxiv.org/abs/2303.13233
Few-shot object detection (FSOD) aims to expand an object detector for novel categories given only a few instances for training. The few training samples restrict the performance of FSOD model. Recent text-to-image generation models have shown promising results in generating high-quality images. How applicable these synthetic images are for FSOD tasks remains under-explored. This work extensively studies how synthetic images generated from state-of-the-art text-to-image generators benefit FSOD tasks. We focus on two perspectives: (1) How to use synthetic data for FSOD? (2) How to find representative samples from the large-scale synthetic dataset? We design a copy-paste-based pipeline for using synthetic data. Specifically, saliency object detection is applied to the original generated image, and the minimum enclosing box is used for cropping the main object based on the saliency map. After that, the cropped object is randomly pasted on the image, which comes from the base dataset. We also study the influence of the input text of text-to-image generator and the number of synthetic images used. To construct a representative synthetic training dataset, we maximize the diversity of the selected images via a sample-based and cluster-based method. However, the severe problem of high false positives (FP) ratio of novel categories in FSOD can not be solved by using synthetic data. We propose integrating CLIP, a zero-shot recognition model, into the FSOD pipeline, which can filter 90% of FP by defining a threshold for the similarity score between the detected object and the text of the predicted category. Extensive experiments on PASCAL VOC and MS COCO validate the effectiveness of our method, in which performance gain is up to 21.9% compared to the few-shot baseline.
有限对象检测(FSOD)旨在扩展对新分类类别的对象检测器,仅提供少数实例进行训练。这些训练样本限制了FSOD模型的性能。最近,生成式文本到图像生成模型在生成高质量的图像方面表现出良好的结果。这些合成图像对于FSOD任务的应用仍然未被充分探索。本文深入研究了如何从先进的文本到图像生成模型中生成合成图像,以改善FSOD任务。我们关注两个方面:(1)如何以复制粘贴的方式使用合成数据进行FSOD任务?(2)如何从大型合成数据集中查找代表性样本?我们设计了一个基于复制粘贴的 pipeline 用于使用合成数据。具体而言,我们通过使用原始生成图像中的关注对象进行对象检测,并使用最小包围盒根据关注映射进行裁剪的主要对象。之后,裁剪对象随机粘贴到来自基础数据集的图像上。我们还研究了输入文本文本生成器和使用合成图像的数量对所选图像的影响。为了构建一个代表性的合成训练集,我们通过样本方法和簇方法最大地扩展了选择的样本的多样性。但在FSOD中,新分类类别的高误报率(FP)比例的严重问题不能用合成数据来解决。我们提出了将 CLIP 一种零次识别模型集成到FSOD管道中的方法,该方法可以通过定义相似性得分之间的检测到对象和预测类别文本的阈值过滤90%的FP。在PASCAL VOC 和 MS COCO 等数据集上的广泛实验验证了我们方法的有效性,其性能提升高达21.9%。与有限对象检测基准线相比。
https://arxiv.org/abs/2303.13221
In Task Oriented Dialogue (TOD) system, detecting and inducing new intents are two main challenges to apply the system in the real world. In this paper, we suggest the semantic multi-view model to resolve these two challenges: (1) SBERT for General Embedding (GE), (2) Multi Domain Batch (MDB) for dialogue domain knowledge, and (3) Proxy Gradient Transfer (PGT) for cluster-specialized semantic. MDB feeds diverse dialogue datasets to the model at once to tackle the multi-domain problem by learning the multiple domain knowledge. We introduce a novel method PGT, which employs the Siamese network to fine-tune the model with a clustering method directly.Our model can learn how to cluster dialogue utterances by using PGT. Experimental results demonstrate that our multi-view model with MDB and PGT significantly improves the Open Intent Induction performance compared to baseline systems.
在任务定向对话系统(TOD)中,检测和引入新的意图是将其应用于现实世界的两个主要挑战。在本文中,我们建议采用语义多视角模型来解决这两个挑战:(1) SBERT用于一般嵌入(GE),(2) 对话领域知识的多域批量(MDB),(3) 代理梯度转移(PGT)用于簇特定语义。 MDB通过一次性向模型注入多样化的对话数据集来解决多域问题,通过学习多个域知识。我们介绍了一种新的方法PGT,它使用iamese网络微调模型,并通过聚类方法直接进行。我们的模型可以通过使用PGT来学习如何聚类对话 utterances。实验结果表明,与基线系统相比,我们的多视角模型使用MDB和PGT显著提高了开放意图引入性能。
https://arxiv.org/abs/2303.13099
Current attention algorithms (e.g., self-attention) are stimulus-driven and highlight all the salient objects in an image. However, intelligent agents like humans often guide their attention based on the high-level task at hand, focusing only on task-related objects. This ability of task-guided top-down attention provides task-adaptive representation and helps the model generalize to various tasks. In this paper, we consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision. Prior work indicates a functional equivalence between visual attention and sparse reconstruction; we show that an AbS visual system that optimizes a similar sparse reconstruction objective modulated by a goal-directed top-down signal naturally simulates top-down attention. We further propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and achieves controllable top-down attention. For real-world applications, AbSViT consistently improves over baselines on Vision-Language tasks such as VQA and zero-shot retrieval where language guides the top-down attention. AbSViT can also serve as a general backbone, improving performance on classification, semantic segmentation, and model robustness.
目前的注意力算法(例如,自我注意力)是基于刺激驱动的,并在图像中强调所有突出的物体。然而,像人类这样的智能代理通常基于当前任务的指导,只关注任务相关的物体。这种任务引导的高层次注意力提供了任务适应的表示,帮助模型适应多种任务。在本文中,我们将从视觉分析的迭代(AbS)视角看待高层次注意力。先前的工作表明视觉注意力和稀疏重建之间存在功能等价性。我们表明,一个以目标为导向的高层次信号驱动的AbS视觉系统自然地模拟了高层次注意力。我们还提出了分析-分析迭代视觉卷积器(AbSViT),它是一个高层次信号驱动的ViT模型,其变化模拟了AbS,并实现了可控制高层次注意力。对于实际应用场景,AbSViT在视觉语言任务(如VQA和零样本检索)中 consistently improves over baselines,特别是在语言指导高层次注意力的情况下。AbSViT还可以作为一个通用的骨架,提高分类、语义分割和模型鲁棒性。
https://arxiv.org/abs/2303.13043
Salient Span Masking (SSM) has shown itself to be an effective strategy to improve closed-book question answering performance. SSM extends general masked language model pretraining by creating additional unsupervised training sentences that mask a single entity or date span, thus oversampling factual information. Despite the success of this paradigm, the span types and sampling strategies are relatively arbitrary and not widely studied for other tasks. Thus, we investigate SSM from the perspective of temporal tasks, where learning a good representation of various temporal expressions is important. To that end, we introduce Temporal Span Masking (TSM) intermediate training. First, we find that SSM alone improves the downstream performance on three temporal tasks by an avg. +5.8 points. Further, we are able to achieve additional improvements (avg. +0.29 points) by adding the TSM task. These comprise the new best reported results on the targeted tasks. Our analysis suggests that the effectiveness of SSM stems from the sentences chosen in the training data rather than the mask choice: sentences with entities frequently also contain temporal expressions. Nonetheless, the additional targeted spans of TSM can still improve performance, especially in a zero-shot context.
突出片段掩码(SSM)已经表明它是一种有效的策略,以提高闭包问答表现。SSM通过创建额外的未 unsupervised 训练句子来扩展通用掩码语言模型的预训练,这些句子掩盖了单个实体或日期范围,从而过度采样了事实信息。尽管这种方法取得了成功,但片段类型和采样策略相对任意,在其他方面的研究并不广泛。因此,我们从时间任务的角度研究了 SSM,其中学习各种时间表达方式非常重要。为此,我们引入了时间片段掩码(TSM)的中间训练。首先,我们发现,SSM 单独可以提高三个时间任务的平均向下性能5.8点。此外,通过添加 TSM 任务,我们可以实现额外的改善(平均增加0.29点)。这些组成了针对目标任务的最佳报告结果。我们的分析表明,SSM 的有效性来自于训练数据中选择的句子而不是掩码选择:句子中包含实体通常也包含时间表达方式。然而,尽管 TSM 额外的目标片段仍然可以改善性能,特别是在零经验上下文中。
https://arxiv.org/abs/2303.12860
The task of repository-level code completion is to continue writing the unfinished code based on a broader context of the repository. While for automated code completion tools, it is difficult to utilize the useful information scattered in different files. We propose RepoCoder, a simple, generic, and effective framework to address the challenge. It streamlines the repository-level code completion process by incorporating a similarity-based retriever and a pre-trained code language model, which allows for the effective utilization of repository-level information for code completion and grants the ability to generate code at various levels of granularity. Furthermore, RepoCoder utilizes a novel iterative retrieval-generation paradigm that bridges the gap between retrieval context and the intended completion target. We also propose a new benchmark RepoEval, which consists of the latest and high-quality real-world repositories covering line, API invocation, and function body completion scenarios. We test the performance of RepoCoder by using various combinations of code retrievers and generators. Experimental results indicate that RepoCoder significantly improves the zero-shot code completion baseline by over 10% in all settings and consistently outperforms the vanilla retrieval-augmented code completion approach. Furthermore, we validate the effectiveness of RepoCoder through comprehensive analysis, providing valuable insights for future research.
存储层代码补全的任务是基于存储层上下文更广泛的情况继续编写未完成的代码。但对于自动化代码补全工具来说,很难利用分散在不同文件中的有用信息。我们提出了RepoCoder,一个简单、通用且有效的框架来解决这个挑战。它简化了存储层代码补全的过程,通过添加一个基于相似度的检索器和一个预先训练的代码语言模型,使得有效利用存储层信息进行代码补全和赋予生成代码的多种粒度的能力。此外,RepoCoder使用了一种新颖的迭代检索生成范式,以填补检索上下文和预期完成目标之间的差距。我们还提出了一个新基准RepoEval,它包括最新的高质量实际存储库,涵盖了线条、API调用和函数体完成场景。我们使用各种代码检索器和生成器的组合来测试RepoCoder的性能。实验结果显示, repoCoder在所有设置中都显著改善了零次检索代码补全基线超过10%,并始终优于简单的检索增强代码补全方法。此外,我们通过全面分析验证了RepoCoder的有效性,为未来的研究提供了宝贵的见解。
https://arxiv.org/abs/2303.12570
Contrastive Language-Image Pre-training, benefiting from large-scale unlabeled text-image pairs, has demonstrated great performance in open-world vision understanding tasks. However, due to the limited Text-3D data pairs, adapting the success of 2D Vision-Language Models (VLM) to the 3D space remains an open problem. Existing works that leverage VLM for 3D understanding generally resort to constructing intermediate 2D representations for the 3D data, but at the cost of losing 3D geometry information. To take a step toward open-world 3D vision understanding, we propose Contrastive Language-Image-Point Cloud Pretraining (CLIP^2) to directly learn the transferable 3D point cloud representation in realistic scenarios with a novel proxy alignment mechanism. Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios. On top of that, we propose a cross-modal contrastive objective to learn semantic and instance-level aligned point cloud representation. Experimental results on both indoor and outdoor scenarios show that our learned 3D representation has great transfer ability in downstream tasks, including zero-shot and few-shot 3D recognition, which boosts the state-of-the-art methods by large margins. Furthermore, we provide analyses of the capability of different representations in real scenarios and present the optional ensemble scheme.
Contrastive Language-Image Pre-training利用大规模未标记文本图像 pairs 展示了在开放世界视觉理解任务中的良好表现。然而,由于文本-3D数据 pairs 有限,将2D视觉-语言模型(VLM)在3D空间中的成功适应仍然是一个开放性问题。现有的工作利用VLM为3D理解而使用,通常只能构建2D中间表示,但却失去了3D几何信息。为了迈向开放世界3D视觉理解,我们提出了Contrastive Language-Image-Point Cloud Pretraining(CLIP^2),通过一种新的代理对齐机制,在真实的场景下直接学习可转移的3D点云表示。具体来说,我们利用2D和3D场景中的自然对应关系,从这些复杂的场景中构建对齐的文本-图像-点代理。此外,我们提出了一个跨modalContrastive目标,以学习语义和实例级别的对齐点云表示。在室内和室外场景中的实验结果显示,我们学习到的3D表示在后续任务中具有很强的转移能力,包括零和经验3D识别,这极大地提高了现有方法。此外,我们提供了不同表示在真实场景下的能力分析,并提出了可选的集成方案。
https://arxiv.org/abs/2303.12417
The electrocardiogram (ECG) is one of the most commonly used non-invasive, convenient medical monitoring tools that assist in the clinical diagnosis of heart diseases. Recently, deep learning (DL) techniques, particularly self-supervised learning (SSL), have demonstrated great potential in the classification of ECG. SSL pre-training has achieved competitive performance with only a small amount of annotated data after fine-tuning. However, current SSL methods rely on the availability of annotated data and are unable to predict labels not existing in fine-tuning datasets. To address this challenge, we propose Multimodal ECG-Text Self-supervised pre-training (METS), the first work to utilize the auto-generated clinical reports to guide ECG SSL pre-training. We use a trainable ECG encoder and a frozen language model to embed paired ECG and automatically machine-generated clinical reports separately. The SSL aims to maximize the similarity between paired ECG and auto-generated report while minimize the similarity between ECG and other reports. In downstream classification tasks, METS achieves around 10% improvement in performance without using any annotated data via zero-shot classification, compared to other supervised and SSL baselines that rely on annotated data. Furthermore, METS achieves the highest recall and F1 scores on the MIT-BIH dataset, despite MIT-BIH containing different classes of ECG compared to the pre-trained dataset. The extensive experiments have demonstrated the advantages of using ECG-Text multimodal self-supervised learning in terms of generalizability, effectiveness, and efficiency.
心电图(ECG)是最常用的非侵入性、方便的医疗监测工具之一,协助心脏病的临床诊断。最近,深度学习(DL)技术,特别是自监督学习(SSL),在ECG分类方面表现出巨大的潜力。SSL pre-training通过微调后取得了与大量标记数据相关的 competitive 表现。然而,当前SSL方法依赖于标记数据的可用性,无法预测微调数据集上不存在的标签。为了解决这一挑战,我们提出了ECG文本modal自监督前训练(METS),是第一个利用自动生成的临床报告指导ECG SSL pre-training的工作。我们使用可训练ECG编码器和冻结语言模型,分别嵌入一对ECG和自动生成的临床报告。SSL的目标是最大化配对ECG和自动报告之间的相似性,同时最小化ECG和其他报告之间的相似性。在后续分类任务中,METS通过零样本分类取得了与依赖标记数据的其他自监督和SSL基准点相比约10%的性能改进,尽管MIT-BIH比训练数据集包含不同的ECG类别。此外,METS在MIT-BIH数据集上取得了最高的召回率和F1得分,尽管MIT-BIH相对于训练数据集包含不同的ECG类别。广泛的实验已经证明了使用ECG文本modal自监督学习在可移植性、效率和泛化性方面的优势。
https://arxiv.org/abs/2303.12311
We present a cascaded diffusion model based on a part-level implicit 3D representation. Our model achieves state-of-the-art generation quality and also enables part-level shape editing and manipulation without any additional training in conditional setup. Diffusion models have demonstrated impressive capabilities in data generation as well as zero-shot completion and editing via a guided reverse process. Recent research on 3D diffusion models has focused on improving their generation capabilities with various data representations, while the absence of structural information has limited their capability in completion and editing tasks. We thus propose our novel diffusion model using a part-level implicit representation. To effectively learn diffusion with high-dimensional embedding vectors of parts, we propose a cascaded framework, learning diffusion first on a low-dimensional subspace encoding extrinsic parameters of parts and then on the other high-dimensional subspace encoding intrinsic attributes. In the experiments, we demonstrate the outperformance of our method compared with the previous ones both in generation and part-level completion and manipulation tasks.
我们提出了基于零件级别的隐含三维表示的级联扩散模型。我们的模型实现了最先进的生成质量,并在条件设置下无需额外的训练即可实现零件级别的形状编辑和操纵。扩散模型通过引导逆过程展示了在数据生成和零次完成和编辑任务方面令人印象深刻的能力。最近的3D扩散模型研究主要关注通过多种数据表示来提高生成能力,而缺乏结构信息则限制了完成和编辑任务的能力。因此我们提出了我们的新型扩散模型,使用零件级别的高维嵌入向量来学习扩散。为了有效地学习由零件级别的高维嵌入向量学习的扩散,我们提出了级联框架。我们首先学习零件外部参数的低维子空间,然后学习另一个高维子空间以学习内部属性。在实验中,我们证明了我们方法相比之前方法在生成和零件级别完成和操纵任务方面的表现优异。
https://arxiv.org/abs/2303.12236
While generative modeling on multimodal image-text data has been actively developed with large-scale paired datasets, there have been limited attempts to generate both image and text data by a single model rather than a generation of one fixed modality conditioned on the other modality. In this paper, we explore a unified generative vision-and-language (VL) model that can produce both images and text sequences. Especially, we propose a generative VL transformer based on the non-autoregressive mask prediction, named MAGVLT, and compare it with an autoregressive generative VL transformer (ARGVLT). In comparison to ARGVLT, the proposed MAGVLT enables bidirectional context encoding, fast decoding by parallel token predictions in an iterative refinement, and extended editing capabilities such as image and text infilling. For rigorous training of our MAGVLT with image-text pairs from scratch, we combine the image-to-text, text-to-image, and joint image-and-text mask prediction tasks. Moreover, we devise two additional tasks based on the step-unrolled mask prediction and the selective prediction on the mixture of two image-text pairs. Experimental results on various downstream generation tasks of VL benchmarks show that our MAGVLT outperforms ARGVLT by a large margin even with significant inference speedup. Particularly, MAGVLT achieves competitive results on both zero-shot image-to-text and text-to-image generation tasks from MS-COCO by one moderate-sized model (fewer than 500M parameters) even without the use of monomodal data and networks.
尽管在多模态图像和文本数据的生成建模方面,已经利用大规模配对数据集进行了积极的开发,但仍然存在有限的机会,通过一个模型生成 both 图像和文本数据,而不是通过一个固定的模态条件生成 one 的模态。在本文中,我们探索了一种统一的视觉和语言 (VL) 生成模型,可以生成 both 图像和文本序列。特别是,我们提出了基于非自回归掩码预测的生成 VL 变换器,称为 MAGVLT,并将其与自回归生成 VL 变换器 (ARGVLT)进行比较。与 ARGVLT 相比,我们提出的 MAGVLT 实现了双向上下文编码、通过并行 token 预测快速解码、以及扩展了编辑能力,如图像和文本填充。为了从 scratch 开始训练我们的 MAGVLT 与图像和文本对的严格训练,我们结合了图像到文本、文本到图像和联合图像和文本掩码预测任务。此外,我们还设计了两个基于步展开掩码预测和选择性预测的图像到文本和对图像和文本混合物的任务。VL 基准测试中的各种下游生成任务的实验结果表明,即使Inference速度有很大的提高,我们的 MAGVLT 仍然比 ARGVLT 在 MS-COCO 中实现竞争结果,并且即使不使用单模态数据和网络。特别是,MAGVLT 通过一个规模适中的模型(小于 500 兆参数)从零样本图像到文本和文本到图像的生成任务中实现了竞争结果,即使在不使用单模态数据和网络的情况下。
https://arxiv.org/abs/2303.12208
Backdoor attacks inject poisoned data into the training set, resulting in misclassification of the poisoned samples during model inference. Defending against such attacks is challenging, especially in real-world black-box settings where only model predictions are available. In this paper, we propose a novel backdoor defense framework that can effectively defend against various attacks through zero-shot image purification (ZIP). Our proposed framework can be applied to black-box models without requiring any internal information about the poisoned model or any prior knowledge of the clean/poisoned samples. Our defense framework involves a two-step process. First, we apply a linear transformation on the poisoned image to destroy the trigger pattern. Then, we use a pre-trained diffusion model to recover the missing semantic information removed by the transformation. In particular, we design a new reverse process using the transformed image to guide the generation of high-fidelity purified images, which can be applied in zero-shot settings. We evaluate our ZIP backdoor defense framework on multiple datasets with different kinds of attacks. Experimental results demonstrate the superiority of our ZIP framework compared to state-of-the-art backdoor defense baselines. We believe that our results will provide valuable insights for future defense methods for black-box models.
后门攻击将有毒数据注入训练集,导致模型推理时误分类有毒样本。对这种攻击的防御非常困难,特别是在只有模型预测可用的真实黑盒环境中。在本文中,我们提出了一种新的后门防御框架,可以通过零次采样图像净化(ZIP)有效地防御各种攻击。该框架可以应用于黑盒模型,而无需有关毒模型或干净/有毒样本的任何内部信息。我们的防御框架涉及两个步骤。首先,我们对有毒图像进行线性变换,摧毁触发模式。然后,我们使用预先训练的扩散模型恢复被变换掉的语义信息。特别是,我们设计了一个新的逆过程,使用变换图像来指导生成高保真度净化图像,该过程可以在零次采样环境中应用。我们对各种不同类型的攻击 multiple datasets 进行了多个数据集的评估。实验结果显示,我们的 ZIP 框架相对于最先进的后门防御基线更加优秀。我们相信,我们的结果将为黑盒模型的未来防御方法提供有价值的洞察。
https://arxiv.org/abs/2303.12175
The large-scale vision-language models (e.g., CLIP) are leveraged by different methods to detect unseen objects. However, most of these works require additional captions or images for training, which is not feasible in the context of zero-shot detection. In contrast, the distillation-based method is an extra-data-free method, but it has its limitations. Specifically, existing work creates distillation regions that are biased to the base categories, which limits the distillation of novel category information and harms the distillation efficiency. Furthermore, directly using the raw feature from CLIP for distillation neglects the domain gap between the training data of CLIP and the detection datasets, which makes it difficult to learn the mapping from the image region to the vision-language feature space - an essential component for detecting unseen objects. As a result, existing distillation-based methods require an excessively long training schedule. To solve these problems, we propose Efficient feature distillation for Zero-Shot Detection (EZSD). Firstly, EZSD adapts the CLIP's feature space to the target detection domain by re-normalizing CLIP to bridge the domain gap; Secondly, EZSD uses CLIP to generate distillation proposals with potential novel instances, to avoid the distillation being overly biased to the base categories. Finally, EZSD takes advantage of semantic meaning for regression to further improve the model performance. As a result, EZSD achieves state-of-the-art performance in the COCO zero-shot benchmark with a much shorter training schedule and outperforms previous work by 4% in LVIS overall setting with 1/10 training time.
大型视觉语言模型(例如Clip)利用不同方法检测未观测到的对象。然而,这些方法中大多数需要额外的标题或图像来进行训练,这在零样本检测上下文中是不可行的。相比之下,基于离散傅里叶变换的方法是一种无额外数据的方法,但它也有其限制。具体来说,现有的工作创造了基于基类的偏置区域,这限制了新类别信息的汇聚和损害了汇聚效率。此外,直接使用Clip的 raw feature 进行汇聚忽略了Clip的训练数据和检测数据之间的域差,这使从图像区域到视觉语言特征空间的映射学习变得困难 - 这是检测未观测到对象的关键组件。因此,现有的基于汇聚的方法需要过长的训练时间表。为了解决这些问题,我们提出了高效的特征汇聚零样本检测(EZSD)方法。首先,EZSD将Clip的特征空间适应到目标检测域,通过归一化Clip来弥散域差;其次,EZSD使用Clip生成可能的新实例汇聚提议,以避免汇聚过度偏置基类。最后,EZSD利用回归语义意义进一步改善模型性能。因此,EZSD在COCO零样本基准测试中取得了最先进的性能,训练时间表只有1/10,但比先前工作高出4%。在LVIS整体设置中,通过训练时间仅为1/10,EZSD表现优于先前工作。
https://arxiv.org/abs/2303.12145