Domain generalization studies the problem of training a model with samples from several domains (or distributions) and then testing the model with samples from a new, unseen domain. In this paper, we propose a novel approach for domain generalization that leverages recent advances in large vision-language models, specifically a CLIP teacher model, to train a smaller model that generalizes to unseen domains. The key technical contribution is a new type of regularization that requires the student's learned image representations to be close to the teacher's learned text representations obtained from encoding the corresponding text descriptions of images. We introduce two designs of the loss function, absolute and relative distance, which provide specific guidance on how the training process of the student model should be regularized. We evaluate our proposed method, dubbed RISE (Regularized Invariance with Semantic Embeddings), on various benchmark datasets and show that it outperforms several state-of-the-art domain generalization methods. To our knowledge, our work is the first to leverage knowledge distillation using a large vision-language model for domain generalization. By incorporating text-based information, RISE improves the generalization capability of machine learning models.
域泛化研究的问题是训练从一个多个域(或分布)中收集样本的模型,然后使用从一个未知的新域中收集样本的模型进行测试。在本文中,我们提出了一种域泛化的新方法,利用大型视觉语言模型的最新进展,特别是Clip teacher模型,训练一种小型模型,使其能够泛化到未知的域。关键技术贡献是一种新类型的正则化,它要求学生 learned 的图像表示接近从图像对应的文本描述编码中得到的 teacher 的文本表示。我们介绍了两种 loss 函数的设计,即绝对距离和相对距离,提供了具体指导,如何对学生模型的训练过程正则化。我们评估了我们提出的新方法,称为rise(正则化语义嵌入),在各种基准数据集上进行评估,并表明它比一些最先进的域泛化方法表现更好。据我们所知,我们的工作是使用大型视觉语言模型进行域泛化的第一个利用知识蒸馏的方法。通过引入文本信息,rise改善了机器学习模型的泛化能力。
https://arxiv.org/abs/2309.12530
Self-supervised representation learning has seen remarkable progress in the last few years, with some of the recent methods being able to learn useful image representations without labels. These methods are trained using backpropagation, the de facto standard. Recently, Geoffrey Hinton proposed the forward-forward algorithm as an alternative training method. It utilizes two forward passes and a separate loss function for each layer to train the network without backpropagation. In this study, for the first time, we study the performance of forward-forward vs. backpropagation for self-supervised representation learning and provide insights into the learned representation spaces. Our benchmark employs four standard datasets, namely MNIST, F-MNIST, SVHN and CIFAR-10, and three commonly used self-supervised representation learning techniques, namely rotation, flip and jigsaw. Our main finding is that while the forward-forward algorithm performs comparably to backpropagation during (self-)supervised training, the transfer performance is significantly lagging behind in all the studied settings. This may be caused by a combination of factors, including having a loss function for each layer and the way the supervised training is realized in the forward-forward paradigm. In comparison to backpropagation, the forward-forward algorithm focuses more on the boundaries and drops part of the information unnecessary for making decisions which harms the representation learning goal. Further investigation and research are necessary to stabilize the forward-forward strategy for self-supervised learning, to work beyond the datasets and configurations demonstrated by Geoffrey Hinton.
自监督表示学习在过去几年中取得了显著进展,一些最近的方法能够在没有标签的情况下学习有用的图像表示。这些方法使用回退作为事实上的标准训练方法。最近,Geoffrey Hinton提出了前向-前向算法作为另一种训练方法。它使用两个前向遍历和每个层单独的损失函数来训练网络而无需回退。在本研究中,我们首次研究前向-前向相对于回退在自监督表示学习中的表现,并提供了学到的表示空间 insights。我们的基准使用四个标准数据集,分别是米NIST、F-米NIST、SVHN和CIFAR-10,以及三种常见的自监督表示学习技术,分别是旋转、翻转和拼图。我们的主要发现是,虽然在(自)监督训练中前向-前向算法表现与回退相当,但在所有研究设置中传输性能显著落后。这可能是由多种因素的组合造成的,包括每个层都有一个损失函数以及前向-前向范式中监督训练的实现方式。与回退相比,前向-前向算法更关注边界,并删除不必要的决策信息,这损害了表示学习目标。需要进行进一步研究和研究以稳定自监督学习的前向-前向策略,超越Geoffrey Hinton演示的数据集和配置。
https://arxiv.org/abs/2309.11955
Text-guided image generation aimed to generate desired images conditioned on given texts, while text-guided image manipulation refers to semantically edit parts of a given image based on specified texts. For these two similar tasks, the key point is to ensure image fidelity as well as semantic consistency. Many previous approaches require complex multi-stage generation and adversarial training, while struggling to provide a unified framework for both tasks. In this work, we propose TextCLIP, a unified framework for text-guided image generation and manipulation without adversarial training. The proposed method accepts input from images or random noise corresponding to these two different tasks, and under the condition of the specific texts, a carefully designed mapping network that exploits the powerful generative capabilities of StyleGAN and the text image representation capabilities of Contrastive Language-Image Pre-training (CLIP) generates images of up to $1024\times1024$ resolution that can currently be generated. Extensive experiments on the Multi-modal CelebA-HQ dataset have demonstrated that our proposed method outperforms existing state-of-the-art methods, both on text-guided generation tasks and manipulation tasks.
本文提出了TextCLIP,一个无对抗训练的文本指导图像生成和操纵的统一框架。该框架接受对应这两个不同任务的图像或随机噪声输入,并利用StyleGAN的强大生成能力和Contrastive Language-Image Pre-training(CLIP)的文本图像表示能力,以生成分辨率为$1024\times1024$的图像。在多模态CelebA-HQ数据集上进行广泛的实验表明,我们提出的方法在文本指导生成任务和操纵任务中都优于现有的最先进的方法。
https://arxiv.org/abs/2309.11923
Multimodal Large Language Models (MLLMs) that integrate text and other modalities (especially vision) have achieved unprecedented performance in various multimodal tasks. However, due to the unsolved adversarial robustness problem of vision models, MLLMs can have more severe safety and security risks by introducing the vision inputs. In this work, we study the adversarial robustness of Google's Bard, a competitive chatbot to ChatGPT that released its multimodal capability recently, to better understand the vulnerabilities of commercial MLLMs. By attacking white-box surrogate vision encoders or MLLMs, the generated adversarial examples can mislead Bard to output wrong image descriptions with a 22% success rate based solely on the transferability. We show that the adversarial examples can also attack other MLLMs, e.g., a 26% attack success rate against Bing Chat and a 86% attack success rate against ERNIE bot. Moreover, we identify two defense mechanisms of Bard, including face detection and toxicity detection of images. We design corresponding attacks to evade these defenses, demonstrating that the current defenses of Bard are also vulnerable. We hope this work can deepen our understanding on the robustness of MLLMs and facilitate future research on defenses. Our code is available at this https URL.
综合文本和其他modality(特别是视觉)的大型语言模型(MLLM),在多种modal任务中取得了前所未有的表现。然而,由于视觉模型未能解决的攻击鲁棒性问题,引入视觉输入可能会带来更加严重的安全和安全风险。在这项工作中,我们研究了Google的Bard(最近发布的 multimodal能力竞争的聊天机器人ChatGPT)的攻击鲁棒性,以更好地理解商业MLLM的漏洞。通过攻击白盒视觉编码器或MLLM,生成的dversarial examples可以误导Bard输出错误的图像描述,仅通过转移性成功率为22%。我们证明,dversarial examples也可以攻击其他MLLM,例如,对 Bing Chat 的攻击成功率为26%,对ERNIE机器人的攻击成功率为86%。此外,我们识别了 Bard 的两个防御机制,包括图像面部检测和毒性检测。我们设计相应的攻击来规避这些防御,表明 Bard 当前防御机制也薄弱环节。我们希望这项工作可以加深我们对MLLM的鲁棒性的理解,并促进未来的防御研究。我们的代码在这个httpsURL上可用。
https://arxiv.org/abs/2309.11751
Referenceless metrics (e.g., CLIPScore) use pretrained vision--language models to assess image descriptions directly without costly ground-truth reference texts. Such methods can facilitate rapid progress, but only if they truly align with human preference judgments. In this paper, we introduce ContextRef, a benchmark for assessing referenceless metrics for such alignment. ContextRef has two components: human ratings along a variety of established quality dimensions, and ten diverse robustness checks designed to uncover fundamental weaknesses. A crucial aspect of ContextRef is that images and descriptions are presented in context, reflecting prior work showing that context is important for description quality. Using ContextRef, we assess a variety of pretrained models, scoring functions, and techniques for incorporating context. None of the methods is successful with ContextRef, but we show that careful fine-tuning yields substantial improvements. ContextRef remains a challenging benchmark though, in large part due to the challenge of context dependence.
无参考指标(例如CLIPScore)使用训练好的视觉语言模型直接评估图像描述,而不需要昂贵的实际参考文本。这些方法可以促进快速进展,但只有真正与人类偏好判断对齐时才有效。在本文中,我们介绍了ContextRef,这是一个用于评估无参考指标基准。ContextRef由两个部分组成:人类在多种 established 质量维度上的评分,以及旨在揭示基本弱点的十种不同稳健性检查。ContextRef的一个关键特征是图像和描述在上下文中呈现,反映了先前工作表明上下文对于描述质量的重要性。利用ContextRef,我们评估了多种训练好的模型、评分函数和技巧,以考虑上下文。这些方法中没有一种是ContextRef有效的,但我们证明,仔细微调可以带来显著的改进。ContextRef尽管仍然是一个具有挑战性的基准,但很大程度上这是由于上下文依赖的挑战。
https://arxiv.org/abs/2309.11710
As 3D human pose estimation can now be achieved with very high accuracy in the supervised learning scenario, tackling the case where 3D pose annotations are not available has received increasing attention. In particular, several methods have proposed to learn image representations in a self-supervised fashion so as to disentangle the appearance information from the pose one. The methods then only need a small amount of supervised data to train a pose regressor using the pose-related latent vector as input, as it should be free of appearance information. In this paper, we carry out in-depth analysis to understand to what degree the state-of-the-art disentangled representation learning methods truly separate the appearance information from the pose one. First, we study disentanglement from the perspective of the self-supervised network, via diverse image synthesis experiments. Second, we investigate disentanglement with respect to the 3D pose regressor following an adversarial attack perspective. Specifically, we design an adversarial strategy focusing on generating natural appearance changes of the subject, and against which we could expect a disentangled network to be robust. Altogether, our analyses show that disentanglement in the three state-of-the-art disentangled representation learning frameworks if far from complete, and that their pose codes contain significant appearance information. We believe that our approach provides a valuable testbed to evaluate the degree of disentanglement of pose from appearance in self-supervised 3D human pose estimation.
现如今,在监督学习场景中,3D人类姿态估计可以实现极高的精度。因此,解决3D姿态标注不足的问题已经引起了越来越多的关注。特别是,有一些方法提出了通过自监督学习来学习图像表示,以便将外观信息从姿态信息中分离。这些方法只需要少量的监督数据来训练姿态回归器,使用姿态相关的隐向量作为输入,因为外观信息应该被排除在外。在本文中,我们进行了深入分析,以了解art-of-the-art的分离表示学习方法在何种程度上真正将外观信息从姿态信息中分离。首先,我们从自监督网络的角度出发,通过多种图像合成实验研究了分离的实现方式。其次,我们研究了从攻击视角下的3D姿态回归器的分离实现。具体来说,我们设计了一种新的攻击策略,重点是生成物体的自然外观变化,我们可以期望分离网络具有鲁棒性。总之,我们的分析表明,三个art-of-the-art的分离表示学习框架的分离程度虽然还没有完全完成,但它们的姿态代码中包含重要的外观信息。我们相信,我们的方法提供了评估自监督3D人类姿态估计中姿态与外观分离程度的宝贵测试平台。
https://arxiv.org/abs/2309.11667
We present Kosmos-2.5, a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models.
我们介绍了 Kosmos-2.5,一个用于机器阅读文本密集型图像的多模态语言模型。在大规模文本密集型图像的预训练基础上, Kosmos-2.5在两个独立但合作的任务上表现出色:(1)生成具有空间意识的文本块,每个文本块在图像中分配其空间坐标,(2)产生结构化文本输出,将其样式和结构捕获到Markdown格式中。这种统一的多模态语言能力通过共享Transformer架构、特定任务提示和灵活文本表示来实现。我们评估了 Kosmos-2.5的端到端文档级文本识别和图像到Markdown文本生成性能。此外,通过监督微调,模型可以轻松适应各种不同提示的文本密集型图像理解任务,使其成为涉及大量文本的图像的实际应用中的通用工具。这项工作还开辟了多模态大型语言模型的未来规模扩展之路。
https://arxiv.org/abs/2309.11419
Offline handwriting recognition (HWR) has improved significantly with the advent of deep learning architectures in recent years. Nevertheless, it remains a challenging problem and practical applications often rely on post-processing techniques for restricting the predicted words via lexicons or language models. Despite their enhanced performance, such systems are less usable in contexts where out-of-vocabulary words are anticipated, e.g. for detecting misspelled words in school assessments. To that end, we introduce the task of comparing a handwriting image to text. To solve the problem, we propose an unrestricted binary classifier, consisting of a HWR feature extractor and a multimodal classification head which convolves the feature extractor output with the vector representation of the input text. Our model's classification head is trained entirely on synthetic data created using a state-of-the-art generative adversarial network. We demonstrate that, while maintaining high recall, the classifier can be calibrated to achieve an average precision increase of 19.5% compared to addressing the task by directly using state-of-the-art HWR models. Such massive performance gains can lead to significant productivity increases in applications utilizing human-in-the-loop automation.
过去几年中,深度学习架构的出现使得离线手写识别(HWR)性能得到了显著提高。然而,它仍然是一个具有挑战性的问题,并且实用的应用程序通常依赖于后处理技术通过词汇表或语言模型限制预测单词。尽管这些系统的性能得到了增强,但在预计缺少词汇表的单词的情况下,它们 less useful,例如在在学校评估中检测拼写错误的单词方面。为此,我们引入了比较手写图像和文本的任务。为了解决这个问题,我们提出了一个不受限制的二进制分类器,它由一个HWR特征提取器和一个多模式分类头组成,该分类头将特征提取器输出与输入文本的向量表示卷积。我们训练我们的分类头完全使用先进的生成对抗网络生成的模拟数据。我们证明,尽管保持高召回率,分类器可以校准以实现平均精度提高19.5%,而直接使用先进的HWR模型解决这个问题则无法达到这个水平。这种巨大的性能提升可以在利用人类参与的自动化应用中导致显著的生产率增加。
https://arxiv.org/abs/2309.10158
Data-driven approaches hold promise for audio captioning. However, the development of audio captioning methods can be biased due to the limited availability and quality of text-audio data. This paper proposes a SynthAC framework, which leverages recent advances in audio generative models and commonly available text corpus to create synthetic text-audio pairs, thereby enhancing text-audio representation. Specifically, the text-to-audio generation model, i.e., AudioLDM, is used to generate synthetic audio signals with captions from an image captioning dataset. Our SynthAC expands the availability of well-annotated captions from the text-vision domain to audio captioning, thus enhancing text-audio representation by learning relations within synthetic text-audio pairs. Experiments demonstrate that our SynthAC framework can benefit audio captioning models by incorporating well-annotated text corpus from the text-vision domain, offering a promising solution to the challenge caused by data scarcity. Furthermore, SynthAC can be easily adapted to various state-of-the-art methods, leading to substantial performance improvements.
数据驱动的方法对于音频字幕有着潜在的 promise。然而,由于文本-音频数据的有限性和质量的限制,字幕方法的发展可能会受到偏见。本文提出了一个合成AC框架,利用最新的音频生成模型和常用的文本语料库,创建合成文本-音频对,从而提高文本-音频表示。具体来说,文本到音频生成模型(即 AudioLDM)从图像字幕数据集中提取合成音频信号并添加注释。我们的合成AC框架将扩展文本-视觉领域的良好注释字幕的可用性,通过在合成文本-音频对中学习关系来提高文本-音频表示。实验结果表明,通过将文本-视觉领域的良好注释文本语料库纳入我们的合成AC框架中,可以为音频字幕模型带来好处,并提供一个解决数据稀缺挑战的有前途的解决方案。此外,合成AC框架可以轻松适应各种先进的方法,导致显著的性能改进。
https://arxiv.org/abs/2309.09705
Text-based Person Retrieval aims to retrieve the target person images given a textual query. The primary challenge lies in bridging the substantial gap between vision and language modalities, especially when dealing with limited large-scale datasets. In this paper, we introduce a CLIP-based Synergistic Knowledge Transfer(CSKT) approach for TBPR. Specifically, to explore the CLIP's knowledge on input side, we first propose a Bidirectional Prompts Transferring (BPT) module constructed by text-to-image and image-to-text bidirectional prompts and coupling projections. Secondly, Dual Adapters Transferring (DAT) is designed to transfer knowledge on output side of Multi-Head Self-Attention (MHSA) in vision and language. This synergistic two-way collaborative mechanism promotes the early-stage feature fusion and efficiently exploits the existing knowledge of CLIP. CSKT outperforms the state-of-the-art approaches across three benchmark datasets when the training parameters merely account for 7.4% of the entire model, demonstrating its remarkable efficiency, effectiveness and generalization.
文本人物检索旨在根据文本查询检索目标人物图像。其主要挑战在于跨越视觉和语言模式之间的实质性差距,特别是在处理有限大规模数据集时。在本文中,我们介绍了一种基于CLIP的协同知识转移(CSKT)方法,用于TBPR。具体来说,为了探索Clip在输入方面的知识,我们提出了通过文本到图像和图像到文本的双向prompt和耦合投影构建的Bidirectional Prompts Transferring(BPT)模块。其次,我们设计了双重Adapters Transferring(DAT)方法,用于在视觉和语言中传递Multi-Head Self-Attention(MHSA)的输出知识。这种协同双通道机制促进了早期特征融合,并高效利用Clip现有的知识。当训练参数仅占总模型的7.4%时,CSKT在三个基准数据集上优于最先进的方法,这表明其非凡的效率、效率和泛化能力。
https://arxiv.org/abs/2309.09496
Video summarization remains a huge challenge in computer vision due to the size of the input videos to be summarized. We propose an efficient, language-only video summarizer that achieves competitive accuracy with high data efficiency. Using only textual captions obtained via a zero-shot approach, we train a language transformer model and forego image representations. This method allows us to perform filtration amongst the representative text vectors and condense the sequence. With our approach, we gain explainability with natural language that comes easily for human interpretation and textual summaries of the videos. An ablation study that focuses on modality and data compression shows that leveraging text modality only effectively reduces input data processing while retaining comparable results.
视频概括在计算机视觉中仍然是一个巨大的挑战,因为需要概括输入视频的大小。我们提出了一种高效的、仅使用语言的视频概括器,能够在高数据效率的情况下实现 competitive accuracy。仅使用通过零经验方法获得的文字标题,我们训练了一个语言转换Transformer模型,并放弃了图像表示。这种方法允许我们在代表性文本向量之间进行过滤,并压缩序列。通过我们的方法,我们获得了自然语言解释性,这对于人类视频解释和文本概括来说很容易实现。专注于模式和数据压缩的研究结果表明,仅仅利用文本模式只能有效地减少输入数据的处理,但保留了类似的结果。
https://arxiv.org/abs/2309.09405
Generative modeling of 3D LiDAR data is an emerging task with promising applications for autonomous mobile robots, such as scalable simulation, scene manipulation, and sparse-to-dense completion of LiDAR point clouds. Existing approaches have shown the feasibility of image-based LiDAR data generation using deep generative models while still struggling with the fidelity of generated data and training instability. In this work, we present R2DM, a novel generative model for LiDAR data that can generate diverse and high-fidelity 3D scene point clouds based on the image representation of range and reflectance intensity. Our method is based on the denoising diffusion probabilistic models (DDPMs), which have demonstrated impressive results among generative model frameworks and have been significantly progressing in recent years. To effectively train DDPMs on the LiDAR domain, we first conduct an in-depth analysis regarding data representation, training objective, and spatial inductive bias. Based on our designed model R2DM, we also introduce a flexible LiDAR completion pipeline using the powerful properties of DDPMs. We demonstrate that our method outperforms the baselines on the generation task of KITTI-360 and KITTI-Raw datasets and the upsampling task of KITTI-360 datasets. Our code and pre-trained weights will be available at this https URL.
3D 激光雷达数据生成模型是一个新兴任务,具有潜在的自主移动机器人应用,例如可扩展仿真、场景操纵和密集到稀疏的激光雷达点云生成。现有的方法已经证明了使用深度生成模型使用图像表示 range 和Reflectance 强度的方法可行性,但仍然面临着生成数据质量差和训练不稳定的问题。在本文中,我们介绍了 R2DM,这是一款新的生成模型,用于激光雷达数据,可以根据图像表示的范围和反射强度生成多样性和高保真的 3D 场景点云。我们的方法是基于去噪扩散概率模型(DDPMs),它在生成模型框架中表现出令人印象深刻的结果,并在近年来取得了显著进展。为了在激光雷达领域有效地训练DDPMs,我们首先进行了深入分析,涉及数据表示、训练目标和空间偏见。基于我们设计的模型 R2DM,我们还介绍了一种灵活的激光雷达完成管道,利用DDPMs 的强大特性。我们证明了我们的方法和KITTI-360、KITTI- raw 数据和 KITTI-360 数据集生成任务相比,表现更好。我们的代码和预训练权重将在这个 https URL 上提供。
https://arxiv.org/abs/2309.09256
Since annotating medical images for segmentation tasks commonly incurs expensive costs, it is highly desirable to design an annotation-efficient method to alleviate the annotation burden. Recently, contrastive learning has exhibited a great potential in learning robust representations to boost downstream tasks with limited labels. In medical imaging scenarios, ready-made meta labels (i.e., specific attribute information of medical images) inherently reveal semantic relationships among images, which have been used to define positive pairs in previous work. However, the multi-perspective semantics revealed by various meta labels are usually incompatible and can incur intractable "semantic contradiction" when combining different meta labels. In this paper, we tackle the issue of "semantic contradiction" in a gradient-guided manner using our proposed Gradient Mitigator method, which systematically unifies multi-perspective meta labels to enable a pre-trained model to attain a better high-level semantic recognition ability. Moreover, we emphasize that the fine-grained discrimination ability is vital for segmentation-oriented pre-training, and develop a novel method called Gradient Filter to dynamically screen pixel pairs with the most discriminating power based on the magnitude of gradients. Comprehensive experiments on four medical image segmentation datasets verify that our new method GCL: (1) learns informative image representations and considerably boosts segmentation performance with limited labels, and (2) shows promising generalizability on out-of-distribution datasets.
由于标注医学图像以分割任务通常导致昂贵的成本,因此设计一种高效的标注方法以减轻标注负担是非常理想的。最近,比较学习表现出在学习强壮表示以Boost限制性标签下的任务的巨大潜力。在医学图像场景下,现成的元标签(即医学图像的特定属性信息)本身就揭示了图像之间的语义关系,这些关系在以前的工作中被用于定义正交对。然而,各种元标签揭示的多方语义通常不兼容,并且当结合不同的元标签时可能会产生不可逆的“语义矛盾”。在本文中,我们使用我们提出的梯度缓和方法以Gradient Guided manner通过梯度引导来解决“语义矛盾”问题,该方法 systematicly 统一了多个元标签,从而使预训练模型能够更好地高级别语义识别能力。此外,我们强调精细的分化能力对于分割导向的预训练是至关重要的,并开发了一种称为Gradient Filter的新方法,以动态地根据梯度大小筛选像素对,最具分化能力的像素对。我们对四个医学图像分割数据集进行了全面的实验,以验证我们的新方法GCL:(1)学习有用的图像表示,并在限制性标签下显著Boost分割性能,(2)在分布之外的数据集上表现出令人鼓舞的可移植性。
https://arxiv.org/abs/2309.08888
Cross-modal retrieval (CMR) has been extensively applied in various domains, such as multimedia search engines and recommendation systems. Most existing CMR methods focus on image-to-text retrieval, whereas audio-to-text retrieval, a less explored domain, has posed a great challenge due to the difficulty to uncover discriminative features from audio clips and texts. Existing studies are restricted in the following two ways: 1) Most researchers utilize contrastive learning to construct a common subspace where similarities among data can be measured. However, they considers only cross-modal transformation, neglecting the intra-modal separability. Besides, the temperature parameter is not adaptively adjusted along with semantic guidance, which degrades the performance. 2) These methods do not take latent representation reconstruction into account, which is essential for semantic alignment. This paper introduces a novel audio-text oriented CMR approach, termed Contrastive Latent Space Reconstruction Learning (CLSR). CLSR improves contrastive representation learning by taking intra-modal separability into account and adopting an adaptive temperature control strategy. Moreover, the latent representation reconstruction modules are embedded into the CMR framework, which improves modal interaction. Experiments in comparison with some state-of-the-art methods on two audio-text datasets have validated the superiority of CLSR.
跨modal检索(CMR)已经广泛应用于各种领域,例如多媒体搜索引擎和推荐系统。目前,大多数CMR方法专注于图像到文本检索,而音频到文本检索是一个较少探索的领域,因为难以从音频片段和文本中发现区别性特征。现有的研究受到以下两种方式的限制: 1) 大多数研究人员使用对比学习来构建一个共同的子空间,可以在数据之间测量相似性。然而,他们只考虑了跨modal转换,而忽略了modal内部区分性。此外,温度参数没有与语义指导一起Adaptively 调整,这会影响性能。 2) 这些方法没有考虑到隐含表示重建,这是语义对齐所必需的。本文介绍了一种音频文本oriented的CMR方法,称为Contrastive潜在表示重建学习(CLSR)。通过考虑modal内部区分性并采用自适应温度控制策略,CLSR改善了对比表示学习。此外,隐含表示重建模块被嵌入到CMR框架中,提高了modal交互。与两个音频文本数据集上的一些先进方法相比,实验验证了CLSR的优越性。
https://arxiv.org/abs/2309.08839
Recently, the development of pre-trained vision language foundation models (VLFMs) has led to remarkable performance in many tasks. However, these models tend to have strong single-image understanding capability but lack the ability to understand multiple images. Therefore, they cannot be directly applied to cope with image change understanding (ICU), which requires models to capture actual changes between multiple images and describe them in language. In this paper, we discover that existing VLFMs perform poorly when applied directly to ICU because of the following problems: (1) VLFMs generally learn the global representation of a single image, while ICU requires capturing nuances between multiple images. (2) The ICU performance of VLFMs is significantly affected by viewpoint variations, which is caused by the altered relationships between objects when viewpoint changes. To address these problems, we propose a Viewpoint Integration and Registration method. Concretely, we introduce a fused adapter image encoder that fine-tunes pre-trained encoders by inserting designed trainable adapters and fused adapters, to effectively capture nuances between image pairs. Additionally, a viewpoint registration flow and a semantic emphasizing module are designed to reduce the performance degradation caused by viewpoint variations in the visual and semantic space, respectively. Experimental results on CLEVR-Change and Spot-the-Diff demonstrate that our method achieves state-of-the-art performance in all metrics.
近年来,训练后的视觉语言基础模型(VLFMs)的发展在许多任务中取得了显著表现。然而,这些模型通常具有强大的单图像理解能力,但缺乏理解多图像的能力。因此,它们不能直接应用于图像变化理解(ICU),这需要模型捕获多图像之间的微妙变化并用语言描述它们。在本文中,我们发现,将现有VLFMs直接应用于ICU性能表现不佳,因为这些问题:(1) VLFMs通常学习单图像的全局表示,而ICU需要捕捉多图像之间的微妙变化。(2) VLFMs的ICU性能受到视角变化的影响,这是由视角变化引起的物体之间的关系改变造成的。为了解决这些问题,我们提出了视角整合和注册方法。具体来说,我们介绍了一种融合适配器图像编码器,通过插入设计可训练适配器和融合适配器,微调预训练编码器,有效地捕捉图像之间的微妙变化。此外,视角注册流和语义强调模块被设计用于减少视觉和语义空间视角变化引起的性能下降。对于CLEVR-Change和Spot-the-Diff的实验结果表明,我们的方法在所有指标上都实现了最先进的表现。
https://arxiv.org/abs/2309.08585
Large language models with instruction-following abilities have revolutionized the field of artificial intelligence. These models show exceptional generalizability to tackle various real-world tasks through their natural language interfaces. However, their performance heavily relies on high-quality exemplar data, which is often difficult to obtain. This challenge is further exacerbated when it comes to multimodal instruction following. We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved multimodal instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. We release our dataset, model, and demo to foster future research in the area of multimodal instruction following.
大型语言模型具有指令跟随能力已经彻底改变了人工智能领域。这些模型表现出出色的泛化能力,通过自然语言界面处理各种实际任务。然而,它们的表现严重依赖于高质量的示范数据,这些数据往往很难获取。这个问题在多模态指令跟随方面进一步加剧。我们介绍了TextBind,一个几乎不需要标注的框架,为大型语言模型提供多轮交互式多模态指令跟随能力。我们的方法只需要图像caption配对,并从语言模型中生成多轮交互式多模态指令-响应对话。我们发布了我们的数据集、模型和演示,以促进多模态指令跟随领域的未来研究。
https://arxiv.org/abs/2309.08637
Large Language Models (LLMs), primarily trained on text-based datasets, exhibit exceptional proficiencies in understanding and executing complex linguistic instructions via text outputs. However, they falter when requests to generate non-text ones. Concurrently, modality conversion models, such as text-to-image, despite generating high-quality images, suffer from a lack of extensive textual pretraining. As a result, these models are only capable of accommodating specific image descriptions rather than comprehending more complex instructions. To bridge this gap, we propose a novel approach, \methodname, from a modality conversion perspective that evolves a text-based LLM into a multi-modal one. We specifically employ a minimal dataset to instruct LLMs to recognize the intended output modality as directed by the instructions. Consequently, the adapted LLM can effectively summon various off-the-shelf modality conversion models from the model zoos to generate non-text responses. This circumvents the necessity for complicated pretraining that typically requires immense quantities of paired multi-modal data, while simultaneously inheriting the extensive knowledge of LLMs and the ability of high-quality generative models. To evaluate and compare the adapted multi-modal LLM with its traditional counterparts, we have constructed a multi-modal instruction benchmark that solicits diverse modality outputs. The experiment results reveal that, with minimal training, LLMs can be conveniently adapted to comprehend requests for non-text responses, thus achieving higher flexibility in multi-modal scenarios. Code and data will be made available at this https URL.
大型语言模型(LLMs)主要基于文本数据训练,在理解和执行复杂的语言指令方面表现出卓越的能力。然而,当要求生成非文本响应时,它们往往会出现问题。同时,像文本到图像这样的modality转换模型,尽管生成高质量的图像,但由于缺乏广泛的文本预训练,表现不佳。因此,为了填补这一差距,我们提出了一种新方法,名为 \methodname,从modality转换的角度提出,将基于文本的LLM转化为多modality LLM。我们特别使用了最小的数据集来指示LLMs以识别根据指令指定的输出modality。因此,适应后的LLM能够 effectively召唤模型动物园中的各种off-the-shelfmodality转换模型,生成非文本响应。这避免了复杂的预训练的必要性,通常需要大量配对多模态数据,同时继承LLMs广泛的知识和高质量的生成模型能力。为了评估和比较适应的多modality LLM与传统版的对应物,我们建立了一个多模态指令基准,吸引了各种modality输出。实验结果显示,通过少量的训练,LLMs可以方便地适应理解非文本响应的请求,因此在多模态情境中实现更高的灵活性。代码和数据将在这个 https URL 上提供。
https://arxiv.org/abs/2309.07623
Object search is a challenging task because when given complex language descriptions (e.g., "find the white cup on the table"), the robot must move its camera through the environment and recognize the described object. Previous works map language descriptions to a set of fixed object detectors with predetermined noise models, but these approaches are challenging to scale because new detectors need to be made for each object. In this work, we bridge the gap in realistic object search by posing the search problem as a partially observable Markov decision process (POMDP) where the object detector and visual sensor noise in the observation model is determined by a single Deep Neural Network conditioned on complex language descriptions. We incorporate the neural network's outputs into our language-conditioned observation model (LCOM) to represent dynamically changing sensor noise. With an LCOM, any language description of an object can be used to generate an appropriate object detector and noise model, and training an LCOM only requires readily available supervised image-caption datasets. We empirically evaluate our method by comparing against a state-of-the-art object search algorithm in simulation, and demonstrate that planning with our observation model yields a significantly higher average task completion rate (from 0.46 to 0.66) and more efficient and quicker object search than with a fixed-noise model. We demonstrate our method on a Boston Dynamics Spot robot, enabling it to handle complex natural language object descriptions and efficiently find objects in a room-scale environment.
搜索对象是一个挑战性的任务,因为给定复杂的语言描述(例如“找到桌子上的 white cup”)时,机器人必须穿越环境并识别描述的对象。以往的工作将语言描述映射到预定义噪声模型固定的对象探测器集合中,但这些方法在规模上很难扩展,因为每个对象都需要新的对象探测器。在本文中,我们将现实对象搜索中的差距通过将搜索问题 posed as 部分可观测的马氏决策过程(POMDP)来解决,其中观察模型的对象探测器和视觉传感器噪声由一个基于复杂语言描述的单一深度学习神经网络决定的。我们将神经网络的输出集成到我们的语言条件观察模型(LCOM)中,以表示动态变化的传感器噪声。使用 LCOM,任何对象的语言描述都可以用于生成适当的对象探测器和噪声模型,而训练 LCOM只需要现成的监督图像标题数据集。我们通过比较模拟中最先进的对象搜索算法来Empirically 评估了我们的方法,并证明与我们的观察模型计划相比,任务平均完成率(从0.46到0.66)显著提高,对象搜索效率更高、更快,比固定噪声模型更有效。我们使用波士顿动力的 Spot 机器人展示了我们的方法,使其能够处理复杂的自然语言对象描述,并在房间的规模环境中高效地找到对象。
https://arxiv.org/abs/2309.07276
We introduce differentiable indirection -- a novel learned primitive that employs differentiable multi-scale lookup tables as an effective substitute for traditional compute and data operations across the graphics pipeline. We demonstrate its flexibility on a number of graphics tasks, i.e., geometric and image representation, texture mapping, shading, and radiance field representation. In all cases, differentiable indirection seamlessly integrates into existing architectures, trains rapidly, and yields both versatile and efficient results.
我们引入了可区分的方向性操作——一种 novel 的学习基本操作,使用可区分的多尺度查找表作为图形管道中的传统计算和数据操作的有效替代品。我们对多个图形任务进行了测试,包括几何和图像表示、纹理映射、着色和亮度场表示。在所有情况下,可区分的方向性操作无缝集成到现有的架构中,快速训练并获得了既灵活又高效的结果。
https://arxiv.org/abs/2309.08387
We propose a new paradigm to automatically generate training data with accurate labels at scale using the text-to-image synthesis frameworks (e.g., DALL-E, Stable Diffusion, etc.). The proposed approach1 decouples training data generation into foreground object generation, and contextually coherent background generation. To generate foreground objects, we employ a straightforward textual template, incorporating the object class name as input prompts. This is fed into a text-to-image synthesis framework, producing various foreground images set against isolated backgrounds. A foreground-background segmentation algorithm is then used to generate foreground object masks. To generate context images, we begin by creating language descriptions of the context. This is achieved by applying an image captioning method to a small set of images representing the desired context. These textual descriptions are then transformed into a diverse array of context images via a text-to-image synthesis framework. Subsequently, we composite these with the foreground object masks produced in the initial step, utilizing a cut-and-paste method, to formulate the training data. We demonstrate the advantages of our approach on five object detection and segmentation datasets, including Pascal VOC and COCO. We found that detectors trained solely on synthetic data produced by our method achieve performance comparable to those trained on real data (Fig. 1). Moreover, a combination of real and synthetic data yields even much better results. Further analysis indicates that the synthetic data distribution complements the real data distribution effectively. Additionally, we emphasize the compositional nature of our data generation approach in out-of-distribution and zero-shot data generation scenarios. We open-source our code at this https URL
我们提出了一种新的范式,使用文本到图像合成框架(例如DALL-E、稳定扩散等)来大规模自动生成具有准确标签的训练数据。我们提出了一种方法,将训练数据的生成分为前景对象的生成和上下文协调的背景生成。为了生成前景对象,我们使用一个简单的文本模板,其中包含对象类名称作为输入提示。这个模板被输入到文本到图像合成框架中,生成各种前景对象置于孤立背景上的图像。前景对象掩码生成算法随后使用文本到图像合成框架生成。为了生成上下文图像,我们首先创建上下文的描述语言。这通过应用图像摘要方法实现,以代表想要上下文的图像。这些文本描述通过文本到图像合成框架转换为一组多样化的上下文图像。随后,我们将这些先前生成的前景对象掩码与在步骤开始时生成的前景对象掩码组合起来,使用粘贴方法形成训练数据。我们证明了我们的方法在五个物体检测和分割 datasets,包括Pascal VOC和COCO上的优点。我们发现,仅使用我们方法生成的合成数据训练的探测器,其性能与使用真实数据训练的探测器相当(图1)。此外,将真实和合成数据组合在一起还产生 even 更好的结果。进一步分析表明,合成数据分布有效地补充了真实数据分布。此外,我们强调了在我们不在分布范围内或无样本数据生成场景中,我们生成方法的组成性质。我们将我们的代码开源在这个 https URL 上。
https://arxiv.org/abs/2309.05956