Pre-training image representations from the raw text about images enables zero-shot vision transfer to downstream tasks. Through pre-training on millions of samples collected from the internet, multimodal foundation models, such as CLIP, produce state-of-the-art zero-shot results that often reach competitiveness with fully supervised methods without the need for task-specific training. Besides the encouraging performance on classification accuracy, it is reported that these models close the robustness gap by matching the performance of supervised models trained on ImageNet under natural distribution shift. Because robustness is critical to real-world applications, especially safety-critical ones, in this paper, we present a comprehensive evaluation based on a large-scale robustness benchmark covering 7 natural, 3 synthetic distribution shifts, and 11 adversarial attacks. We use CLIP as a pilot study. We show that CLIP leads to a significant robustness drop compared to supervised ImageNet models on our benchmark, especially under synthetic distribution shift and adversarial attacks. Furthermore, data overlap analysis suggests that the observed robustness under natural distribution shifts could be attributed, at least in part, to data overlap. In summary, our evaluation shows a comprehensive evaluation of robustness is necessary; and there is a significant need to improve the robustness of zero-shot multimodal models.
通过从原始文本中提取图像预训练图像表示,使得零散 shot 视觉传输下游任务成为可能。通过从互联网收集数百万个样本进行预训练,多模态基础模型(如 CLIP)产生了最先进的零散 shot 结果,通常可以达到与无需任务特定训练的全监督方法相媲美的水平。除了分类准确度令人鼓舞的结果之外,据报道,这些模型通过在自然分布漂移下训练监督模型与 ImageNet 上的监督模型相匹敌,从而缩小了鲁棒性差距。因为鲁棒性对现实世界的应用(尤其是关键应用)至关重要,尤其是在本文中,我们基于覆盖7个自然、3个合成分布漂移和11个对抗攻击的大型鲁棒性基准进行全面评估。我们使用 CLIP 作为试点研究。我们发现,CLIP 在我们的基准上导致监督 ImageNet 模型在合成分布漂移和对抗攻击方面的鲁棒性显著下降。此外,数据重叠分析表明,观察到的鲁棒性在自然分布漂移上可能是由数据重叠造成的。总之,我们的评估表明,对鲁棒性的全面评估是必要的;提高零散 shot 多模态模型的鲁棒性具有重要的意义。
https://arxiv.org/abs/2403.10499
Diffusion-based audio and music generation models commonly generate music by constructing an image representation of audio (e.g., a mel-spectrogram) and then converting it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their effectiveness. We propose MusicHiFi -- an efficient high-fidelity stereophonic vocoder. Our method employs a cascade of three generative adversarial networks (GANs) that convert low-resolution mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth expansion, and upmixes to stereophonic audio. Compared to previous work, we propose 1) a unified GAN-based generator and discriminator architecture and training procedure for each stage of our cascade, 2) a new fast, near downsampling-compatible bandwidth extension module, and 3) a new fast downmix-compatible mono-to-stereo upmixer that ensures the preservation of monophonic content in the output. We evaluate our approach using both objective and subjective listening tests and find our approach yields comparable or better audio quality, better spatialization control, and significantly faster inference speed compared to past work. Sound examples are at this https URL.
扩散基于音频和音乐的生成模型通常通过构建音频的图像表示(例如,一个频谱图)并使用相位重建模型或 vocoder 将它转换为音频来生成音乐。然而,典型的 vocoder 产生的音频在较低的分辨率(例如 16-24 kHz)上,这限制了它们的有效性。我们提出 MusicHiFi -- 一个高效的高保真立体声 vocoder。我们的方法采用了一个级联的三个生成对抗网络(GANs)来将低分辨率频谱图转换为音频,通过带宽扩展将低分辨率音频升级到高分辨率音频,并使用上混合器将单声道音频转换为立体声音频。与之前的工作相比,我们提出了以下1) 对于我们级联的每个阶段,使用统一的 GAN 生成器和判别器架构以及训练程序;2) 一个新型的快速、近降采样兼容的带宽扩展模块;3) 一个快速降采样兼容的单声道到立体声上混合器,确保在输出中保留单声道内容的完整性。我们通过客观和主观听测试评估了我们的方法,发现我们的方法产生了相当不错的音频质量、更好的空间定位控制,以及比过去工作显著更快的推理速度。音频示例可以在以下链接中找到:https://www.example.com/
https://arxiv.org/abs/2403.10493
Image segmentation is one of the most fundamental problems in computer vision and has drawn a lot of attentions due to its vast applications in image understanding and autonomous driving. However, designing effective and efficient segmentation neural architectures is a labor-intensive process that may require lots of trials by human experts. In this paper, we address the challenge of integrating multi-head self-attention into high resolution representation CNNs efficiently, by leveraging architecture search. Manually replacing convolution layers with multi-head self-attention is non-trivial due to the costly overhead in memory to maintain high resolution. By contrast, we develop a multi-target multi-branch supernet method, which not only fully utilizes the advantages of high-resolution features, but also finds the proper location for placing multi-head self-attention module. Our search algorithm is optimized towards multiple objective s (e.g., latency and mIoU) and capable of finding architectures on Pareto frontier with arbitrary number of branches in a single search. We further present a series of model via Hybrid Convolutional-Transformer Architecture Search (HyCTAS) method that searched for the best hybrid combination of light-weight convolution layers and memory-efficient self-attention layers between branches from different resolutions and fuse to high resolution for both efficiency and effectiveness. Extensive experiments demonstrate that HyCTAS outperforms previous methods on semantic segmentation task. Code and models are available at \url{this https URL}.
图像分割是计算机视觉中最基本的问题之一,由于其在图像理解和自动驾驶中的广泛应用,因此受到了很多关注。然而,设计有效且高效的分割神经架构是一个劳动密集的过程,可能需要许多专家的人工作尝试。在本文中,我们通过利用架构搜索解决了将多头自注意力集成到高分辨率表示CNNs中的问题,通过构建多目标多分支超级网络。通过手动替换卷积层为多头自注意力,由于需要高昂的内存开销来维持高分辨率,因此这是不可能的。相反,我们开发了一种多目标多分支超级网络方法,不仅充分利用了高分辨率特征的优势,而且发现了放置多头自注意力的适当位置。我们的搜索算法针对多个目标(如延迟和mIoU)进行优化,可以在一个搜索中找到架构在帕累托前沿的任意数量分支上的最优架构。我们还通过HyCTAS方法展示了一系列模型,该方法在寻找不同分辨率分支的最佳轻量级卷积层和内存高效的自注意力层之间进行了搜索,将高分辨率与效率进行了平衡。大量实验证明,HyCTAS在语义分割任务上优于以前的算法。代码和模型可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2403.10413
In this study, we address the intricate challenge of multi-task dense prediction, encompassing tasks such as semantic segmentation, depth estimation, and surface normal estimation, particularly when dealing with partially annotated data (MTPSL). The complexity arises from the absence of complete task labels for each training image. Given the inter-related nature of these pixel-wise dense tasks, our focus is on mining and capturing cross-task relationships. Existing solutions typically rely on learning global image representations for global cross-task image matching, imposing constraints that, unfortunately, sacrifice the finer structures within the images. Attempting local matching as a remedy faces hurdles due to the lack of precise region supervision, making local alignment a challenging endeavor. The introduction of Segment Anything Model (SAM) sheds light on addressing local alignment challenges by providing free and high-quality solutions for region detection. Leveraging SAM-detected regions, the subsequent challenge lies in aligning the representations within these regions. Diverging from conventional methods that directly learn a monolithic image representation, our proposal involves modeling region-wise representations using Gaussian Distributions. Aligning these distributions between corresponding regions from different tasks imparts higher flexibility and capacity to capture intra-region structures, accommodating a broader range of tasks. This innovative approach significantly enhances our ability to effectively capture cross-task relationships, resulting in improved overall performance in partially supervised multi-task dense prediction scenarios. Extensive experiments conducted on two widely used benchmarks underscore the superior effectiveness of our proposed method, showcasing state-of-the-art performance even when compared to fully supervised methods.
在这项研究中,我们解决了多任务密集预测的复杂挑战,包括语义分割、深度估计和表面法线估计等任务,尤其是在处理部分标注数据(MTPSL)时。复杂性源于每个训练图像缺少完整的任务标签。鉴于这些像素级密集任务之间的相关性,我们的重点是挖掘和捕捉跨任务关系。现有的解决方案通常依赖于全局图像表示学习全局跨任务图像匹配,并强加一些限制,不幸的是,牺牲了图像中的更细结构。尝试局部匹配作为解决方法遇到了障碍,因为缺乏精确的区域指导,使得局部对齐成为一个具有挑战性的任务。SAM模型的引入使通过为区域检测提供免费且高质量解决方案来解决局部对齐问题。利用SAM检测到的区域,接下来的挑战在于对齐这些区域内的表示。从传统的直接学习单体图像表示的方法中脱离出来,我们的方案使用高斯分布对区域进行建模。在对应任务的不同区域之间对齐这些分布赋予了更高的灵活性和容量,适应了更广泛的任务。这种创新方法显著增强了我们捕捉跨任务关系的能力,在部分监督的多任务密集预测场景中实现了整体性能的提高。在两个广泛使用的基准上进行的大量实验都证实了所提出方法的优势,即使与完全监督方法相比,表现也相当卓越。
https://arxiv.org/abs/2403.10252
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, consisting of both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.
在这项工作中,我们讨论了构建高性能的多模态大型语言模型(MLLMs)。特别是,我们研究了各种架构组件和数据选择的重要性。通过仔细和全面的图像编码器、视觉语言连接器和各种预训练数据选择,我们识别出几个关键的设计经验。例如,我们证明了,在大型多模态预训练中,使用仔细混合图像捕捉、平滑图像文本和文本only数据对于在多个基准上实现最先进的(SOTA)几 shot结果至关重要,与其他已发表的预训练结果相比。此外,我们还证明了图像编码器与图像分辨率相结合,对图像标记计数有相当大的影响,而视觉语言连接器的设计则相对较小。通过扩展所提出的食谱,我们构建了MM1,一种具有30B参数的 multimodal 模型家族,包括密集模型和专家混合(MoE)变体,在预训练指标上实现了最先进的性能,并在各种已建立的多模态基准上实现了具有竞争力的性能。由于大规模预训练,MM1 具有诸如增强的上下文学习 和多图像推理 这样的有吸引力的特性,实现了几 shot链式思维提示。
https://arxiv.org/abs/2403.09611
Multimodal large language models (MLLMs) have shown impressive reasoning abilities, which, however, are also more vulnerable to jailbreak attacks than their LLM predecessors. Although still capable of detecting unsafe responses, we observe that safety mechanisms of the pre-aligned LLMs in MLLMs can be easily bypassed due to the introduction of image features. To construct robust MLLMs, we propose ECSO(Eyes Closed, Safety On), a novel training-free protecting approach that exploits the inherent safety awareness of MLLMs, and generates safer responses via adaptively transforming unsafe images into texts to activate intrinsic safety mechanism of pre-aligned LLMs in MLLMs. Experiments on five state-of-the-art (SoTA) MLLMs demonstrate that our ECSO enhances model safety significantly (e.g., a 37.6% improvement on the MM-SafetyBench (SD+OCR), and 71.3% on VLSafe for the LLaVA-1.5-7B), while consistently maintaining utility results on common MLLM benchmarks. Furthermore, we show that ECSO can be used as a data engine to generate supervised-finetuning (SFT) data for MLLM alignment without extra human intervention.
多模态大型语言模型(MLLMs)表现出惊人的推理能力,然而,与LLM前身的易受攻击性相比,它们也更容易受到黑客攻击。虽然MLLM中的预对齐LLM的安全机制仍然能够检测到不安全的响应,但我们的观察发现,由于引入了图像特征,MLLM预对齐LLM的安全机制很容易被绕过。为了构建稳健的MLLM,我们提出了ECSO(闭眼,安全开启),一种新颖的训练免费的保护方法,它利用了MLLMs固有的安全意识,并通过自适应地将不安全的图像转换为文本来激活预对齐LLM的安全机制。在五个最先进的(SoTA)MLLM上的实验表明,我们的ECSO显著增强了模型安全性(例如,在MM-SafetyBench (SD+OCR)上的改进率为37.6%,在VLSafe上的改进率为71.3%)。同时,我们在常见MLLM基准上保持了使用价值结果。此外,我们还证明了ECSO可以作为数据引擎,用于为MLLM的对齐生成监督微调(SFT)数据,而无需额外的人干预。
https://arxiv.org/abs/2403.09572
Recent advancements in state space models, notably Mamba, have demonstrated significant progress in modeling long sequences for tasks like language understanding. Yet, their application in vision tasks has not markedly surpassed the performance of traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). This paper posits that the key to enhancing Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling. Traditional ViM approaches, which flatten spatial tokens, overlook the preservation of local 2D dependencies, thereby elongating the distance between adjacent tokens. We introduce a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies while maintaining a global perspective. Additionally, acknowledging the varying preferences for scan patterns across different network layers, we propose a dynamic method to independently search for the optimal scan choices for each layer, substantially improving performance. Extensive experiments across both plain and hierarchical models underscore our approach's superiority in effectively capturing image representations. For example, our model significantly outperforms Vim-Ti by 3.1% on ImageNet with the same 1.5G FLOPs. Code is available at: this https URL.
近年来,在状态空间模型(State Space Models)方面的进步,特别是Mamba,在诸如语言理解等任务中取得了显著的进展。然而,在视觉任务中,它们的应用并没有明显超越传统卷积神经网络(CNNs)和视觉变压器(ViTs)的性能。本文认为,提高视觉Mamba(ViM)的关键在于优化序列建模的扫描方向。传统的ViM方法平铺空间权重,忽视了局部2D依赖关系的保留,从而延长了相邻词之间的距离。我们引入了一种新颖的局部扫描策略,将图像划分为不同的窗口,在保持全局视图的同时有效捕捉局部依赖关系。此外,考虑到不同网络层对扫描模式的需求存在差异,我们提出了一个动态方法,独立搜索每个层的最佳扫描选择,从而显著提高性能。在plain和hierarchical模型之间进行广泛的实验,都证实了我们在捕捉图像表示方面的优越性。例如,与Vim-Ti相比,我们的模型在ImageNet上具有相同的1.5G FLOPs时,性能提高了3.1%。代码可在此处下载:https://this URL。
https://arxiv.org/abs/2403.09338
Vision language models (VLMs) have drastically changed the computer vision model landscape in only a few years, opening an exciting array of new applications from zero-shot image classification, over to image captioning, and visual question answering. Unlike pure vision models, they offer an intuitive way to access visual content through language prompting. The wide applicability of such models encourages us to ask whether they also align with human vision - specifically, how far they adopt human-induced visual biases through multimodal fusion, or whether they simply inherit biases from pure vision models. One important visual bias is the texture vs. shape bias, or the dominance of local over global information. In this paper, we study this bias in a wide range of popular VLMs. Interestingly, we find that VLMs are often more shape-biased than their vision encoders, indicating that visual biases are modulated to some extent through text in multimodal models. If text does indeed influence visual biases, this suggests that we may be able to steer visual biases not just through visual input but also through language: a hypothesis that we confirm through extensive experiments. For instance, we are able to steer shape bias from as low as 49% to as high as 72% through prompting alone. For now, the strong human bias towards shape (96%) remains out of reach for all tested VLMs.
视觉语言模型(VLMs)在短短几年内彻底改变了计算机视觉模型的格局,为零散图像分类、图像描述和视觉问答等带来了众多新颖的应用。与纯视觉模型不同,它们通过语言提示提供了直观访问视觉内容的方式。这类模型的广泛应用鼓励我们问是否也与人类视觉保持一致 - 尤其是,它们在多模态融合中采用人类诱导的视觉偏见程度,或是否只是从纯视觉模型中继承偏见。一个重要的视觉偏见是纹理 vs. 形状偏见,即局部信息相对于全局信息的统治。在本文中,我们研究了这种偏见在广受欢迎的VLMs中的情况。有趣的是,我们发现VLMs往往比它们的视觉编码器更具有形状偏见,表明视觉偏见在某种程度上通过多模态模型中的文本进行调节。如果文本确实会影响视觉偏见,这表明我们不仅可以通过视觉输入来引导视觉偏见,还可以通过语言:通过广泛的实验结果我们证实了这一点。例如,我们通过提示可以将形状偏见从49%引导到72%。目前,所有测试的VLMs对形状(96%)的强烈人类偏见仍然是不可达的。
https://arxiv.org/abs/2403.09193
With the emergence of large language models (LLMs) and vision foundation models, how to combine the intelligence and capacity of these open-sourced or API-available models to achieve open-world visual perception remains an open question. In this paper, we introduce VisionGPT to consolidate and automate the integration of state-of-the-art foundation models, thereby facilitating vision-language understanding and the development of vision-oriented AI. VisionGPT builds upon a generalized multimodal framework that distinguishes itself through three key features: (1) utilizing LLMs (e.g., LLaMA-2) as the pivot to break down users' requests into detailed action proposals to call suitable foundation models; (2) integrating multi-source outputs from foundation models automatically and generating comprehensive responses for users; (3) adaptable to a wide range of applications such as text-conditioned image understanding/generation/editing and visual question answering. This paper outlines the architecture and capabilities of VisionGPT, demonstrating its potential to revolutionize the field of computer vision through enhanced efficiency, versatility, and generalization, and performance. Our code and models will be made publicly available. Keywords: VisionGPT, Open-world visual perception, Vision-language understanding, Large language model, and Foundation model
随着大型语言模型(LLMs)和视觉基础模型(VMMs)的出现,如何将它们的开源或API可得模型的智能和能力相结合以实现开放世界视觉感知仍然是一个开放性问题。在本文中,我们介绍了VisionGPT来巩固和自动化将最先进的基线模型集成到一起,从而促进视觉语言理解和视觉导向人工智能的发展。VisionGPT基于一个通用的多模态框架,通过三个关键特点取得了区分:(1)利用LLM(例如LLLA-2)作为基准来分解用户的请求为详细行动建议以调用合适的基线模型;(2)自动集成基线模型的多源输出并生成全面回答给用户;(3)适用于诸如文本条件图像理解/生成/编辑和视觉问题回答等广泛应用场景。本文概述了VisionGPT的架构和能力,证明了其通过提高效率、多样性和泛化能力以及性能,可能彻底颠覆计算机视觉领域的潜力。我们的代码和模型将公开发布。关键词:VisionGPT,开放世界视觉感知,视觉语言理解,大型语言模型和基线模型
https://arxiv.org/abs/2403.09027
Implicit neural representations (INRs) recently achieved great success in image representation and compression, offering high visual quality and fast rendering speeds with 10-1000 FPS, assuming sufficient GPU resources are available. However, this requirement often hinders their use on low-end devices with limited memory. In response, we propose a groundbreaking paradigm of image representation and compression by 2D Gaussian Splatting, named GaussianImage. We first introduce 2D Gaussian to represent the image, where each Gaussian has 8 parameters including position, covariance and color. Subsequently, we unveil a novel rendering algorithm based on accumulated summation. Remarkably, our method with a minimum of 3$\times$ lower GPU memory usage and 5$\times$ faster fitting time not only rivals INRs (e.g., WIRE, I-NGP) in representation performance, but also delivers a faster rendering speed of 1500-2000 FPS regardless of parameter size. Furthermore, we integrate existing vector quantization technique to build an image codec. Experimental results demonstrate that our codec attains rate-distortion performance comparable to compression-based INRs such as COIN and COIN++, while facilitating decoding speeds of approximately 1000 FPS. Additionally, preliminary proof of concept shows that our codec surpasses COIN and COIN++ in performance when using partial bits-back coding.
最近,隐式神经表示(INRs)在图像表示和压缩方面取得了巨大的成功,在10-1000 FPS的帧速率下提供了高品质的视觉效果和快速的渲染速度,前提是有足够的GPU资源可用。然而,这种要求通常会阻碍其在具有有限内存的低端设备上的使用。为了应对这种情况,我们提出了一个创新的多层高斯平铺图像表示和压缩范式,名为GaussianImage。我们首先引入2D高斯来表示图像,其中每个高斯具有8个参数,包括位置、协方差和颜色。接着,我们揭示了一种基于累积求和的全新渲染算法。值得注意的是,我们的方法在GPU内存使用最少的情况下,具有与INRs(如WIRE和I-NGP)相当的表现,而且无论参数大小,都能实现1500-2000 FPS的渲染速度。此外,我们将现有的向量量化技术集成到图像编码中,构建了一种图像编码码。实验结果表明,我们的编码在速率失真方面与基于压缩的INRs(如COIN和COIN++)的表现相当,同时通过部分比特反向 coding 促进解码速度达到约1000 FPS。此外,初步证明我们的编码在性能上超越了COIN和COIN++。
https://arxiv.org/abs/2403.08551
The combination of language processing and image processing keeps attracting increased interest given recent impressive advances that leverage the combined strengths of both domains of research. Among these advances, the task of editing an image on the basis solely of a natural language instruction stands out as a most challenging endeavour. While recent approaches for this task resort, in one way or other, to some form of preliminary preparation, training or fine-tuning, this paper explores a novel approach: We propose a preparation-free method that permits instruction-guided image editing on the fly. This approach is organized along three steps properly orchestrated that resort to image captioning and DDIM inversion, followed by obtaining the edit direction embedding, followed by image editing proper. While dispensing with preliminary preparation, our approach demonstrates to be effective and competitive, outperforming recent, state of the art models for this task when evaluated on the MAGICBRUSH dataset.
语言处理和图像处理的结合一直吸引着越来越多的关注,因为最近取得了利用两个领域研究优势的令人印象深刻的进展。在这些进展中,基于自然语言指令编辑图像的任务尤其具有挑战性。虽然最近为这个任务采用了某种形式的初步准备、训练或微调,但本文探索了一种新颖的方法:我们提出了一种无需准备即可进行指令指导图像编辑的方法。这种方法组织在三个步骤上,步骤依次是图像标注和DDIM反向,然后是获得编辑方向嵌入,最后是图像编辑本身。在不进行初步准备的情况下,我们的方法表现出了有效和竞争力的特点,在MAGICBRUSH数据集上评估时超过了最近的最先进模型。
https://arxiv.org/abs/2403.08004
In this work, we investigate the potential of a large language model (LLM) to directly comprehend visual signals without the necessity of fine-tuning on multi-modal datasets. The foundational concept of our method views an image as a linguistic entity, and translates it to a set of discrete words derived from the LLM's vocabulary. To achieve this, we present the Vision-to-Language Tokenizer, abbreviated as V2T Tokenizer, which transforms an image into a ``foreign language'' with the combined aid of an encoder-decoder, the LLM vocabulary, and a CLIP model. With this innovative image encoding, the LLM gains the ability not only for visual comprehension but also for image denoising and restoration in an auto-regressive fashion-crucially, without any fine-tuning. We undertake rigorous experiments to validate our method, encompassing understanding tasks like image recognition, image captioning, and visual question answering, as well as image denoising tasks like inpainting, outpainting, deblurring, and shift restoration. Code and models are available at this https URL.
在这项工作中,我们研究了大型语言模型(LLM)直接理解视觉信号的潜力,而无需在多模态数据集上进行微调。我们方法的基本概念是将图像看作一个语言实体,并将其转化为LLM词汇表中的一组离散单词。为了实现这一目标,我们提出了 Vision-to-Language Tokenizer,简称V2T Tokenizer,它通过联合编码器-解码器、LLM词汇表和CLIP模型将图像转换为“外语”。有了这种创新性的图像编码,LLM不仅能够实现视觉理解,而且能够以自回归的方式进行图像去噪和修复。我们进行了严格的实验来验证我们的方法,包括理解任务(图像识别、图像标题和视觉问答)和图像去噪任务(修复、去模糊和位移恢复)。代码和模型可在此https URL找到。
https://arxiv.org/abs/2403.07874
The creation of high-quality human-labeled image-caption datasets presents a significant bottleneck in the development of Visual-Language Models (VLMs). We propose a novel approach that leverages the strengths of Large Language Models (LLMs) and image generation models to create synthetic image-text pairs for efficient and effective VLM training. Our method employs pretraining a text-to-image model to synthesize image embeddings starting from captions generated by an LLM. These synthetic pairs are then used to train a VLM. Extensive experiments demonstrate that the VLM trained with synthetic data exhibits comparable performance on image captioning, while requiring a fraction of the data used by models trained solely on human-annotated data. In particular, we outperform the baseline by 17% through augmentation with a synthetic dataset. Furthermore, we show that synthesizing in the image embedding space is 25% faster than in the pixel space. This research introduces a promising technique for generating large-scale, customizable image datasets, leading to enhanced VLM performance and wider applicability across various domains, all with improved data efficiency and resource utilization.
高质量的人标注图像-文本数据集的创建在视觉语言模型(VLMs)的发展中是一个重要的瓶颈。我们提出了一种利用大型语言模型(LLMs)和图像生成模型的优势来创建用于高效和有效VLM训练的合成图像-文本对的方法。我们的方法通过将预训练的文本到图像模型用于从LLM生成的摘要中合成图像嵌入来工作。这些合成对用于训练VLM。大量实验证明,使用合成数据训练的VLM在图像标注方面表现出与仅使用人类标注数据训练的模型的 comparable性能。特别是,通过使用合成数据进行扩充,我们比基线提高了17%的性能。此外,我们还证明了在图像嵌入空间合成比在像素空间合成更快。这项研究引入了一种通过生成大型、可定制的图像数据集来提高VLM性能和更广泛应用的方法,所有这些都提高了数据效率和资源利用率。
https://arxiv.org/abs/2403.07750
Medical vision language pre-training (VLP) has emerged as a frontier of research, enabling zero-shot pathological recognition by comparing the query image with the textual descriptions for each disease. Due to the complex semantics of biomedical texts, current methods struggle to align medical images with key pathological findings in unstructured reports. This leads to the misalignment with the target disease's textual representation. In this paper, we introduce a novel VLP framework designed to dissect disease descriptions into their fundamental aspects, leveraging prior knowledge about the visual manifestations of pathologies. This is achieved by consulting a large language model and medical experts. Integrating a Transformer module, our approach aligns an input image with the diverse elements of a disease, generating aspect-centric image representations. By consolidating the matches from each aspect, we improve the compatibility between an image and its associated disease. Additionally, capitalizing on the aspect-oriented representations, we present a dual-head Transformer tailored to process known and unknown diseases, optimizing the comprehensive detection efficacy. Conducting experiments on seven downstream datasets, ours outperforms recent methods by up to 8.07% and 11.23% in AUC scores for seen and novel categories, respectively. Our code is released at \href{this https URL}{this https URL}.
医学视觉语言预训练(VLP)已成为研究的前沿,通过将查询图像与每个疾病的文本描述进行比较来实现零散病灶识别。由于生物医学文本的复杂语义,目前的 methods 很难将医学图像与无结构报告中的关键病理性发现对齐。这导致目标疾病文本表示的错位。在本文中,我们介绍了一种新颖的 VLP 框架,旨在将疾病描述分解为其基本方面,利用关于疾病视觉表现的先验知识。通过咨询大型语言模型和医学专家,我们实现了一个大型的 VLP 系统。该系统通过将 Transformer 模块整合到方法中,将输入图像与疾病多样性的元素对齐,生成以方面为中心的图像表示。通过整合各个方面的匹配,我们提高了图像与相关疾病之间的兼容性。此外,利用方面导向的表示,我们提出了一个用于处理已知和未知疾病的双头 Transformer,优化了全面检测的有效性。在下游数据集上进行实验后,我们发现我们方法的性能优于最近的方法,在 seen 和 novel 类别的 AUC 分数上分别提高了 8.07% 和 11.23%。我们的代码发布在 \href{this <https://this <https://this URL>}{this <https://this URL>}。
https://arxiv.org/abs/2403.07636
Pre-trained models with large-scale training data, such as CLIP and Stable Diffusion, have demonstrated remarkable performance in various high-level computer vision tasks such as image understanding and generation from language descriptions. Yet, their potential for low-level tasks such as image restoration remains relatively unexplored. In this paper, we explore such models to enhance image restoration. As off-the-shelf features (OSF) from pre-trained models do not directly serve image restoration, we propose to learn an additional lightweight module called Pre-Train-Guided Refinement Module (PTG-RM) to refine restoration results of a target restoration network with OSF. PTG-RM consists of two components, Pre-Train-Guided Spatial-Varying Enhancement (PTG-SVE), and Pre-Train-Guided Channel-Spatial Attention (PTG-CSA). PTG-SVE enables optimal short- and long-range neural operations, while PTG-CSA enhances spatial-channel attention for restoration-related learning. Extensive experiments demonstrate that PTG-RM, with its compact size ($<$1M parameters), effectively enhances restoration performance of various models across different tasks, including low-light enhancement, deraining, deblurring, and denoising.
预训练模型具有大量数据的高水平计算机视觉任务,如CLIP和Stable Diffusion,已经在各种高级计算机视觉任务中表现出非凡的性能,例如图像理解和从语言描述中生成图像。然而,它们在低级任务(如图像修复)上的潜力仍然相对未探索。在本文中,我们探讨了这些模型以增强图像修复。预训练模型的离散特征(OSF)并不直接服务于图像修复,因此我们提出了一种名为Pre-Train-Guided Refinement Module(PTG-RM)的轻量级模块,以在具有OSF的目标修复网络中修复修复结果。PTG-RM由两个组件组成,分别是Pre-Train-Guided Spatial-Varying Enhancement(PTG-SVE)和Pre-Train-Guided Channel-Spatial Attention(PTG-CSA)。PTG-SVE实现了最佳短期和长期神经操作,而PTG-CSA增强了与修复相关的学习中的空间通道关注。 extensive实验证明,PTG-RM具有紧凑的规模($<$1M参数),有效增强了各种模型在不同任务上的修复性能,包括低光增强、去雾、去噪和去噪。
https://arxiv.org/abs/2403.06793
Research on generative models to produce human-aligned / human-preferred outputs has seen significant recent contributions. Between text and image-generative models, we narrowed our focus to text-based generative models, particularly to produce captions for images that align with human preferences. In this research, we explored a potential method to amplify the performance of the Deep Neural Network Model to generate captions that are preferred by humans. This was achieved by integrating Supervised Learning and Reinforcement Learning with Human Feedback (RLHF) using the Flickr8k dataset. Also, a novel loss function that is capable of optimizing the model based on human feedback is introduced. In this paper, we provide a concise sketch of our approach and results, hoping to contribute to the ongoing advances in the field of human-aligned generative AI models.
研究人类友好/人类偏爱的生成模型的工作取得了显著的最近贡献。在文本和图像生成模型之间,我们缩小了我们的重点,特别关注生成符合人类偏好的图像的摘要。在这项研究中,我们探讨了一种可能的方法,以增强Deep Neural Network模型生成被人类喜欢的摘要的能力。通过使用Flickr8k数据集,将监督学习和强化学习与人类反馈(RLHF)集成,实现了这一目标。此外,我们还引入了一种新的损失函数,该函数基于人类反馈优化模型。在本文中,我们简要描述了我们的方法以及结果,希望为该领域的人类友好生成人工智能模型的持续进展做出贡献。
https://arxiv.org/abs/2403.06735
In several real-world scenarios like autonomous navigation and mobility, to obtain a better visual understanding of the surroundings, image captioning and object detection play a crucial role. This work introduces a novel multitask learning framework that combines image captioning and object detection into a joint model. We propose TICOD, Transformer-based Image Captioning and Object detection model for jointly training both tasks by combining the losses obtained from image captioning and object detection networks. By leveraging joint training, the model benefits from the complementary information shared between the two tasks, leading to improved performance for image captioning. Our approach utilizes a transformer-based architecture that enables end-to-end network integration for image captioning and object detection and performs both tasks jointly. We evaluate the effectiveness of our approach through comprehensive experiments on the MS-COCO dataset. Our model outperforms the baselines from image captioning literature by achieving a 3.65% improvement in BERTScore.
在现实场景如自动驾驶和移动性中,获得更好的周围视觉理解,图像标注和目标检测起着关键作用。本文提出了一种新颖的多任务学习框架,将图像标注和目标检测合并成一个共享模型。我们提出了TICOD,基于Transformer的图像标注和目标检测模型,通过结合图像标注和目标检测网络的损失来共同训练两个任务。通过利用联合训练,该模型从两个任务中获得了互补的信息,从而提高了图像标注的性能。我们的方法基于Transformer架构,实现了图像标注和目标检测的端到端网络集成,并共同执行这两个任务。我们通过在MS-COCO数据集上进行全面的实验来评估我们方法的的有效性。我们的模型通过实现BERTScore提高3.65%的性能,超过了图像标注文献中的基线。
https://arxiv.org/abs/2403.06292
While various deep learning methods were proposed for low-dose computed tomography (CT) denoising, they often suffer from over-smoothing, blurring, and lack of explainability. To alleviate these issues, we propose a plug-and-play Language-Engaged Dual-space Alignment loss (LEDA) to optimize low-dose CT denoising models. Our idea is to leverage large language models (LLMs) to align denoised CT and normal dose CT images in both the continuous perceptual space and discrete semantic space, which is the first LLM-based scheme for low-dose CT denoising. LEDA involves two steps: the first is to pretrain an LLM-guided CT autoencoder, which can encode a CT image into continuous high-level features and quantize them into a token space to produce semantic tokens derived from the LLM's vocabulary; and the second is to minimize the discrepancy between the denoised CT images and normal dose CT in terms of both encoded high-level features and quantized token embeddings derived by the LLM-guided CT autoencoder. Extensive experimental results on two public LDCT denoising datasets demonstrate that our LEDA can enhance existing denoising models in terms of quantitative metrics and qualitative evaluation, and also provide explainability through language-level image understanding. Source code is available at this https URL.
虽然已经提出了许多低剂量计算机断层扫描(CT)去噪方法,但它们通常存在过度平滑、模糊和缺乏可解释性。为了减轻这些问题,我们提出了一个可插拔的 Language-Engaged Dual-space Alignment 损失(LEDA)来优化低剂量 CT 去噪模型。我们的想法是利用大型语言模型(LLMs)在连续感知空间和离散语义空间对去噪 CT 和正常剂量 CT 图像进行对齐,这是基于 LLM 的第一个低剂量 CT 去噪方案。LEDA 包括两个步骤:第一步是在 LLM 的指导下去预训练一个 CT 自编码器,它可以将 CT 图像编码成连续的高级特征并将其量化为基于 LLM 的词汇的语义标记;第二步是最小化编码的高级特征和通过 LLM 指导的自编码器得到的量化标记之间的差异,即对编码的高级特征和量化标记的差异。在两个公开的 LDCT 去噪数据集上进行的大量实验结果表明,我们的 LEDA 可以在数量指标和定性评估方面增强现有的去噪模型,并且还通过语言级别图像理解提供了可解释性。源代码可在此链接处获取。
https://arxiv.org/abs/2403.06128
We show that, simply initializing image understanding models using a pre-trained UNet (or transformer) of diffusion models, it is possible to achieve remarkable transferable performance on fundamental vision perception tasks using a moderate amount of target data (even synthetic data only), including monocular depth, surface normal, image segmentation, matting, human pose estimation, among virtually many others. Previous works have adapted diffusion models for various perception tasks, often reformulating these tasks as generation processes to align with the diffusion process. In sharp contrast, we demonstrate that fine-tuning these models with minimal adjustments can be a more effective alternative, offering the advantages of being embarrassingly simple and significantly faster. As the backbone network of Stable Diffusion models is trained on giant datasets comprising billions of images, we observe very robust generalization capabilities of the diffusion backbone. Experimental results showcase the remarkable transferability of the backbone of diffusion models across diverse tasks and real-world datasets.
我们证明了,仅使用扩散模型的预训练UNet(或Transformer)初始化图像理解模型,在适量目标数据(即使是仅包含伪数据)的情况下,可以实现对基本视觉感知任务的显著迁移性能,包括单目深度、表面法线、图像分割、Matting、人体姿态估计等几乎所有任务。之前的工作已经将扩散模型适应各种感知任务,通常将这些任务重新建模为扩散过程。与之大相径庭,我们证明了用最小的调整对这些模型进行微调可以是一种更有效的选择,具有简单和显著更快的优势。 随着Stable Diffusion模型的骨干网络在包含数十亿图像的大型数据集上进行训练,我们观察到了扩散骨干网络的鲁棒通用能力。实验结果展示了扩散模型骨干在各种任务和现实世界数据集上的显著迁移能力。
https://arxiv.org/abs/2403.06090
In this paper, we propose a new framework for improving Content Based Image Retrieval (CBIR) for texture images. This is achieved by using a new image representation based on the RCT-Plus transform which is a novel variant of the Redundant Contourlet transform that extracts a richer directional information in the image. Moreover, the process of image search is improved through a learning-based approach where the images of the database are classified using an adapted similarity metric to the statistical modeling of the RCT-Plus transform. A query is then first classified to select the best texture class after which the retained class images are ranked to select top ones. By this, we have achieved significant improvements in the retrieval rates compared to previous CBIR schemes.
在本文中,我们提出了一个新的框架,用于提高基于内容的图像检索(CBIR)对于纹理图像。这是通过使用一种基于RCT-Plus变换的新图像表示来实现的,这是一种新颖的变体,可以从图像中提取更丰富的方向信息。此外,通过一种基于学习的图像搜索方法来改进图像检索过程。具体来说,我们将数据库中的图像分类为使用自适应的相似度度量对RCT-Plus变换的统计建模。然后,在选择最佳纹理类之前,保留的纹理类图像按排名排序以选择前几名。通过这种方法,我们已经在与以前CBIR方案的检索率上取得了显著的改进。
https://arxiv.org/abs/2403.06048