In many scenarios, especially biomedical applications, the correct delineation of complex fine-scaled structures such as neurons, tissues, and vessels is critical for downstream analysis. Despite the strong predictive power of deep learning methods, they do not provide a satisfactory representation of these structures, thus creating significant barriers in scalable annotation and downstream analysis. In this dissertation, we tackle such challenges by proposing novel representations of these topological structures in a deep learning framework. We leverage the mathematical tools from topological data analysis, i.e., persistent homology and discrete Morse theory, to develop principled methods for better segmentation and uncertainty estimation, which will become powerful tools for scalable annotation.
在许多场景中,尤其是在生物医学应用中,正确划分复杂微细结构,如神经元、组织和血管,对后续分析至关重要。尽管深度学习方法具有强大的预测能力,但它们并未提供对这些结构的良好表示,从而在可扩展注释和后续分析中造成了显著的障碍。在这篇论文中,我们通过在深度学习框架中提出新颖的表示方法来解决这些挑战。我们利用拓扑数据分析中的数学工具,即持久同构和离散Morse理论,开发了基于原则的方法,以更好地分割和不确定性估计,这将成为用于可扩展注释的强大工具。
https://arxiv.org/abs/2403.15361
The year 2023 marked a significant surge in the exploration of applying large language model (LLM) chatbots, notably ChatGPT, across various disciplines. We surveyed the applications of ChatGPT in various sectors of bioinformatics and biomedical informatics throughout the year, covering omics, genetics, biomedical text mining, drug discovery, biomedical image understanding, bioinformatics programming, and bioinformatics education. Our survey delineates the current strengths and limitations of this chatbot in bioinformatics and offers insights into potential avenues for future development.
2023年标志着在各种学科中应用大型语言模型(LLM)聊天机器人的规模显著增加。我们在全年对ChatGPT在生物信息学(Bioinformatics)和生物医学信息学领域的应用进行了调查,涵盖了基因组学、基因编辑、生物医学文本挖掘、药物发现、生物医学图像理解、生物信息学编程和生物信息学教育。我们的调查揭示了这种聊天机器人在生物信息学领域的当前优势和局限性,并为未来发展的潜在途径提供了见解。
https://arxiv.org/abs/2403.15274
There has been a significant progress in text conditional image generation models. Recent advancements in this field depend not only on improvements in model structures, but also vast quantities of text-image paired datasets. However, creating these kinds of datasets is very costly and requires a substantial amount of labor. Famous face datasets don't have corresponding text captions, making it difficult to develop text conditional image generation models on these datasets. Some research has focused on developing text to image generation models using only images without text captions. Here, we propose CLIP-VQDiffusion, which leverage the pretrained CLIP model to provide multimodal text-image representations and strong image generation capabilities. On the FFHQ dataset, our model outperformed previous state-of-the-art methods by 4.4% in clipscore and generated very realistic images even when the text was both in and out of distribution. The pretrained models and codes will soon be available at this https URL
在文本条件图像生成模型的进步方面取得了显著进展。该领域的最近进步不仅取决于模型结构的改进,还取决于大量文本-图像对数据集。然而,创建这类数据集是非常昂贵且需要大量劳动的。著名的 face 数据集没有相应的文本注释,使得在这些数据集上开发文本条件图像生成模型变得困难。一些研究集中在使用仅包含图像的文本到图像生成模型上。本文我们提出了 CLIP-VQDiffusion,它利用预训练的 CLIP 模型提供多模态文本-图像表示和强大的图像生成能力。在 FFHQ 数据集上,我们的模型在 clipscore 指标上超过了最先进的水平,即使在文本在分布内和分布外的情况下,生成也非常逼真的图像。预训练模型和代码 soon 将在此处 available at this <https://URL>。
https://arxiv.org/abs/2403.14944
Recent large-scale vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and generating textual descriptions for visual content. However, these models lack an understanding of user-specific concepts. In this work, we take a first step toward the personalization of VLMs, enabling them to learn and reason over user-provided concepts. For example, we explore whether these models can learn to recognize you in an image and communicate what you are doing, tailoring the model to reflect your personal experiences and relationships. To effectively recognize a variety of user-specific concepts, we augment the VLM with external concept heads that function as toggles for the model, enabling the VLM to identify the presence of specific target concepts in a given image. Having recognized the concept, we learn a new concept embedding in the intermediate feature space of the VLM. This embedding is tasked with guiding the language model to naturally integrate the target concept in its generated response. We apply our technique to BLIP-2 and LLaVA for personalized image captioning and further show its applicability for personalized visual question-answering. Our experiments demonstrate our ability to generalize to unseen images of learned concepts while preserving the model behavior on unrelated inputs.
近年来,大规模视觉语言模型(VLMs)在理解和生成视觉内容的文本描述方面表现出了非凡的能力。然而,这些模型缺乏对用户特定概念的理解。在这项工作中,我们迈出了实现VLMs个性化的第一步,使它们能够学习和推理用户提供的概念。例如,我们研究了这些模型是否能够学会识别你在一张图片中,并和你正在做什么,将模型个性化以反映你个人的经历和关系。为了有效地识别各种用户特定概念,我们通过向VLM添加外部概念头作为开关来增强模型,使VLM能够识别出给定图像中特定目标概念的存在。在识别了概念之后,我们在VLM的中间特征空间中学习了一个新的概念嵌入。这个嵌入被赋予了引导语言模型在生成响应中自然地整合目标概念的任务。我们将我们的技术应用于BLIP-2和LLaVA以进行个性化的图像标题,并进一步证明了它在个性化的视觉问题回答中的适用性。我们的实验表明,我们可以在保留模型行为的同时将学习到的概念泛化到未见过的图像上。
https://arxiv.org/abs/2403.14599
Multi-modal Large Language Models (MLLMs) demonstrate remarkable success across various vision-language tasks. However, they suffer from visual hallucination, where the generated responses diverge from the provided image. Are MLLMs completely oblivious to accurate visual cues when they hallucinate? Our investigation reveals that the visual branch may simultaneously advocate both accurate and non-existent content. To address this issue, we propose Pensieve, a training-free method inspired by our observation that analogous visual hallucinations can arise among images sharing common semantic and appearance characteristics. During inference, Pensieve enables MLLMs to retrospect relevant images as references and compare them with the test image. This paradigm assists MLLMs in downgrading hallucinatory content mistakenly supported by the visual input. Experiments on Whoops, MME, POPE, and LLaVA Bench demonstrate the efficacy of Pensieve in mitigating visual hallucination, surpassing other advanced decoding strategies. Additionally, Pensieve aids MLLMs in identifying details in the image and enhancing the specificity of image descriptions.
多模态大型语言模型(MLLMs)在各种视觉语言任务上表现出了显著的成功。然而,它们在生成反应时会出现视觉幻觉,即生成的响应会与提供的图像产生分歧。MLLMs在幻觉时是否对准确的视觉线索完全失准呢?我们的调查发现,视觉分支可能同时主张准确和非存在的信息。为了应对这个问题,我们提出了Pensieve,一种训练免费的方法,受到我们观察到的现象的启发,即共享共同语义和外观特征的图像之间可能出现类似的视觉幻觉。在推理过程中,Pensieve使MLLMs能够回顾相关的图像作为参考,并将它们与测试图像进行比较。这种范式有助于MLLMs降低由视觉输入错误支持的幻觉内容。在Whoops、MME、POPE和LLaVA基准上的实验证明了Pensieve在减轻视觉幻觉方面的有效性,超过了其他先进的解码策略。此外,Pensieve还有助于MLLMs在图像中识别细节并增强图像描述的准确性。
https://arxiv.org/abs/2403.14401
This paper proposes LayoutLLM, a more flexible document analysis method for understanding imaged documents. Visually Rich Document Understanding tasks, such as document image classification and information extraction, have gained significant attention due to their importance. Existing methods have been developed to enhance document comprehension by incorporating pre-training awareness of images, text, and layout structure. However, these methods require fine-tuning for each task and dataset, and the models are expensive to train and operate. To overcome this limitation, we propose a new LayoutLLM that integrates these with large-scale language models (LLMs). By leveraging the strengths of existing research in document image understanding and LLMs' superior language understanding capabilities, the proposed model, fine-tuned with multimodal instruction datasets, performs an understanding of document images in a single model. Our experiments demonstrate improvement over the baseline model in various document analysis tasks.
本文提出LayoutLLM,一种更灵活的图像文档分析方法,以理解图像文档。由于视觉丰富文档理解任务(如文档图像分类和信息提取)的重要性,已经引起了很大的关注。现有的方法通过引入图像、文本和布局结构的预训练意识来增强文档理解。然而,这些方法需要为每个任务和数据集进行微调,并且训练和运行模型成本较高。为了克服这一限制,我们提出了一个新型的LayoutLLM,将其与大规模语言模型(LLMs)集成。通过利用现有研究在文档图像理解和LLMs卓越的语言理解能力上的优势,所提出的模型通过多模态指令数据集进行微调,在单个模型中对文档图像进行理解。我们的实验结果表明,与基线模型相比,该模型在各种文档分析任务上都有所改进。
https://arxiv.org/abs/2403.14252
We introduce SynGround, a novel framework that combines data-driven learning and knowledge transfer from various large-scale pretrained models to enhance the visual grounding capabilities of a pretrained vision-and-language model. The knowledge transfer from the models initiates the generation of image descriptions through an image description generator. These descriptions serve dual purposes: they act as prompts for synthesizing images through a text-to-image generator, and as queries for synthesizing text, from which phrases are extracted using a large language model. Finally, we leverage an open-vocabulary object detector to generate synthetic bounding boxes for the synthetic images and texts. We finetune a pretrained vision-and-language model on this dataset by optimizing a mask-attention consistency objective that aligns region annotations with gradient-based model explanations. The resulting model improves the grounding capabilities of an off-the-shelf vision-and-language model. Particularly, SynGround improves the pointing game accuracy of ALBEF on the Flickr30k dataset from 79.38% to 87.26%, and on RefCOCO+ Test A from 69.35% to 79.06% and on RefCOCO+ Test B from 53.77% to 63.67%.
我们提出了SynGround,一种结合数据驱动学习和来自各种大规模预训练模型的知识传递来增强预训练视觉和语言模型的视觉接地能力的 novel 框架。模型的知识传递启动了通过图像描述生成器生成图像的生成过程。这些描述的双重作用是:它们作为通过文本到图像生成器的图像提示,同时也是用于从大型语言模型合成文本的查询。最后,我们利用一个开卷积词对象检测器为图像和文本生成合成边界框。通过优化一个与掩码注意一致性目标相关的参数,我们将预训练视觉和语言模型微调到这个数据集上。经过训练,得到的模型在标准视觉和语言模型的接地能力上有所提高。特别地,SynGround 从 Flextrack-100 数据集上提高了 ALBEF 的指名游戏准确性从 79.38% 提高到了 87.26%,从 RefCOCO+ Test A 上提高了 69.35% 的接地能力,从 RefCOCO+ Test B 上提高了 53.77% 的接地能力。
https://arxiv.org/abs/2403.13804
The abilities of large language models (LLMs) have recently progressed to unprecedented levels, paving the way to novel applications in a wide variety of areas. In computer vision, LLMs can be used to prime vision-language tasks such image captioning and visual question answering when coupled with pre-trained vision backbones. While different approaches have been explored to interface LLMs with ``perceptual backbones'' that process, e.g., visual or audio data, they are often explored for different tasks, different datasets, and using different perceptual backbones and language models, hindering direct comparison of the interfacing mechanisms. To remedy this lack of comparability between methods, we present an extensive experimental evaluation of different interfacing mechanisms, across multiple tasks (including image, video, and audio captioning as well as visual question answering), datasets and backbones, paying special attention to low-data settings. We find improved performance using existing mechanisms over state-of-the-art results, and identify a new interfacing mechanism that yields (near) optimal results across different tasks, while obtaining a 4x reduction in training time.
大语言模型的能力最近达到了前所未有的水平,为各种领域的创新应用铺平了道路。在计算机视觉领域,LLMs可用于在预训练视觉骨干上启动图像标题和视觉问答等任务。虽然已经探索了多种方法来将LLMs与“感知骨干”集成,例如视觉或音频数据,但通常会探索不同任务、不同数据集和不同感知骨干以及语言模型,这阻碍了直接比较接口机制。为了弥补这种方法之间的可比性不足,我们进行了广泛的实验评估,涉及多个任务(包括图像、视频和音频标题以及视觉问答)和数据集,特别关注低数据设置。我们发现,现有的机制在性能上超过了最先进的结果,并发现了一种新的接口机制,在不同的任务上实现了(近)最优结果,同时将训练时间减少了4倍。
https://arxiv.org/abs/2403.13499
Recent trends in Large Vision Language Models (LVLMs) research have been increasingly focusing on advancing beyond general image understanding towards more nuanced, object-level referential comprehension. In this paper, we present and delve into the self-consistency capability of LVLMs, a crucial aspect that reflects the models' ability to both generate informative captions for specific objects and subsequently utilize these captions to accurately re-identify the objects in a closed-loop process. This capability significantly mirrors the precision and reliability of fine-grained visual-language understanding. Our findings reveal that the self-consistency level of existing LVLMs falls short of expectations, posing limitations on their practical applicability and potential. To address this gap, we introduce a novel fine-tuning paradigm named Self-Consistency Tuning (SC-Tune). It features the synergistic learning of a cyclic describer-locator system. This paradigm is not only data-efficient but also exhibits generalizability across multiple LVLMs. Through extensive experiments, we demonstrate that SC-Tune significantly elevates performance across a spectrum of object-level vision-language benchmarks and maintains competitive or improved performance on image-level vision-language benchmarks. Both our model and code will be publicly available at this https URL.
近年来,大型视觉语言模型(LVLM)的研究趋势越来越关注超越一般图像理解,向更细微、物体级别参考理解的发展。在本文中,我们报告并深入研究了LVLMs的自一致性能力,这是模型在为特定物体生成有用的注释后,准确重新识别物体的能力。这一能力显著反映了高级视觉语言理解的准确性和可靠性。我们的研究结果表明,现有LVLMs的自一致性水平未达到预期,限制了它们在实际应用中的潜力和可能性。为了填补这一空白,我们引入了一个名为自一致性调整(SC-Tune)的新微调范式。它具有循环描述器-定位器系统的协同学习。这一范式不仅具有高效的数据利用,而且在多个LVLM上表现出普遍性。通过广泛的实验,我们证明了SC-Tune在物体级别视觉语言基准测试中的性能明显提高,同时在图像级别视觉语言基准测试中保持竞争性或改善性能。我们的模型和代码将公开发布在https://这个URL上。
https://arxiv.org/abs/2403.13263
Foundation models pre-trained on web-scale vision-language data, such as CLIP, are widely used as cornerstones of powerful machine learning systems. While pre-training offers clear advantages for downstream learning, it also endows downstream models with shared adversarial vulnerabilities that can be easily identified through the open-sourced foundation model. In this work, we expose such vulnerabilities in CLIP's downstream models and show that foundation models can serve as a basis for attacking their downstream systems. In particular, we propose a simple yet effective adversarial attack strategy termed Patch Representation Misalignment (PRM). Solely based on open-sourced CLIP vision encoders, this method produces adversaries that simultaneously fool more than 20 downstream models spanning 4 common vision-language tasks (semantic segmentation, object detection, image captioning and visual question-answering). Our findings highlight the concerning safety risks introduced by the extensive usage of public foundational models in the development of downstream systems, calling for extra caution in these scenarios.
基础模型在预训练时使用大规模视觉语言数据,如CLIP,作为强大机器学习系统的支柱。虽然预训练为下游学习提供了明确的优点,但它也使下游模型具有可识别的共享对抗漏洞。在这项工作中,我们揭示了CLIP下游模型的这些漏洞,并证明了基础模型可以用于攻击其下游系统。特别地,我们提出了一个简单而有效的攻击策略称为补丁表示不匹配(PRM)。仅基于开源CLIP视觉编码器,这种方法生成了同时欺骗多个下游模型超过20个任务(语义分割、目标检测、图像摘要和视觉问答)的攻击者。我们的研究结果突出了在开发下游系统时广泛使用公共基础模型所引入的令人担忧的安全风险,呼吁在这些场景中加倍谨慎。
https://arxiv.org/abs/2403.12693
Vision-language pre-training (VLP) models exhibit remarkable capabilities in comprehending both images and text, yet they remain susceptible to multimodal adversarial examples (AEs). Strengthening adversarial attacks and uncovering vulnerabilities, especially common issues in VLP models (e.g., high transferable AEs), can stimulate further research on constructing reliable and practical VLP models. A recent work (i.e., Set-level guidance attack) indicates that augmenting image-text pairs to increase AE diversity along the optimization path enhances the transferability of adversarial examples significantly. However, this approach predominantly emphasizes diversity around the online adversarial examples (i.e., AEs in the optimization period), leading to the risk of overfitting the victim model and affecting the transferability. In this study, we posit that the diversity of adversarial examples towards the clean input and online AEs are both pivotal for enhancing transferability across VLP models. Consequently, we propose using diversification along the intersection region of adversarial trajectory to expand the diversity of AEs. To fully leverage the interaction between modalities, we introduce text-guided adversarial example selection during optimization. Furthermore, to further mitigate the potential overfitting, we direct the adversarial text deviating from the last intersection region along the optimization path, rather than adversarial images as in existing methods. Extensive experiments affirm the effectiveness of our method in improving transferability across various VLP models and downstream vision-and-language tasks (e.g., Image-Text Retrieval(ITR), Visual Grounding(VG), Image Captioning(IC)).
视觉语言预训练(VLP)模型在理解和处理图像和文本方面表现出非凡的能力,然而,它们仍然容易受到多模态对抗样本(AEs)的影响。加强对抗攻击和揭示漏洞,尤其是VLP模型中常见的问题(例如高度可转移的AE),可以激发更多关于构建可靠且实用的VLP模型的研究。最近的工作(即设置层指导攻击)表明,在优化路径上增加图像-文本对以增加AE多样性显著增强对抗样本的可移植性。然而,这种方法主要关注在线对抗样本(即优化期间的AE)周围的多样性,导致受害者模型的过拟合和对可移植性的影响。在这项研究中,我们提出,对清洁输入和在线对抗样本的多样性对于增强VLP模型的可移植性至关重要。因此,我们提出了一种通过交叉区域上的多样性扩大AE多样性的方法。为了充分利用模态之间的交互,我们在优化过程中引入了文本指导的对抗样本选择。此外,为了进一步减轻过拟合的风险,我们将沿着优化路径将偏离最后一个交叉区域的对抗文本,而不是像现有方法那样将对抗图像。大量实验证实了我们在各种VLP模型和下游视觉与语言任务(例如图像-文本检索、视觉 grounded、图像标题)上改善可移植性的有效方法。
https://arxiv.org/abs/2403.12445
Open-domain real-world entity recognition is essential yet challenging, involving identifying various entities in diverse environments. The lack of a suitable evaluation dataset has been a major obstacle in this field due to the vast number of entities and the extensive human effort required for data curation. We introduce Entity6K, a comprehensive dataset for real-world entity recognition, featuring 5,700 entities across 26 categories, each supported by 5 human-verified images with annotations. Entity6K offers a diverse range of entity names and categorizations, addressing a gap in existing datasets. We conducted benchmarks with existing models on tasks like image captioning, object detection, zero-shot classification, and dense captioning to demonstrate Entity6K's effectiveness in evaluating models' entity recognition capabilities. We believe Entity6K will be a valuable resource for advancing accurate entity recognition in open-domain settings.
开放域 real-world 实体识别是必要的,但具有挑战性,涉及在各种环境中识别多种实体。缺乏合适的评估数据集是这一领域的一个主要障碍,因为存在大量实体,数据清理需要大量的人力投入。我们介绍 Entity6K,一个用于现实世界实体识别的全面数据集,包含 26 个类别的 5,700 个实体,每个实体都由 5 个人工验证的图像和注释支持。Entity6K 提供了一个丰富的实体名称和分类,填补了现有数据集中的空白。我们使用现有的模型在图像描述性、目标检测、零散分类和密集摘要任务上进行了基准测试,以证明 Entity6K 在评估模型实体识别能力方面的有效性。我们相信,Entity6K 将成为推动开放领域准确实体识别的有价值资源。
https://arxiv.org/abs/2403.12339
Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization. However, such models require a massive amount of pre-training data. Improving the quality of the pre-training data has been shown to be much more effective in improving CLIP's performance than increasing its volume. Nevertheless, finding small subsets of training data that provably generalize the best has remained an open question. In this work, we propose the first theoretically rigorous data selection method for CLIP. We show that subsets that closely preserve the cross-covariance of the images and captions of the full data provably achieve a superior generalization performance. Our extensive experiments on ConceptualCaptions3M and ConceptualCaptions12M demonstrate that subsets found by \method\ achieve over 2.7x and 1.4x the accuracy of the next best baseline on ImageNet and its shifted versions. Moreover, we show that our subsets obtain 1.5x the average accuracy across 11 downstream datasets, of the next best baseline. The code is available at: this https URL.
对比性语言图像预训练(CLIP)在大型图像捕捉数据集上学习到的表示具有惊人的零样本通用性。然而,这样的模型需要大量的预训练数据。提高预训练数据的质量已经证明是改善CLIP性能比增加其规模更有效的办法。然而,找到一些训练数据的小子集使得其能够保证最佳模型具有零样本通用性仍然是一个未解决的问题。在本文中,我们提出了第一个理论化的数据选择方法来支持CLIP。我们证明了,保留图像和摘要数据的交叉协方差的小子集确实具有卓越的泛化性能。我们对 ConceptualCaptions3M 和 ConceptualCaptions12M 的广泛实验证明,使用 \方法\找到的子集在 ImageNet 和其平移版本上实现了比最接近的基线超过2.7倍和1.4倍的准确度。此外,我们还证明了,我们的子集在11个下游数据集上的平均准确度是下一最好基线的1.5倍。代码可在此处下载:https://this URL。
https://arxiv.org/abs/2403.12267
As a cross-modal task, visual storytelling aims to generate a story for an ordered image sequence automatically. Different from the image captioning task, visual storytelling requires not only modeling the relationships between objects in the image but also mining the connections between adjacent images. Recent approaches primarily utilize either end-to-end frameworks or multi-stage frameworks to generate relevant stories, but they usually overlook latent topic information. In this paper, in order to generate a more coherent and relevant story, we propose a novel method, Topic Aware Reinforcement Network for VIsual StoryTelling (TARN-VIST). In particular, we pre-extracted the topic information of stories from both visual and linguistic perspectives. Then we apply two topic-consistent reinforcement learning rewards to identify the discrepancy between the generated story and the human-labeled story so as to refine the whole generation process. Extensive experimental results on the VIST dataset and human evaluation demonstrate that our proposed model outperforms most of the competitive models across multiple evaluation metrics.
作为一种跨模态任务,视觉叙事旨在自动为有序图像序列生成故事。与图像标题任务不同,视觉叙事不仅需要建模图像中对象之间的关系,还需要挖掘相邻图像之间的联系。最近的方法主要利用端到端框架或多阶段框架生成相关故事,但通常忽视了潜在主题信息。在本文中,为了生成更连贯和相关的故事,我们提出了一个新颖的方法:主题感知强化网络(TARN-VIST)。特别地,我们从视觉和语言两个角度预提取了故事的主题信息。然后我们应用两个主题相关的强化学习奖励来识别生成的故事与人类标注故事之间的差异,以优化整个生成过程。在VIST数据集和人类评估的广泛实验结果中,我们的提出的模型在多个评估指标上优于大多数竞争模型。
https://arxiv.org/abs/2403.11550
Two approaches have emerged to input images into large language models (LLMs). The first is to caption images into natural language. The second is to map image feature embeddings into the domain of the LLM and pass the mapped embeddings directly to the LLM. The majority of recent few-shot multimodal work reports performance using architectures that employ variations of one of these two approaches. But they overlook an important comparison between them. We design a controlled and focused experiment to compare these two approaches to few-shot visual question answering (VQA) with LLMs. Our findings indicate that for Flan-T5 XL, a 3B parameter LLM, connecting visual embeddings directly to the LLM embedding space does not guarantee improved performance over using image captions. In the zero-shot regime, we find using textual image captions is better. In the few-shot regimes, how the in-context examples are selected determines which is better.
有两种将图像输入到大型语言模型(LLMs)的方法已经出现了。第一种是将图像标题成自然语言。第二种是将图像特征嵌入映射到LLM的领域,并直接传递映射的嵌入到LLM。大多数最近的几少 shot 多模态工作都使用采用了这两种方法中的一个或两个架构的模型,但是它们忽略了它们之间的重要比较。我们设计了一个有控制力的实验,以比较这两种方法与少 shot 视觉问答(VQA) with LLMs的性能。我们的研究结果表明,对于 Flan-T5 XL这样的3B参数LLM,直接将视觉表示连接到LLM嵌入空间并没有提高性能。在零散局面下,我们发现使用文本图像标题要好。在少散局面下,如何选择上下文实例决定了哪种方法更好。
https://arxiv.org/abs/2403.11317
Image-text retrieval (ITR) plays a significant role in making informed decisions for various remote sensing (RS) applications. Nonetheless, creating ITR datasets containing vision and language modalities not only requires significant geo-spatial sampling area but also varing categories and detailed descriptions. To this end, we introduce an image caption dataset LuojiaHOG, which is geospatial-aware, label-extension-friendly and comprehensive-captioned. LuojiaHOG involves the hierarchical spatial sampling, extensible classification system to Open Geospatial Consortium (OGC) standards, and detailed caption generation. In addition, we propose a CLIP-based Image Semantic Enhancement Network (CISEN) to promote sophisticated ITR. CISEN consists of two components, namely dual-path knowledge transfer and progressive cross-modal feature fusion. Comprehensive statistics on LuojiaHOG reveal the richness in sampling diversity, labels quantity and descriptions granularity. The evaluation on LuojiaHOG is conducted across various state-of-the-art ITR models, including ALBEF, ALIGN, CLIP, FILIP, Wukong, GeoRSCLIP and CISEN. We use second- and third-level labels to evaluate these vision-language models through adapter-tuning and CISEN demonstrates superior performance. For instance, it achieves the highest scores with WMAP@5 of 88.47\% and 87.28\% on third-level ITR tasks, respectively. In particular, CISEN exhibits an improvement of approximately 1.3\% and 0.9\% in terms of WMAP@5 compared to its baseline. These findings highlight CISEN advancements accurately retrieving pertinent information across image and text. LuojiaHOG and CISEN can serve as a foundational resource for future RS image-text alignment research, facilitating a wide range of vision-language applications.
图像文本检索(ITR)在各种遥感和(RS)应用中做出明智的决策具有重要作用。然而,创建包含视觉和语言模态的ITR数据集不仅需要显著的地理采样区域,而且要考虑分类和详细描述的多样性。为此,我们介绍了一个名为LuojiaHOG的图像标题数据集,它具有地理感知性、标签扩展友好性和全面性。LuojiaHOG涉及分层空间采样、扩展分类系统至Open Geospatial Consortium(OGC)标准以及详细描述生成。此外,我们提出了一个基于CLIP的图像语义增强网络(CISEN)以促进复杂的ITR。CISEN由两个组件组成,即双路径知识传递和渐进跨模态特征融合。LuojiaHOG全面统计揭示了采样多样性、标签数量和描述 granularity。在LuojiaHOG上进行评估,包括ALBEF、ALIGN、CLIP、FILIP、Wukong、GeoRSCLIP和CISEN。我们使用第二和第三级标签通过自适应调整和CISEN在视觉语言模型上表现出优越性能。例如,它分别实现了WMAP@5的88.47\%和87.28\%的分数,在第三级ITR任务上。特别是在CISEN的基础上,它在大规模ITR任务中的性能显著提高。这些发现突出了CISEN准确检索图像和文本相关信息的优越性。LuojiaHOG和CISEN可以作为未来RS图像文本对齐研究的基石,促进各种视觉语言应用的发展。
https://arxiv.org/abs/2403.10887
Pre-training image representations from the raw text about images enables zero-shot vision transfer to downstream tasks. Through pre-training on millions of samples collected from the internet, multimodal foundation models, such as CLIP, produce state-of-the-art zero-shot results that often reach competitiveness with fully supervised methods without the need for task-specific training. Besides the encouraging performance on classification accuracy, it is reported that these models close the robustness gap by matching the performance of supervised models trained on ImageNet under natural distribution shift. Because robustness is critical to real-world applications, especially safety-critical ones, in this paper, we present a comprehensive evaluation based on a large-scale robustness benchmark covering 7 natural, 3 synthetic distribution shifts, and 11 adversarial attacks. We use CLIP as a pilot study. We show that CLIP leads to a significant robustness drop compared to supervised ImageNet models on our benchmark, especially under synthetic distribution shift and adversarial attacks. Furthermore, data overlap analysis suggests that the observed robustness under natural distribution shifts could be attributed, at least in part, to data overlap. In summary, our evaluation shows a comprehensive evaluation of robustness is necessary; and there is a significant need to improve the robustness of zero-shot multimodal models.
通过从原始文本中提取图像预训练图像表示,使得零散 shot 视觉传输下游任务成为可能。通过从互联网收集数百万个样本进行预训练,多模态基础模型(如 CLIP)产生了最先进的零散 shot 结果,通常可以达到与无需任务特定训练的全监督方法相媲美的水平。除了分类准确度令人鼓舞的结果之外,据报道,这些模型通过在自然分布漂移下训练监督模型与 ImageNet 上的监督模型相匹敌,从而缩小了鲁棒性差距。因为鲁棒性对现实世界的应用(尤其是关键应用)至关重要,尤其是在本文中,我们基于覆盖7个自然、3个合成分布漂移和11个对抗攻击的大型鲁棒性基准进行全面评估。我们使用 CLIP 作为试点研究。我们发现,CLIP 在我们的基准上导致监督 ImageNet 模型在合成分布漂移和对抗攻击方面的鲁棒性显著下降。此外,数据重叠分析表明,观察到的鲁棒性在自然分布漂移上可能是由数据重叠造成的。总之,我们的评估表明,对鲁棒性的全面评估是必要的;提高零散 shot 多模态模型的鲁棒性具有重要的意义。
https://arxiv.org/abs/2403.10499
Diffusion-based audio and music generation models commonly generate music by constructing an image representation of audio (e.g., a mel-spectrogram) and then converting it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their effectiveness. We propose MusicHiFi -- an efficient high-fidelity stereophonic vocoder. Our method employs a cascade of three generative adversarial networks (GANs) that convert low-resolution mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth expansion, and upmixes to stereophonic audio. Compared to previous work, we propose 1) a unified GAN-based generator and discriminator architecture and training procedure for each stage of our cascade, 2) a new fast, near downsampling-compatible bandwidth extension module, and 3) a new fast downmix-compatible mono-to-stereo upmixer that ensures the preservation of monophonic content in the output. We evaluate our approach using both objective and subjective listening tests and find our approach yields comparable or better audio quality, better spatialization control, and significantly faster inference speed compared to past work. Sound examples are at this https URL.
扩散基于音频和音乐的生成模型通常通过构建音频的图像表示(例如,一个频谱图)并使用相位重建模型或 vocoder 将它转换为音频来生成音乐。然而,典型的 vocoder 产生的音频在较低的分辨率(例如 16-24 kHz)上,这限制了它们的有效性。我们提出 MusicHiFi -- 一个高效的高保真立体声 vocoder。我们的方法采用了一个级联的三个生成对抗网络(GANs)来将低分辨率频谱图转换为音频,通过带宽扩展将低分辨率音频升级到高分辨率音频,并使用上混合器将单声道音频转换为立体声音频。与之前的工作相比,我们提出了以下1) 对于我们级联的每个阶段,使用统一的 GAN 生成器和判别器架构以及训练程序;2) 一个新型的快速、近降采样兼容的带宽扩展模块;3) 一个快速降采样兼容的单声道到立体声上混合器,确保在输出中保留单声道内容的完整性。我们通过客观和主观听测试评估了我们的方法,发现我们的方法产生了相当不错的音频质量、更好的空间定位控制,以及比过去工作显著更快的推理速度。音频示例可以在以下链接中找到:https://www.example.com/
https://arxiv.org/abs/2403.10493
Image segmentation is one of the most fundamental problems in computer vision and has drawn a lot of attentions due to its vast applications in image understanding and autonomous driving. However, designing effective and efficient segmentation neural architectures is a labor-intensive process that may require lots of trials by human experts. In this paper, we address the challenge of integrating multi-head self-attention into high resolution representation CNNs efficiently, by leveraging architecture search. Manually replacing convolution layers with multi-head self-attention is non-trivial due to the costly overhead in memory to maintain high resolution. By contrast, we develop a multi-target multi-branch supernet method, which not only fully utilizes the advantages of high-resolution features, but also finds the proper location for placing multi-head self-attention module. Our search algorithm is optimized towards multiple objective s (e.g., latency and mIoU) and capable of finding architectures on Pareto frontier with arbitrary number of branches in a single search. We further present a series of model via Hybrid Convolutional-Transformer Architecture Search (HyCTAS) method that searched for the best hybrid combination of light-weight convolution layers and memory-efficient self-attention layers between branches from different resolutions and fuse to high resolution for both efficiency and effectiveness. Extensive experiments demonstrate that HyCTAS outperforms previous methods on semantic segmentation task. Code and models are available at \url{this https URL}.
图像分割是计算机视觉中最基本的问题之一,由于其在图像理解和自动驾驶中的广泛应用,因此受到了很多关注。然而,设计有效且高效的分割神经架构是一个劳动密集的过程,可能需要许多专家的人工作尝试。在本文中,我们通过利用架构搜索解决了将多头自注意力集成到高分辨率表示CNNs中的问题,通过构建多目标多分支超级网络。通过手动替换卷积层为多头自注意力,由于需要高昂的内存开销来维持高分辨率,因此这是不可能的。相反,我们开发了一种多目标多分支超级网络方法,不仅充分利用了高分辨率特征的优势,而且发现了放置多头自注意力的适当位置。我们的搜索算法针对多个目标(如延迟和mIoU)进行优化,可以在一个搜索中找到架构在帕累托前沿的任意数量分支上的最优架构。我们还通过HyCTAS方法展示了一系列模型,该方法在寻找不同分辨率分支的最佳轻量级卷积层和内存高效的自注意力层之间进行了搜索,将高分辨率与效率进行了平衡。大量实验证明,HyCTAS在语义分割任务上优于以前的算法。代码和模型可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2403.10413
In this study, we address the intricate challenge of multi-task dense prediction, encompassing tasks such as semantic segmentation, depth estimation, and surface normal estimation, particularly when dealing with partially annotated data (MTPSL). The complexity arises from the absence of complete task labels for each training image. Given the inter-related nature of these pixel-wise dense tasks, our focus is on mining and capturing cross-task relationships. Existing solutions typically rely on learning global image representations for global cross-task image matching, imposing constraints that, unfortunately, sacrifice the finer structures within the images. Attempting local matching as a remedy faces hurdles due to the lack of precise region supervision, making local alignment a challenging endeavor. The introduction of Segment Anything Model (SAM) sheds light on addressing local alignment challenges by providing free and high-quality solutions for region detection. Leveraging SAM-detected regions, the subsequent challenge lies in aligning the representations within these regions. Diverging from conventional methods that directly learn a monolithic image representation, our proposal involves modeling region-wise representations using Gaussian Distributions. Aligning these distributions between corresponding regions from different tasks imparts higher flexibility and capacity to capture intra-region structures, accommodating a broader range of tasks. This innovative approach significantly enhances our ability to effectively capture cross-task relationships, resulting in improved overall performance in partially supervised multi-task dense prediction scenarios. Extensive experiments conducted on two widely used benchmarks underscore the superior effectiveness of our proposed method, showcasing state-of-the-art performance even when compared to fully supervised methods.
在这项研究中,我们解决了多任务密集预测的复杂挑战,包括语义分割、深度估计和表面法线估计等任务,尤其是在处理部分标注数据(MTPSL)时。复杂性源于每个训练图像缺少完整的任务标签。鉴于这些像素级密集任务之间的相关性,我们的重点是挖掘和捕捉跨任务关系。现有的解决方案通常依赖于全局图像表示学习全局跨任务图像匹配,并强加一些限制,不幸的是,牺牲了图像中的更细结构。尝试局部匹配作为解决方法遇到了障碍,因为缺乏精确的区域指导,使得局部对齐成为一个具有挑战性的任务。SAM模型的引入使通过为区域检测提供免费且高质量解决方案来解决局部对齐问题。利用SAM检测到的区域,接下来的挑战在于对齐这些区域内的表示。从传统的直接学习单体图像表示的方法中脱离出来,我们的方案使用高斯分布对区域进行建模。在对应任务的不同区域之间对齐这些分布赋予了更高的灵活性和容量,适应了更广泛的任务。这种创新方法显著增强了我们捕捉跨任务关系的能力,在部分监督的多任务密集预测场景中实现了整体性能的提高。在两个广泛使用的基准上进行的大量实验都证实了所提出方法的优势,即使与完全监督方法相比,表现也相当卓越。
https://arxiv.org/abs/2403.10252