Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences. Motivated by the human cognitive process for long-form video understanding, we emphasize interactive reasoning and planning over the ability to process lengthy visual inputs. We introduce a novel agent-based system, VideoAgent, that employs a large language model as a central agent to iteratively identify and compile crucial information to answer a question, with vision-language foundation models serving as tools to translate and retrieve visual information. Evaluated on the challenging EgoSchema and NExT-QA benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only 8.4 and 8.2 frames used on average. These results demonstrate superior effectiveness and efficiency of our method over the current state-of-the-art methods, highlighting the potential of agent-based approaches in advancing long-form video understanding.
长格式视频理解是计算机视觉领域的一个重要挑战,要求模型具有在长多模态序列上进行推理的能力。为了满足人类在长格式视频理解中的认知过程,我们强调交互式推理和规划,而不是处理长视觉输入的能力。我们引入了一个名为VideoAgent的新颖智能体系统,它采用一个大语言模型作为核心代理,通过迭代确定和汇总关键信息来回答问题,而视觉语言模型作为工具来翻译和检索视觉信息。在具有挑战性的EgoSchema和NExT-QA基准测试中评估,VideoAgent平均使用8.4和8.2帧,实现了54.1%和71.3%的零散准确性。这些结果表明,我们的方法在当前最先进的方法上具有优越的效性和效率,突出了基于智能体的方法在促进长格式视频理解方面的潜力。
https://arxiv.org/abs/2403.10517
Pre-training image representations from the raw text about images enables zero-shot vision transfer to downstream tasks. Through pre-training on millions of samples collected from the internet, multimodal foundation models, such as CLIP, produce state-of-the-art zero-shot results that often reach competitiveness with fully supervised methods without the need for task-specific training. Besides the encouraging performance on classification accuracy, it is reported that these models close the robustness gap by matching the performance of supervised models trained on ImageNet under natural distribution shift. Because robustness is critical to real-world applications, especially safety-critical ones, in this paper, we present a comprehensive evaluation based on a large-scale robustness benchmark covering 7 natural, 3 synthetic distribution shifts, and 11 adversarial attacks. We use CLIP as a pilot study. We show that CLIP leads to a significant robustness drop compared to supervised ImageNet models on our benchmark, especially under synthetic distribution shift and adversarial attacks. Furthermore, data overlap analysis suggests that the observed robustness under natural distribution shifts could be attributed, at least in part, to data overlap. In summary, our evaluation shows a comprehensive evaluation of robustness is necessary; and there is a significant need to improve the robustness of zero-shot multimodal models.
通过从原始文本中提取图像预训练图像表示,使得零散 shot 视觉传输下游任务成为可能。通过从互联网收集数百万个样本进行预训练,多模态基础模型(如 CLIP)产生了最先进的零散 shot 结果,通常可以达到与无需任务特定训练的全监督方法相媲美的水平。除了分类准确度令人鼓舞的结果之外,据报道,这些模型通过在自然分布漂移下训练监督模型与 ImageNet 上的监督模型相匹敌,从而缩小了鲁棒性差距。因为鲁棒性对现实世界的应用(尤其是关键应用)至关重要,尤其是在本文中,我们基于覆盖7个自然、3个合成分布漂移和11个对抗攻击的大型鲁棒性基准进行全面评估。我们使用 CLIP 作为试点研究。我们发现,CLIP 在我们的基准上导致监督 ImageNet 模型在合成分布漂移和对抗攻击方面的鲁棒性显著下降。此外,数据重叠分析表明,观察到的鲁棒性在自然分布漂移上可能是由数据重叠造成的。总之,我们的评估表明,对鲁棒性的全面评估是必要的;提高零散 shot 多模态模型的鲁棒性具有重要的意义。
https://arxiv.org/abs/2403.10499
Diffusion-based audio and music generation models commonly generate music by constructing an image representation of audio (e.g., a mel-spectrogram) and then converting it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their effectiveness. We propose MusicHiFi -- an efficient high-fidelity stereophonic vocoder. Our method employs a cascade of three generative adversarial networks (GANs) that convert low-resolution mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth expansion, and upmixes to stereophonic audio. Compared to previous work, we propose 1) a unified GAN-based generator and discriminator architecture and training procedure for each stage of our cascade, 2) a new fast, near downsampling-compatible bandwidth extension module, and 3) a new fast downmix-compatible mono-to-stereo upmixer that ensures the preservation of monophonic content in the output. We evaluate our approach using both objective and subjective listening tests and find our approach yields comparable or better audio quality, better spatialization control, and significantly faster inference speed compared to past work. Sound examples are at this https URL.
扩散基于音频和音乐的生成模型通常通过构建音频的图像表示(例如,一个频谱图)并使用相位重建模型或 vocoder 将它转换为音频来生成音乐。然而,典型的 vocoder 产生的音频在较低的分辨率(例如 16-24 kHz)上,这限制了它们的有效性。我们提出 MusicHiFi -- 一个高效的高保真立体声 vocoder。我们的方法采用了一个级联的三个生成对抗网络(GANs)来将低分辨率频谱图转换为音频,通过带宽扩展将低分辨率音频升级到高分辨率音频,并使用上混合器将单声道音频转换为立体声音频。与之前的工作相比,我们提出了以下1) 对于我们级联的每个阶段,使用统一的 GAN 生成器和判别器架构以及训练程序;2) 一个新型的快速、近降采样兼容的带宽扩展模块;3) 一个快速降采样兼容的单声道到立体声上混合器,确保在输出中保留单声道内容的完整性。我们通过客观和主观听测试评估了我们的方法,发现我们的方法产生了相当不错的音频质量、更好的空间定位控制,以及比过去工作显著更快的推理速度。音频示例可以在以下链接中找到:https://www.example.com/
https://arxiv.org/abs/2403.10493
Mitigating hallucinations of Large Multi-modal Models(LMMs) is crucial to enhance their reliability for general-purpose assistants. This paper shows that such hallucinations of LMMs can be significantly exacerbated by preceding user-system dialogues. To precisely measure this, we first present an evaluation benchmark by extending popular multi-modal benchmark datasets with prepended hallucinatory dialogues generated by our novel Adversarial Question Generator, which can automatically generate image-related yet adversarial dialogues by adopting adversarial attacks on LMMs. On our benchmark, the zero-shot performance of state-of-the-art LMMs dropped significantly for both the VQA and Captioning tasks. Next, we further reveal this hallucination is mainly due to the prediction bias toward preceding dialogues rather than visual content. To reduce this bias, we propose Adversarial Instruction Tuning that robustly fine-tunes LMMs on augmented multi-modal instruction-following datasets with hallucinatory dialogues. Extensive experiments show that our proposed approach successfully reduces dialogue hallucination while maintaining or even improving performance.
缓解大型多模态模型(LMMs)的幻觉对于增强其对于通用助手设备的可靠性至关重要。本文表明,LMMs的前用户-系统对话可能会显著加剧这种幻觉。为了准确测量这一点,我们首先通过扩展流行的多模态基准数据集,使用我们新颖的对抗性问题生成器生成的附带幻觉对话,该生成器可以通过对LMMs的对抗攻击来生成与图像相关的 adversarial 对话。在我们的基准上,最先进的 LMM 的零散性能对于 both VQA 和 Captioning 任务都下降了显著。接下来,我们进一步表明,这种幻觉主要是由先前的对话预测偏差导致的,而不是视觉内容。为了减少这种偏见,我们提出了对抗指令调整,它在使用增强多模态指令跟随数据集上对 LMMs 进行鲁棒微调的同时,通过附带幻觉对话进行调整。大量实验证明,我们提出的方法在保持或甚至提高性能的同时,成功地减少了对话幻觉。
https://arxiv.org/abs/2403.10492
Image segmentation is one of the most fundamental problems in computer vision and has drawn a lot of attentions due to its vast applications in image understanding and autonomous driving. However, designing effective and efficient segmentation neural architectures is a labor-intensive process that may require lots of trials by human experts. In this paper, we address the challenge of integrating multi-head self-attention into high resolution representation CNNs efficiently, by leveraging architecture search. Manually replacing convolution layers with multi-head self-attention is non-trivial due to the costly overhead in memory to maintain high resolution. By contrast, we develop a multi-target multi-branch supernet method, which not only fully utilizes the advantages of high-resolution features, but also finds the proper location for placing multi-head self-attention module. Our search algorithm is optimized towards multiple objective s (e.g., latency and mIoU) and capable of finding architectures on Pareto frontier with arbitrary number of branches in a single search. We further present a series of model via Hybrid Convolutional-Transformer Architecture Search (HyCTAS) method that searched for the best hybrid combination of light-weight convolution layers and memory-efficient self-attention layers between branches from different resolutions and fuse to high resolution for both efficiency and effectiveness. Extensive experiments demonstrate that HyCTAS outperforms previous methods on semantic segmentation task. Code and models are available at \url{this https URL}.
图像分割是计算机视觉中最基本的问题之一,由于其在图像理解和自动驾驶中的广泛应用,因此受到了很多关注。然而,设计有效且高效的分割神经架构是一个劳动密集的过程,可能需要许多专家的人工作尝试。在本文中,我们通过利用架构搜索解决了将多头自注意力集成到高分辨率表示CNNs中的问题,通过构建多目标多分支超级网络。通过手动替换卷积层为多头自注意力,由于需要高昂的内存开销来维持高分辨率,因此这是不可能的。相反,我们开发了一种多目标多分支超级网络方法,不仅充分利用了高分辨率特征的优势,而且发现了放置多头自注意力的适当位置。我们的搜索算法针对多个目标(如延迟和mIoU)进行优化,可以在一个搜索中找到架构在帕累托前沿的任意数量分支上的最优架构。我们还通过HyCTAS方法展示了一系列模型,该方法在寻找不同分辨率分支的最佳轻量级卷积层和内存高效的自注意力层之间进行了搜索,将高分辨率与效率进行了平衡。大量实验证明,HyCTAS在语义分割任务上优于以前的算法。代码和模型可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2403.10413
In this study, we address the intricate challenge of multi-task dense prediction, encompassing tasks such as semantic segmentation, depth estimation, and surface normal estimation, particularly when dealing with partially annotated data (MTPSL). The complexity arises from the absence of complete task labels for each training image. Given the inter-related nature of these pixel-wise dense tasks, our focus is on mining and capturing cross-task relationships. Existing solutions typically rely on learning global image representations for global cross-task image matching, imposing constraints that, unfortunately, sacrifice the finer structures within the images. Attempting local matching as a remedy faces hurdles due to the lack of precise region supervision, making local alignment a challenging endeavor. The introduction of Segment Anything Model (SAM) sheds light on addressing local alignment challenges by providing free and high-quality solutions for region detection. Leveraging SAM-detected regions, the subsequent challenge lies in aligning the representations within these regions. Diverging from conventional methods that directly learn a monolithic image representation, our proposal involves modeling region-wise representations using Gaussian Distributions. Aligning these distributions between corresponding regions from different tasks imparts higher flexibility and capacity to capture intra-region structures, accommodating a broader range of tasks. This innovative approach significantly enhances our ability to effectively capture cross-task relationships, resulting in improved overall performance in partially supervised multi-task dense prediction scenarios. Extensive experiments conducted on two widely used benchmarks underscore the superior effectiveness of our proposed method, showcasing state-of-the-art performance even when compared to fully supervised methods.
在这项研究中,我们解决了多任务密集预测的复杂挑战,包括语义分割、深度估计和表面法线估计等任务,尤其是在处理部分标注数据(MTPSL)时。复杂性源于每个训练图像缺少完整的任务标签。鉴于这些像素级密集任务之间的相关性,我们的重点是挖掘和捕捉跨任务关系。现有的解决方案通常依赖于全局图像表示学习全局跨任务图像匹配,并强加一些限制,不幸的是,牺牲了图像中的更细结构。尝试局部匹配作为解决方法遇到了障碍,因为缺乏精确的区域指导,使得局部对齐成为一个具有挑战性的任务。SAM模型的引入使通过为区域检测提供免费且高质量解决方案来解决局部对齐问题。利用SAM检测到的区域,接下来的挑战在于对齐这些区域内的表示。从传统的直接学习单体图像表示的方法中脱离出来,我们的方案使用高斯分布对区域进行建模。在对应任务的不同区域之间对齐这些分布赋予了更高的灵活性和容量,适应了更广泛的任务。这种创新方法显著增强了我们捕捉跨任务关系的能力,在部分监督的多任务密集预测场景中实现了整体性能的提高。在两个广泛使用的基准上进行的大量实验都证实了所提出方法的优势,即使与完全监督方法相比,表现也相当卓越。
https://arxiv.org/abs/2403.10252
Video-text Large Language Models (video-text LLMs) have shown remarkable performance in answering questions and holding conversations on simple videos. However, they perform almost the same as random on grounding text queries in long and complicated videos, having little ability to understand and reason about temporal information, which is the most fundamental difference between videos and images. In this paper, we propose HawkEye, one of the first video-text LLMs that can perform temporal video grounding in a fully text-to-text manner. To collect training data that is applicable for temporal video grounding, we construct InternVid-G, a large-scale video-text corpus with segment-level captions and negative spans, with which we introduce two new time-aware training objectives to video-text LLMs. We also propose a coarse-grained method of representing segments in videos, which is more robust and easier for LLMs to learn and follow than other alternatives. Extensive experiments show that HawkEye is better at temporal video grounding and comparable on other video-text tasks with existing video-text LLMs, which verifies its superior video-text multi-modal understanding abilities.
视频文本大型语言模型(video-text LLMs)在回答问题和进行对话的简单视频中表现出了出色的性能。然而,它们在长且复杂的视频中表现几乎与随机相同,对文本查询的 grounding 几乎没有任何理解能力和推理能力,这是视频和图像之间最基本的区别。在本文中,我们提出了HawkEye,是第一个在文本到文本方式下实现时间视频 grounding 的 video-text LLM。为了收集适用于时间视频 grounding 的训练数据,我们构建了InternVid-G,一个大规模视频文本语料库,带有分段级的字幕和负跨度,其中我们引入了两个新的时间感知训练目标给视频-文本 LLM。我们还提出了一个表示视频段落的粗粒度方法,这是更稳健且更容易让 LLM 学习和跟随的其他替代方法的。大量实验证明,HawkEye 在时间视频 grounding 方面表现更好,与现有视频文本 LLM 相当,验证了其在视频-文本多模态理解能力上的卓越表现。
https://arxiv.org/abs/2403.10228
Audio-text retrieval (ATR), which retrieves a relevant caption given an audio clip (A2T) and vice versa (T2A), has recently attracted much research attention. Existing methods typically aggregate information from each modality into a single vector for matching, but this sacrifices local details and can hardly capture intricate relationships within and between modalities. Furthermore, current ATR datasets lack comprehensive alignment information, and simple binary contrastive learning labels overlook the measurement of fine-grained semantic differences between samples. To counter these challenges, we present a novel ATR framework that comprehensively captures the matching relationships of multimodal information from different perspectives and finer granularities. Specifically, a fine-grained alignment method is introduced, achieving a more detail-oriented matching through a multiscale process from local to global levels to capture meticulous cross-modal relationships. In addition, we pioneer the application of cross-modal similarity consistency, leveraging intra-modal similarity relationships as soft supervision to boost more intricate alignment. Extensive experiments validate the effectiveness of our approach, outperforming previous methods by significant margins of at least 3.9% (T2A) / 6.9% (A2T) R@1 on the AudioCaps dataset and 2.9% (T2A) / 5.4% (A2T) R@1 on the Clotho dataset.
音频文本检索(ATR)是一种从音频片段(A2T)中检索相关标题,反之从文本中检索相关音频片段(T2A)的方法,最近吸引了大量研究关注。现有的方法通常将来自每个模态的信息聚合为单个向量进行匹配,但这会牺牲局部细节,并且很难捕捉模态之间的精细关系。此外,当前的ATR数据集缺乏全面的对齐信息,简单的二元对比学习标签忽视了样本之间的细微语义差异的测量。为了应对这些挑战,我们提出了一个新颖的ATR框架,全面捕捉不同角度和更细粒度的多模态信息之间的匹配关系。具体来说,我们引入了一种细粒度对齐方法,通过多尺度过程从局部到全局层次结构,捕捉细致的跨模态关系。此外,我们还开创性地应用了跨模态相似性一致性,利用内部模态相似关系作为软监督,以提高更复杂对齐的准确性。大量实验验证了我们的方法的有效性,在AudioCaps数据集上比前方法至少提高了3.9%(T2A)/6.9%(A2T)的R@1,在Clotho数据集上比前方法至少提高了2.9%(T2A)/5.4%(A2T)的R@1。
https://arxiv.org/abs/2403.10146
Understanding videos is one of the fundamental directions in computer vision research, with extensive efforts dedicated to exploring various architectures such as RNN, 3D CNN, and Transformers. The newly proposed architecture of state space model, e.g., Mamba, shows promising traits to extend its success in long sequence modeling to video modeling. To assess whether Mamba can be a viable alternative to Transformers in the video understanding domain, in this work, we conduct a comprehensive set of studies, probing different roles Mamba can play in modeling videos, while investigating diverse tasks where Mamba could exhibit superiority. We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks. Our extensive experiments reveal the strong potential of Mamba on both video-only and video-language tasks while showing promising efficiency-performance trade-offs. We hope this work could provide valuable data points and insights for future research on video understanding. Code is public: this https URL.
理解视频是计算机视觉研究的一个基本方向,致力于探索各种架构,如RNN、3D CNN和Transformer。新提出的状态空间模型(如Mamba)表现出在长序列建模领域延续其成功的前景,同时研究了Mamba在视频理解领域可能表现出优势的各种任务。在本文中,我们进行了一系列全面的研究,探讨了Mamba在建模视频中的不同作用,同时研究了Mamba在各种任务中可能表现出优势的多样性。我们将Mamba分为四种建模视频的角色,构建了一个由14个模型/模块组成的视频Mamba套件,并在12个视频理解任务上对其进行评估。我们广泛的实验揭示了Mamba在视频-仅和视频-语言任务方面的强大潜力,同时展示了其在效率和性能上的有益权衡。我们希望这项工作可以为未来关于视频理解的科学研究提供有价值的数据点和见解。代码是公开的:此链接。
https://arxiv.org/abs/2403.09626
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, consisting of both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.
在这项工作中,我们讨论了构建高性能的多模态大型语言模型(MLLMs)。特别是,我们研究了各种架构组件和数据选择的重要性。通过仔细和全面的图像编码器、视觉语言连接器和各种预训练数据选择,我们识别出几个关键的设计经验。例如,我们证明了,在大型多模态预训练中,使用仔细混合图像捕捉、平滑图像文本和文本only数据对于在多个基准上实现最先进的(SOTA)几 shot结果至关重要,与其他已发表的预训练结果相比。此外,我们还证明了图像编码器与图像分辨率相结合,对图像标记计数有相当大的影响,而视觉语言连接器的设计则相对较小。通过扩展所提出的食谱,我们构建了MM1,一种具有30B参数的 multimodal 模型家族,包括密集模型和专家混合(MoE)变体,在预训练指标上实现了最先进的性能,并在各种已建立的多模态基准上实现了具有竞争力的性能。由于大规模预训练,MM1 具有诸如增强的上下文学习 和多图像推理 这样的有吸引力的特性,实现了几 shot链式思维提示。
https://arxiv.org/abs/2403.09611
Multimodal large language models (MLLMs) have shown impressive reasoning abilities, which, however, are also more vulnerable to jailbreak attacks than their LLM predecessors. Although still capable of detecting unsafe responses, we observe that safety mechanisms of the pre-aligned LLMs in MLLMs can be easily bypassed due to the introduction of image features. To construct robust MLLMs, we propose ECSO(Eyes Closed, Safety On), a novel training-free protecting approach that exploits the inherent safety awareness of MLLMs, and generates safer responses via adaptively transforming unsafe images into texts to activate intrinsic safety mechanism of pre-aligned LLMs in MLLMs. Experiments on five state-of-the-art (SoTA) MLLMs demonstrate that our ECSO enhances model safety significantly (e.g., a 37.6% improvement on the MM-SafetyBench (SD+OCR), and 71.3% on VLSafe for the LLaVA-1.5-7B), while consistently maintaining utility results on common MLLM benchmarks. Furthermore, we show that ECSO can be used as a data engine to generate supervised-finetuning (SFT) data for MLLM alignment without extra human intervention.
多模态大型语言模型(MLLMs)表现出惊人的推理能力,然而,与LLM前身的易受攻击性相比,它们也更容易受到黑客攻击。虽然MLLM中的预对齐LLM的安全机制仍然能够检测到不安全的响应,但我们的观察发现,由于引入了图像特征,MLLM预对齐LLM的安全机制很容易被绕过。为了构建稳健的MLLM,我们提出了ECSO(闭眼,安全开启),一种新颖的训练免费的保护方法,它利用了MLLMs固有的安全意识,并通过自适应地将不安全的图像转换为文本来激活预对齐LLM的安全机制。在五个最先进的(SoTA)MLLM上的实验表明,我们的ECSO显著增强了模型安全性(例如,在MM-SafetyBench (SD+OCR)上的改进率为37.6%,在VLSafe上的改进率为71.3%)。同时,我们在常见MLLM基准上保持了使用价值结果。此外,我们还证明了ECSO可以作为数据引擎,用于为MLLM的对齐生成监督微调(SFT)数据,而无需额外的人干预。
https://arxiv.org/abs/2403.09572
Environment maps endowed with sophisticated semantics are pivotal for facilitating seamless interaction between robots and humans, enabling them to effectively carry out various tasks. Open-vocabulary maps, powered by Visual-Language models (VLMs), possess inherent advantages, including multimodal retrieval and open-set classes. However, existing open-vocabulary maps are constrained to closed indoor scenarios and VLM features, thereby diminishing their usability and inference capabilities. Moreover, the absence of topological relationships further complicates the accurate querying of specific instances. In this work, we propose OpenGraph, a representation of open-vocabulary hierarchical graph structure designed for large-scale outdoor environments. OpenGraph initially extracts instances and their captions from visual images using 2D foundation models, encoding the captions with features to enhance textual reasoning. Subsequently, 3D incremental panoramic mapping with feature embedding is achieved by projecting images onto LiDAR point clouds. Finally, the environment is segmented based on lane graph connectivity to construct a hierarchical graph. Validation results from real public dataset SemanticKITTI demonstrate that, even without fine-tuning the models, OpenGraph exhibits the ability to generalize to novel semantic classes and achieve the highest segmentation and query accuracy. The source code of OpenGraph is publicly available at this https URL.
具有复杂语义的环境图是促进机器人与人类之间无缝交互的关键,使它们能够有效执行各种任务。基于视觉语言模型(VLMs)的开放词汇图具有优势,包括多模态检索和开放集类。然而,现有的开放词汇图受到室内场景和VLM特征的限制,从而降低了其可用性和推理能力。此外,缺乏拓扑关系还进一步复杂了针对特定实例的准确查询。在本文中,我们提出了OpenGraph,一种为大规模户外环境设计的开放词汇层次图表示。OpenGraph首先使用2D基础模型从视觉图像中提取实例和它们的 caption,并将它们编码为特征以增强文本推理。随后,通过将图像投影到激光雷达点云上实现3D增量全景映射。最后,根据道路图连接性对环境进行分割,构建层次图。来自公共数据集SemanticKITTI的验证结果表明,即使没有对模型进行微调,OpenGraph也表现出将新 semantic 类扩展的能力,并实现最高的分割和查询准确性。OpenGraph的源代码公开在https://这个URL上。
https://arxiv.org/abs/2403.09412
This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT. Motivated by the universality of the Multi-layer Transformer architecture (e.g, GPT) widely used in large language models (LLMs), we seek to broaden its scope to serve as a powerful vision foundation model (VFM). However, unlike language modeling, visual tasks typically require specific modules, such as bounding box heads for detection and pixel decoders for segmentation, greatly hindering the application of powerful multi-layer transformers in the vision domain. To solve this, we design a universal language interface that empowers the successful auto-regressive decoding to adeptly unify various visual tasks, from image-level understanding (e.g., captioning), over sparse perception (e.g., detection), to dense prediction (e.g., segmentation). Based on the above designs, the entire model is composed solely of a ViT, without any specific additions, offering a remarkable architectural simplification. GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning. Interestingly, our GiT builds a new benchmark in generalist performance, and fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training. This reflects a similar impact observed in LLMs. Further enriching training with 27 datasets, GiT achieves strong zero-shot results over various tasks. Due to its simple design, this paradigm holds promise for narrowing the architectural gap between vision and language. Code and models will be available at \url{this https URL}.
本文提出了一种简单而有效的框架GiT,可用于各种视觉任务,仅使用基本的ViT。受到大型语言模型(LLMs)中广泛使用的多层Transformer架构(如GPT)的启发,我们希望将其扩展到作为强大的视觉基础模型(VFM)。然而,与语言建模不同,视觉任务通常需要特定的模块,例如用于检测的边界框头和用于分割的像素解码器,这大大阻碍了在视觉领域应用强大的多层Transformer架构。为了解决这个问题,我们设计了一个通用的语言接口,使得自适应递归解码能够有效地统一各种视觉任务,从图像级的理解(如文本摘要)到稀疏感知(如检测)再到密集预测(如分割)。基于上述设计,整个模型仅由ViT组成,没有特定的添加,实现了惊人的架构简化。GiT是一个多任务视觉模型,在五个具有代表性的基准上进行共同训练,没有任务特定的微调。有趣的是,我们的GiT在一般性能上建立了新的基准,并在任务之间促进了相互增强,与单独训练相比,取得了显著的改进。这反映了LLMs上观察到的类似影响。通过添加27个数据集,GiT在各种任务上都取得了强大的零样本结果。由于其简单的设计,这个范式有可能缩小视觉和语言之间的架构差距。代码和模型将在此处下载:\url{此链接}。
https://arxiv.org/abs/2403.09394
Mainstream parameter-efficient fine-tuning (PEFT) methods, such as LoRA or Adapter, project a model's hidden states to a lower dimension, allowing pre-trained models to adapt to new data through this low-rank bottleneck. However, PEFT tasks involving multiple modalities, like vision-language (VL) tasks, require not only adaptation to new data but also learning the relationship between different modalities. Targeting at VL PEFT tasks, we propose a family of operations, called routing functions, to enhance VL alignment in the low-rank bottlenecks. The routing functions adopt linear operations and do not introduce new trainable parameters. In-depth analyses are conducted to study their behavior. In various VL PEFT settings, the routing functions significantly improve performance of the original PEFT methods, achieving over 20% improvement on VQAv2 ($\text{RoBERTa}_{\text{large}}$+ViT-L/16) and 30% on COCO Captioning (GPT2-medium+ViT-L/16). Also when fine-tuning a pre-trained multimodal model such as CLIP-BART, we observe smaller but consistent improvements across a range of VL PEFT tasks.
主流参数高效的微调(PEFT)方法,如LoRA或Adapter,将模型的隐藏状态降低到较低的维度,使得预训练模型能通过这个低秩瓶颈适应新的数据。然而,涉及多个模态的PEFT任务,如视觉语言(VL)任务,不仅需要适应新的数据,还需要学习不同模态之间的关系。针对VL PEFT任务,我们提出了一类操作,称为路由函数,以增强VL的配准在低秩瓶颈中。路由函数采用线性操作,不引入新的训练参数。对它们的行为进行了深入分析。在各种VL PEFT设置中,路由函数显著提高了原始PEFT方法的性能,在VQAv2(RoBERTa large)+ViT-L/16)和COCO Captioning(GPT2-medium+ViT-L/16)等任务上实现了超过20%的改善。此外,当微调预训练的多模态模型(如CLIP-BART)时,我们观察到一系列VL PEFT任务的改善程度较小,但具有 consistency。
https://arxiv.org/abs/2403.09377
Recent advancements in state space models, notably Mamba, have demonstrated significant progress in modeling long sequences for tasks like language understanding. Yet, their application in vision tasks has not markedly surpassed the performance of traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). This paper posits that the key to enhancing Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling. Traditional ViM approaches, which flatten spatial tokens, overlook the preservation of local 2D dependencies, thereby elongating the distance between adjacent tokens. We introduce a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies while maintaining a global perspective. Additionally, acknowledging the varying preferences for scan patterns across different network layers, we propose a dynamic method to independently search for the optimal scan choices for each layer, substantially improving performance. Extensive experiments across both plain and hierarchical models underscore our approach's superiority in effectively capturing image representations. For example, our model significantly outperforms Vim-Ti by 3.1% on ImageNet with the same 1.5G FLOPs. Code is available at: this https URL.
近年来,在状态空间模型(State Space Models)方面的进步,特别是Mamba,在诸如语言理解等任务中取得了显著的进展。然而,在视觉任务中,它们的应用并没有明显超越传统卷积神经网络(CNNs)和视觉变压器(ViTs)的性能。本文认为,提高视觉Mamba(ViM)的关键在于优化序列建模的扫描方向。传统的ViM方法平铺空间权重,忽视了局部2D依赖关系的保留,从而延长了相邻词之间的距离。我们引入了一种新颖的局部扫描策略,将图像划分为不同的窗口,在保持全局视图的同时有效捕捉局部依赖关系。此外,考虑到不同网络层对扫描模式的需求存在差异,我们提出了一个动态方法,独立搜索每个层的最佳扫描选择,从而显著提高性能。在plain和hierarchical模型之间进行广泛的实验,都证实了我们在捕捉图像表示方面的优越性。例如,与Vim-Ti相比,我们的模型在ImageNet上具有相同的1.5G FLOPs时,性能提高了3.1%。代码可在此处下载:https://this URL。
https://arxiv.org/abs/2403.09338
Vision language models (VLMs) have drastically changed the computer vision model landscape in only a few years, opening an exciting array of new applications from zero-shot image classification, over to image captioning, and visual question answering. Unlike pure vision models, they offer an intuitive way to access visual content through language prompting. The wide applicability of such models encourages us to ask whether they also align with human vision - specifically, how far they adopt human-induced visual biases through multimodal fusion, or whether they simply inherit biases from pure vision models. One important visual bias is the texture vs. shape bias, or the dominance of local over global information. In this paper, we study this bias in a wide range of popular VLMs. Interestingly, we find that VLMs are often more shape-biased than their vision encoders, indicating that visual biases are modulated to some extent through text in multimodal models. If text does indeed influence visual biases, this suggests that we may be able to steer visual biases not just through visual input but also through language: a hypothesis that we confirm through extensive experiments. For instance, we are able to steer shape bias from as low as 49% to as high as 72% through prompting alone. For now, the strong human bias towards shape (96%) remains out of reach for all tested VLMs.
视觉语言模型(VLMs)在短短几年内彻底改变了计算机视觉模型的格局,为零散图像分类、图像描述和视觉问答等带来了众多新颖的应用。与纯视觉模型不同,它们通过语言提示提供了直观访问视觉内容的方式。这类模型的广泛应用鼓励我们问是否也与人类视觉保持一致 - 尤其是,它们在多模态融合中采用人类诱导的视觉偏见程度,或是否只是从纯视觉模型中继承偏见。一个重要的视觉偏见是纹理 vs. 形状偏见,即局部信息相对于全局信息的统治。在本文中,我们研究了这种偏见在广受欢迎的VLMs中的情况。有趣的是,我们发现VLMs往往比它们的视觉编码器更具有形状偏见,表明视觉偏见在某种程度上通过多模态模型中的文本进行调节。如果文本确实会影响视觉偏见,这表明我们不仅可以通过视觉输入来引导视觉偏见,还可以通过语言:通过广泛的实验结果我们证实了这一点。例如,我们通过提示可以将形状偏见从49%引导到72%。目前,所有测试的VLMs对形状(96%)的强烈人类偏见仍然是不可达的。
https://arxiv.org/abs/2403.09193
With the emergence of large language models (LLMs) and vision foundation models, how to combine the intelligence and capacity of these open-sourced or API-available models to achieve open-world visual perception remains an open question. In this paper, we introduce VisionGPT to consolidate and automate the integration of state-of-the-art foundation models, thereby facilitating vision-language understanding and the development of vision-oriented AI. VisionGPT builds upon a generalized multimodal framework that distinguishes itself through three key features: (1) utilizing LLMs (e.g., LLaMA-2) as the pivot to break down users' requests into detailed action proposals to call suitable foundation models; (2) integrating multi-source outputs from foundation models automatically and generating comprehensive responses for users; (3) adaptable to a wide range of applications such as text-conditioned image understanding/generation/editing and visual question answering. This paper outlines the architecture and capabilities of VisionGPT, demonstrating its potential to revolutionize the field of computer vision through enhanced efficiency, versatility, and generalization, and performance. Our code and models will be made publicly available. Keywords: VisionGPT, Open-world visual perception, Vision-language understanding, Large language model, and Foundation model
随着大型语言模型(LLMs)和视觉基础模型(VMMs)的出现,如何将它们的开源或API可得模型的智能和能力相结合以实现开放世界视觉感知仍然是一个开放性问题。在本文中,我们介绍了VisionGPT来巩固和自动化将最先进的基线模型集成到一起,从而促进视觉语言理解和视觉导向人工智能的发展。VisionGPT基于一个通用的多模态框架,通过三个关键特点取得了区分:(1)利用LLM(例如LLLA-2)作为基准来分解用户的请求为详细行动建议以调用合适的基线模型;(2)自动集成基线模型的多源输出并生成全面回答给用户;(3)适用于诸如文本条件图像理解/生成/编辑和视觉问题回答等广泛应用场景。本文概述了VisionGPT的架构和能力,证明了其通过提高效率、多样性和泛化能力以及性能,可能彻底颠覆计算机视觉领域的潜力。我们的代码和模型将公开发布。关键词:VisionGPT,开放世界视觉感知,视觉语言理解,大型语言模型和基线模型
https://arxiv.org/abs/2403.09027
In the field of computational histopathology, both whole slide images (WSIs) and diagnostic captions provide valuable insights for making diagnostic decisions. However, aligning WSIs with diagnostic captions presents a significant challenge. This difficulty arises from two main factors: 1) Gigapixel WSIs are unsuitable for direct input into deep learning models, and the redundancy and correlation among the patches demand more attention; and 2) Authentic WSI diagnostic captions are extremely limited, making it difficult to train an effective model. To overcome these obstacles, we present PathM3, a multimodal, multi-task, multiple instance learning (MIL) framework for WSI classification and captioning. PathM3 adapts a query-based transformer to effectively align WSIs with diagnostic captions. Given that histopathology visual patterns are redundantly distributed across WSIs, we aggregate each patch feature with MIL method that considers the correlations among instances. Furthermore, our PathM3 overcomes data scarcity in WSI-level captions by leveraging limited WSI diagnostic caption data in the manner of multi-task joint learning. Extensive experiments with improved classification accuracy and caption generation demonstrate the effectiveness of our method on both WSI classification and captioning task.
在计算病理学领域,全片光片图像(WSIs)和诊断性附录提供了进行诊断决策的有价值的信息。然而,将WSIs与诊断性附录对齐是一个显著的挑战。这种困难源于两个主要因素:1)高达吉帕字节级别的WSIs不适合直接输入到深度学习模型中,而补丁之间的冗余和相关性需要更多的关注;2)真实的WSI诊断附录非常有限,使得训练有效模型变得困难。为了克服这些障碍,我们提出了PathM3,一种多模态、多任务、多实例学习(MIL)框架,用于WSI分类和注释。PathM3将一个基于查询的Transformer适应性地对齐WSIs与诊断性附录。由于组织学视觉模式在WSIs上冗余分布,我们使用MIL方法对每个补丁特征进行聚合。此外,通过利用WSI诊断性附录有限的数据进行多任务联合学习,我们的PathM3克服了数据稀缺性。通过提高分类准确性和生成附录, extensive experiments证明了我们的方法在WSI分类和注释任务上都具有有效性。
https://arxiv.org/abs/2403.08967
Implicit neural representations (INRs) recently achieved great success in image representation and compression, offering high visual quality and fast rendering speeds with 10-1000 FPS, assuming sufficient GPU resources are available. However, this requirement often hinders their use on low-end devices with limited memory. In response, we propose a groundbreaking paradigm of image representation and compression by 2D Gaussian Splatting, named GaussianImage. We first introduce 2D Gaussian to represent the image, where each Gaussian has 8 parameters including position, covariance and color. Subsequently, we unveil a novel rendering algorithm based on accumulated summation. Remarkably, our method with a minimum of 3$\times$ lower GPU memory usage and 5$\times$ faster fitting time not only rivals INRs (e.g., WIRE, I-NGP) in representation performance, but also delivers a faster rendering speed of 1500-2000 FPS regardless of parameter size. Furthermore, we integrate existing vector quantization technique to build an image codec. Experimental results demonstrate that our codec attains rate-distortion performance comparable to compression-based INRs such as COIN and COIN++, while facilitating decoding speeds of approximately 1000 FPS. Additionally, preliminary proof of concept shows that our codec surpasses COIN and COIN++ in performance when using partial bits-back coding.
最近,隐式神经表示(INRs)在图像表示和压缩方面取得了巨大的成功,在10-1000 FPS的帧速率下提供了高品质的视觉效果和快速的渲染速度,前提是有足够的GPU资源可用。然而,这种要求通常会阻碍其在具有有限内存的低端设备上的使用。为了应对这种情况,我们提出了一个创新的多层高斯平铺图像表示和压缩范式,名为GaussianImage。我们首先引入2D高斯来表示图像,其中每个高斯具有8个参数,包括位置、协方差和颜色。接着,我们揭示了一种基于累积求和的全新渲染算法。值得注意的是,我们的方法在GPU内存使用最少的情况下,具有与INRs(如WIRE和I-NGP)相当的表现,而且无论参数大小,都能实现1500-2000 FPS的渲染速度。此外,我们将现有的向量量化技术集成到图像编码中,构建了一种图像编码码。实验结果表明,我们的编码在速率失真方面与基于压缩的INRs(如COIN和COIN++)的表现相当,同时通过部分比特反向 coding 促进解码速度达到约1000 FPS。此外,初步证明我们的编码在性能上超越了COIN和COIN++。
https://arxiv.org/abs/2403.08551
Story Visualization (SV) is a challenging generative vision task, that requires both visual quality and consistency between different frames in generated image sequences. Previous approaches either employ some kind of memory mechanism to maintain context throughout an auto-regressive generation of the image sequence, or model the generation of the characters and their background separately, to improve the rendering of characters. On the contrary, we embrace a completely parallel transformer-based approach, exclusively relying on Cross-Attention with past and future captions to achieve consistency. Additionally, we propose a Character Guidance technique to focus on the generation of characters in an implicit manner, by forming a combination of text-conditional and character-conditional logits in the logit space. We also employ a caption-augmentation technique, carried out by a Large Language Model (LLM), to enhance the robustness of our approach. The combination of these methods culminates into state-of-the-art (SOTA) results over various metrics in the most prominent SV benchmark (Pororo-SV), attained with constraint resources while achieving superior computational complexity compared to previous arts. The validity of our quantitative results is supported by a human survey.
Story Visualization(SV)是一个具有挑战性的生成视觉任务,它需要生成图像序列中的视觉质量和不同帧之间的一致性。之前的方法要么采用某种记忆机制来维持上下文,要么分别建模生成字符和它们的背景,以提高字符的渲染效果。相反,我们采用了一种完全基于Transformer的 approach,仅依赖过时和未来的旁白来达到一致性。此外,我们还提出了一种 Character Guidance 技术,通过在logit空间中形成文本相关和字符相关的分数来关注生成字符。我们还使用由大型语言模型(LLM)执行的 caption-augmentation 技术来增强我们方法的可行性。这些方法的组合在最具挑战性的 SV 基准(Pororo-SV)上获得了最先进的成果,通过约束资源实现了与之前艺术作品相比优越的计算复杂性。我们定量结果的有效性得到了人类调查的支持。
https://arxiv.org/abs/2403.08502