The recent work Local Implicit Image Function (LIIF) and subsequent Implicit Neural Representation (INR) based works have achieved remarkable success in Arbitrary-Scale Super-Resolution (ASSR) by using MLP to decode Low-Resolution (LR) features. However, these continuous image representations typically implement decoding in High-Resolution (HR) High-Dimensional (HD) space, leading to a quadratic increase in computational cost and seriously hindering the practical applications of ASSR. To tackle this problem, we propose a novel Latent Modulated Function (LMF), which decouples the HR-HD decoding process into shared latent decoding in LR-HD space and independent rendering in HR Low-Dimensional (LD) space, thereby realizing the first computational optimal paradigm of continuous image representation. Specifically, LMF utilizes an HD MLP in latent space to generate latent modulations of each LR feature vector. This enables a modulated LD MLP in render space to quickly adapt to any input feature vector and perform rendering at arbitrary resolution. Furthermore, we leverage the positive correlation between modulation intensity and input image complexity to design a Controllable Multi-Scale Rendering (CMSR) algorithm, offering the flexibility to adjust the decoding efficiency based on the rendering precision. Extensive experiments demonstrate that converting existing INR-based ASSR methods to LMF can reduce the computational cost by up to 99.9%, accelerate inference by up to 57 times, and save up to 76% of parameters, while maintaining competitive performance. The code is available at this https URL.
最近基于MLP的局部隐式图像函数(LIIF)和后续的隐式神经表示(INR)在任意尺度超分辨率(ASSR)方面的研究取得了显著的成功,通过使用MLP解码低分辨率(LR)特征。然而,这些连续的图像表示通常在高分辨率(HR)和高维度(HD)空间中执行解码,导致计算成本增加,严重阻碍了ASSR的实际应用。为了解决这个问题,我们提出了一个新颖的潜在模块函数(LMF),它将高分辨率(HR)和高维度(HD)解码过程在LR-HD空间中进行共享隐式解码,在HR低维度(LD)空间中实现独立渲染,从而实现了连续图像表示的第一个计算最优范式。具体来说,LMF利用高维度(HD)的MLP在隐空间中生成每个LR特征向量的隐式模度。这使得在渲染空间中,模度强度与输入特征向量呈正相关,从而实现对输入图像复杂度的自适应调整。此外,我们利用模度强度与输入图像复杂度之间的正相关性,设计了一个可控制多尺度渲染(CMSR)算法,根据渲染精度调整解码效率。大量实验证明,将现有的INR为基础的ASSR方法转换为LMF可以降低计算成本至99.9%,加速推理至57倍,并节省约76%的参数,同时保持竞争力的性能。代码可在此处访问:https://url.com/
https://arxiv.org/abs/2404.16451
Despite their remarkable successes, state-of-the-art language models face challenges in grasping certain important semantic details. This paper introduces the VISLA (Variance and Invariance to Semantic and Lexical Alterations) benchmark, designed to evaluate the semantic and lexical understanding of language models. VISLA presents a 3-way semantic (in)equivalence task with a triplet of sentences associated with an image, to evaluate both vision-language models (VLMs) and unimodal language models (ULMs). An evaluation involving 34 VLMs and 20 ULMs reveals surprising difficulties in distinguishing between lexical and semantic variations. Spatial semantics encoded by language models also appear to be highly sensitive to lexical information. Notably, text encoders of VLMs demonstrate greater sensitivity to semantic and lexical variations than unimodal text encoders. Our contributions include the unification of image-to-text and text-to-text retrieval tasks, an off-the-shelf evaluation without fine-tuning, and assessing LMs' semantic (in)variance in the presence of lexical alterations. The results highlight strengths and weaknesses across diverse vision and unimodal language models, contributing to a deeper understanding of their capabilities. % VISLA enables a rigorous evaluation, shedding light on language models' capabilities in handling semantic and lexical nuances. Data and code will be made available at this https URL.
尽管它们取得了惊人的成功,最先进的语言模型在理解某些重要的语义细节方面仍然面临着挑战。本文介绍了一个名为 VISLA(语义和词义变化对齐)的基准,旨在评估语言模型的语义和词义理解能力。VISLA 提出了一种三元语义(形式)等价任务,三元组句子与图像相关联,以评估视觉语言模型(VLMs)和单语语言模型(ULMs)的语义和词义理解能力。在评估 34 个 VLMs 和 20 个 ULMs 的过程中,揭示了在区分词汇和语义变化方面令人惊讶的困难。语言模型编码的空间语义也似乎对词汇信息非常敏感。值得注意的是,VLMs 的文本编码器在语义和词义变化方面比单语文本编码器更加敏感。我们的贡献包括统一图像到文本和文本到文本检索任务,无需微调即可进行一般评估,以及评估 LMs 在词汇变化影响下的语义(形式)变化。结果表明,各种视觉和单语语言模型的优势和劣势得到了突出,有助于更深入地了解它们的能力。VISLA 为严格的评估提供了一个框架,揭示了语言模型在处理语义和词义细微差别方面的能力。数据和代码将在此链接的 URL 上提供。
https://arxiv.org/abs/2404.16365
In this paper, we present OmniSearchSage, a versatile and scalable system for understanding search queries, pins, and products for Pinterest search. We jointly learn a unified query embedding coupled with pin and product embeddings, leading to an improvement of $>8\%$ relevance, $>7\%$ engagement, and $>5\%$ ads CTR in Pinterest's production search system. The main contributors to these gains are improved content understanding, better multi-task learning, and real-time serving. We enrich our entity representations using diverse text derived from image captions from a generative LLM, historical engagement, and user-curated boards. Our multitask learning setup produces a single search query embedding in the same space as pin and product embeddings and compatible with pre-existing pin and product embeddings. We show the value of each feature through ablation studies, and show the effectiveness of a unified model compared to standalone counterparts. Finally, we share how these embeddings have been deployed across the Pinterest search stack, from retrieval to ranking, scaling to serve $300k$ requests per second at low latency. Our implementation of this work is available at this https URL.
在本文中,我们提出了OmniSearchSage,一个用于理解Pinterest搜索查询、列表和产品的多功能且可扩展的系统。我们共同学习了一个统一的查询嵌入与Pin和产品嵌入相结合,导致了在Pinterest的生成搜索系统中的相关性提高8%以上,参与度提高7%以上,广告CTR提高5%以上。这些增长的主要贡献是改进的内容理解、更好的多任务学习和实时 serving。我们通过从图像标题中提取多样性的文本来丰富实体表示,包括历史参与度和用户创建的列表。我们的多任务学习设置在相同的空间中产生了与Pin和产品嵌入兼容的单个搜索查询嵌入,并支持预先存在的Pin和产品嵌入。我们通过消融研究来展示每个功能的价值,并比较统一的模型与分离的模型的效果。最后,我们分享了这些嵌入如何应用于Pinterest搜索链的各个环节,从检索到排名,以及在低延迟下每秒服务300k请求的效果。本文中我们所做的工作的实现版本请点击以下链接获取:https://www.aclweb.org/anthology/W19-6231
https://arxiv.org/abs/2404.16260
Recent dataset deduplication techniques have demonstrated that content-aware dataset pruning can dramatically reduce the cost of training Vision-Language Pretrained (VLP) models without significant performance losses compared to training on the original dataset. These results have been based on pruning commonly used image-caption datasets collected from the web -- datasets that are known to harbor harmful social biases that may then be codified in trained models. In this work, we evaluate how deduplication affects the prevalence of these biases in the resulting trained models and introduce an easy-to-implement modification to the recent SemDeDup algorithm that can reduce the negative effects that we observe. When examining CLIP-style models trained on deduplicated variants of LAION-400M, we find our proposed FairDeDup algorithm consistently leads to improved fairness metrics over SemDeDup on the FairFace and FACET datasets while maintaining zero-shot performance on CLIP benchmarks.
近年来,基于内容的图像-文本预训练模型 deduplication 技术已经证明了,在训练过程中删除常见于网络的图像-文本数据集可以显著降低训练 Visual-Language Pre-trained (VLP) 模型的成本,同时不显著影响性能。这些结果基于从互联网上收集的常见图像-文本数据集进行裁剪——这些数据集已知可能包含有害的社交偏见,而这些偏见可能会在训练模型时进行编码。在这项工作中,我们评估了 deduplication 对训练模型的影响以及最近采用 SemDeDup 算法进行修改以减少我们观察到的负面效应的容易实现方法。当研究基于 deduplicated 变体的 CLIP 模型在 FairFace 和 FACET 数据集上进行训练时,我们发现,我们提出的公平 DeDup 算法在 FairDeDup 数据集上的性能始终优于 SemDeDup,同时保持在 CLIP 基准测试上的零散性能。
https://arxiv.org/abs/2404.16123
Diffusion models have made significant advances in text-guided synthesis tasks. However, editing user-provided images remains challenging, as the high dimensional noise input space of diffusion models is not naturally suited for image inversion or spatial editing. In this work, we propose an image representation that promotes spatial editing of input images using a diffusion model. Concretely, we learn to encode an input into "image elements" that can faithfully reconstruct an input image. These elements can be intuitively edited by a user, and are decoded by a diffusion model into realistic images. We show the effectiveness of our representation on various image editing tasks, such as object resizing, rearrangement, dragging, de-occlusion, removal, variation, and image composition. Project page: this https URL
扩散模型在文本引导的合成任务中取得了显著的进展。然而,编辑用户提供的图像仍然具有挑战性,因为扩散模型的高维噪声输入空间并不自然适合图像反演或空间编辑。在这项工作中,我们提出了一种使用扩散模型促进输入图像空间编辑的图像表示。具体来说,我们学会了将输入编码成“图像元素”,这些元素可以忠实地重构输入图像。这些元素可以直观地编辑用户,并由扩散模型解码为逼真的图像。我们在各种图像编辑任务上展示了我们表示的有效性,包括对象缩放、重新排列、拖动、消除、去除、变化和图像组合。页面链接:这是 https:// this URL。
https://arxiv.org/abs/2404.16029
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification. Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image. These specific triplets are not as commonly available as simple image-text pairs, limiting the widespread use of CIR and its scalability. On the other hand, zero-shot CIR can be relatively easily trained with image-caption pairs without considering the image-to-image relation, but this approach tends to yield lower accuracy. We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data and learn our large language model-based Visual Delta Generator (VDG) to generate text describing the visual difference (i.e., visual delta) between the two. VDG, equipped with fluent language knowledge and being model agnostic, can generate pseudo triplets to boost the performance of CIR models. Our approach significantly improves the existing supervised learning approaches and achieves state-of-the-art results on the CIR benchmarks.
组合图像检索(CIR)是一个根据给定文本修改查询图像的任务。目前的方法依赖于有标签的三元组来训练CIR模型,这些三元组并不像简单的图像-文本对那么常见,从而限制了CIR的广泛应用和其可扩展性。另一方面,零散式CIR可以通过图像描述性对无监督训练进行相对容易的实现,但不考虑图像之间的关系,这种方法往往导致较低的准确性。我们提出了一种新的半监督CIR方法,其中我们在辅助数据中寻找参考图像及其相关目标图像,并使用基于大型语言模型(VDG)生成描述两个视觉差异(即视觉差)的文本。VDG,拥有流畅的语义知识,且对模型无依赖,可以生成伪三元组来提高CIR模型的性能。我们的方法显著提高了现有监督学习方法,并在CIR基准测试中实现了最先进的性能。
https://arxiv.org/abs/2404.15516
Geospatial Copilots unlock unprecedented potential for performing Earth Observation (EO) applications through natural language instructions. However, existing agents rely on overly simplified single tasks and template-based prompts, creating a disconnect with real-world scenarios. In this work, we present GeoLLM-Engine, an environment for tool-augmented agents with intricate tasks routinely executed by analysts on remote sensing platforms. We enrich our environment with geospatial API tools, dynamic maps/UIs, and external multimodal knowledge bases to properly gauge an agent's proficiency in interpreting realistic high-level natural language commands and its functional correctness in task completions. By alleviating overheads typically associated with human-in-the-loop benchmark curation, we harness our massively parallel engine across 100 GPT-4-Turbo nodes, scaling to over half a million diverse multi-tool tasks and across 1.1 million satellite images. By moving beyond traditional single-task image-caption paradigms, we investigate state-of-the-art agents and prompting techniques against long-horizon prompts.
地理空间协同飞行器通过自然语言指令解锁了执行地球观测(EO)应用前所未有的潜力。然而,现有的代理依赖于过于简单的单一任务和基于模板的提示,与现实世界的场景存在割裂。在这项工作中,我们提出了GeoLLM-Engine,一个由远程 sensing 平台上的分析师定期执行复杂任务的工具增强代理的环境。我们通过添加地理空间 API 工具、动态地图/UI 和外部多模态知识库来丰富我们的环境,以便更准确地衡量代理在解释真实高级自然语言命令方面的熟练程度及其在任务完成中的功能性正确性。通过减轻与人类在环基准 Curation 相关的开销,我们在100个GPT-4-Turbo节点上充分利用我们的大规模并行引擎,扩展到超过50000个多样化的多工具任务和1100万卫星图像。通过超越传统单一任务图像捕捉范例,我们研究了最先进的代理和提示技术对抗长距离提示的现状。
https://arxiv.org/abs/2404.15500
An effective method for combining frozen large language models (LLM) and visual encoders involves a resampler module that creates a `visual prompt' which is provided to the LLM, along with the textual prompt. While this approach has enabled impressive performance across many coarse-grained tasks like image captioning and visual question answering, more fine-grained tasks that require spatial understanding have not been thoroughly examined. In this paper, we use \textit{diagnostic classifiers} to measure the extent to which the visual prompt produced by the resampler encodes spatial information. Our results show that this information is largely absent from the resampler output when kept frozen during training of the classifiers. However, when the resampler and classifier are trained jointly, we observe a significant performance boost. This shows that the compression achieved by the resamplers can in principle encode the requisite spatial information, but that more object-aware objectives are needed at the pretraining stage to facilitate this capability
一种有效的将冻存的大语言模型(LLM)和视觉编码器相结合的方法包括一个重新采样模块,该模块为LLM创建了一个视觉提示,并提供了文本提示。虽然这种方法在许多粗粒度任务(如图像摘要和视觉问题回答)中取得了令人印象深刻的性能,但尚未对需要空间理解更细粒度任务进行全面评估。在本文中,我们使用\textit{诊断分类器}来衡量重新采样器产生的视觉提示是否编码了空间信息。我们的结果表明,在分类器训练期间,该信息基本上不存在于重新采样器的输出中。然而,当重新采样器和分类器一起训练时,我们观察到显著的性能提升。这说明通过重新采样器获得的压缩在原则上可以编码所需的空间信息,但需要更多的目标感知目标在预训练阶段以实现这种能力。
https://arxiv.org/abs/2404.13594
As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pretrained vision encoders such as vision encoders in CLIP and DINOv2 have brought promising performance, we found that there is still no single vision encoder that can dominate various image content understanding, e.g., the CLIP vision encoder leads to outstanding results on general image understanding but poor performance on document or chart content. To alleviate the bias of CLIP vision encoder, we first delve into the inherent behavior of different pre-trained vision encoders and then propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism. In the coarse-grained stage, we design a context-aware expert routing strategy to dynamically select the most suitable vision experts according to the user instruction, input image, and expertise of vision experts. This benefits from the powerful model function understanding ability of the large language model (LLM) equipped with expert-routing low-rank adaptation (LoRA). In the fine-grained stage, we elaborately conduct the mixture-of-vision-expert adapter (MoV-Adapter) to extract and fuse task-specific knowledge from various experts. This coarse-to-fine paradigm effectively leverages representations from experts based on multimodal context and model expertise, further enhancing the generalization ability. We conduct extensive experiments to evaluate the effectiveness of the proposed approach. Without any bells and whistles, MoVA can achieve significant performance gains over current state-of-the-art methods in a wide range of challenging multimodal benchmarks. Codes and models will be available at this https URL.
作为多模态大型语言模型的关键组件,视觉编码器的能力对MLLM对多样图像内容的理解带来了很大的影响。尽管一些大型预训练视觉编码器(如CLIP和DINOv2中的视觉编码器)已经带来了良好的性能,但我们发现仍然没有一种视觉编码器可以主导各种图像内容的理解,例如,CLIP视觉编码器在通用图像理解方面表现出色,但在文档或图表内容上表现不佳。为了减轻CLIP视觉编码器的偏差,我们首先深入研究了不同预训练视觉编码器的固有行为,然后提出了MoVA,一种强大的新MLLM,通过粗到细的机制将任务特定的视觉专家与粗略到细的机制相结合。在粗粒度阶段,我们设计了一个基于用户指令、输入图像和视觉专家专业知识的上下文感知专家路由策略,根据这些信息动态选择最合适的视觉专家。这得益于配备专家路由低秩适应(LoRA)的大型语言模型(LLM)的强大的模型功能理解能力。在细粒度阶段,我们详细介绍了混合视觉专家适配器(MoV-Adapter),以提取和融合各种专家的任务特定知识。这种粗粒度到细粒度的范式有效地利用了多模态上下文和模型专业知识,进一步加强了泛化能力。我们对所提出的方法进行了广泛的实验评估。与任何华丽的装饰相比,MoVA在各种具有挑战性的多模态基准测试中的性能都有显著的提高。代码和模型将在此处https:// URL中提供。
https://arxiv.org/abs/2404.13046
AI in dermatology is evolving at a rapid pace but the major limitation to training trustworthy classifiers is the scarcity of data with ground-truth concept level labels, which are meta-labels semantically meaningful to humans. Foundation models like CLIP providing zero-shot capabilities can help alleviate this challenge by leveraging vast amounts of image-caption pairs available on the internet. CLIP can be fine-tuned using domain specific image-caption pairs to improve classification performance. However, CLIP's pre-training data is not well-aligned with the medical jargon that clinicians use to perform diagnoses. The development of large language models (LLMs) in recent years has led to the possibility of leveraging the expressive nature of these models to generate rich text. Our goal is to use these models to generate caption text that aligns well with both the clinical lexicon and with the natural human language used in CLIP's pre-training data. Starting with captions used for images in PubMed articles, we extend them by passing the raw captions through an LLM fine-tuned on the field's several textbooks. We find that using captions generated by an expressive fine-tuned LLM like GPT-3.5 improves downstream zero-shot concept classification performance.
翻译:皮肤病学中的人工智能正在以快速发展的速度不断演变,但训练可靠分类器的主要限制是缺乏具有真实概念级别标签的数据,这些标签在人类中具有语义意义。像CLIP这样的基础模型提供零散能力,通过利用互联网上大量可用的图像标题对数据进行微调,可以缓解这一挑战。CLIP可以通过针对特定领域的图像标题进行微调来提高分类性能。然而,CLIP的预训练数据与医生在诊断过程中使用的医学术语并不完全对齐。近年来,大型语言模型(LLMs)的发展使得利用这些模型的表达性特征生成丰富文本的可能性成为可能。我们的目标是使用这些模型生成与临床词汇和CLIP预训练数据中自然人类语言相符的文本。从PubMed文章中使用的图像的摘要开始,我们通过在几个教材上微调的LLM对原始摘要进行扩展。我们发现,使用像GPT-3.5这样的具有表达性的微调的LLM生成captions可以提高下游的零散概念分类性能。
https://arxiv.org/abs/2404.13043
We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability. Beyond holistic image understanding, Groma is adept at region-level tasks such as region captioning and visual grounding. Such capabilities are built upon a localized visual tokenization mechanism, where an image input is decomposed into regions of interest and subsequently encoded into region tokens. By integrating region tokens into user instructions and model responses, we seamlessly enable Groma to understand user-specified region inputs and ground its textual output to images. Besides, to enhance the grounded chat ability of Groma, we curate a visually grounded instruction dataset by leveraging the powerful GPT-4V and visual prompting techniques. Compared with MLLMs that rely on the language model or external module for localization, Groma consistently demonstrates superior performances in standard referring and grounding benchmarks, highlighting the advantages of embedding localization into image tokenization. Project page: this https URL.
我们提出了Groma,一种具有 grounded 和 fine-grained 视觉感知能力的多模态大型语言模型(MLLM)。除了全局图像理解之外,Groma 还擅长诸如区域注释和视觉 grounding 之类的区域级别任务。这些能力基于局部视觉标记化机制,其中图像输入被分解成感兴趣的区域并随后编码成区域标记。通过将区域标记集成到用户指令和模型响应中,我们使 Groma 能够理解用户指定的区域输入,并将文本输出与图像 grounding 相结合。此外,为了增强 Groma 的 grounded chat 能力,我们利用 GPT-4V 的强大的视觉提示技术和视觉数据增强方法,收集了一个视觉 grounded 指令数据集。与依赖于语言模型或外部模块进行 localization 的 MLLM 相比,Groma 在标准参考和 grounding 基准测试中 consistently表现出卓越的性能,突显了将 localization 嵌入到图像标记化中的优势。项目页面:此链接:https:// this URL。
https://arxiv.org/abs/2404.13013
Recent advancements in LLMs have shown their significant potential in tasks like text summarization and generation. Yet, they often encounter difficulty while solving complex physics problems that require arithmetic calculation and a good understanding of concepts. Moreover, many physics problems include images that contain important details required to understand the problem's context. We propose an LMM-based chatbot to answer multimodal physics MCQs. For domain adaptation, we utilize the MM-PhyQA dataset comprising Indian high school-level multimodal physics problems. To improve the LMM's performance, we experiment with two techniques, RLHF (Reinforcement Learning from Human Feedback) and Image Captioning. In image captioning, we add a detailed explanation of the diagram in each image, minimizing hallucinations and image processing errors. We further explore the integration of Reinforcement Learning from Human Feedback (RLHF) methodology inspired by the ranking approach in RLHF to enhance the human-like problem-solving abilities of the models. The RLHF approach incorporates human feedback into the learning process of LLMs, improving the model's problem-solving skills, truthfulness, and reasoning capabilities, minimizing the hallucinations in the answers, and improving the quality instead of using vanilla-supervised fine-tuned models. We employ the LLaVA open-source model to answer multimodal physics MCQs and compare the performance with and without using RLHF.
近年来,LLM(自然语言处理)在诸如文本摘要和生成等任务方面的进展已经展示了其重要的应用潜力。然而,在解决需要进行数学计算并具备对概念的理解的复杂物理学问题时,它们往往遇到困难。此外,许多物理学问题包括包含了解题必要细节的图像。因此,我们提出了一个基于LLM的聊天机器人来回答多模态物理学多项选择题。为了进行领域迁移,我们利用包括印度高中水平多模态物理学问题的MM-PhyQA数据集。为了提高LLM的性能,我们尝试了两种方法:RLHF(基于人类反馈的强化学习)和图像捕获。在图像描述中,我们在每个图像中添加了详细的图表解释,减少了幻觉和图像处理误差。我们进一步探讨了将RLHF中的人反馈方法论与排名方法相结合,以增强模型的类人问题解决能力。RLHF方法将人类反馈纳入LLM的学习过程中,提高了模型的问题解决技能、真理度、推理能力,减少了答案中的幻觉,并提高了模型的质量,而不是使用预训练的微调模型。我们使用LLaVA开源模型来回答多模态物理学多项选择题,并将性能与有无使用RLHF进行比较。
https://arxiv.org/abs/2404.12926
This report introduces a solution to the Topic 1 Zero-shot Image Captioning of 2024 NICE : New frontiers for zero-shot Image Captioning Evaluation. In contrast to NICE 2023 datasets, this challenge involves new annotations by humans with significant differences in caption style and content. Therefore, we enhance image captions effectively through retrieval augmentation and caption grading methods. At the data level, we utilize high-quality captions generated by image caption models as training data to address the gap in text styles. At the model level, we employ OFA (a large-scale visual-language pre-training model based on handcrafted templates) to perform the image captioning task. Subsequently, we propose caption-level strategy for the high-quality caption data generated by the image caption models and integrate them with retrieval augmentation strategy into the template to compel the model to generate higher quality, more matching, and semantically enriched captions based on the retrieval augmentation prompts. Our approach ranks first on the leaderboard, achieving a CIDEr score of 234.11 and 1st in all other metrics.
本报告为2024 NICE主题1:零 shot 图像标题介绍了一种解决方案:零 shot图像标题评估的新领域。与NICE 2023数据集相比,这个挑战涉及了人类对标题风格和内容的显著差异。因此,我们通过检索增强和评分方法有效地增强图像标题。在数据层面上,我们利用图像标题模型生成的高质量标题作为训练数据来解决文本风格的空白。在模型层面上,我们采用OFA(一个基于手工模板的大型视觉语言预训练模型)进行图像标题任务。随后,我们提出了针对图像标题模型的高质量标题策略,并将它们与检索增强策略集成到模板中,以迫使模型根据检索增强提示生成更高质量、更匹配、更具语义丰富的标题。我们的方法在排行榜上排名第一,实现了CIDEr分数为234.11,并且在所有其他指标上都排名第一。
https://arxiv.org/abs/2404.12739
Today, most methods for image understanding tasks rely on feed-forward neural networks. While this approach has allowed for empirical accuracy, efficiency, and task adaptation via fine-tuning, it also comes with fundamental disadvantages. Existing networks often struggle to generalize across different datasets, even on the same task. By design, these networks ultimately reason about high-dimensional scene features, which are challenging to analyze. This is true especially when attempting to predict 3D information based on 2D images. We propose to recast 3D multi-object tracking from RGB cameras as an \emph{Inverse Rendering (IR)} problem, by optimizing via a differentiable rendering pipeline over the latent space of pre-trained 3D object representations and retrieve the latents that best represent object instances in a given input image. To this end, we optimize an image loss over generative latent spaces that inherently disentangle shape and appearance properties. We investigate not only an alternate take on tracking but our method also enables examining the generated objects, reasoning about failure situations, and resolving ambiguous cases. We validate the generalization and scaling capabilities of our method by learning the generative prior exclusively from synthetic data and assessing camera-based 3D tracking on the nuScenes and Waymo datasets. Both these datasets are completely unseen to our method and do not require fine-tuning. Videos and code are available at this https URL.
today,大多数图像理解任务的方法都依赖于前馈神经网络。虽然这种方法通过微调获得了 empirical 的准确性和效率,但同时也存在一些基本缺陷。现有的网络往往难以在不同的数据集上泛化,即使是相同任务。通过设计,这些网络最终在预训练的3D对象表示的潜在空间中进行推理,这是具有挑战性的。尤其是在试图根据2D图像预测3D信息时,这更是如此。我们提出将从RGB相机中的3D多对象跟踪重新建模为同义词{反向渲染(IR)问题,通过优化通过不同的渲染管道在预训练3D对象表示的潜在空间中进行优化,并检索在给定输入图像中最好地表示物体实例的潜在。为此,我们优化了一个在生成性潜在空间上进行的图像损失。我们研究了不仅是对跟踪的另一种看法,而且我们的方法还允许我们检查生成的物体,推理失败情况,并解决模糊情况。我们通过仅从合成数据中学习生成先验来评估我们的方法的泛化能力和扩展能力。我们在 nuScenes 和 Waymo 数据集上对相机基于3D跟踪的性能进行了评估。这两个数据集完全未见对我们的方法,也不需要微调。视频和代码可在此处 https:// URL 下载。
https://arxiv.org/abs/2404.12359
Self-supervised learning (SSL) has emerged as a promising technique for medical image analysis due to its ability to learn without annotations. However, despite the promising potential, conventional SSL methods encounter limitations, including challenges in achieving semantic alignment and capturing subtle details. This leads to suboptimal representations, which fail to accurately capture the underlying anatomical structures and pathological details. In response to these constraints, we introduce a novel SSL framework OPTiML, employing optimal transport (OT), to capture the dense semantic invariance and fine-grained details, thereby enhancing the overall effectiveness of SSL in medical image representation learning. The core idea is to integrate OT with a cross-viewpoint semantics infusion module (CV-SIM), which effectively captures complex, fine-grained details inherent in medical images across different viewpoints. In addition to the CV-SIM module, OPTiML imposes the variance and covariance regularizations within OT framework to force the model focus on clinically relevant information while discarding less informative features. Through these, the proposed framework demonstrates its capacity to learn semantically rich representations that can be applied to various medical imaging tasks. To validate its effectiveness, we conduct experimental studies on three publicly available datasets from chest X-ray modality. Our empirical results reveal OPTiML's superiority over state-of-the-art methods across all evaluated tasks.
自监督学习(SSL)作为一种无需标注的学习技术,在医学图像分析领域呈现出巨大的潜力。然而,尽管具有潜在的积极影响,传统的 SSL 方法也存在局限性,包括在实现语义对齐和捕捉细微细节方面遇到的挑战。这导致 suboptimal 表示,无法准确捕捉到解剖学结构和病理细节。为了应对这些限制,我们引入了一个名为 OPTiML 的新 SSL 框架,采用最优传输(OT)技术,以捕捉密集的语义不变性和细粒度细节,从而增强 SSL 在医学图像表示学习中的整体效果。核心思想是将 OT 与跨视点语义注入模块(CV-SIM)相结合,有效地捕捉不同观点下医学图像中复杂、细粒度的细节。除了 CV-SIM 模块之外,OPTiML 对 OT 框架内的方差和协方差进行正则化,以迫使模型将注意力集中在临床相关信息上,而忽略更不相关的特征。通过这些,所提出的框架展示了其学习语义丰富表示的能力,可以应用于各种医学成像任务。为了验证其有效性,我们在三个公开可用的数据集(包括胸部 X 光摄影模式)上进行了实验研究。我们的实证结果表明,OPTiML 在所有评估任务上都优于最先进的 methods。
https://arxiv.org/abs/2404.11868
In the face of burgeoning image data, efficiently retrieving similar images poses a formidable challenge. Past research has focused on refining hash functions to distill images into compact indicators of resemblance. Initial attempts used shallow models, evolving to attention mechanism-based architectures from Convolutional Neural Networks (CNNs) to advanced models. Recognizing limitations in gradient-based models for spatial information embedding, we propose an innovative image hashing method, NeuroHash leveraging Hyperdimensional Computing (HDC). HDC symbolically encodes spatial information into high-dimensional vectors, reshaping image representation. Our approach combines pre-trained large vision models with HDC operations, enabling spatially encoded feature representations. Hashing with locality-sensitive hashing (LSH) ensures swift and efficient image retrieval. Notably, our framework allows dynamic hash manipulation for conditional image retrieval. Our work introduces a transformative image hashing framework enabling spatial-aware conditional retrieval. By seamlessly combining DNN-based neural and HDC-based symbolic models, our methodology breaks from traditional training, offering flexible and conditional image retrieval. Performance evaluations signify a paradigm shift in image-hashing methodologies, demonstrating enhanced retrieval accuracy.
面对快速增长的图像数据,高效地检索相似的图像是一个具有挑战性的任务。过去的研究所侧重于优化哈希函数,以将图像压缩成相似性的简洁指标。初始尝试使用浅层模型,从卷积神经网络(CNNs)进化到关注机制为基础的架构,最终达到更先进的模型。然而,对于基于梯度的模型的空间信息嵌入限制,我们提出了创新性的图像哈希方法:NeuroHash,利用高维计算(HDC)。HDC 符号化地编码空间信息为高维向量,重新塑造图像表示。我们的方法将预训练的大视觉模型与 HDC 操作相结合,实现了空间编码特征表示。使用局部感知哈希(LSH)进行哈希确保快速且高效的图像检索。值得注意的是,我们的框架允许动态哈希操作进行条件图像检索。我们的工作引入了一个 transformative 图像哈希框架,实现空间感知条件检索。通过将基于深度神经网络(DNN)的神经模型与基于高维计算(HDC)的符号模型无缝结合,我们的方法摒弃了传统的训练方式,实现了灵活的带有条件图像检索。性能评估表明,图像哈希方法论正处于一种范式性的转变,并证明了更准确的检索精度。
https://arxiv.org/abs/2404.11025
Diffusion models have exhibited remarkable capabilities in text-to-image generation. However, their performance in image-to-text generation, specifically image captioning, has lagged behind Auto-Regressive (AR) models, casting doubt on their applicability for such tasks. In this work, we revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding. With these benefits, diffusion models can alleviate the inherent limitations of AR methods, including their slow inference speed, error propagation, and unidirectional constraints. Furthermore, we identify the prior underperformance of diffusion models stemming from the absence of an effective latent space for image-text alignment, and the discrepancy between continuous diffusion processes and discrete textual data. In response, we introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions and integrates a regularization module to manage varying text lengths. Our framework also includes a diffuser for semantic image-to-text conversion and a Back&Refine technique to enhance token interactivity during inference. LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS COCO dataset with 38.2 BLEU@4 and 126.2 CIDEr, demonstrating exceptional performance without pre-training or ancillary modules. This indicates strong competitiveness with AR models, revealing the previously untapped potential of diffusion models in image-to-text generation.
扩散模型在文本到图像生成方面的表现引人注目。然而,在图像到文本生成方面,特别是图像标题生成,它们的性能已经落后于自回归模型。在这项工作中,我们重新审视了扩散模型,突出了它们整体上下文建模和并行解码的能力。借助这些优点,扩散模型可以减轻AR方法固有的限制,包括其缓慢的推理速度、错误传播和单向约束。此外,我们指出了扩散模型由于缺乏有效的图像文本对齐的潜在空间而表现出的先前的低性能,以及连续扩散过程和离散文本数据之间的差异。为了应对这些问题,我们引入了一种名为LaDiC的新架构,它利用分裂的BERT创建了专用的潜在空间,并包括一个正则化模块来管理不同的文本长度。我们的框架还包括一个扩散器用于语义图像到文本转换和 Back&Refine 技术,用于在推理过程中增强标记交互。LaDiC 在基于扩散的方法在 MS COCO 数据集上实现了最先进的性能,达到38.2 BLEU@4 和126.2 CIDEr,这表明 LaDiC 在没有预训练或辅助模块的情况下具有出色的性能。这揭示了扩散模型在图像到文本生成方面的潜力,这是 AR 模型所无法匹敌的。
https://arxiv.org/abs/2404.10763
Text-to-Image (T2I) Synthesis has made tremendous strides in enhancing synthesized image quality, but current datasets evaluate model performance only on descriptive, instruction-based prompts. Real-world news image captions take a more pragmatic approach, providing high-level situational and Named-Entity (NE) information and limited physical object descriptions, making them abstractive. To evaluate the ability of T2I models to capture intended subjects from news captions, we introduce the Abstractive News Captions with High-level cOntext Representation (ANCHOR) dataset, containing 70K+ samples sourced from 5 different news media organizations. With Large Language Models (LLM) achieving success in language and commonsense reasoning tasks, we explore the ability of different LLMs to identify and understand key subjects from abstractive captions. Our proposed method Subject-Aware Finetuning (SAFE), selects and enhances the representation of key subjects in synthesized images by leveraging LLM-generated subject weights. It also adapts to the domain distribution of news images and captions through custom Domain Fine-tuning, outperforming current T2I baselines on ANCHOR. By launching the ANCHOR dataset, we hope to motivate research in furthering the Natural Language Understanding (NLU) capabilities of T2I models.
文本转图像(T2I)合成已经在提高合成图像质量方面取得了巨大的进步,但是现有的数据集仅评估模型在描述性、指令式提示上的性能。现实世界的新闻标题更务实,提供高级情境和命名实体(NE)信息,以及有限的物体描述,使它们具有抽象性。为了评估T2I模型从新闻标题中捕捉意图主题的能力,我们引入了抽象新闻标题高级上下文表示(ANCHOR)数据集,包含来自5个不同新闻媒体组织的70K+个样本。在大语言模型(LLM)在语言和常识推理任务中取得成功之后,我们研究了不同LLM从抽象性摘要中识别和理解关键主题的能力。我们提出的SAFE方法选择和增强了通过LLM生成的主题权重对合成图像中关键主题的表示。它还通过自定义领域微调适应新闻图像和摘要的领域分布,在ANCHOR数据集上优于当前的T2I基线。通过启动ANCHOR数据集,我们希望激励研究进一步提高T2I模型的自然语言理解(NLU)能力。
https://arxiv.org/abs/2404.10141
The advent of Large Multimodal Models (LMMs) has sparked a surge in research aimed at harnessing their remarkable reasoning abilities. However, for understanding text-rich images, challenges persist in fully leveraging the potential of LMMs, and existing methods struggle with effectively processing high-resolution images. In this work, we propose TextCoT, a novel Chain-of-Thought framework for text-rich image understanding. TextCoT utilizes the captioning ability of LMMs to grasp the global context of the image and the grounding capability to examine local textual regions. This allows for the extraction of both global and local visual information, facilitating more accurate question-answering. Technically, TextCoT consists of three stages, including image overview, coarse localization, and fine-grained observation. The image overview stage provides a comprehensive understanding of the global scene information, and the coarse localization stage approximates the image area containing the answer based on the question asked. Then, integrating the obtained global image descriptions, the final stage further examines specific regions to provide accurate answers. Our method is free of extra training, offering immediate plug-and-play functionality. Extensive experiments are conducted on a series of text-rich image question-answering benchmark datasets based on several advanced LMMs, and the results demonstrate the effectiveness and strong generalization ability of our method. Code is available at this https URL.
大规模多模态模型(LMMs)的出现引发了旨在充分利用其非凡推理能力的研究高潮。然而,对于理解富含文本的图像,要完全利用LMMs的潜力仍然具有挑战性,现有的方法在处理高分辨率图像时也存在困难。在这项工作中,我们提出了TextCoT,一种用于文本丰富图像理解的全新链式思维框架。TextCoT利用LMMs的摘要能力来把握图像的全局上下文和定位能力来检查局部文本区域。这使得可以提取全局和局部视觉信息,从而促进更准确的问题回答。从技术上讲,TextCoT由三个阶段组成,包括图像概述、粗略定位和细粒度观察。图像概述阶段提供了对全局场景信息的全面理解,粗略定位阶段根据提出的问题估算包含答案的图像区域。然后,将获得的全局图像描述集成到其中,最后的阶段进一步检查具体区域以提供准确答案。我们的方法无需额外训练,具有即插即用的功能。在多个基于先进LMM的文本丰富图像问题回答基准数据集上进行了广泛的实验,结果表明,我们的方法具有有效性和强大的泛化能力。代码可以从该链接处获取。
https://arxiv.org/abs/2404.09797
The emergence of Large Multimodal Models (LMMs) marks a significant milestone in the development of artificial intelligence. Insurance, as a vast and complex discipline, involves a wide variety of data forms in its operational processes, including text, images, and videos, thereby giving rise to diverse multimodal tasks. Despite this, there has been limited systematic exploration of multimodal tasks specific to insurance, nor a thorough investigation into how LMMs can address these challenges. In this paper, we explore GPT-4V's capabilities in the insurance domain. We categorize multimodal tasks by focusing primarily on visual aspects based on types of insurance (e.g., auto, household/commercial property, health, and agricultural insurance) and insurance stages (e.g., risk assessment, risk monitoring, and claims processing). Our experiment reveals that GPT-4V exhibits remarkable abilities in insurance-related tasks, demonstrating not only a robust understanding of multimodal content in the insurance domain but also a comprehensive knowledge of insurance scenarios. However, there are notable shortcomings: GPT-4V struggles with detailed risk rating and loss assessment, suffers from hallucination in image understanding, and shows variable support for different languages. Through this work, we aim to bridge the insurance domain with cutting-edge LMM technology, facilitate interdisciplinary exchange and development, and provide a foundation for the continued advancement and evolution of future research endeavors.
大规模多模态模型(LMMs)的出现标志着人工智能发展的重要里程碑。作为一门广阔而复杂的学科,保险领域涉及多种数据形式,包括文本、图像和视频,从而产生了各种多模态任务。尽管如此,在保险领域的多模态任务方面,系统性的探索还是有限的,而且关于LMM如何应对这些挑战的研究也是有限的。在本文中,我们探讨了GPT-4V在保险领域的应用能力。我们主要根据保险类型(如汽车、家庭/商业财产、健康和农业保险)对多模态任务进行分类,并关注保险阶段(如风险评估、风险监测和索赔处理)。我们的实验揭示了GPT-4V在保险相关任务中的非凡能力,这不仅表明其在保险领域的多模态内容方面具有稳健的理解,而且表明其在保险场景方面具有全面的了解。然而,仍存在显著的不足:GPT-4V在详细风险评估和损失评估方面表现不佳,在图像理解方面存在幻觉,并且对不同语言的支持具有波动性。通过这项工作,我们旨在将保险领域与最先进的多模态模型技术相连接,促进跨学科的交流和发展,并为未来研究的进步和演变提供基础。
https://arxiv.org/abs/2404.09690