QA

Bridging the Gap Between Saliency Prediction and Image Quality Assessment

2024-05-08 12:04:43

Kirillov Alexey, Andrey Moskalenko, Dmitriy Vatolin

arXiv_CV

arXiv_CV QA Salient Relation Knowledge Prediction
Abstract

Over the past few years, deep neural models have made considerable advances in image quality assessment (IQA). However, the underlying reasons for their success remain unclear, owing to the complex nature of deep neural networks. IQA aims to describe how the human visual system (HVS) works and to create its efficient approximations. On the other hand, Saliency Prediction task aims to emulate HVS via determining areas of visual interest. Thus, we believe that saliency plays a crucial role in human perception. In this work, we conduct an empirical study that reveals the relation between IQA and Saliency Prediction tasks, demonstrating that the former incorporates knowledge of the latter. Moreover, we introduce a novel SACID dataset of saliency-aware compressed images and conduct a large-scale comparison of classic and neural-based IQA methods. All supplementary code and data will be available at the time of publication.

Abstract (translated)

在过去的几年里，深度神经网络在图像质量评估（IQA）方面取得了显著的进步。然而，由于深度神经网络的复杂性，其成功背后的原因仍然不明确。IQA 的目标描述了人视觉系统（HVS）的工作，并旨在创建其有效的近似。另一方面，Saliency 预测任务旨在通过确定视觉兴趣区域来模仿 HVS。因此，我们认为高亮在人类感知中扮演着关键角色。在这项工作中，我们进行了一项实证研究，揭示了 IQA 和 Saliency 预测任务之间的关系，证明了前一个包含了后一个的知识。此外，我们还引入了一个名为 SACID 的适用于高亮度的压缩图像的新 SACID 数据集，并对基于经典方法和神经网络的 IQA 方法进行了大规模比较。所有补充代码和数据将在发表时提供。

URL

https://arxiv.org/abs/2405.04997

PDF

https://arxiv.org/pdf/2405.04997.pdf
Read All
DALK: Dynamic Co-Augmentation of LLMs and KG to answer Alzheimer's Disease Questions with Scientific Literature

2024-05-08 05:38:20

Dawei Li, Shu Yang, Zhen Tan, Jae Young Baik, Sunkwon Yun, Joseph Lee, Aaron Chacko, Bojian Hou, Duy Duong-Tran, Ying Ding, Huan Liu, Li Shen, Tianlong Chen

arXiv_AI

arXiv_AI QA Inference Knowledge Knowledge_Graph Language_Model LLM
Abstract

Recent advancements in large language models (LLMs) have achieved promising performances across various applications. Nonetheless, the ongoing challenge of integrating long-tail knowledge continues to impede the seamless adoption of LLMs in specialized domains. In this work, we introduce DALK, a.k.a. Dynamic Co-Augmentation of LLMs and KG, to address this limitation and demonstrate its ability on studying Alzheimer's Disease (AD), a specialized sub-field in biomedicine and a global health priority. With a synergized framework of LLM and KG mutually enhancing each other, we first leverage LLM to construct an evolving AD-specific knowledge graph (KG) sourced from AD-related scientific literature, and then we utilize a coarse-to-fine sampling method with a novel self-aware knowledge retrieval approach to select appropriate knowledge from the KG to augment LLM inference capabilities. The experimental results, conducted on our constructed AD question answering (ADQA) benchmark, underscore the efficacy of DALK. Additionally, we perform a series of detailed analyses that can offer valuable insights and guidelines for the emerging topic of mutually enhancing KG and LLM. We will release the code and data at this https URL.

Abstract (translated)

近年来，在自然语言处理（NLP）领域的大型语言模型（LLM）的进步已经取得了各种应用领域的 promising 表现。然而，长尾知识整合持续挑战仍然阻碍了 LLM 在专业领域的无缝采用。在这项工作中，我们引入了 DALK（动态协同增强LLM和KG），以解决这一局限，并证明其在研究阿尔茨海默病（AD）方面的能力，这是生物医学领域的一个专业子领域，也是全球健康优先事项。通过 LLM 和 KG 相互增强的协同框架，我们首先利用 LLM 构建一个从 AD 相关科学文献中不断演变的 AD 特定知识图（KG），然后我们利用一种新颖的自我感知知识检索方法，对 KG 进行粗到细的采样，以选择适当的知识来增强 LLM 的推理能力。实验结果表明，在构建的 AD 问题回答（ADQA）基准上进行实验时，DALK 的有效性得到了充分验证。此外，我们进行了一系列详细分析，可以提供有关增强 KG 和 LLM 的有益见解和指导。我们将发布代码和数据于该链接处。

URL

https://arxiv.org/abs/2405.04819

PDF

https://arxiv.org/pdf/2405.04819.pdf
Read All
S-EQA: Tackling Situational Queries in Embodied Question Answering

2024-05-08 00:45:20

Vishnu Sashank Dorbala, Prasoon Goyal, Robinson Piramuthu, Michael Johnston, Dinesh Manocha, Reza Ghanadhan

arXiv_AI

arXiv_AI VQA QA Relation Knowledge Quantitative Embodied Agent LLM
Abstract

We present and tackle the problem of Embodied Question Answering (EQA) with Situational Queries (S-EQA) in a household environment. Unlike prior EQA work tackling simple queries that directly reference target objects and quantifiable properties pertaining them, EQA with situational queries (such as "Is the bathroom clean and dry?") is more challenging, as the agent needs to figure out not just what the target objects pertaining to the query are, but also requires a consensus on their states to be answerable. Towards this objective, we first introduce a novel Prompt-Generate-Evaluate (PGE) scheme that wraps around an LLM's output to create a dataset of unique situational queries, corresponding consensus object information, and predicted answers. PGE maintains uniqueness among the generated queries, using multiple forms of semantic similarity. We validate the generated dataset via a large scale user-study conducted on M-Turk, and introduce it as S-EQA, the first dataset tackling EQA with situational queries. Our user study establishes the authenticity of S-EQA with a high 97.26% of the generated queries being deemed answerable, given the consensus object data. Conversely, we observe a low correlation of 46.2% on the LLM-predicted answers to human-evaluated ones; indicating the LLM's poor capability in directly answering situational queries, while establishing S-EQA's usability in providing a human-validated consensus for an indirect solution. We evaluate S-EQA via Visual Question Answering (VQA) on VirtualHome, which unlike other simulators, contains several objects with modifiable states that also visually appear different upon modification -- enabling us to set a quantitative benchmark for S-EQA. To the best of our knowledge, this is the first work to introduce EQA with situational queries, and also the first to use a generative approach for query creation.

Abstract (translated)

我们在家庭环境中针对情境查询（S-EQA）解决了 embodied 问题回答（EQA）问题。与之前的工作不同，这些工作主要解决与目标对象直接引用并可量化的属性相关的简单查询，而 EQA with situational queries（例如“卫生间干净干燥吗？”）更具挑战性，因为代理需要确定不仅目标对象的答案，而且还需要就它们的状态达成一致。为了实现这个目标，我们首先介绍了一种新颖的提示生成-评估（PGE）方案，该方案围绕 LLM 的输出创建了一个独特的数据集，包括独特的情境查询、相应的共识对象信息和预测的答案。PGE 在生成的查询中保持独特性，利用多种语义相似性。我们通过在 M-Turk 上进行大规模用户研究来验证生成的数据集，并将其作为 S-EQA，第一个处理情境查询的 dataset。我们的用户研究证实 S-EQA 的真实性，其中有 97.26% 的生成查询被认为具有答案，基于共识对象数据。相反，我们在 LLM 预测的答案和人类评估的答案之间观察到较低的相关性，表明 LLM 在直接回答情境查询方面能力较差，但 S-EQA 在提供人类验证的共识方面具有可用性。我们通过在 VirtualHome 上使用视觉问答（VQA）来评估 S-EQA，这个模拟器与其他模拟器不同，包含多个可修改的状态的对象，在修改后也具有不同的视觉表现，使我们能够为 S-EQA 设定一个量化基准。据我们所知，这是第一个介绍 EQA with situational queries 的作品，也是第一个使用生成方法创建查询的。

URL

https://arxiv.org/abs/2405.04732

PDF

https://arxiv.org/pdf/2405.04732.pdf
Read All
Cross-IQA: Unsupervised Learning for Image Quality Assessment

2024-05-07 13:35:51

Zhen Zhang

arXiv_AI

arXiv_AI QA Prediction Transformer Unsupervised Pose Reconstruction
Abstract

Automatic perception of image quality is a challenging problem that impacts billions of Internet and social media users daily. To advance research in this field, we propose a no-reference image quality assessment (NR-IQA) method termed Cross-IQA based on vision transformer(ViT) model. The proposed Cross-IQA method can learn image quality features from unlabeled image data. We construct the pretext task of synthesized image reconstruction to unsupervised extract the image quality information based ViT block. The pretrained encoder of Cross-IQA is used to fine-tune a linear regression model for score prediction. Experimental results show that Cross-IQA can achieve state-of-the-art performance in assessing the low-frequency degradation information (e.g., color change, blurring, etc.) of images compared with the classical full-reference IQA and NR-IQA under the same datasets.

Abstract (translated)

自动感知图像质量是一个具有挑战性的问题，每天影响数十亿互联网和社交媒体用户。为了在這個領域進一步研究，我們提出了基於視覺變壓器（ViT）模型的交叉 IQA 方法，稱為 ViT-CrossIQA。這種方法可以从未標記的圖像數據中學習圖像質量特征。我們構建了基於 ViT 的合成圖像重构的前置任務，用於無監督地提取圖像質量信息，基於 ViT 塊。用於交叉 IQA 的预训练編碼器用於微調线性回歸模型進行得分預測。實驗結果表明，ViT-CrossIQA 在與相同數據集上評估圖像的低頻率退化信息（例如，色彩變化，模糊等）方面與經典的全參考 IQA 和 NR-IQA 达到最先进的性能。

URL

https://arxiv.org/abs/2405.04311

PDF

https://arxiv.org/pdf/2405.04311.pdf
Read All
Mitigating Clickbait: An Approach to Spoiler Generation Using Multitask Learning

2024-05-07 13:09:25

Sayantan Pal, Souvik Das, Rohini K. Srihari

arXiv_AI

arXiv_AI QA Action
Abstract

This study introduces 'clickbait spoiling', a novel technique designed to detect, categorize, and generate spoilers as succinct text responses, countering the curiosity induced by clickbait content. By leveraging a multi-task learning framework, our model's generalization capabilities are significantly enhanced, effectively addressing the pervasive issue of clickbait. The crux of our research lies in generating appropriate spoilers, be it a phrase, an extended passage, or multiple, depending on the spoiler type required. Our methodology integrates two crucial techniques: a refined spoiler categorization method and a modified version of the Question Answering (QA) mechanism, incorporated within a multi-task learning paradigm for optimized spoiler extraction from context. Notably, we have included fine-tuning methods for models capable of handling longer sequences to accommodate the generation of extended spoilers. This research highlights the potential of sophisticated text processing techniques in tackling the omnipresent issue of clickbait, promising an enhanced user experience in the digital realm.

Abstract (translated)

本研究介绍了一种名为“点击标题泄露”的新技术，旨在检测、分类和生成 spoilers 作为简短的文本响应，抵消了点击标题内容引起的好奇心。通过利用多任务学习框架，我们模型的泛化能力得到了显著增强，有效地解决了点击标题内容引起的普遍问题。我们研究的核心在于生成适当的 spoilers，无论是短语、扩展段落还是多个，根据所需的 spoiler 类型而定。我们的方法融入了两种关键技术：精细的 spoiler 分类方法和用于多任务学习范式中从上下文提取优化 spoiler 的问答机制的修改版本。值得注意的是，我们还为能够处理较长序列的模型包括微调方法，以适应生成扩展 spoiler。这项研究突出了在解决点击标题内容普遍问题的过程中运用高级文本处理技术所带来的潜力，为数字领域提供了一个更优化的用户体验。

URL

https://arxiv.org/abs/2405.04292

PDF

https://arxiv.org/pdf/2405.04292.pdf
Read All
Bridging the Synthetic-to-Authentic Gap: Distortion-Guided Unsupervised Domain Adaptation for Blind Image Quality Assessment

2024-05-07 10:07:33

Aobo Li, Jinjian Wu, Yongxu Liu, Leida Li

arXiv_CV

arXiv_CV QA Knowledge Unsupervised Pose
Abstract

The annotation of blind image quality assessment (BIQA) is labor-intensive and time-consuming, especially for authentic images. Training on synthetic data is expected to be beneficial, but synthetically trained models often suffer from poor generalization in real domains due to domain gaps. In this work, we make a key observation that introducing more distortion types in the synthetic dataset may not improve or even be harmful to generalizing authentic image quality assessment. To solve this challenge, we propose distortion-guided unsupervised domain adaptation for BIQA (DGQA), a novel framework that leverages adaptive multi-domain selection via prior knowledge from distortion to match the data distribution between the source domains and the target domain, thereby reducing negative transfer from the outlier source domains. Extensive experiments on two cross-domain settings (synthetic distortion to authentic distortion and synthetic distortion to algorithmic distortion) have demonstrated the effectiveness of our proposed DGQA. Besides, DGQA is orthogonal to existing model-based BIQA methods, and can be used in combination with such models to improve performance with less training data.

Abstract (translated)

盲图像质量评估（BIQA）的注释工作是费力且耗时的，尤其是在真实图像上。预计在合成数据上的训练会有所益处，但经过合成训练的模型在真实领域往往由于领域差距而表现不佳。在这项工作中，我们做出了一个关键观察，即在合成数据中引入更多扭曲类型并不能提高或甚至对真实图像质量评估造成损害。为了解决这个挑战，我们提出了一个名为DGQA的新框架，它通过利用扭曲先验知识来引导自监督域适应，以匹配源领域和目标领域之间的数据分布，从而减少来自离群源领域的负迁移。在两个跨域设置（合成扭曲到真实扭曲和合成扭曲到算法扭曲）的实验中，已经证明了我们提出的DGQA的有效性。此外，DGQA与现有的基于模型的BIQA方法正交，可以与这些模型结合使用以提高性能，同时训练数据需求较少。

URL

https://arxiv.org/abs/2405.04167

PDF

https://arxiv.org/pdf/2405.04167.pdf
Read All
VSA4VQA: Scaling a Vector Symbolic Architecture to Visual Question Answering on Natural Images

2024-05-06 20:59:45

Anna Penzkofer, Lei Shi, Andreas Bulling

arXiv_AI

arXiv_AI VQA Deep_Learning QA Language_Model Transformer Pose Zero-Shot
Abstract

While Vector Symbolic Architectures (VSAs) are promising for modelling spatial cognition, their application is currently limited to artificially generated images and simple spatial queries. We propose VSA4VQA - a novel 4D implementation of VSAs that implements a mental representation of natural images for the challenging task of Visual Question Answering (VQA). VSA4VQA is the first model to scale a VSA to complex spatial queries. Our method is based on the Semantic Pointer Architecture (SPA) to encode objects in a hyperdimensional vector space. To encode natural images, we extend the SPA to include dimensions for object's width and height in addition to their spatial location. To perform spatial queries we further introduce learned spatial query masks and integrate a pre-trained vision-language model for answering attribute-related questions. We evaluate our method on the GQA benchmark dataset and show that it can effectively encode natural images, achieving competitive performance to state-of-the-art deep learning methods for zero-shot VQA.

Abstract (translated)

虽然向量符号架构（VSAs）对于建模空间认知潜力很大，但目前它们的应用仅限于人工生成的图像和简单的空间查询。我们提出VSA4VQA——一种新颖的4D实现VSAs，它在视觉问答（VQA）具有挑战性的任务中实现了自然图像的思维表示。VSA4VQA是第一个将VSAs扩展到复杂空间查询的模型。我们的方法基于语义指针架构（SPA）对超维向量空间中的物体进行编码。为了编码自然图像，我们将SPA扩展以包括物体宽度和高度的空间位置。为了执行空间查询，我们进一步引入了学习到的空间查询掩码，并集成了一个预训练的视觉语言模型来回答属性相关问题。我们在GQA基准数据集上评估我们的方法，结果表明它可以有效编码自然图像，在零散射击VQA中实现与最先进的深度学习方法竞争的性能。

URL

https://arxiv.org/abs/2405.03852

PDF

https://arxiv.org/pdf/2405.03852.pdf
Read All
Language-Image Models with 3D Understanding

2024-05-06 17:57:27

Jang Hyun Cho, Boris Ivanovic, Yulong Cao, Edward Schmerling, Yue Wang, Xinshuo Weng, Boyi Li, Yurong You, Philipp Krähenbühl, Yan Wang, Marco Pavone

arXiv_AI

arXiv_AI VQA Recognition QA Language_Model Transformer 3D LLM
Abstract

Multi-modal large language models (MLLMs) have shown incredible capabilities in a variety of 2D vision and language tasks. We extend MLLMs' perceptual capabilities to ground and reason about images in 3-dimensional space. To that end, we first develop a large-scale pre-training dataset for 2D and 3D called LV3D by combining multiple existing 2D and 3D recognition datasets under a common task formulation: as multi-turn question-answering. Next, we introduce a new MLLM named Cube-LLM and pre-train it on LV3D. We show that pure data scaling makes a strong 3D perception capability without 3D specific architectural design or training objective. Cube-LLM exhibits intriguing properties similar to LLMs: (1) Cube-LLM can apply chain-of-thought prompting to improve 3D understanding from 2D context information. (2) Cube-LLM can follow complex and diverse instructions and adapt to versatile input and output formats. (3) Cube-LLM can be visually prompted such as 2D box or a set of candidate 3D boxes from specialists. Our experiments on outdoor benchmarks demonstrate that Cube-LLM significantly outperforms existing baselines by 21.3 points of AP-BEV on the Talk2Car dataset for 3D grounded reasoning and 17.7 points on the DriveLM dataset for complex reasoning about driving scenarios, respectively. Cube-LLM also shows competitive results in general MLLM benchmarks such as refCOCO for 2D grounding with (87.0) average score, as well as visual question answering benchmarks such as VQAv2, GQA, SQA, POPE, etc. for complex reasoning. Our project is available at this https URL.

Abstract (translated)

多模态大型语言模型（MLLMs）在各种2D视觉和语言任务中表现出令人惊叹的 capabilities。我们将MLLMs的感知能力扩展到将图像在3维空间中接地和推理。为此，我们首先通过将多个现有的2D和3D识别数据集合并为一个共同任务的形式（多轮问题回答），开发了一个大规模2D和3D预训练数据集LV3D。接下来，我们引入了一个名为Cube-LLM的新MLLM，并将其在LV3D上预训练。我们证明了纯数据扩展具有很强的3D感知能力，而不需要3D特定的架构设计或训练目标。Cube-LLM表现出与LLMs相似的特征：（1）Cube-LLM可以从2D上下文信息应用连锁思考提示来提高3D理解。（2）Cube-LLM可以遵循复杂和多样化的指令，并适应各种输入和输出格式。（3）Cube-LLM可以以视觉提示的方式进行，如2D框或来自专家的一组候选3D框。在户外基准测试上，我们的实验表明，Cube-LLM在3D grounded reasoning和complex reasoning about driving scenarios方面显著优于现有基线，其AP-BEV得分分别比Talk2Car数据集高21.3分和比DriveLM数据集高17.7分。Cube-LLM在通用MLLM基准测试中的表现也非常 competitive，例如在2D grounding方面，其平均得分与refCOCO基准测试相当，（87.0）；在视觉问题回答基准测试中，如VQAv2、GQA、SQA、POPE等，也表现出竞争力的结果。我们的项目URL为https://this URL。

URL

https://arxiv.org/abs/2405.03685

PDF

https://arxiv.org/pdf/2405.03685.pdf
Read All
MAmmoTH2: Scaling Instructions from the Web

2024-05-06 15:11:38

Xiang Yue, Tuney Zheng, Ge Zhang, Wenhu Chen

arXiv_CL

arXiv_CL QA Language_Model Transformer Pose Chat LLM
Abstract

Instruction tuning improves the reasoning abilities of large language models (LLMs), with data quality and scalability being the crucial factors. Most instruction tuning data come from human crowd-sourcing or GPT-4 distillation. We propose a paradigm to efficiently harvest 10 million naturally existing instruction data from the pre-training web corpus to enhance LLM reasoning. Our approach involves (1) recalling relevant documents, (2) extracting instruction-response pairs, and (3) refining the extracted pairs using open-source LLMs. Fine-tuning base LLMs on this dataset, we build MAmmoTH2 models, which significantly boost performance on reasoning benchmarks. Notably, MAmmoTH2-7B's (Mistral) performance increases from 11% to 34% on MATH and from 36% to 67% on GSM8K without training on any in-domain data. Further training MAmmoTH2 on public instruction tuning datasets yields MAmmoTH2-Plus, achieving state-of-the-art performance on several reasoning and chatbot benchmarks. Our work demonstrates how to harvest large-scale, high-quality instruction data without costly human annotation or GPT-4 distillation, providing a new paradigm for building better instruction tuning data.

Abstract (translated)

指令微调能提高大型语言模型（LLMs）的推理能力，数据质量和可扩展性是关键因素。大多数指令微调数据来自人类众包或GPT-4提取。我们提出了一个从预训练网络语料库中有效收集1000万自然存在的指令数据以增强LLM推理的范式。我们的方法包括（1）回忆相关文档，（2）提取指令-回应用户，（3）通过开源LLM对提取的配对进行优化。在這個數據集上对基础LLMs进行微调，我们构建了MAmmoTH2模型，在推理基准测试中显著提高了性能。值得注意的是，MAmmoTH2-7B的（Mistral）性能在MATH上从11%增加到34%，在GSM8K上从36%增加到67%，在没有在任何一个领域数据进行训练的情况下。在公共指令微调数据集上进一步训练MAmmoTH2，产生了MAmmoTH2-Plus，在多个推理和聊天机器人基准测试上实现了最先进的性能。我们的工作展示了如何在不花费昂贵的人工注释或GPT-4提取的情况下，收集大规模高质量的指令数据，为构建更好的指令微调数据提供了新的范式。

URL

https://arxiv.org/abs/2405.03548

PDF

https://arxiv.org/pdf/2405.03548.pdf
Read All
MedDoc-Bot: A Chat Tool for Comparative Analysis of Large Language Models in the Context of the Pediatric Hypertension Guideline

2024-05-06 11:11:23

Mohamed Yaseen Jabarulla, Steffen Oeltze-Jafra, Philipp Beerbaum, Theodor Uden

arXiv_AI

arXiv_AI QA Language_Model Pose Medical Chat LLM
Abstract

This research focuses on evaluating the non-commercial open-source large language models (LLMs) Meditron, MedAlpaca, Mistral, and Llama-2 for their efficacy in interpreting medical guidelines saved in PDF format. As a specific test scenario, we applied these models to the guidelines for hypertension in children and adolescents provided by the European Society of Cardiology (ESC). Leveraging Streamlit, a Python library, we developed a user-friendly medical document chatbot tool (MedDoc-Bot). This tool enables authorized users to upload PDF files and pose questions, generating interpretive responses from four locally stored LLMs. A pediatric expert provides a benchmark for evaluation by formulating questions and responses extracted from the ESC guidelines. The expert rates the model-generated responses based on their fidelity and relevance. Additionally, we evaluated the METEOR and chrF metric scores to assess the similarity of model responses to reference answers. Our study found that Llama-2 and Mistral performed well in metrics evaluation. However, Llama-2 was slower when dealing with text and tabular data. In our human evaluation, we observed that responses created by Mistral, Meditron, and Llama-2 exhibited reasonable fidelity and relevance. This study provides valuable insights into the strengths and limitations of LLMs for future developments in medical document interpretation. Open-Source Code: this https URL

Abstract (translated)

这项研究重点评估了非商业性的开源大型语言模型（LLMs）Meditron、MedAlpaca、Mistral和Llama-2在解释保存在PDF格式中的医疗指南的有效性。作为具体测试场景，我们将这些模型应用于欧洲心脏病学会（ESC）提供的儿童和青少年高血压指南。利用Streamlit，一个Python库，我们开发了一个用户友好的医疗文件聊天机器人工具（MedDoc-Bot）。这个工具允许授权用户上传PDF文件并提出问题，从而从本地存储的四个LLM中生成解释性回答。儿科专家通过构思问题和对ESC指南的回答进行评估，为评估提供了基准。此外，我们还评估了METEOR和chrF指标分数，以评估模型回答与参考答案的相似性。我们的研究发现在指标评估方面，Llama-2和Mistral表现良好。然而，当处理文本和表格数据时，Llama-2的速度较慢。在我们的人类评估中，我们观察到由Mistral、Meditron和Llama-2生成的响应具有合理的忠实度和相关性。这项研究为未来医疗文件解释的发展提供了宝贵的见解。开源代码：此链接：<https://github.com/your-name/meddoc-bot>

URL

https://arxiv.org/abs/2405.03359

PDF

https://arxiv.org/pdf/2405.03359.pdf
Read All
Light-VQA+: A Video Quality Assessment Model for Exposure Correction with Vision-Language Guidance

2024-05-06 10:26:06

Xunchu Zhou, Xiaohong Liu, Yunlong Dong, Tengchuan Kou, Yixuan Gao, Zicheng Zhang, Chunyi Li, Haoning Wu, Guangtao Zhai

arXiv_CV

arXiv_CV VQA QA Transformer Pose Action Enhancement
Abstract

Recently, User-Generated Content (UGC) videos have gained popularity in our daily lives. However, UGC videos often suffer from poor exposure due to the limitations of photographic equipment and techniques. Therefore, Video Exposure Correction (VEC) algorithms have been proposed, Low-Light Video Enhancement (LLVE) and Over-Exposed Video Recovery (OEVR) included. Equally important to the VEC is the Video Quality Assessment (VQA). Unfortunately, almost all existing VQA models are built generally, measuring the quality of a video from a comprehensive perspective. As a result, Light-VQA, trained on LLVE-QA, is proposed for assessing LLVE. We extend the work of Light-VQA by expanding the LLVE-QA dataset into Video Exposure Correction Quality Assessment (VEC-QA) dataset with over-exposed videos and their corresponding corrected versions. In addition, we propose Light-VQA+, a VQA model specialized in assessing VEC. Light-VQA+ differs from Light-VQA mainly from the usage of the CLIP model and the vision-language guidance during the feature extraction, followed by a new module referring to the Human Visual System (HVS) for more accurate assessment. Extensive experimental results show that our model achieves the best performance against the current State-Of-The-Art (SOTA) VQA models on the VEC-QA dataset and other public datasets.

Abstract (translated)

近年来，用户生成内容（UGC）视频在我们的日常生活中变得越来越受欢迎。然而，由于摄影设备的限制和技术的局限性，UGC视频往往受到不良曝光。因此，提出了Video Exposure Correction（VEC）算法，包括Low-Light Video Enhancement（LLVE）和Over-Exposed Video Recovery（OEVR）。与VEC同样重要的是视频质量评估（VQA）。然而，几乎所有现有的VQA模型都是基于全面的，从综合角度来看视频的质量。因此，我们提出了基于LLVE-QA的Light-VQA，用于评估LLVE。我们通过将LLVE-QA数据集扩展到包括过曝视频及其相应修正版本的Video Exposure Correction（VEC）质量评估（VEC-QA）数据集，来扩展Light-VQA的工作。此外，我们提出了Light-VQA+，一种专门用于评估VEC的VQA模型。Light-VQA+与Light-VQA的主要区别在于使用CLIP模型和特征提取过程中的视觉语言指导，然后是一个新模块，指代人类视觉系统（HVS），用于更准确地评估。大量的实验结果表明，我们的模型在VEC-QA数据集和其他公共数据集上，实现了与当前最先进的（SOTA）VQA模型相同的性能。

URL

https://arxiv.org/abs/2405.03333

PDF

https://arxiv.org/pdf/2405.03333.pdf
Read All
WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning

2024-05-06 08:42:34

Yuanhan Zhang, Kaichen Zhang, Bo Li, Fanyi Pu, Christopher Arif Setiadharma, Jingkang Yang, Ziwei Liu

arXiv_CV

arXiv_CV Video_Caption QA Knowledge Language_Model Agent LLM
Abstract

Multimodal information, together with our knowledge, help us to understand the complex and dynamic world. Large language models (LLM) and large multimodal models (LMM), however, still struggle to emulate this capability. In this paper, we present WorldQA, a video understanding dataset designed to push the boundaries of multimodal world models with three appealing properties: (1) Multimodal Inputs: The dataset comprises 1007 question-answer pairs and 303 videos, necessitating the analysis of both auditory and visual data for successful interpretation. (2) World Knowledge: We identify five essential types of world knowledge for question formulation. This approach challenges models to extend their capabilities beyond mere perception. (3) Long-Chain Reasoning: Our dataset introduces an average reasoning step of 4.45, notably surpassing other videoQA datasets. Furthermore, we introduce WorldRetriever, an agent designed to synthesize expert knowledge into a coherent reasoning chain, thereby facilitating accurate responses to WorldQA queries. Extensive evaluations of 13 prominent LLMs and LMMs reveal that WorldRetriever, although being the most effective model, achieved only 70% of humanlevel performance in multiple-choice questions. This finding highlights the necessity for further advancement in the reasoning and comprehension abilities of models. Our experiments also yield several key insights. For instance, while humans tend to perform better with increased frames, current LMMs, including WorldRetriever, show diminished performance under similar conditions. We hope that WorldQA,our methodology, and these insights could contribute to the future development of multimodal world models.

Abstract (translated)

多模态信息，加上我们的知识，帮助我们更好地理解复杂和动态的世界。然而，大语言模型（LLM）和大多模态模型（LMM）仍然很难实现这种能力。在本文中，我们提出了WorldQA，一个旨在推动多模态世界模型界限的视频理解数据集：（1）多模态输入：该数据集包括1007个问题-答案对和303个视频，因此需要对听力和视觉数据进行分析才能成功解释。（2）世界知识：我们确定了问题陈述的五个基本类型。这种方法挑战了模型在感知能力之外扩展其功能。（3）长链条推理：我们的数据集引入了平均推理步骤为4.45的推理步骤，这在其他视频QA数据集中超过了。此外，我们还引入了WorldRetriever，一种设计用于合成专家知识的代理，从而促进对WorldQA查询的准确回答。对13个知名LLM和LMM的广泛评估发现，尽管WorldRetriever是最有效的模型，但在多选题中只实现了70%的人类水平性能。这一发现强调了模型在推理和理解能力方面的进一步发展。我们的实验还产生了几个关键见解。例如，虽然人类在增加帧数时表现更好，但包括WorldRetriever在内的当前LLM在类似条件下表现出的性能减弱。我们希望WorldQA，我们的方法，以及这些见解能为多模态世界模型的未来发展做出贡献。

URL

https://arxiv.org/abs/2405.03272

PDF

https://arxiv.org/pdf/2405.03272.pdf
Read All
Advancing Multimodal Medical Capabilities of Gemini

2024-05-06 04:44:22

Lin Yang, Shawn Xu, Andrew Sellergren, Timo Kohlberger, Yuchen Zhou, Ira Ktena, Atilla Kiraly, Faruk Ahmed, Farhad Hormozdiari, Tiam Jaroensri, Eric Wang, Ellery Wulczyn, Fayaz Jamil, Theo Guidroz, Chuck Lau, Siyuan Qiao, Yun Liu, Akshay Goel, Kendall Park, Arnav Agharwal, Nick George, Yang Wang, Ryutaro Tanno, David G. T. Barrett, Wei-Hung Weng, S. Sara Mahdavi, Khaled Saab, Tao Tu, Sreenivasa Raju Kalidindi, Mozziyar Etemadi, Jorge Cuadros, Gregory Sorensen, Yossi Matias, Katherine Chou, Greg Corrado, Joelle Barral, Shravya Shetty, David Fleet, S. M. Ali Eslami, Daniel Tse, Shruthi Prabhakara, Cory McLean, Dave Steiner, Rory Pilgrim, Christopher Kelly, Shekoofeh Azizi, Daniel Golden

arXiv_AI

arXiv_AI VQA Classification Image_Classification QA Prediction Pose Medical 3D
Abstract

Many clinical tasks require an understanding of specialized data, such as medical images and genomics, which is not typically found in general-purpose large multimodal models. Building upon Gemini's multimodal models, we develop several models within the new Med-Gemini family that inherit core capabilities of Gemini and are optimized for medical use via fine-tuning with 2D and 3D radiology, histopathology, ophthalmology, dermatology and genomic data. Med-Gemini-2D sets a new standard for AI-based chest X-ray (CXR) report generation based on expert evaluation, exceeding previous best results across two separate datasets by an absolute margin of 1% and 12%, where 57% and 96% of AI reports on normal cases, and 43% and 65% on abnormal cases, are evaluated as "equivalent or better" than the original radiologists' reports. We demonstrate the first ever large multimodal model-based report generation for 3D computed tomography (CT) volumes using Med-Gemini-3D, with 53% of AI reports considered clinically acceptable, although additional research is needed to meet expert radiologist reporting quality. Beyond report generation, Med-Gemini-2D surpasses the previous best performance in CXR visual question answering (VQA) and performs well in CXR classification and radiology VQA, exceeding SoTA or baselines on 17 of 20 tasks. In histopathology, ophthalmology, and dermatology image classification, Med-Gemini-2D surpasses baselines across 18 out of 20 tasks and approaches task-specific model performance. Beyond imaging, Med-Gemini-Polygenic outperforms the standard linear polygenic risk score-based approach for disease risk prediction and generalizes to genetically correlated diseases for which it has never been trained. Although further development and evaluation are necessary in the safety-critical medical domain, our results highlight the potential of Med-Gemini across a wide range of medical tasks.

Abstract (translated)

许多临床任务需要对专业数据的理解，比如医学图像和基因组数据，这在通用大型多模态模型中通常不会存在。在Gemini的多模态模型的基础上，我们开发了几种新Med-Gemini家族中的模型，通过2D和3D放射学、病理学、眼科、皮肤病和基因组数据进行微调，以优化医学用途。Med-Gemini-2D为基于专家评估的AI驱动胸部X光（CXR）报告生成树立了新的标准，超越了两个不同的数据集的 previous best 结果，其绝对差值分别为1%和12%。在正常和异常病例中，AI报告与原始放射科医生的报告相比较，有57%和96%的AI报告被认为是“等同或更好”的。我们证明了使用Med-Gemini-3D生成3D计算机断层扫描（CT）体积的第一种大型多模态模型报告。在CT体积的评估中，尽管53%的AI报告在临床上是可以接受的，但需要进一步研究以满足专家放射科医生报告的质量要求。超越报告生成，Med-Gemini-2D在CXR视觉问答（VQA）方面超越了前面的最佳表现，并在CXR分类和放射学VQA方面表现出色，在20个任务中有17个任务超过了SoTA或基线。在病理学、眼科和皮肤病图像分类中，Med-Gemini-2D超越了基线，在18个任务中接近于任务特定的模型性能。除了成像之外，Med-Gemini-Polygenic在疾病风险预测方面超越了基于标准线性多基因风险评分的方法，并将其扩展到与培训无关的遗传相关疾病。尽管在安全关键医疗领域还需要进一步发展和评估，但我们的结果突出了Med-Gemini在广泛的医疗任务中的潜力。

URL

https://arxiv.org/abs/2405.03162

PDF

https://arxiv.org/pdf/2405.03162.pdf
Read All
Tetris: A Compilation Framework for VQA Applications in Quantum Computing

2024-05-05 23:53:12

Yuwei Jin, Zirui Li, Fei Hua, Tianyi Hao, Huiyang Zhou, Yipeng Huang, Eddy Z. Zhang

arXiv_CV

arXiv_CV VQA QA Optimization
Abstract

Quantum computing has shown promise in solving complex problems by leveraging the principles of superposition and entanglement. Variational quantum algorithms (VQA) are a class of algorithms suited for near term quantum computers due to their modest requirements of qubits and depths of computation. This paper introduces Tetris, a compilation framework for VQA applications on near term quantum devices. Tetris focuses on reducing two qubit gates in the compilation process since a two qubit gate has an order of magnitude more significant error and execution time than a single qubit gate. Tetris exploits unique opportunities in the circuit synthesis stage often overlooked by the state of the art VQA compilers for reducing the number of two qubit gates. Tetris comes with a refined IR of Pauli string to express such a two qubit gate optimization opportunity. Moreover, Tetris is equipped with a fast bridging approach that mitigates the hardware mapping cost. Overall, Tetris demonstrates a reduction of up to 41.3 percent in CNOT gate counts, 37.9 percent in circuit depth, and 42.6 percent in circuit duration for various molecules of different sizes and structures compared with the state-of-the-art approaches. Tetris is open-sourced at this link.

Abstract (translated)

量子计算通过利用叠加和纠缠的原理在解决复杂问题上表现出了巨大的潜力。变分量子算法（VQA）是一类适用于近期量子计算机的算法，因为它们对qubits的量子比特数和计算深度的要求相对较低。本文介绍了Tetris，一个用于近期量子设备上VQA应用的编译框架。Tetris专注于在编译过程中减少两个qubit门，因为一个两个qubit门比一个单qubit门的错误和执行时间有十倍以上。Tetris利用了在电路合成阶段常常被忽视的电路结构独特性，减少了两个qubit门的数量。Tetris附带了一个精确的IR Pauli字符串，表达了这种两个qubit门优化的机会。此外，Tetris还配备了快速桥接方法，减轻了硬件映射成本。总之，与最先进的VQA方法相比，Tetris在各种分子的大小和结构上展示了将CNOT门计数减少至41.3％的 reduction，将电路深度减少至37.9％，将电路持续时间减少至42.6％。Tetris目前处于测试阶段，并已开源。

URL

https://arxiv.org/abs/2309.01905

PDF

https://arxiv.org/pdf/2309.01905.pdf
Read All
Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents

2024-05-05 14:53:51

Junkai Li, Siyu Wang, Meng Zhang, Weitao Li, Yunghwei Lai, Xinhui Kang, Weizhi Ma, Yang Liu

arXiv_AI

arXiv_AI QA Knowledge Language_Model Pose Autonomous Medical Agent LLM
Abstract

In this paper, we introduce a simulacrum of hospital called Agent Hospital that simulates the entire process of treating illness. All patients, nurses, and doctors are autonomous agents powered by large language models (LLMs). Our central goal is to enable a doctor agent to learn how to treat illness within the simulacrum. To do so, we propose a method called MedAgent-Zero. As the simulacrum can simulate disease onset and progression based on knowledge bases and LLMs, doctor agents can keep accumulating experience from both successful and unsuccessful cases. Simulation experiments show that the treatment performance of doctor agents consistently improves on various tasks. More interestingly, the knowledge the doctor agents have acquired in Agent Hospital is applicable to real-world medicare benchmarks. After treating around ten thousand patients (real-world doctors may take over two years), the evolved doctor agent achieves a state-of-the-art accuracy of 93.06% on a subset of the MedQA dataset that covers major respiratory diseases. This work paves the way for advancing the applications of LLM-powered agent techniques in medical scenarios.

Abstract (translated)

在本文中，我们提出了一个名为Agent Hospital的医院模拟模型，该模型模拟了治疗疾病的过程。所有患者、护士和医生都是由大型语言模型（LLMs）驱动的自主代理。我们的核心目标是让医生代理学会在模拟中如何治疗疾病。为此，我们提出了一个名为MedAgent-Zero的方法。由于模拟可以根据知识库和LLMs模拟疾病的发生和进展，医生代理可以从成功和失败案例中积累经验。仿真实验表明，医生代理在各种任务上的治疗效果不断提高。更令人兴奋的是，医生代理在Agent Hospital中所获得的知识可以应用于现实世界的医疗基准。在治疗大约10,000名患者（现实世界的医生可能需要两年多才能完成）后，进化的医生代理在覆盖主要呼吸疾病的部分MedQA数据集上达到93.06%的准确率，为LLM-驱动代理技术在医疗场景中的应用铺平道路。

URL

https://arxiv.org/abs/2405.02957

PDF

https://arxiv.org/pdf/2405.02957.pdf
Read All
Beyond Relevance: Evaluate and Improve Retrievers on Perspective Awareness

2024-05-04 17:10:00

Xinran Zhao, Tong Chen, Sihao Chen, Hongming Zhang, Tongshuang Wu

arXiv_CL

arXiv_CL QA Relation Zero-Shot
Abstract

The task of Information Retrieval (IR) requires a system to identify relevant documents based on users' information needs. In real-world scenarios, retrievers are expected to not only rely on the semantic relevance between the documents and the queries but also recognize the nuanced intents or perspectives behind a user query. For example, when asked to verify a claim, a retrieval system is expected to identify evidence from both supporting vs. contradicting perspectives, for the downstream system to make a fair judgment call. In this work, we study whether retrievers can recognize and respond to different perspectives of the queries -- beyond finding relevant documents for a claim, can retrievers distinguish supporting vs. opposing documents? We reform and extend six existing tasks to create a benchmark for retrieval, where we have diverse perspectives described in free-form text, besides root, neutral queries. We show that current retrievers covered in our experiments have limited awareness of subtly different perspectives in queries and can also be biased toward certain perspectives. Motivated by the observation, we further explore the potential to leverage geometric features of retriever representation space to improve the perspective awareness of retrievers in a zero-shot manner. We demonstrate the efficiency and effectiveness of our projection-based methods on the same set of tasks. Further analysis also shows how perspective awareness improves performance on various downstream tasks, with 4.2% higher accuracy on AmbigQA and 29.9% more correlation with designated viewpoints on essay writing, compared to non-perspective-aware baselines.

Abstract (translated)

信息检索（IR）任务的目的是根据用户的需要识别相关的文档。在现实场景中，检索器不仅应该根据文档和查询之间的语义相关性来查找相关文档，还应该识别用户查询背后的细微意图或观点。例如，当被要求验证一个主张时，检索系统应该从支持者和反驳者的角度来看明证据，以便下游系统做出公正的判断。在这项工作中，我们研究了检索器是否能够识别和响应不同查询的角度 - 不仅限于找到相关文档，还可以区分支持者和反对者的文档吗？我们将现有的六个任务进行改革和扩展，为检索创建了一个基准，其中我们用自由文本描述了多样化的观点。我们发现，我们实验中的现有检索器对查询中的微妙不同角度缺乏意识，并且可能存在偏见。为了激发这种观察，我们进一步研究了利用检索器表示空间的几何特征来以零散的方式改善检索器在零散观点上的视角意识的可能性。我们在同一任务集上展示了我们的投影基方法的有效性和有效性。进一步的分析还表明，视角意识在各种下游任务上的改善，与非视角意识的基线相比，在Am ambigQA上的准确度提高了4.2%，在论文写作上的指定观点上的相关性提高了29.9%。

URL

https://arxiv.org/abs/2405.02714

PDF

https://arxiv.org/pdf/2405.02714.pdf
Read All
Single and Multi-Hop Question-Answering Datasets for Reticular Chemistry with GPT-4-Turbo

2024-05-03 14:29:54

Nakul Rampal, Kaiyu Wang, Matthew Burigana, Lingxiang Hou, Juri Al-Johani, Anna Sackmann, Hanan S. Murayshid, Walaa Abdullah Al-Sumari, Arwa M. Al-Abdulkarim, Nahla Eid Al-Hazmi, Majed O. Al-Awad, Christian Borgs, Jennifer T. Chayes, Omar M. Yaghi

arXiv_CL

arXiv_CL QA Transformer Chat
Abstract

The rapid advancement in artificial intelligence and natural language processing has led to the development of large-scale datasets aimed at benchmarking the performance of machine learning models. Herein, we introduce 'RetChemQA,' a comprehensive benchmark dataset designed to evaluate the capabilities of such models in the domain of reticular chemistry. This dataset includes both single-hop and multi-hop question-answer pairs, encompassing approximately 45,000 Q&As for each type. The questions have been extracted from an extensive corpus of literature containing about 2,530 research papers from publishers including NAS, ACS, RSC, Elsevier, and Nature Publishing Group, among others. The dataset has been generated using OpenAI's GPT-4 Turbo, a cutting-edge model known for its exceptional language understanding and generation capabilities. In addition to the Q&A dataset, we also release a dataset of synthesis conditions extracted from the corpus of literature used in this study. The aim of RetChemQA is to provide a robust platform for the development and evaluation of advanced machine learning algorithms, particularly for the reticular chemistry community. The dataset is structured to reflect the complexities and nuances of real-world scientific discourse, thereby enabling nuanced performance assessments across a variety of tasks. The dataset is available at the following link: this https URL

Abstract (translated)

人工智能和自然语言处理技术的快速发展导致了针对机器学习模型在特定领域进行基准测试的大规模数据集的发展。在这里，我们介绍了一个名为“RetChemQA”的全面基准数据集，旨在评估这些模型的能力。该数据集包括单步和多步问题与答案对，涵盖了每种类型的大约45,000个问题。这些问题来自于一个包含约2,530篇研究论文的广泛文献集，其中包括NAS、ACS、RSC、Elsevier和Nature Publishing Group等出版商。数据集是由OpenAI的GPT-4 Turbo（一种最先进的模型，以其出色的语言理解和生成能力而闻名）生成的。除了问题与答案数据集外，我们还发布了用于本研究的文献库的合成条件数据集。RetChemQA的目标是为开发和评估高级机器学习算法提供一个稳健的平台，特别是为Ret化学领域。数据集的结构反映了真实科学对话的复杂性和细微差别，从而在各种任务上提供 nuanced 的性能评估。数据集可通过以下链接访问：https:// this URL

URL

https://arxiv.org/abs/2405.02128

PDF

https://arxiv.org/pdf/2405.02128.pdf
Read All
Comparative Analysis of Retrieval Systems in the Real World

2024-05-03 12:30:01

Dmytro Mozolevskyi, Waseem AlShikh

arXiv_AI

arXiv_AI QA Language_Model Transformer Chat
Abstract

This research paper presents a comprehensive analysis of integrating advanced language models with search and retrieval systems in the fields of information retrieval and natural language processing. The objective is to evaluate and compare various state-of-the-art methods based on their performance in terms of accuracy and efficiency. The analysis explores different combinations of technologies, including Azure Cognitive Search Retriever with GPT-4, Pinecone's Canopy framework, Langchain with Pinecone and different language models (OpenAI, Cohere), LlamaIndex with Weaviate Vector Store's hybrid search, Google's RAG implementation on Cloud VertexAI-Search, Amazon SageMaker's RAG, and a novel approach called KG-FID Retrieval. The motivation for this analysis arises from the increasing demand for robust and responsive question-answering systems in various domains. The RobustQA metric is used to evaluate the performance of these systems under diverse paraphrasing of questions. The report aims to provide insights into the strengths and weaknesses of each method, facilitating informed decisions in the deployment and development of AI-driven search and retrieval systems.

Abstract (translated)

这份研究论文对将先进语言模型与搜索和检索系统在信息检索和自然语言处理领域的整合进行全面分析。目标是对它们在准确性、效率方面的表现进行评估和比较。分析探讨了不同技术的组合，包括Azure Cognitive Search Retriever with GPT-4、Pinecone的Canopy框架、Langchain with Pinecone和不同的语言模型（OpenAI，Cohere），LlamaIndex with Weaviate Vector Store的混合搜索、Google在Cloud VertexAI-Search上的RAG实现和一种名为KG-FID Retrieval的新方法。分析的动机源于各行各中对可靠且响应式的问题回答系统的需求不断增加。采用RobustQA指标来评估这些系统在各种歧义下的表现。报告旨在提供关于每种方法的优缺点的见解，从而在AI驱动的搜索和检索系统的部署和发展过程中做出明智的决策。

URL

https://arxiv.org/abs/2405.02048

PDF

https://arxiv.org/pdf/2405.02048.pdf
Read All
OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning

2024-05-02 17:59:24

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, Jose M. Alvarez

arXiv_CV

arXiv_CV VQA QA Sparse Language_Model Pose Autonomous Action 3D Agent LLM
Abstract

The advances in multimodal large language models (MLLMs) have led to growing interests in LLM-based autonomous driving agents to leverage their strong reasoning capabilities. However, capitalizing on MLLMs' strong reasoning capabilities for improved planning behavior is challenging since planning requires full 3D situational awareness beyond 2D reasoning. To address this challenge, our work proposes a holistic framework for strong alignment between agent models and 3D driving tasks. Our framework starts with a novel 3D MLLM architecture that uses sparse queries to lift and compress visual representations into 3D before feeding them into an LLM. This query-based representation allows us to jointly encode dynamic objects and static map elements (e.g., traffic lanes), providing a condensed world model for perception-action alignment in 3D. We further propose OmniDrive-nuScenes, a new visual question-answering dataset challenging the true 3D situational awareness of a model with comprehensive visual question-answering (VQA) tasks, including scene description, traffic regulation, 3D grounding, counterfactual reasoning, decision making and planning. Extensive studies show the effectiveness of the proposed architecture as well as the importance of the VQA tasks for reasoning and planning in complex 3D scenes.

Abstract (translated)

多模态大型语言模型（MLLMs）的进步导致了对基于LLM的自动驾驶代理的浓厚兴趣，以利用其强大的推理能力。然而，利用MLLMs的强大的推理能力进行改进的规划行为具有挑战性，因为规划需要超过2D推理的全面3D情景意识。为解决这个问题，我们的工作提出了一个整体框架，实现代理模型与3D驾驶任务的强一致性。我们的框架从采用稀疏查询的全新3D MLLM架构开始，该架构在将视觉表示压缩成3D后输入LLM之前利用稀疏查询。这种基于查询的表示允许我们共同编码动态物体和静态地图元素（例如，交通车道），为3D感知-动作对齐提供了一个压缩的世界模型。我们还提出了OmniDrive-nuScenes，一个新的视觉问题回答数据集，挑战了具有全面视觉问题回答（VQA）任务的模型的真正3D情景意识，包括场景描述、交通规则、3D建模、反事实推理、决策和规划。大量研究证明了所建议的架构的有效性以及VQA任务对复杂3D场景中的推理和规划的重要性。

URL

https://arxiv.org/abs/2405.01533

PDF

https://arxiv.org/pdf/2405.01533.pdf
Read All
V-FLUTE: Visual Figurative Language Understanding with Textual Explanations

2024-05-02 17:07:25

Arkadiy Saakyan, Shreyas Kulkarni, Tuhin Chakrabarty, Smaranda Muresan

arXiv_AI

arXiv_AI Caption VQA Language_Model Transformer Pose
Abstract

Large Vision-Language models (VLMs) have demonstrated strong reasoning capabilities in tasks requiring a fine-grained understanding of literal images and text, such as visual question-answering or visual entailment. However, there has been little exploration of these models' capabilities when presented with images and captions containing figurative phenomena such as metaphors or humor, the meaning of which is often implicit. To close this gap, we propose a new task and a high-quality dataset: Visual Figurative Language Understanding with Textual Explanations (V-FLUTE). We frame the visual figurative language understanding problem as an explainable visual entailment task, where the model has to predict whether the image (premise) entails a claim (hypothesis) and justify the predicted label with a textual explanation. Using a human-AI collaboration framework, we build a high-quality dataset, V-FLUTE, that contains 6,027 <image, claim, label, explanation> instances spanning five diverse multimodal figurative phenomena: metaphors, similes, idioms, sarcasm, and humor. The figurative phenomena can be present either in the image, the caption, or both. We further conduct both automatic and human evaluations to assess current VLMs' capabilities in understanding figurative phenomena.

Abstract (translated)

大视觉语言模型（VLMs）已经在需要对字面图像和文本进行深入理解的任务中表现出强大的推理能力，例如视觉问答或视觉蕴含。然而，在遇到包含象征性现象（如隐喻或幽默）的图像和字幕时，对这些模型的能力进行了深入的研究还是很少的。为了填补这一空白，我们提出了一个新的任务和高质量的数据集：视觉符号语言理解与文本解释（V-FLUTE）。我们将视觉符号语言理解问题视为一种可解释的视觉蕴含任务，其中模型需要预测图像（前提）是否符合一个假设（结论），并通过文本解释预测标签。利用人机合作框架，我们构建了一个高质量的数据集V-FLUTE，其中包括6,027个<图像，陈述，标签，解释>实例，涵盖了五种多样 multimodal 符号现象：隐喻、比喻、惯用语、讽刺和幽默。符号现象可以出现在图像中，描述中，或两者兼备。我们进一步进行了自动和人类评估，以评估现有 VLMs 对符号现象的理解能力。

URL

https://arxiv.org/abs/2405.01474

PDF

https://arxiv.org/pdf/2405.01474.pdf
Read All

Content

QA (20)

QA

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF