We introduce a versatile $\textit{flexible-captioning}$ vision-language model (VLM) capable of generating region-specific descriptions of varying lengths. The model, FlexCap, is trained to produce length-conditioned captions for input bounding boxes, and this allows control over the information density of its output, with descriptions ranging from concise object labels to detailed captions. To achieve this we create large-scale training datasets of image region descriptions of varying length, starting from captioned images. This flexible-captioning capability has several valuable applications. First, FlexCap demonstrates superior performance in dense captioning tasks on the Visual Genome dataset. Second, a visual question answering (VQA) system can be built by employing FlexCap to generate localized descriptions as inputs to a large language model. The resulting system achieves state-of-the-art zero-shot performance on a number of VQA datasets. We also demonstrate a $\textit{localize-then-describe}$ approach with FlexCap can be better at open-ended object detection than a $\textit{describe-then-localize}$ approach with other VLMs. We highlight a novel characteristic of FlexCap, which is its ability to extract diverse visual information through prefix conditioning. Finally, we qualitatively demonstrate FlexCap's broad applicability in tasks such as image labeling, object attribute recognition, and visual dialog. Project webpage: this https URL .
我们介绍了一个具有多才多艺的柔性捕获视觉语言模型(VLM),可以生成不同长度的区域特定描述。该模型FlexCap通过为输入边界框生成长度条件下的捕获结果,从而控制其输出信息密度。描述可以从简洁的物体标签到详细的捕获信息。为了实现这一点,我们创建了各种长度的大规模训练数据集,从带标签的图像开始。这种柔性捕获能力具有几个宝贵的应用。首先,FlexCap在视觉基因组数据集上的密集捕获任务中表现出卓越的性能。其次,通过使用FlexCap生成局部描述作为大型语言模型的输入,可以构建视觉问答(VQA)系统。该系统在多个VQA数据集上实现了最先进的零散射击性能。我们还证明了使用FlexCap的“局部化然后描述”方法比其他VLM的“描述然后局部化”方法在开放性物体检测方面表现更好。我们突出柔性捕获模型的一个新颖特点,即它可以通过前缀条件提取多样视觉信息。最后,我们初步展示了FlexCap在图像分类、物体属性识别和视觉对话等任务上的广泛应用。项目网页:https://this URL 。
https://arxiv.org/abs/2403.12026
Large language models (LLMs) hold immense promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. In this work, we present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and then conduct an empirical case study with Med-PaLM 2, resulting in the largest human evaluation study in this area to date. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases, and EquityMedQA, a collection of seven newly-released datasets comprising both manually-curated and LLM-generated questions enriched for adversarial queries. Both our human assessment framework and dataset design process are grounded in an iterative participatory approach and review of possible biases in Med-PaLM 2 answers to adversarial queries. Through our empirical study, we find that the use of a collection of datasets curated through a variety of methodologies, coupled with a thorough evaluation protocol that leverages multiple assessment rubric designs and diverse rater groups, surfaces biases that may be missed via narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. We emphasize that while our framework can identify specific forms of bias, it is not sufficient to holistically assess whether the deployment of an AI system promotes equitable health outcomes. We hope the broader community leverages and builds on these tools and methods towards realizing a shared goal of LLMs that promote accessible and equitable healthcare for all.
大语言模型(LLMs)在满足复杂健康信息需求方面具有巨大的潜力,但也可能引入危害并加剧健康差异。可靠评估公平相关模型失败是实现促进健康公平的系统的一个关键步骤。在这项工作中,我们提供了资源和方法来揭示在长篇LLM生成的医疗问题回答中可能引发公平危害的偏见,并使用Med-PaLM 2进行实证研究,这是该领域有史以来最大的人类评估研究。我们的贡献包括一个多因素框架来评估LLM生成的答案中的偏见,以及EquityMedQA,一个包含手工编辑和LLM生成的具有对抗性查询增强的问题的数据集。我们的人工评估框架和数据集设计过程都基于迭代参与方法和对Med-PaLM 2回答中可能偏见的审查。通过我们的实证研究,我们发现,通过使用通过各种方法构建的集合,并结合彻底评估协议利用多个评估评分设计和对不同评分组的充分利用,可以揭示可能通过更狭隘的评估方法无法发现的偏见。我们的经验强调了使用多样的人工评估方法和涉及不同背景和专业知识的评分者的必要性。我们强调,尽管我们的框架可以识别出具体形式的偏见,但它不足以全面评估AI系统是否促进公平的健康结果。我们希望更广泛的社区利用并构建这些工具和方法,以实现一个共同的目标,即促进所有人均可获得且公平的医疗保健。
https://arxiv.org/abs/2403.12025
With the rapid development of generative models, Artificial Intelligence-Generated Contents (AIGC) have exponentially increased in daily lives. Among them, Text-to-Video (T2V) generation has received widespread attention. Though many T2V models have been released for generating high perceptual quality videos, there is still lack of a method to evaluate the quality of these videos quantitatively. To solve this issue, we establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date. The dataset is composed of 10,000 videos generated by 9 different T2V models. We also conduct a subjective study to obtain each video's corresponding mean opinion score. Based on T2VQA-DB, we propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA). The model extracts features from text-video alignment and video fidelity perspectives, then it leverages the ability of a large language model to give the prediction score. Experimental results show that T2VQA outperforms existing T2V metrics and SOTA video quality assessment models. Quantitative analysis indicates that T2VQA is capable of giving subjective-align predictions, validating its effectiveness. The dataset and code will be released at this https URL.
随着生成模型的快速发展,人工智能生成内容(AIGC)在日常生活中的数量呈指数增长。其中,文本转视频(T2V)生成备受关注。尽管已经发布了许多用于生成高质量视频的T2V模型,但缺乏一种量化评估这些视频质量的方法仍然存在。为解决这个问题,我们建立了迄今为止最大的文本转视频质量评估数据库(T2VQA-DB)。该数据库由9个不同T2V模型生成的10,000个视频组成。我们还进行了一项主观研究,以获得每个视频的相应平均意见评分。基于T2VQA-DB,我们提出了一个新颖的基于变换器的模型,用于主观对文本转视频质量评估(T2VQA)。该模型从文本视频对齐和视频质量两个方面提取特征,然后利用大型语言模型的能力进行预测得分。实验结果表明,T2VQA优于现有的T2V指标和最先进的视频质量评估模型。定量分析表明,T2VQA能够给出与主观对齐的预测,验证了其有效性的可靠性。数据集和代码将在此处发布,链接如下。
https://arxiv.org/abs/2403.11956
Recent chatbots have demonstrated impressive ability to understand and communicate in raw-text form. However, there is more to the world than raw text. For example, humans spend long hours of their time on web pages, where text is intertwined with other modalities and tasks are accomplished in the form of various complex interactions. Can state-of-the-art multi-modal models generalize to such complex domains? To address this question, we introduce TurkingBench, a benchmark of tasks formulated as web pages containing textual instructions with multi-modal context. Unlike existing work which employs artificially synthesized web pages, here we use natural HTML pages that were originally designed for crowdsourcing workers for various annotation purposes. The HTML instructions of each task are also instantiated with various values (obtained from the crowdsourcing tasks) to form new instances of the task. This benchmark contains 32.2K instances distributed across 158 tasks. Additionally, to facilitate the evaluation on TurkingBench, we develop an evaluation framework that connects the responses of chatbots to modifications on web pages (modifying a text box, checking a radio, etc.). We evaluate the performance of state-of-the-art models, including language-only, vision-only, and layout-only models, and their combinations, on this benchmark. Our findings reveal that these models perform significantly better than random chance, yet considerable room exists for improvement. We hope this benchmark will help facilitate the evaluation and development of web-based agents.
近年来,聊天机器人已经展示了在原始文本形式下理解和交流的令人印象深刻的能力。然而,世界远不止原始文本。例如,人类花费大量时间在网页上,其中文本与其他模态任务相互交织,任务以各种复杂交互的形式完成。先进的跨模态模型能否扩展到这种复杂领域呢?为回答这个问题,我们引入了TurkingBench,一个将任务表述为包含文本指令的多模态上下文的网页的基准。与现有工作采用人工合成的网页不同,我们使用自然HTML页面,最初是为各种注释目的而设计的。每个任务的HTML指令也都用各种值(从众包任务中获取)来形成新实例。这个基准包含了158个任务中的32,200个实例。此外,为了在TurkingBench上方便评估,我们开发了一个评估框架,将聊天机器人的回答与网页上的修改连接起来(例如修改文本框,检查复选框等)。我们评估了包括语言模型、视觉模型和布局模型在内的最先进模型的性能以及它们的组合在这个基准上的表现。我们的研究结果表明,这些模型表现显著优于随机猜测,但仍有很大的改进空间。我们希望这个基准能够帮助促进基于网页的智能代理的评估和开发。
https://arxiv.org/abs/2403.11905
Employing Large Language Models (LLMs) for semantic parsing has achieved remarkable success. However, we find existing methods fall short in terms of reliability and efficiency when hallucinations are encountered. In this paper, we address these challenges with a framework called QueryAgent, which solves a question step-by-step and performs step-wise self-correction. We introduce an environmental feedback-based self-correction method called ERASER. Unlike traditional approaches, ERASER leverages rich environmental feedback in the intermediate steps to perform selective and differentiated self-correction only when necessary. Experimental results demonstrate that QueryAgent notably outperforms all previous few-shot methods using only one example on GrailQA and GraphQ by 7.0 and 15.0 F1. Moreover, our approach exhibits superiority in terms of efficiency, including runtime, query overhead, and API invocation costs. By leveraging ERASER, we further improve another baseline (i.e., AgentBench) by approximately 10 points, revealing the strong transferability of our approach.
使用大型语言模型(LLMs)进行语义解析取得了显著的成功。然而,在遇到歧义时,我们发现现有的方法在可靠性和效率方面都存在不足。在本文中,我们通过一个名为QueryAgent的框架来解决这些挑战,该框架逐步解决问题并进行逐步自校。我们引入了一种基于环境反馈的自校正方法称为ERASER。与传统方法不同,ERASER利用中间步骤的丰富环境反馈仅在需要时进行选择性和差异化的自校正。实验结果表明,仅使用GrailQA和GraphQ的一个例子,QueryAgent比所有以前的几 shot方法提高了7.0和15.0的F1分数。此外,我们的方法在效率方面表现出优越性,包括运行时间、查询开销和API调用成本。通过利用ERASER,我们进一步提高了另一个基线(即AgentBench)约10个点,揭示了我们方法的有效性。
https://arxiv.org/abs/2403.11886
Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in their visual encoding strategy. To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. LLaVA-UHD includes three key components: (1) An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding, (2) a compression module that further condenses image tokens from visual encoders, and (3) a spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD outperforms established LMMs trained with 2-3 orders of magnitude more data on 9 benchmarks. Notably, our model built on LLaVA-1.5 336x336 supports 6 times larger (i.e., 672x1088) resolution images using only 94% inference computation, and achieves 6.4 accuracy improvement on TextVQA. Moreover, the model can be efficiently trained in academic settings, within 23 hours on 8 A100 GPUs (vs. 26 hours of LLaVA-1.5). We make the data and code publicly available at this https URL.
视觉编码构成了理解大型多模态模型(LMMs)理解视觉世界的基础。传统的LMMs在固定尺寸和有限分辨率的情况下处理图像。而这一方向近期的探索在适应性、效率和甚至正确性方面都有限。在这项工作中,我们首先以GPT-4V和LaVA-1.5为例,揭示了它们视觉编码策略中所存在的系统缺陷。为了应对这些挑战,我们提出了LaVA-UHD,一种大型多模态模型,可以在任何比例和分辨率下有效地感知图像。LaVA-UHD包括三个关键组件:(1)一种图像模块化策略,将原分辨率图像分割为小且可变大小的块以实现高效的扩展和可伸缩编码;(2)一个压缩模块,进一步压缩视觉编码器的图像块;(3)一个空间模式,为LLM组织块状编码器的块状编码器。全面的实验结果表明,LaVA-UHD在9个基准测试上的表现优于使用2-3倍数据训练的现有LMM。值得注意的是,基于LaVA-1.5的模型在仅使用94%的推理计算时支持6倍于(即672x1088)分辨率的图像,并且实现了TextVQA的6.4%的准确率提升。此外,该模型可以在学术环境中以更高效的方式进行训练,在8个A100 GPU上仅用23小时(而LaVA-1.5需要26小时),我们将数据和代码公开发布在https:// this URL上。
https://arxiv.org/abs/2403.11703
We present a novel approach to automatically synthesize "wayfinding instructions" for an embodied robot agent. In contrast to prior approaches that are heavily reliant on human-annotated datasets designed exclusively for specific simulation platforms, our algorithm uses in-context learning to condition an LLM to generate instructions using just a few references. Using an LLM-based Visual Question Answering strategy, we gather detailed information about the environment which is used by the LLM for instruction synthesis. We implement our approach on multiple simulation platforms including Matterport3D, AI Habitat and ThreeDWorld, thereby demonstrating its platform-agnostic nature. We subjectively evaluate our approach via a user study and observe that 83.3% of users find the synthesized instructions accurately capture the details of the environment and show characteristics similar to those of human-generated instructions. Further, we conduct zero-shot navigation with multiple approaches on the REVERIE dataset using the generated instructions, and observe very close correlation with the baseline on standard success metrics (< 1% change in SR), quantifying the viability of generated instructions in replacing human-annotated data. To the best of our knowledge, ours is the first LLM-driven approach capable of generating "human-like" instructions in a platform-agnostic manner, without requiring any form of training.
我们提出了一种新的方法来自动生成“路径规划指令”给体素机器人代理。与之前依赖人类标注数据集的特定仿真平台上的方法不同,我们的算法使用上下文学习来条件LLM生成指令,只需几个参考。使用基于LLM的视觉问答策略,我们收集了LLM用于指令生成的环境中的详细信息。我们在包括Matterport3D、AI Habitat和ThreeDWorld在内的多个仿真平台上实现我们的方法,从而证明了其跨平台性质。我们通过用户研究主观评估了我们的方法,观察到83.3%的用户认为生成的指令准确地捕捉了环境的细节,并具有类似于人类指令的特征。此外,我们在REVERIE数据集上使用多种方法进行零散路径导航,并观察到与基线非常接近的关联(<1%的变化),这表明生成的指令在替代人类标注数据方面具有可行性。据我们所知,这是第一个在非训练方式下生成“人类相似”指令的LLM驱动方法。
https://arxiv.org/abs/2403.11487
We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task, utilizing the zero-shot tool-use ability of LLMs. VideoAgent demonstrates impressive performances on several long-horizon video understanding benchmarks, an average increase of 6.6% on NExT-QA and 26.0% on EgoSchema over baselines, closing the gap between open-sourced models and private counterparts including Gemini 1.5 Pro.
我们探讨了如何通过将大型语言模型和视觉语言模型与新颖的统一记忆机制相结合来解决具有挑战性的视频理解问题,特别是捕捉长时间视频中的长期时间关系。特别是,所提出的多模态代理VideoAgent:1)构建了一个结构化记忆来存储视频的通用时间事件描述和物体中心跟踪状态;2)在给定输入任务查询时,它采用包括视频段局部定位和物体记忆查询等其他视觉基础模型来交互式解决任务,利用LLM的零 shot工具使用能力。VideoAgent在多个长期时间范围的视觉理解基准测试中表现出色,与基线相比,平均提高了6.6%的NExT-QA和26.0%的EgoSchema,缩小了开源模型和私有模型之间的差距,包括Gemini 1.5 Pro。
https://arxiv.org/abs/2403.11481
The task of No-Reference Image Quality Assessment (NR-IQA) is to estimate the quality score of an input image without additional information. NR-IQA models play a crucial role in the media industry, aiding in performance evaluation and optimization guidance. However, these models are found to be vulnerable to adversarial attacks, which introduce imperceptible perturbations to input images, resulting in significant changes in predicted scores. In this paper, we propose a defense method to improve the stability in predicted scores when attacked by small perturbations, thus enhancing the adversarial robustness of NR-IQA models. To be specific, we present theoretical evidence showing that the magnitude of score changes is related to the $\ell_1$ norm of the model's gradient with respect to the input image. Building upon this theoretical foundation, we propose a norm regularization training strategy aimed at reducing the $\ell_1$ norm of the gradient, thereby boosting the robustness of NR-IQA models. Experiments conducted on four NR-IQA baseline models demonstrate the effectiveness of our strategy in reducing score changes in the presence of adversarial attacks. To the best of our knowledge, this work marks the first attempt to defend against adversarial attacks on NR-IQA models. Our study offers valuable insights into the adversarial robustness of NR-IQA models and provides a foundation for future research in this area.
图像质量评估(NR-IQA)的任务是估计没有额外信息的输入图像的质量评分。NR-IQA模型在媒体行业中发挥了关键作用,有助于性能评估和优化指导。然而,这些模型被发现容易受到对抗攻击的影响,攻击会导致对输入图像的不可察觉的扰动,从而导致预测得分的大幅变化。在本文中,我们提出了一种防御方法,以提高当受到小扰动攻击时预测得分的稳定性,从而增强NR-IQA模型的对抗鲁棒性。具体来说,我们提供了理论证据,表明预测得分变化的大小与模型对输入图像的梯度的$\ell_1$范数有关。在此基础上,我们提出了一种针对梯度的范数正则化训练策略,旨在减小梯度的$\ell_1$范数,从而提高NR-IQA模型的对抗鲁棒性。对四个NR-IQA基线模型的实验表明,我们的策略在存在对抗攻击的情况下有效地减少了得分变化。据我们所知,这项工作是保护NR-IQA模型免受对抗攻击的第一尝试。我们的研究为研究NR-IQA模型的对抗鲁棒性提供了宝贵的洞见,并为该领域未来的研究奠定了基础。
https://arxiv.org/abs/2403.11397
Two approaches have emerged to input images into large language models (LLMs). The first is to caption images into natural language. The second is to map image feature embeddings into the domain of the LLM and pass the mapped embeddings directly to the LLM. The majority of recent few-shot multimodal work reports performance using architectures that employ variations of one of these two approaches. But they overlook an important comparison between them. We design a controlled and focused experiment to compare these two approaches to few-shot visual question answering (VQA) with LLMs. Our findings indicate that for Flan-T5 XL, a 3B parameter LLM, connecting visual embeddings directly to the LLM embedding space does not guarantee improved performance over using image captions. In the zero-shot regime, we find using textual image captions is better. In the few-shot regimes, how the in-context examples are selected determines which is better.
有两种将图像输入到大型语言模型(LLMs)的方法已经出现了。第一种是将图像标题成自然语言。第二种是将图像特征嵌入映射到LLM的领域,并直接传递映射的嵌入到LLM。大多数最近的几少 shot 多模态工作都使用采用了这两种方法中的一个或两个架构的模型,但是它们忽略了它们之间的重要比较。我们设计了一个有控制力的实验,以比较这两种方法与少 shot 视觉问答(VQA) with LLMs的性能。我们的研究结果表明,对于 Flan-T5 XL这样的3B参数LLM,直接将视觉表示连接到LLM嵌入空间并没有提高性能。在零散局面下,我们发现使用文本图像标题要好。在少散局面下,如何选择上下文实例决定了哪种方法更好。
https://arxiv.org/abs/2403.11317
The integration of Multimodal Large Language Models (MLLMs) with robotic systems has significantly enhanced the ability of robots to interpret and act upon natural language instructions. Despite these advancements, conventional MLLMs are typically trained on generic image-text pairs, lacking essential robotics knowledge such as affordances and physical knowledge, which hampers their efficacy in manipulation tasks. To bridge this gap, we introduce ManipVQA, a novel framework designed to endow MLLMs with Manipulation-centric knowledge through a Visual Question-Answering format. This approach not only encompasses tool detection and affordance recognition but also extends to a comprehensive understanding of physical concepts. Our approach starts with collecting a varied set of images displaying interactive objects, which presents a broad range of challenges in tool object detection, affordance, and physical concept predictions. To seamlessly integrate this robotic-specific knowledge with the inherent vision-reasoning capabilities of MLLMs, we adopt a unified VQA format and devise a fine-tuning strategy that preserves the original vision-reasoning abilities while incorporating the new robotic insights. Empirical evaluations conducted in robotic simulators and across various vision task benchmarks demonstrate the robust performance of ManipVQA. Code and dataset will be made publicly available at this https URL.
多模态大型语言模型(MLLMs)与机器人系统的集成显著增强了机器人对自然语言指令的解读和执行能力。尽管取得了这些进步,但传统的MLLMs通常在通用图像-文本对上训练,缺乏机器人学知识,如手势和物理知识,这会削弱他们在操作任务中的效果。为了弥合这一空白,我们引入了ManipVQA,一种旨在通过视觉问答格式为MLLMs赋予操作中心知识的新颖框架。这种方法不仅包括工具检测和手势识别,还涉及对物理概念的全面理解。我们的方法从收集一系列展示交互物图像的多样图像开始。这为工具物体检测、手势识别和物理概念预测提出了广泛的挑战。为了将机器人特定知识与MLLMs固有的视觉推理能力无缝集成,我们采用统一的视觉问答格式并设计了一个微调策略,保留原始视觉推理能力,同时纳入新的机器人洞见。在机器人仿真器和各种视觉任务基准测试中进行实证评估,证明了ManipVQA的稳健性能。代码和数据将公开发布在https://这个URL上。
https://arxiv.org/abs/2403.11289
No-Reference Image Quality Assessment (NR-IQA) focuses on designing methods to measure image quality in alignment with human perception when a high-quality reference image is unavailable. The reliance on annotated Mean Opinion Scores (MOS) in the majority of state-of-the-art NR-IQA approaches limits their scalability and broader applicability to real-world scenarios. To overcome this limitation, we propose QualiCLIP (Quality-aware CLIP), a CLIP-based self-supervised opinion-unaware method that does not require labeled MOS. In particular, we introduce a quality-aware image-text alignment strategy to make CLIP generate representations that correlate with the inherent quality of the images. Starting from pristine images, we synthetically degrade them with increasing levels of intensity. Then, we train CLIP to rank these degraded images based on their similarity to quality-related antonym text prompts, while guaranteeing consistent representations for images with comparable quality. Our method achieves state-of-the-art performance on several datasets with authentic distortions. Moreover, despite not requiring MOS, QualiCLIP outperforms supervised methods when their training dataset differs from the testing one, thus proving to be more suitable for real-world scenarios. Furthermore, our approach demonstrates greater robustness and improved explainability than competing methods. The code and the model are publicly available at this https URL.
No-Reference Image Quality Assessment (NR-IQA) 关注于设计方法来测量图像质量,以在高质量参考图像不可用时,与人类感知对齐。在大多数先进的 NR-IQA 方法中,对注释的均方 opinion score (MOS) 的依赖性使得它们的规模不可扩展,并且对于现实世界的场景应用有限。为了克服这一限制,我们提出了 QualiCLIP(质量感知 CLIP),一种基于 CLIP 的自监督无感知方法,不需要 labeled MOS。 特别是,我们引入了质量感知的图像文本对齐策略,使 CLIP 生成与图像固有质量相关的表示。从纯净图像开始,我们通过不断增加强度来合成退化图像。然后,我们训练 CLIP 根据它们与质量相关反义词提示的相似性对退化图像进行排名,同时保证对于质量相近的图像具有相同的表示。 我们的方法在多个数据集上的表现达到最高水平。此外,尽管不要求 MOS,QualiCLIP 在其训练集与测试集不同时,仍然优于监督方法,证明其在现实世界场景中更为适用。此外,我们的方法表现出更高的鲁棒性和更好的可解释性,与竞争方法相比。代码和模型可以在这个链接上公开获取:https://this URL。
https://arxiv.org/abs/2403.11176
Quantization-aware training (QAT) and Knowledge Distillation (KD) are combined to achieve competitive performance in creating low-bit deep learning models. However, existing works applying KD to QAT require tedious hyper-parameter tuning to balance the weights of different loss terms, assume the availability of labeled training data, and require complex, computationally intensive training procedures for good performance. To address these limitations, this paper proposes a novel Self-Supervised Quantization-Aware Knowledge Distillation (SQAKD) framework. SQAKD first unifies the forward and backward dynamics of various quantization functions, making it flexible for incorporating various QAT works. Then it formulates QAT as a co-optimization problem that simultaneously minimizes the KL-Loss between the full-precision and low-bit models for KD and the discretization error for quantization, without supervision from labels. A comprehensive evaluation shows that SQAKD substantially outperforms the state-of-the-art QAT and KD works for a variety of model architectures. Our code is at: this https URL.
量化感知训练(QAT)和知识蒸馏(KD)相结合以在创建低位深度学习模型时实现竞争力的性能。然而,将KD应用于QAT的传统工作需要进行乏味的超参数调优来平衡各种损失项,假设已 availability labeled 训练数据,并且需要复杂的计算密集型训练程序来获得好的性能。为了应对这些限制,本文提出了一个新颖的自我监督量化感知知识蒸馏(SQAKD)框架。SQAKD首先统一了各种量化函数的正向和反向动态,使其易于结合各种QAT工作。然后将QAT表示为一种共同优化的问题,同时最小化KL损失之间的高级精度模型和低位模型之间的差异,没有标签的监督。全面的评估表明,SQAKD在各种模型架构上显著优于最先进的QAT和KD工作。我们的代码在此处:https:// this URL。
https://arxiv.org/abs/2403.11106
The advent of ChatGPT has sparked over a year of regulatory frenzy. However, few existing studies have rigorously questioned the assumption that, if left unregulated, AI chatbot's output would inflict tangible, severe real harm on human affairs. Most researchers have overlooked the critical possibility that the information market itself can effectively mitigate these risks and, as a result, they tend to use regulatory tools to address the issue directly. This Article develops a yardstick for reevaluating both AI-related content risks and corresponding regulatory proposals by focusing on inter-informational competition among various outlets. The decades-long history of regulating information and communications technologies indicates that regulators tend to err too much on the side of caution and to put forward excessive regulatory measures when encountering the uncertainties brought about by new technologies. In fact, a trove of empirical evidence has demonstrated that market competition among information outlets can effectively mitigate most risks and that overreliance on regulation is not only unnecessary but detrimental, as well. This Article argues that sufficient competition among chatbots and other information outlets in the information marketplace can sufficiently mitigate and even resolve most content risks posed by generative AI technologies. This renders certain loudly advocated regulatory strategies, like mandatory prohibitions, licensure, curation of datasets, and notice-and-response regimes, truly unnecessary and even toxic to desirable competition and innovation throughout the AI industry. Ultimately, the ideas that I advance in this Article should pour some much-needed cold water on the regulatory frenzy over generative AI and steer the issue back to a rational track.
ChatGPT的出现引发了一年多的监管狂潮。然而,现有研究并没有严格质疑一个假设,即如果不受监管,AI聊天机器人的输出将对人类事务造成实质严重伤害。大多数研究人员都忽视了信息市场中本身可以有效减轻这些风险的关键可能性,因此他们倾向于直接使用监管工具来解决这个问题。本文通过关注不同媒体之间的信息竞争,为重新评估AI相关内容风险和相应监管建议提供了一个指标。几十年的信息与通信技术监管历史表明,监管机构往往在新技术出现时过于谨慎,提出过度的监管措施。实际上,大量实证证据已经证明,市场中的信息竞争可以有效地减轻大多数风险,过度依赖监管不仅不明智,而且有害。本文认为,在信息市场中,足够多的聊天机器人和其他信息来源之间的竞争可以足够有效地减轻和解决大多数内容风险。这使得某些强烈倡导的监管策略(如强制禁止、许可、数据集策展和通知-回应制度)真正变得多余,甚至对良好的竞争和创新产生有害影响。最终,我在这篇文章中提出的思想应该让过度热衷于生成AI的监管狂潮冷却下来,将问题引导回一个理性的轨道。
https://arxiv.org/abs/2403.11046
In this study, we introduce BEnQA, a dataset comprising parallel Bengali and English exam questions for middle and high school levels in Bangladesh. Our dataset consists of approximately 5K questions covering several subjects in science with different types of questions, including factual, application, and reasoning-based questions. We benchmark several Large Language Models (LLMs) with our parallel dataset and observe a notable performance disparity between the models in Bengali and English. We also investigate some prompting methods, and find that Chain-of-Thought prompting is beneficial mostly on reasoning questions, but not so much on factual ones. We also find that appending English translation helps to answer questions in Bengali. Our findings point to promising future research directions for improving the performance of LLMs in Bengali and more generally in low-resource languages.
在这项研究中,我们引入了BEnQA数据集,这是一个由孟加拉国中高中和高中阶段的英语和孟加拉语考试问题组成的并行数据集。我们的数据集包括约5000个问题,涵盖了科学等多个学科,包括事实性、应用性和推理性问题。我们将这个平行数据集与几个大型语言模型(LLMs)进行比较,并观察到孟加拉语和英语模型之间性能差异明显。我们还研究了一些提示方法,并发现主要是关于推理问题,Chain-of-Thought提示对事实性问题有益,而对推理问题益处较小。我们还发现,附加英语翻译有助于解答孟加拉语问题。我们的研究结果表明,LLMs在孟加拉语和更一般低资源语言中的性能改进方向具有前景。
https://arxiv.org/abs/2403.10900
While Multimodal Large Language Models (MLLMs) have experienced significant advancement on visual understanding and reasoning, their potentials to serve as powerful, flexible, interpretable, and text-driven models for Image Quality Assessment (IQA) remains largely unexplored. In this paper, we conduct a comprehensive and systematic study of prompting MLLMs for IQA. Specifically, we first investigate nine prompting systems for MLLMs as the combinations of three standardized testing procedures in psychophysics (i.e., the single-stimulus, double-stimulus, and multiple-stimulus methods) and three popular prompting strategies in natural language processing (i.e., the standard, in-context, and chain-of-thought prompting). We then present a difficult sample selection procedure, taking into account sample diversity and uncertainty, to further challenge MLLMs equipped with the respective optimal prompting systems. We assess three open-source and one close-source MLLMs on several visual attributes of image quality (e.g., structural and textural distortions, color differences, and geometric transformations) in both full-reference and no-reference scenarios. Experimental results show that only the close-source GPT-4V provides a reasonable account for human perception of image quality, but is weak at discriminating fine-grained quality variations (e.g., color differences) and at comparing visual quality of multiple images, tasks humans can perform effortlessly.
虽然多模态大型语言模型(MLLMs)在视觉理解和推理方面已经取得了显著进展,但它们作为 powerful、flexible、interpretable 和文本驱动模型在图像质量评估(IQA)方面的潜在作用仍然没有被充分利用。在本文中,我们对使用提示MLLMs进行IQA进行全面而系统的探讨。具体来说,我们首先调查了MLLMs的九个提示系统,这些系统是心理物理学(即单刺激、双刺激和多刺激方法)和自然语言处理(即标准、上下文和连锁思维提示策略)的组合。然后,我们提出了一个困难样本选择方法,考虑样本多样性和不确定性,进一步挑战配备相应最优提示系统的MLLMs。我们在两种参考场景(完整参考和无参考)下评估了三种开源和一种闭源MLLM在图像质量(如结构性和文本纹理扭曲、颜色差异和几何变换)方面的表现。实验结果表明,只有闭源的GPT-4V能够给出人类对图像质量的合理解释,但它在细粒度质量变化(如颜色差异)的区分上表现较弱,而且无法轻松比较多张图像的视觉质量。
https://arxiv.org/abs/2403.10854
State-of-the-art KBQA models assume answerability of questions. Recent research has shown that while these can be adapted to detect unaswerability with suitable training and thresholding, this comes at the expense of accuracy for answerable questions, and no single model is able to handle all categories of unanswerability. We propose a new model for KBQA named RetinaQA that is robust against unaswerability. It complements KB-traversal based logical form retrieval with sketch-filling based logical form construction. This helps with questions that have valid logical forms but no data paths in the KB leading to an answer. Additionally, it uses discrimination instead of generation to better identify questions that do not have valid logical forms. We demonstrate that RetinaQA significantly outperforms adaptations of state-of-the-art KBQA models across answerable and unanswerable questions, while showing robustness across unanswerability categories. Remarkably, it also establishes a new state-of-the art for answerable KBQA by surpassing existing models
先进的KBQA模型假定问题有答案。最近的研究表明,虽然这些模型可以适应性地检测不具答案的问题,但代价是对答案问题的准确性,而且没有一个模型能够处理所有类别的无答案问题。我们提出了一种名为RetinaQA的新模型,对无答案问题具有鲁棒性。它与基于逻辑形式检索的KB-遍历和基于填充的逻辑形式构建相结合。这有助于那些具有有效逻辑形式但无KB路径回答问题的提问。此外,它使用区分而不是生成来更好地识别没有有效逻辑形式的提问。我们证明了RetinaQA在答案问题和无答案问题上的先进模型改编方面显著优于其他模型,而保持对无答案问题的鲁棒性。值得注意的是,它还在答案问题的KBQA领域树立了新的标杆,超越了现有模型。
https://arxiv.org/abs/2403.10849
Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences. Motivated by the human cognitive process for long-form video understanding, we emphasize interactive reasoning and planning over the ability to process lengthy visual inputs. We introduce a novel agent-based system, VideoAgent, that employs a large language model as a central agent to iteratively identify and compile crucial information to answer a question, with vision-language foundation models serving as tools to translate and retrieve visual information. Evaluated on the challenging EgoSchema and NExT-QA benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only 8.4 and 8.2 frames used on average. These results demonstrate superior effectiveness and efficiency of our method over the current state-of-the-art methods, highlighting the potential of agent-based approaches in advancing long-form video understanding.
长格式视频理解是计算机视觉领域的一个重要挑战,要求模型具有在长多模态序列上进行推理的能力。为了满足人类在长格式视频理解中的认知过程,我们强调交互式推理和规划,而不是处理长视觉输入的能力。我们引入了一个名为VideoAgent的新颖智能体系统,它采用一个大语言模型作为核心代理,通过迭代确定和汇总关键信息来回答问题,而视觉语言模型作为工具来翻译和检索视觉信息。在具有挑战性的EgoSchema和NExT-QA基准测试中评估,VideoAgent平均使用8.4和8.2帧,实现了54.1%和71.3%的零散准确性。这些结果表明,我们的方法在当前最先进的方法上具有优越的效性和效率,突出了基于智能体的方法在促进长格式视频理解方面的潜力。
https://arxiv.org/abs/2403.10517
Mitigating hallucinations of Large Multi-modal Models(LMMs) is crucial to enhance their reliability for general-purpose assistants. This paper shows that such hallucinations of LMMs can be significantly exacerbated by preceding user-system dialogues. To precisely measure this, we first present an evaluation benchmark by extending popular multi-modal benchmark datasets with prepended hallucinatory dialogues generated by our novel Adversarial Question Generator, which can automatically generate image-related yet adversarial dialogues by adopting adversarial attacks on LMMs. On our benchmark, the zero-shot performance of state-of-the-art LMMs dropped significantly for both the VQA and Captioning tasks. Next, we further reveal this hallucination is mainly due to the prediction bias toward preceding dialogues rather than visual content. To reduce this bias, we propose Adversarial Instruction Tuning that robustly fine-tunes LMMs on augmented multi-modal instruction-following datasets with hallucinatory dialogues. Extensive experiments show that our proposed approach successfully reduces dialogue hallucination while maintaining or even improving performance.
缓解大型多模态模型(LMMs)的幻觉对于增强其对于通用助手设备的可靠性至关重要。本文表明,LMMs的前用户-系统对话可能会显著加剧这种幻觉。为了准确测量这一点,我们首先通过扩展流行的多模态基准数据集,使用我们新颖的对抗性问题生成器生成的附带幻觉对话,该生成器可以通过对LMMs的对抗攻击来生成与图像相关的 adversarial 对话。在我们的基准上,最先进的 LMM 的零散性能对于 both VQA 和 Captioning 任务都下降了显著。接下来,我们进一步表明,这种幻觉主要是由先前的对话预测偏差导致的,而不是视觉内容。为了减少这种偏见,我们提出了对抗指令调整,它在使用增强多模态指令跟随数据集上对 LMMs 进行鲁棒微调的同时,通过附带幻觉对话进行调整。大量实验证明,我们提出的方法在保持或甚至提高性能的同时,成功地减少了对话幻觉。
https://arxiv.org/abs/2403.10492
Performance attribution analysis, defined as the process of explaining the drivers of the excess performance of an investment portfolio against a benchmark, stands as a significant aspect of portfolio management and plays a crucial role in the investment decision-making process, particularly within the fund management industry. Rooted in a solid financial and mathematical framework, the importance and methodologies of this analytical technique are extensively documented across numerous academic research papers and books. The integration of large language models (LLMs) and AI agents marks a groundbreaking development in this field. These agents are designed to automate and enhance the performance attribution analysis by accurately calculating and analyzing portfolio performances against benchmarks. In this study, we introduce the application of an AI Agent for a variety of essential performance attribution tasks, including the analysis of performance drivers and utilizing LLMs as calculation engine for multi-level attribution analysis and question-answer (QA) exercises. Leveraging advanced prompt engineering techniques such as Chain-of-Thought (CoT) and Plan and Solve (PS), and employing a standard agent framework from LangChain, the research achieves promising results: it achieves accuracy rates exceeding 93% in analyzing performance drivers, attains 100% in multi-level attribution calculations, and surpasses 84% accuracy in QA exercises that simulate official examination standards. These findings affirm the impactful role of AI agents, prompt engineering and evaluation in advancing portfolio management processes, highlighting a significant advancement in the practical application and evaluation of AI technologies within the domain.
绩效归因分析,定义为解释投资组合相对于基准的超额表现驱动因素的过程,是组合管理的一个重要方面,并在基金管理行业中起着关键作用。这种分析技术植根于扎实的金融和数学框架之中,在大量学术研究论文和书籍中详细记录了其重要性和方法论。引入大型语言模型(LLMs)和AI代理标志着该领域的一项重大发展。这些代理旨在通过准确计算并分析组合相对于基准的表现来提高绩效归因分析。在这项研究中,我们引入AI代理来执行各种关键的绩效归因任务,包括分析绩效驱动因素和利用LLMs作为多级归因分析的计算引擎以及问答(QA)练习。利用先进的提示工程技术(如Chain-of-Thought(CoT)和Plan and Solve(PS)),并采用LangChain中的标准代理框架,这项研究取得了良好的成果:分析性能驱动因素的准确性率超过93%,多级归因计算达到100%,以及模拟官方考试标准的QA练习的准确性超过84%。这些发现证实了AI代理、提示工程和评估在提高组合管理过程方面的重要作用,突出了在领域内运用人工智能技术的重要性。
https://arxiv.org/abs/2403.10482