Physically realistic materials are pivotal in augmenting the realism of 3D assets across various applications and lighting conditions. However, existing 3D assets and generative models often lack authentic material properties. Manual assignment of materials using graphic software is a tedious and time-consuming task. In this paper, we exploit advancements in Multimodal Large Language Models (MLLMs), particularly GPT-4V, to present a novel approach, Make-it-Real: 1) We demonstrate that GPT-4V can effectively recognize and describe materials, allowing the construction of a detailed material library. 2) Utilizing a combination of visual cues and hierarchical text prompts, GPT-4V precisely identifies and aligns materials with the corresponding components of 3D objects. 3) The correctly matched materials are then meticulously applied as reference for the new SVBRDF material generation according to the original diffuse map, significantly enhancing their visual authenticity. Make-it-Real offers a streamlined integration into the 3D content creation workflow, showcasing its utility as an essential tool for developers of 3D assets.
物理真实感材料在增强3D资产的各种应用和光照条件下的真实感方面至关重要。然而,现有的3D资产和生成模型通常缺乏真实材料的属性。使用图形软件手动分配材料是一个费力且耗时的任务。在本文中,我们利用多模态大型语言模型(MMLMs)的进步,特别是GPT-4V,提出了一个新方法,名为Make-it-Real:1)我们证明了GPT-4V可以有效地识别和描述材料,使得构建详细材料库成为可能。2)利用视觉提示和分层文本提示,GPT-4V准确地识别和校准材料与3D物体相应部件的对应关系。3)然后,正确匹配的材料被用作根据原始漫射图生成新的SVBRDF材料的新参考,显著增强了它们的视觉真实感。Make-it-Real使3D内容创建工作流程更加流畅,展示了它在开发者3D资产方面作为关键工具的重要作用。
https://arxiv.org/abs/2404.16829
In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. (2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448$\times$448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input. (3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Code has been released at this https URL.
在这份报告中,我们介绍了InternVL 1.5,一个开源的多模态大型语言模型(MLLM),以弥合开源和商业模型在多模态理解能力方面的差距。我们介绍了三个简单的改进:(1)强视图编码器:我们对 large-scale vision foundation model -- InternViT-6B 进行连续学习,提高了其视觉理解能力,并使其可以迁移和重用于不同的LLM。 (2)动态高分辨率:我们根据输入图像的透视率和分辨率将图像划分为从1到40个448x448像素的方块,支持最高4K分辨率输入。 (3)高质量双语数据集:我们仔细收集了一个高质量的双语数据集,涵盖了常见的场景、文档图像,并使用英语和中文问题与答案对它们进行了标注,显著提高了 OCR- 和与中文相关的任务的表现。我们通过一系列基准测试和比较研究评估了InternVL 1.5。与开源和商业模型相比,InternVL 1.5显示出具有竞争力的性能,在8个基准测试中实现了最先进的结果。代码已发布在https://这个网址。
https://arxiv.org/abs/2404.16821
As large language models (LLMs) see increasing adoption across the globe, it is imperative for LLMs to be representative of the linguistic diversity of the world. India is a linguistically diverse country of 1.4 Billion people. To facilitate research on multilingual LLM evaluation, we release IndicGenBench - the largest benchmark for evaluating LLMs on user-facing generation tasks across a diverse set 29 of Indic languages covering 13 scripts and 4 language families. IndicGenBench is composed of diverse generation tasks like cross-lingual summarization, machine translation, and cross-lingual question answering. IndicGenBench extends existing benchmarks to many Indic languages through human curation providing multi-way parallel evaluation data for many under-represented Indic languages for the first time. We evaluate a wide range of proprietary and open-source LLMs including GPT-3.5, GPT-4, PaLM-2, mT5, Gemma, BLOOM and LLaMA on IndicGenBench in a variety of settings. The largest PaLM-2 models performs the best on most tasks, however, there is a significant performance gap in all languages compared to English showing that further research is needed for the development of more inclusive multilingual language models. IndicGenBench is released at this http URL
随着大型语言模型(LLMs)在全球范围内的应用不断增加,LLMs代表世界语言多样性至关重要。印度是一个拥有14亿人口的多语言国家。为了促进对多语言LLM评估的研究,我们发布了IndicGenBench - 针对13个脚本和4个语言家庭的多语言用户生成任务评估的最大基准。IndicGenBench由跨语言摘要、机器翻译和跨语言问题回答等多样化的生成任务组成。通过人类审核,IndicGenBench为许多印度语言提供了多途径并行评估数据,为许多代表性不足的印度语言首次提供了全面评估。我们在IndicGenBench上评估了各种专有和开源LLM,包括GPT-3.5、GPT-4、PaLM-2、mT5、Gemma、BLOOM和LLLaMA。在IndicGenBench上,最大的PaLM-2模型在大多数任务上表现最佳,然而,与英语相比,所有语言之间的性能差距都很大,这表明需要进一步研究更包容的多语言语言模型的开发。IndicGenBench发布在以下URL:
https://arxiv.org/abs/2404.16816
While many contemporary large language models (LLMs) can process lengthy input, they still struggle to fully utilize information within the long context, known as the lost-in-the-middle challenge. We hypothesize that it stems from insufficient explicit supervision during the long-context training, which fails to emphasize that any position in a long context can hold crucial information. Based on this intuition, our study presents information-intensive (IN2) training, a purely data-driven solution to overcome lost-in-the-middle. Specifically, IN2 training leverages a synthesized long-context question-answer dataset, where the answer requires (1) fine-grained information awareness on a short segment (~128 tokens) within a synthesized long context (4K-32K tokens), and (2) the integration and reasoning of information from two or more short segments. Through applying this information-intensive training on Mistral-7B, we present FILM-7B (FILl-in-the-Middle). To thoroughly assess the ability of FILM-7B for utilizing long contexts, we design three probing tasks that encompass various context styles (document, code, and structured-data context) and information retrieval patterns (forward, backward, and bi-directional retrieval). The probing results demonstrate that FILM-7B can robustly retrieve information from different positions in its 32K context window. Beyond these probing tasks, FILM-7B significantly improves the performance on real-world long-context tasks (e.g., 23.5->26.9 F1 score on NarrativeQA), while maintaining a comparable performance on short-context tasks (e.g., 59.3->59.2 accuracy on MMLU). Github Link: this https URL.
虽然许多当代大型语言模型(LLMs)可以处理长输入,但它们仍然很难在长上下文中完全利用信息,这被称为迷失在中间的挑战。我们假设这源于在长上下文训练期间缺乏明确的监督,这没有强调任何长上下文中的位置都可能持有关键信息。根据这个直觉,我们的研究提出了信息密集型(IN2)训练,这是一种完全数据驱动的解决方案来克服迷失在中间的挑战。具体来说,IN2训练利用合成长上下文问题-答案数据集,其中答案需要(1)在合成长上下文(4K-32K个词)中的短片段(~128个词)进行精细信息意识,以及(2)来自两个或更多短片段的信息整合和推理。通过在Mistral-7B上应用这一信息密集型训练,我们提出了FILM-7B(FILM-在中间)。为了全面评估FILM-7B在利用长上下文方面的能力,我们设计了一个涵盖各种上下文风格(文档、代码和结构化数据)和信息检索模式(前向、后向和双向检索)的三个探针任务。探针结果表明,FILM-7B可以稳健地从其32K个上下文窗口中的不同位置检索信息。除了这些探针任务之外,FILM-7B在现实世界的长上下文任务中的性能显著提高(例如,在NarrativeQA上的23.5->26.9 F1得分),同时它在短上下文任务中的性能与预相当(例如,在MMLU上的59.3->59.2准确率)。Github链接:https://github.com/。
https://arxiv.org/abs/2404.16811
Generative Commonsense Reasoning (GCR) requires a model to reason about a situation using commonsense knowledge, while generating coherent sentences. Although the quality of the generated sentences is crucial, the diversity of the generation is equally important because it reflects the model's ability to use a range of commonsense knowledge facts. Large Language Models (LLMs) have shown proficiency in enhancing the generation quality across various tasks through in-context learning (ICL) using given examples without the need for any fine-tuning. However, the diversity aspect in LLM outputs has not been systematically studied before. To address this, we propose a simple method that diversifies the LLM generations, while preserving their quality. Experimental results on three benchmark GCR datasets show that our method achieves an ideal balance between the quality and diversity. Moreover, the sentences generated by our proposed method can be used as training data to improve diversity in existing commonsense generators.
生成常识推理(GCR)需要一个模型使用常识知识来推理关于一种情况的句子,同时生成连贯的句子。尽管生成的句子的质量至关重要,但生成多样性同样重要,因为它反映了模型能够使用一系列常识知识事实的能力。大型语言模型(LLMs)通过在上下文中学来提高各种任务的生成质量,而不需要进行微调。然而,LLM输出的多样性方面之前还没有系统地研究过。为了解决这个问题,我们提出了一个简单的方法,它扩展了LLM的生成,同时保留了其质量。在三个基准GCR数据集上的实验结果表明,我们的方法实现了质量与多样性的理想平衡。此外,我们提出的方法生成的句子可以作为现有常识生成器的训练数据,以提高其多样性。
https://arxiv.org/abs/2404.16807
Recent advances in large pre-trained vision-language models have demonstrated remarkable performance on zero-shot downstream tasks. Building upon this, recent studies, such as CoOp and CoCoOp, have proposed the use of prompt learning, where context within a prompt is replaced with learnable vectors, leading to significant improvements over manually crafted prompts. However, the performance improvement for unseen classes is still marginal, and to tackle this problem, data augmentation has been frequently used in traditional zero-shot learning techniques. Through our experiments, we have identified important issues in CoOp and CoCoOp: the context learned through traditional image augmentation is biased toward seen classes, negatively impacting generalization to unseen classes. To address this problem, we propose adversarial token embedding to disentangle low-level visual augmentation features from high-level class information when inducing bias in learnable prompts. Through our novel mechanism called "Adding Attributes to Prompt Learning", AAPL, we guide the learnable context to effectively extract text features by focusing on high-level features for unseen classes. We have conducted experiments across 11 datasets, and overall, AAPL shows favorable performances compared to the existing methods in few-shot learning, zero-shot learning, cross-dataset, and domain generalization tasks.
近年来,大型预训练视觉语言模型在零散分布任务上的表现已经引人注目。在此基础上,一些研究,如CoOp和CoCoOp,提出了使用提示学习的方法,其中上下文在提示中替换为可学习向量,从而在手动设计的提示上取得了显著的改进。然而,对于未见过的类别的性能提升仍然很小,为了解决这个问题,传统零散学习技术中经常使用数据增强。通过我们的实验,我们发现了CoOp和CoCoOp中重要的问题:通过传统图像增强学习到的上下文存在偏见,不利于对未见过的类别的泛化。为了解决这个问题,我们提出了一个对抗性标记嵌入策略,当在提示中诱导偏见时,将低级视觉增强特征与高级分类信息分离。通过我们新颖的机制“在提示中添加属性”,AAPL,我们引导可学习上下文有效地提取未见过的类别的文本特征。我们在11个数据集上进行了实验,总体而言,AAPL在零散分布学习、少样本学习、跨数据集学习和领域泛化任务上的表现与现有方法相比具有优势。
https://arxiv.org/abs/2404.16804
Although the capabilities of large language models (LLMs) ideally scale up with increasing data and compute, they are inevitably constrained by limited resources in reality. Suppose we have a moderately trained LLM (e.g., trained to align with human preference) in hand, can we further exploit its potential and cheaply acquire a stronger model? In this paper, we propose a simple method called ExPO to boost LLMs' alignment with human preference. ExPO assumes that a medium-aligned model can be interpolated between a less-aligned (weaker) model, e.g., the initial SFT model, and a better-aligned (stronger) one, thereby directly obtaining this stronger model by extrapolating from the weights of the former two relatively weaker models. On the AlpacaEval 2.0 benchmark, we show that ExPO pushes models trained with less preference data (e.g., 10% or 20%) to reach and even surpass the fully-trained one, without any additional training. Furthermore, ExPO also significantly improves off-the-shelf DPO/RLHF models and exhibits decent scalability across model sizes from 7B to 70B. Our work demonstrates the efficacy of model extrapolation in exploiting LLMs' capabilities, suggesting a promising direction that deserves future exploration.
尽管大型语言模型(LLMs)在理想情况下能够随着数据和计算能力的增加而扩展其能力,但它们在现实中受到有限资源的限制。假设我们手中有一个适度训练的LLM(例如,训练以与人类偏好对齐),我们能否进一步发掘其潜力并以较低的成本获得更强的模型?在本文中,我们提出了一个简单的方法叫做ExPO,用于提高LLMs与人类偏好的对齐程度。ExPO假设一个中庸对齐的模型可以平滑地存在于一个较不满意的(较弱)模型和更好对齐的(较强)模型之间,从而通过从这两个较弱模型的权重中进行扩展直接获得这个更强的模型。在AlpacaEval 2.0基准上,我们证明了ExPO将偏好数据较少的模型(例如10%或20%)推向并甚至超过完全训练的模型,而没有任何额外的训练。此外,ExPO还显著地改善了标准DPO/RLHF模型,并在模型规模从7B到70B时表现出良好的可扩展性。我们的工作表明,模型扩展在利用LLM的能力方面具有有效性,为未来的探索提供了一个有前景的方向。
https://arxiv.org/abs/2404.16792
Comprehending text-rich visual content is paramount for the practical application of Multimodal Large Language Models (MLLMs), since text-rich scenarios are ubiquitous in the real world, which are characterized by the presence of extensive texts embedded within images. Recently, the advent of MLLMs with impressive versatility has raised the bar for what we can expect from MLLMs. However, their proficiency in text-rich scenarios has yet to be comprehensively and objectively assessed, since current MLLM benchmarks primarily focus on evaluating general visual comprehension. In this work, we introduce SEED-Bench-2-Plus, a benchmark specifically designed for evaluating \textbf{text-rich visual comprehension} of MLLMs. Our benchmark comprises 2.3K multiple-choice questions with precise human annotations, spanning three broad categories: Charts, Maps, and Webs, each of which covers a wide spectrum of text-rich scenarios in the real world. These categories, due to their inherent complexity and diversity, effectively simulate real-world text-rich environments. We further conduct a thorough evaluation involving 34 prominent MLLMs (including GPT-4V, Gemini-Pro-Vision and Claude-3-Opus) and emphasize the current limitations of MLLMs in text-rich visual comprehension. We hope that our work can serve as a valuable addition to existing MLLM benchmarks, providing insightful observations and inspiring further research in the area of text-rich visual comprehension with MLLMs. The dataset and evaluation code can be accessed at this https URL.
理解丰富文本的视觉内容对于多模态大型语言模型的实际应用至关重要,因为这种场景在现实生活中随处可见,特点是图像中嵌入大量文本。近年来,具有令人印象深刻的多样性的MLLM的出现提高了我们对MLLM的期望,然而,对于这些MLLM在丰富文本场景中的表现,我们还没有进行全面的、客观的评估,因为目前的MLLM基准主要关注评估通用视觉理解。在本文中,我们介绍了SEED-Bench-2-Plus,一个专门为评估MLLM的丰富文本视觉理解而设计的基准。我们的基准包括2300多个多选题问题,带有精确的人类注释,涵盖了三个广泛的类别:图表、地图和网站,每个类别涵盖了现实世界中的广泛文本丰富场景。由于它们的固有复杂性和多样性,这些类别有效地模拟了现实世界的文本丰富环境。我们进一步对34个著名的MLLM(包括GPT-4V、Gemini-Pro-Vision和Claude-3-Opus)进行了深入评估,并强调了MLLM在丰富文本视觉理解方面的当前局限性。我们希望我们的工作能为现有的MLLM基准提供宝贵的补充,提供有关丰富文本视觉理解与MLLM的进一步研究,以及有益的观察。数据和评估代码可以在此链接访问:https://url.in/
https://arxiv.org/abs/2404.16790
The recent success of large language models (LLMs) trained on static, pre-collected, general datasets has sparked numerous research directions and applications. One such direction addresses the non-trivial challenge of integrating pre-trained LLMs into dynamic data distributions, task structures, and user preferences. Pre-trained LLMs, when tailored for specific needs, often experience significant performance degradation in previous knowledge domains -- a phenomenon known as "catastrophic forgetting". While extensively studied in the continual learning (CL) community, it presents new manifestations in the realm of LLMs. In this survey, we provide a comprehensive overview of the current research progress on LLMs within the context of CL. This survey is structured into four main sections: we first describe an overview of continually learning LLMs, consisting of two directions of continuity: vertical continuity (or vertical continual learning), i.e., continual adaptation from general to specific capabilities, and horizontal continuity (or horizontal continual learning), i.e., continual adaptation across time and domains (Section 3). We then summarize three stages of learning LLMs in the context of modern CL: Continual Pre-Training (CPT), Domain-Adaptive Pre-training (DAP), and Continual Fine-Tuning (CFT) (Section 4). Then we provide an overview of evaluation protocols for continual learning with LLMs, along with the current available data sources (Section 5). Finally, we discuss intriguing questions pertaining to continual learning for LLMs (Section 6). The full list of papers examined in this survey is available at this https URL.
近年来,基于静态、预先收集的通用数据集训练的大语言模型(LLMs)的成功引发了大量的研究方向和应用。其中一种方向解决了将预训练的LLM集成到动态数据分布、任务结构和用户偏好中的非平凡挑战。经过专门调整以满足特定需求后,预训练的LLM在先前知识领域中的表现常常会显著下降,这种现象被称为“灾难性遗忘”。尽管在持续学习(CL)领域得到了广泛研究,但LLMs在LLM领域中呈现出了新的表现形式。在本次调查中,我们全面概述了LLM在CL背景下的当前研究进展。本次调查分为四个主要部分:我们首先描述了持续学习LLMs的概述,包括两个方向:垂直连续(或垂直持续学习),即从通用到特定能力的持续适应,以及水平连续(或水平持续学习),即跨越时间和领域的持续适应(第3节)。接着我们总结了在现代CL背景下学习LLM的三个阶段:持续预训练(CPT)、领域自适应预训练(DAP)和持续微调(CFT)(第4节)。然后我们概述了使用LLMs进行持续学习的评估协议以及当前可用的数据源(第5节)。最后,我们讨论了与LLM的持续学习相关的一些有趣问题(第6节)。本次调查中审查的论文清单可以在这个https:// URL中找到。
https://arxiv.org/abs/2404.16789
While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications including the fine-tuning of generative models. Unfortunately, PPO requires multiple heuristics to enable stable convergence (e.g. value networks, clipping) and is notorious for its sensitivity to the precise implementation of these components. In response, we take a step back and ask what a minimalist RL algorithm for the era of generative models would look like. We propose REBEL, an algorithm that cleanly reduces the problem of policy optimization to regressing the relative rewards via a direct policy parameterization between two completions to a prompt, enabling strikingly lightweight implementation. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL, which allows us to match the strongest known theoretical guarantees in terms of convergence and sample complexity in the RL literature. REBEL can also cleanly incorporate offline data and handle the intransitive preferences we frequently see in practice. Empirically, we find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO, all while being simpler to implement and more computationally tractable than PPO.
最初是为连续控制问题而开发的,但Proximal Policy Optimization (PPO)现在已成为各种强化学习(RL)应用(包括对生成模型的微调)的摇钱树。不幸的是,PPO需要多个技巧来实现稳定的收敛(例如值网络,截断) ,并以其对这些组件的具体实现非常敏感而闻名。为了应对这个问题,我们回退一步并问:在生成模型时代,一个简约的RL算法会是什么样子?我们提出了REBEL,一种通过直接对两个完成之间的策略参数化来降低策略优化问题的算法。在理论方面,我们证明了诸如自然策略梯度等基本RL算法可以被视为REBEL的变体,这使我们能够匹配RL文献中关于收敛和样本复杂性的最强已知理论保证。REBEL还可以干净地整合离线数据,并处理我们经常遇到的实际问题中的自偏好。在实证研究中,我们发现REBEL在语言建模和图像生成方面的性能与PPO和DPO相当或更好,同时比PPO更简单地实现,并且具有更快的计算可处理性。
https://arxiv.org/abs/2404.16767
While supervised fine-tuning (SFT) has been a straightforward approach for tailoring the output of foundation large language model (LLM) to specific preferences, concerns have been raised about the depth of this alignment, with some critiques suggesting it is merely "superficial". We critically examine this hypothesis within the scope of cross-lingual generation tasks, proposing that the effectiveness of SFT may be constrained by its reliance on prior tokens to guide cross-lingual generation. Based on this crucial insight, and in response to the challenges posed by the costly and limited availability of non-English data for SFT, we introduce a novel training-free alignment method named PreTTY, which employs minimal task-related prior tokens to bridge the foundation LLM and the SFT LLM, achieving comparable performance without training. Experiments on machine translation and part-of-speech tagging across eight languages demonstrate the efficacy of PreTTY in cross-lingual settings. Remarkably, by initiating the decoding process with only one or two prior tokens, foundation LLMs can achieve performance comparable to their SFT counterparts. This method presents a cost-effective alternative to SFT and advances the democratization of multilingual LLMs.
虽然监督微调(SFT)已经是一种将基础大型语言模型(LLM)输出定制到特定偏好的直接方法,但人们对其深度的担忧也随之提出,有些批评认为这仅仅是“表面”。我们将在跨语言生成任务的范围内对这一假设进行深入探讨,并提出一个名为PreTTY的新训练-免费对齐方法,该方法采用最小化任务相关的前缀来连接基础LLM和SFT LLM,实现与训练相同或更好的性能,而无需训练。在八种语言的机器翻译和词性标注实验中,证明了PreTTY在跨语言环境中的有效性。值得注意的是,通过仅使用前几个前缀启动解码过程,基础LLM可以实现与SFT同级的性能。这种方法为SFT提供了一种成本效益高的替代方案,并推动了多语言LLM的民主化。
https://arxiv.org/abs/2404.16766
Developing generalist foundation model has recently attracted tremendous attention among researchers in the field of AI for Medicine (AI4Medicine). A pivotal insight in developing these models is their reliance on dataset scaling, which emphasizes the requirements on developing open-source medical image datasets that incorporate diverse supervision signals across various imaging modalities. In this paper, we introduce RadGenome-Chest CT, a comprehensive, large-scale, region-guided 3D chest CT interpretation dataset based on CT-RATE. Specifically, we leverage the latest powerful universal segmentation and large language models, to extend the original datasets (over 25,692 non-contrast 3D chest CT volume and reports from 20,000 patients) from the following aspects: (i) organ-level segmentation masks covering 197 categories, which provide intermediate reasoning visual clues for interpretation; (ii) 665 K multi-granularity grounded reports, where each sentence of the report is linked to the corresponding anatomical region of CT volume in the form of a segmentation mask; (iii) 1.3 M grounded VQA pairs, where questions and answers are all linked with reference segmentation masks, enabling models to associate visual evidence with textual explanations. All grounded reports and VQA pairs in the validation set have gone through manual verification to ensure dataset quality. We believe that RadGenome-Chest CT can significantly advance the development of multimodal medical foundation models, by training to generate texts based on given segmentation regions, which is unattainable with previous relevant datasets. We will release all segmentation masks, grounded reports, and VQA pairs to facilitate further research and development in this field.
在人工智能领域(AI4Medicine)的研究者中,开发通用基础模型最近引起了巨大的关注。这些模型的关键在于它们对数据集扩大的依赖,强调开发包含各种成像模式下不同监督信号的开放医疗图像数据集。在本文中,我们介绍了RadGenome-Chest CT,一个基于CT-RATE的全面、大规模、区域指导的3D chest CT解释数据集。具体来说,我们利用最先进的强大通用分割和大型语言模型,从以下方面扩展了原始数据集:(一)覆盖197个类别的器官级别分割掩码,为解释提供中间推理的视觉提示;(二)665K个多粒度 grounded 报告,其中每个报告的句子都与相应的 CT 体积的解剖区域通过分割掩码链接;(三)1.3M个 grounded VQA 对,其中问题及其答案都与参考分割掩码链接,使模型能够将视觉证据与文本解释相关联。所有验证集中的 grounded 报告和 VQA 对都经过手动验证,以确保数据集质量。我们相信,RadGenome-Chest CT 可以通过根据给定分割区域生成文本,从而显著推动多模态医疗基础模型的开发,这是之前相关数据集无法实现的。我们将释放所有分割掩码、 grounded 报告和 VQA 对,以促进该领域进一步的研究和发展。
https://arxiv.org/abs/2404.16754
Vision-language models enable open-world classification of objects without the need for any retraining. While this zero-shot paradigm marks a significant advance, even today's best models exhibit skewed performance when objects are dissimilar from their typical depiction. Real world objects such as pears appear in a variety of forms -- from diced to whole, on a table or in a bowl -- yet standard VLM classifiers map all instances of a class to a \it{single vector based on the class label}. We argue that to represent this rich diversity within a class, zero-shot classification should move beyond a single vector. We propose a method to encode and account for diversity within a class using inferred attributes, still in the zero-shot setting without retraining. We find our method consistently outperforms standard zero-shot classification over a large suite of datasets encompassing hierarchies, diverse object states, and real-world geographic diversity, as well finer-grained datasets where intra-class diversity may be less prevalent. Importantly, our method is inherently interpretable, offering faithful explanations for each inference to facilitate model debugging and enhance transparency. We also find our method scales efficiently to a large number of attributes to account for diversity -- leading to more accurate predictions for atypical instances. Finally, we characterize a principled trade-off between overall and worst class accuracy, which can be tuned via a hyperparameter of our method. We hope this work spurs further research into the promise of zero-shot classification beyond a single class vector for capturing diversity in the world, and building transparent AI systems without compromising performance.
视觉语言模型使得无需重新训练即可对开放世界中的物体进行分类。虽然这种零样本范式取得了重大进展,但即使是最先进的模型在物体不与其典型描述相当时也会表现出偏斜的性能。现实世界中的苹果呈现出各种形式——从切成薄片到整个,放在桌子上或碗里——然而,标准视觉语言模型将类别的实例映射到基于类别的单个向量上。我们认为,为了在类中表示这种丰富的多样性,零样本分类应超越单一向量。我们提出了一种方法,通过推断属性来编码和解释类中的多样性,在零样本设置中不需要重新训练。我们发现在一系列包括层次结构、多样物体状态和真实世界地理多样性的大数据集上,我们的方法 consistently优于标准零样本分类。此外,我们的方法具有内在可解释性,为每个推理提供准确的解释,从而促进模型调试和提高透明度。我们还发现,我们的方法能够有效地扩展到大量的属性,以考虑多样性,从而使典型实例的预测更准确。最后,我们描述了总体和最差类准确度之间的原则性权衡,该权衡可以通过我们方法的超参数进行调整。我们希望这项工作能够推动关于零样本分类在捕捉世界多样性方面的前景以及在不牺牲性能的情况下构建透明AI系统的研究。
https://arxiv.org/abs/2404.16717
We present LayerSkip, an end-to-end solution to speed-up inference of large language models (LLMs). First, during training we apply layer dropout, with low dropout rates for earlier layers and higher dropout rates for later layers, and an early exit loss where all transformer layers share the same exit. Second, during inference, we show that this training recipe increases the accuracy of early exit at earlier layers, without adding any auxiliary layers or modules to the model. Third, we present a novel self-speculative decoding solution where we exit at early layers and verify and correct with remaining layers of the model. Our proposed self-speculative decoding approach has less memory footprint than other speculative decoding approaches and benefits from shared compute and activations of the draft and verification stages. We run experiments on different Llama model sizes on different types of training: pretraining from scratch, continual pretraining, finetuning on specific data domain, and finetuning on specific task. We implement our inference solution and show speedups of up to 2.16x on summarization for CNN/DM documents, 1.82x on coding, and 2.0x on TOPv2 semantic parsing task.
我们提出了LayerSkip,这是一种加速大型语言模型(LLM)推理速度的端到端解决方案。首先,在训练过程中,我们应用层下落,对于较早的层,下落率较低,对于较晚的层,下落率较高,并且有一个早期的退出损失,其中所有Transformer层都共享相同的退出。其次,在推理过程中,我们证明了这种训练方法在较早的层上增加了早期退出模型的准确性,而没有添加任何辅助层或模块到模型中。第三,我们提出了一个新颖的自适应解码解决方案,其中我们在较早的层退出,并使用模型的剩余层来验证和纠正。我们针对LLama模型的大小和不同类型的训练进行了实验:从头预训练,继续预训练,针对特定数据领域的微调,以及针对特定任务的微调。我们实现了我们的推理解决方案,并在综述的CNN/DM文档上展示了速度提高至2.16倍,在编码上提高了1.82倍,在TOPv2语义解析任务上提高了2.0倍。
https://arxiv.org/abs/2404.16710
In the rapidly evolving field of artificial intelligence, ensuring safe decision-making of Large Language Models (LLMs) is a significant challenge. This paper introduces Governance of the Commons Simulation (GovSim), a simulation platform designed to study strategic interactions and cooperative decision-making in LLMs. Through this simulation environment, we explore the dynamics of resource sharing among AI agents, highlighting the importance of ethical considerations, strategic planning, and negotiation skills. GovSim is versatile and supports any text-based agent, including LLMs agents. Using the Generative Agent framework, we create a standard agent that facilitates the integration of different LLMs. Our findings reveal that within GovSim, only two out of 15 tested LLMs managed to achieve a sustainable outcome, indicating a significant gap in the ability of models to manage shared resources. Furthermore, we find that by removing the ability of agents to communicate, they overuse the shared resource, highlighting the importance of communication for cooperation. Interestingly, most LLMs lack the ability to make universalized hypotheses, which highlights a significant weakness in their reasoning skills. We open source the full suite of our research results, including the simulation environment, agent prompts, and a comprehensive web interface.
在人工智能领域,确保大型语言模型(LLMs)的安全决策是一个重要的挑战。本文介绍了治理共享资源模拟(GovSim)模拟平台,该平台旨在研究LLMs的战略互动和合作决策。通过这个仿真环境,我们探讨了AI代理之间资源共享的动态,强调了道德考虑、战略规划和谈判技能的重要性。GovSim具有灵活性,支持任何基于文本的代理,包括LLM代理。利用生成代理框架,我们创建了一个标准代理,促进不同LLM的集成。我们的研究结果表明,在GovSim中,只有两个 out of 15 经测试的LLM成功地实现了可持续的结果,表明模型在管理共享资源方面的能力存在显著的差距。此外,我们发现,通过移除代理与进行沟通的能力,它们超出了共享资源的使用,强调了沟通在合作中的重要性。有趣的是,大多数LLM都缺乏普遍化假设的能力,这表明它们在推理能力方面存在显著的弱点。我们开源了我们所有研究的完整套件,包括仿真环境、代理提示和综合网页界面。
https://arxiv.org/abs/2404.16698
We explored the addition bias, a cognitive tendency to prefer adding elements over removing them to alter an initial state or structure, by conducting four preregistered experiments examining the problem-solving behavior of both humans and OpenAl's GPT-4 large language model. The experiments involved 588 participants from the U.S. and 680 iterations of the GPT-4 model. The problem-solving task was either to create symmetry within a grid (Experiments 1 and 3) or to edit a summary (Experiments 2 and 4). As hypothesized, we found that overall, the addition bias was present. Solution efficiency (Experiments 1 and 2) and valence of the instruction (Experiments 3 and 4) played important roles. Human participants were less likely to use additive strategies when subtraction was relatively more efficient than when addition and subtraction were equally efficient. GPT-4 exhibited the opposite behavior, with a strong addition bias when subtraction was more efficient. In terms of instruction valence, GPT-4 was more likely to add words when asked to "improve" compared to "edit", whereas humans did not show this effect. When we looked at the addition bias under different conditions, we found more biased responses for GPT-4 compared to humans. Our findings highlight the importance of considering comparable and sometimes superior subtractive alternatives, as well as reevaluating one's own and particularly the language models' problem-solving behavior.
我们通过进行四项预注册实验,研究了人类和OpenAl的GPT-4大型语言模型在解决问题行为方面的差异,以探讨添加偏差(addition bias)这一认知趋势。实验涉及来自美国588名参与者和GPT-4模型的680个迭代。问题解决任务可以是创建网格内的对称性(实验1和3)或者编辑摘要(实验2和4)。 根据我们的假设,我们发现总体上存在添加偏差。解决方案效率(实验1和2)和指令的积极性(实验3和4)非常重要。当减法相对更有效时,人类参与者更不可能使用添加策略。GPT-4表现出相反的行为,在减法更有效时具有强烈的添加偏差。 在指令积极性方面,GPT-4在被告知“改进”时更可能添加单词,而人类则没有这种效果。当我们研究添加偏差在不同条件下时,发现GPT-4的回答更加偏见,相对于人类而言。我们的研究结果强调了考虑可比较的和有时更好的减法替代方案的重要性,以及重新评估自己以及特别是语言模型的解决问题的行为。
https://arxiv.org/abs/2404.16692
Visual Instruction Tuning represents a novel learning paradigm involving the fine-tuning of pre-trained language models using task-specific instructions. This paradigm shows promising zero-shot results in various natural language processing tasks but is still unexplored in vision emotion understanding. In this work, we focus on enhancing the model's proficiency in understanding and adhering to instructions related to emotional contexts. Initially, we identify key visual clues critical to visual emotion recognition. Subsequently, we introduce a novel GPT-assisted pipeline for generating emotion visual instruction data, effectively addressing the scarcity of annotated instruction data in this domain. Expanding on the groundwork established by InstructBLIP, our proposed EmoVIT architecture incorporates emotion-specific instruction data, leveraging the powerful capabilities of Large Language Models to enhance performance. Through extensive experiments, our model showcases its proficiency in emotion classification, adeptness in affective reasoning, and competence in comprehending humor. The comparative analysis provides a robust benchmark for Emotion Visual Instruction Tuning in the era of LLMs, providing valuable insights and opening avenues for future exploration in this domain. Our code is available at \url{this https URL}.
视觉指令微调是一种新的学习范式,涉及使用任务特定指令对预训练语言模型进行微调。在这个范式中,我们专注于提高模型在理解并遵循与情感上下文相关的指令方面的能力。首先,我们识别出对视觉情感识别至关重要的关键视觉线索。接着,我们引入了一种新颖的GPT辅助生成情感视觉指令数据的长式依赖关系网络,有效解决了该领域中标注指令数据不足的问题。通过在InstructionBLIP工作的基础上拓展工作,我们提出的EmoVIT架构利用大型语言模型的强大能力来增强性能。通过广泛的实验,我们的模型在情感分类、情感推理和理解幽默方面展现了卓越的表现。比较分析为LLM时代的情感视觉指令微调提供了一个稳健的基准,为这个领域提供了宝贵的见解,并开拓了未来的研究方向。我们的代码可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2404.16670
Developing autonomous agents for mobile devices can significantly enhance user interactions by offering increased efficiency and accessibility. However, despite the growing interest in mobile device control agents, the absence of a commonly adopted benchmark makes it challenging to quantify scientific progress in this area. In this work, we introduce B-MoCA: a novel benchmark designed specifically for evaluating mobile device control agents. To create a realistic benchmark, we develop B-MoCA based on the Android operating system and define 60 common daily tasks. Importantly, we incorporate a randomization feature that changes various aspects of mobile devices, including user interface layouts and language settings, to assess generalization performance. We benchmark diverse agents, including agents employing large language models (LLMs) or multi-modal LLMs as well as agents trained from scratch using human expert demonstrations. While these agents demonstrate proficiency in executing straightforward tasks, their poor performance on complex tasks highlights significant opportunities for future research to enhance their effectiveness. Our source code is publicly available at this https URL.
开发移动设备上的自主代理可以显著增强用户交互,通过提供更高的效率和可访问性。然而,尽管移动设备控制代理越来越受到关注,但缺乏一个普遍适用的基准使得衡量这一领域科学进展具有挑战性。在这项工作中,我们介绍了B-MoCA:一个专门为评估移动设备控制代理而设计的新的基准。为了创建一个真实的基准,我们基于Android操作系统开发B-MoCA,并定义了60个常见的日常任务。重要的是,我们引入了随机化功能,随机改变移动设备的各个方面,包括用户界面布局和语言设置,以评估泛化性能。我们基准了各种代理,包括使用大型语言模型(LLMs)或多模态LLM训练的代理以及使用人类专家演示训练的代理。虽然这些代理在执行简单任务时表现出熟练,但他们在复杂任务上的表现却显露出未来研究可以改进其有效性的巨大潜力。我们的源代码可在此链接公开使用。
https://arxiv.org/abs/2404.16660
Recently, deep learning-based language models have significantly enhanced text-to-SQL tasks, with promising applications in retrieving patient records within the medical domain. One notable challenge in such applications is discerning unanswerable queries. Through fine-tuning model, we demonstrate the feasibility of converting medical record inquiries into SQL queries. Additionally, we introduce an entropy-based method to identify and filter out unanswerable results. We further enhance result quality by filtering low-confidence SQL through log probability-based distribution, while grammatical and schema errors are mitigated by executing queries on the actual database. We experimentally verified that our method can filter unanswerable questions, which can be widely utilized even when the parameters of the model are not accessible, and that it can be effectively utilized in practice.
近年来,基于深度学习的语言模型在文本到数据库任务上取得了显著的提高,在医疗领域有广泛的应用,如检索病历记录。在这种应用中,一个值得注意的是区分不可回答的问题。通过微调模型,我们证明了将医疗记录查询转换为SQL查询是可能的。此外,我们还引入了一种基于熵的方法来识别和过滤不可回答的结果。通过基于日志概率分布过滤低置信度的SQL,我们进一步提高了结果的质量。通过在实际数据库上执行查询来过滤低置信度的SQL,我们可以缓解语义和模式错误。我们通过实验验证,我们的方法可以过滤不可回答的问题,即使模型的参数不可用,实际应用中也可以广泛使用,而且在实践中取得了良好的效果。
https://arxiv.org/abs/2404.16659
Large language models (LLMs) have showcased profound capabilities in language understanding and generation, facilitating a wide array of applications. However, there is a notable paucity of detailed, open-sourced methodologies on efficiently scaling LLMs beyond 50 billion parameters with minimum trial-and-error cost and computational resources. In this report, we introduce Tele-FLM (aka FLM-2), a 52B open-sourced multilingual large language model that features a stable, efficient pre-training paradigm and enhanced factual judgment capabilities. Tele-FLM demonstrates superior multilingual language modeling abilities, measured by BPB on textual corpus. Besides, in both English and Chinese foundation model evaluation, it is comparable to strong open-sourced models that involve larger pre-training FLOPs, such as Llama2-70B and DeepSeek-67B. In addition to the model weights, we share the core designs, engineering practices, and training details, which we expect to benefit both the academic and industrial communities.
大语言模型(LLMs)在语言理解和生成方面展示了深刻的应用潜力,推动了各种应用的发展。然而,在将LLMs扩展到具有500亿参数以上最小尝试和错误成本的范围内时,特别是在计算资源方面,依然存在显著的缺乏详细、开源的方法论。在本文中,我们介绍了Tele-FLM(即FLM-2),一个520亿参数开源的多语言大语言模型,具有稳定的预训练范式和增强的的事实判断能力。Tele-FLM展示了卓越的多语言语言建模能力,以BPB衡量文本语料库。此外,在英语和中文基础模型评估中,它与涉及更大预训练FLOPs的强开源模型(如Llama2-70B和DeepSeek-67B)相当。除了模型权重外,我们还分享了核心设计、工程实践和训练细节,我们期望这将有益于学术界和工业界。
https://arxiv.org/abs/2404.16645