We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, Blink is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe Blink will stimulate the community to help multimodal LLMs catch up with human-level visual perception.
我们介绍了一个新的基准Blink,它专注于其他评估中没有发现的视觉感知能力。大多数Blink任务都可以通过人类“眨眼内解决”(例如,相对深度估计,视觉对应,法医检测和多视角推理)。然而,我们发现这些感知要求的任务对现有的多模态LLM构成了重大挑战,因为它们通过自然语言进行中介。Blink将14个经典计算机视觉任务重新格式化为3,807个多选题,与单张或多张图像和视觉提示搭配。虽然人类平均得到95.70%的准确率,但Blink对于现有的多模态LLM来说仍然具有令人惊讶的挑战性:即使是表现最好的GPT-4V和Gemini,其准确率也只有51.26%和45.72%,只有13.17%和7.63%高于随机猜测,表明在最近的多模态LLM中,这样的感知能力尚未“出现”。我们的分析还强调,专家CV模型本可以更好地解决这些问题,这表明未来改进的潜在途径。我们相信,Blink将激发社区帮助多模态LLM追上人类水平视觉感知。
https://arxiv.org/abs/2404.12390
We introduce Reka Core, Flash, and Edge, a series of powerful multimodal language models trained from scratch by Reka. Reka models are able to process and reason with text, images, video, and audio inputs. This technical report discusses details of training some of these models and provides comprehensive evaluation results. We show that Reka Edge and Reka Flash are not only state-of-the-art but also outperform many much larger models, delivering outsized values for their respective compute class. Meanwhile, our most capable and largest model, Reka Core, approaches the best frontier models on both automatic evaluations and blind human evaluations. On image question answering benchmarks (e.g. MMMU, VQAv2), Core performs competitively to GPT4-V. Meanwhile, on multimodal chat, Core ranks as the second most preferred model under a blind third-party human evaluation setup, outperforming other models such as Claude 3 Opus. On text benchmarks, Core not only performs competitively to other frontier models on a set of well-established benchmarks (e.g. MMLU, GSM8K) but also outperforms GPT4-0613 on human evaluation. On video question answering (Perception-Test), Core outperforms Gemini Ultra. Models are shipped in production at this http URL . A showcase of non cherry picked qualitative examples can also be found at this http URL .
我们介绍了由Reka开发的 Reka Core、Flash 和 Edge 一系列强大的多模态语言模型。这些模型是由 Reka 从头训练的,能够处理和推理文本、图像、视频和音频输入。本技术报告讨论了训练这些模型的细节,并提供了全面评估结果。我们发现,Reka Edge 和 Reka Flash 不仅是最先进的,而且在各自的计算类别中还表现出色,为它们各自的计算类别带来了超额价值。与此同时,我们最强大和最大的模型 Reka Core,在自动评估和盲人评估方面都接近最佳前沿模型。在图像问答基准(如MMMU,VQAv2)上,Core与GPT4-V竞争相当。在多模态聊天中,Core在盲第三方人类评估设置下排名第二,超过了像Claude 3 Opus这样的其他模型。在文本基准上,Core不仅在一批经过良好检验的基准(如MMLU,GSM8K)上表现竞争力,而且也在人类评估中超过了GPT4-0613。在视频问答(Perception-Test)中,Core超过了Gemini Ultra。模型现在在生产环境中通过这个链接发送:http://www.reka.ai/。也可以在这个链接找到展示非 cherry-picked 高质量示例的示例:http://www.reka.ai/quality-examples/。
https://arxiv.org/abs/2404.12387
This study evaluates the performance of general-purpose AI, like ChatGPT, in legal question-answering tasks, highlighting significant risks to legal professionals and clients. It suggests leveraging foundational models enhanced by domain-specific knowledge to overcome these issues. The paper advocates for creating open-source legal AI systems to improve accuracy, transparency, and narrative diversity, addressing general AI's shortcomings in legal contexts.
本研究评估了通用人工智能(如ChatGPT)在法律问题解答任务中的表现,强调了法律专业人员和客户面临的重要风险。它建议利用在特定领域知识基础上发现的基础模型来克服这些问题。论文主张创建开源法律人工智能系统以提高准确性、透明度和叙事多样性,解决通用人工智能在法律背景中的不足。
https://arxiv.org/abs/2404.12349
Aligning language models (LMs) based on human-annotated preference data is a crucial step in obtaining practical and performant LM-based systems. However, multilingual human preference data are difficult to obtain at scale, making it challenging to extend this framework to diverse languages. In this work, we evaluate a simple approach for zero-shot cross-lingual alignment, where a reward model is trained on preference data in one source language and directly applied to other target languages. On summarization and open-ended dialog generation, we show that this method is consistently successful under comprehensive evaluation settings, including human evaluation: cross-lingually aligned models are preferred by humans over unaligned models on up to >70% of evaluation instances. We moreover find that a different-language reward model sometimes yields better aligned models than a same-language reward model. We also identify best practices when there is no language-specific data for even supervised finetuning, another component in alignment.
将基于人类标注偏好数据的语言模型对齐作为获得实际且高性能的语言模型系统的关键步骤。然而,在规模上获得多语言人类偏好数据是困难的,这使得将此框架扩展到各种语言具有挑战性。在这项工作中,我们评估了一种简单的零散跨语言对齐方法,其中在一种源语言的偏好数据上训练了一个奖励模型,并直接应用于其他目标语言。在概述和开放性对话生成方面,我们发现,在综合评估设置中,这种方法在包括人类评估的广泛评估实例中始终成功地实现了卓越表现:跨语言对齐的模型在超过70%的评估实例中优于未对齐的模型。此外,我们还发现,当没有语言特定的数据进行甚至监督微调时,不同语言的奖励模型有时会生成更好的对齐模型。我们也在没有语言特定数据进行监督微调时,识别出最佳实践。
https://arxiv.org/abs/2404.12318
Retrieval augmented generation (RAG) systems combine the strengths of language generation and information retrieval to power many real-world applications like chatbots. Use of RAG for combined understanding of multimodal data such as text, images and videos is appealing but two critical limitations exist: one-time, upfront capture of all content in large multimodal data as text descriptions entails high processing times, and not all information in the rich multimodal data is typically in the text descriptions. Since the user queries are not known apriori, developing a system for multimodal to text conversion and interactive querying of multimodal data is challenging. To address these limitations, we propose iRAG, which augments RAG with a novel incremental workflow to enable interactive querying of large corpus of multimodal data. Unlike traditional RAG, iRAG quickly indexes large repositories of multimodal data, and in the incremental workflow, it uses the index to opportunistically extract more details from select portions of the multimodal data to retrieve context relevant to an interactive user query. Such an incremental workflow avoids long multimodal to text conversion times, overcomes information loss issues by doing on-demand query-specific extraction of details in multimodal data, and ensures high quality of responses to interactive user queries that are often not known apriori. To the best of our knowledge, iRAG is the first system to augment RAG with an incremental workflow to support efficient interactive querying of large, real-world multimodal data. Experimental results on real-world long videos demonstrate 23x to 25x faster video to text ingestion, while ensuring that quality of responses to interactive user queries is comparable to responses from a traditional RAG where all video data is converted to text upfront before any querying.
检索增强生成(RAG)系统将自然语言生成和信息检索的优势相结合,为许多现实应用提供了动力,如聊天机器人。使用RAG对多模态数据(如文本、图像和视频)的联合理解具有吸引力,但有两个关键限制:一次性的、一次性的捕捉大型多模态数据中的所有内容意味着处理时间很高,而且通常 rich 多模态数据中的信息并不都在文本描述中。由于用户查询不知道,因此为多模态到文本转换和交互式查询大型多模态数据开发系统具有挑战性。为了克服这些限制,我们提出了 iRAG,它通过新的增量工作流程增强了 RAG,以实现对大型多模态数据集的交互式查询。与传统 RAG 不同,iRAG 快速索引大型多模态数据库,并且在增量工作流程中,它使用索引从多模态数据的部分部分非文本描述中主动提取更多详细信息以检索与交互式用户查询相关的上下文。这种增量工作流程避免了长多模态到文本转换时间,克服了信息损失问题,确保了交互式用户查询的响应具有高质量,而这些查询往往不知道。据我们所知,iRAG 是第一个用增量工作流程增强 RAG 的系统,以支持对大型、现实世界多模态数据的 efficient 交互式查询。在现实世界的长视频中进行实验结果表明,与传统 RAG 相比,视频到文本的 ingestion 速度提高了 23 到 25 倍,同时保证交互式用户查询的响应质量与传统 RAG 中的所有视频数据在查询之前转换为文本的响应质量相当。
https://arxiv.org/abs/2404.12309
This study introduces a novel method for irony detection, applying Large Language Models (LLMs) with prompt-based learning to facilitate emotion-centric text augmentation. Traditional irony detection techniques typically fall short due to their reliance on static linguistic features and predefined knowledge bases, often overlooking the nuanced emotional dimensions integral to irony. In contrast, our methodology augments the detection process by integrating subtle emotional cues, augmented through LLMs, into three benchmark pre-trained NLP models - BERT, T5, and GPT-2 - which are widely recognized as foundational in irony detection. We assessed our method using the SemEval-2018 Task 3 dataset and observed substantial enhancements in irony detection capabilities.
本研究介绍了一种新颖的 Irony 检测方法,该方法采用基于提示的学习方法(LLMs)来促进情感中心化文本增强。传统的 Irony 检测技术通常因为其依赖静态语言特征和预定义知识库而不足,往往忽视了 Irony 中至关重要的细微情感维度。相比之下,我们的方法通过将微妙的情感线索通过 LLMs 增强,将三种广泛认为是 Irony 检测基础的预训练 NLP 模型 - BERT、T5 和 GPT-2 - 集成到检测过程中,从而增强了 Irony 检测能力。我们对该方法使用 SemEval-2018 任务 3 数据集进行了评估,并观察到 Irony 检测能力得到了显著提升。
https://arxiv.org/abs/2404.12291
Embedding models are crucial for various natural language processing tasks but can be limited by factors such as limited vocabulary, lack of context, and grammatical errors. This paper proposes a novel approach to improve embedding performance by leveraging large language models (LLMs) to enrich and rewrite input text before the embedding process. By utilizing ChatGPT 3.5 to provide additional context, correct inaccuracies, and incorporate metadata, the proposed method aims to enhance the utility and accuracy of embedding models. The effectiveness of this approach is evaluated on three datasets: Banking77Classification, TwitterSemEval 2015, and Amazon Counter-factual Classification. Results demonstrate significant improvements over the baseline model on the TwitterSemEval 2015 dataset, with the best-performing prompt achieving a score of 85.34 compared to the previous best of 81.52 on the Massive Text Embedding Benchmark (MTEB) Leaderboard. However, performance on the other two datasets was less impressive, highlighting the importance of considering domain-specific characteristics. The findings suggest that LLM-based text enrichment has shown promising results to improve embedding performance, particularly in certain domains. Hence, numerous limitations in the process of embedding can be avoided.
嵌入模型对各种自然语言处理任务至关重要,但它们可能受到词汇有限、缺乏上下文和语法错误等因素的限制。本文提出了一种通过利用大型语言模型(LLMs)在嵌入过程前对输入文本进行丰富和重新编写的全新方法,以提高嵌入性能。通过使用ChatGPT 3.5提供额外的上下文、修正不准确性和包含元数据,所提出的方法旨在增强嵌入模型的效用和准确性。本文在三个数据集上进行了评估:Banking77分类、TwitterSemEval 2015和Amazon Counter-factual分类。结果表明,与基线模型相比,在TwitterSemEval 2015数据集上取得了显著的提高,最佳表现提示得分比之前最佳成绩(MTEB Leaderboard)高出85.34分。然而,在其他两个数据集上的表现并不令人印象深刻,这表明在考虑领域特征时非常重要。这些发现表明,基于LLM的文本丰富已经显示出改善嵌入性能的前景,特别是在某些领域。因此,在嵌入过程中可以避免许多限制。
https://arxiv.org/abs/2404.12283
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark.
本文介绍了由MLCommons AI Safety Working Group创建的AI安全基准的v0.5版本。AI安全基准旨在评估使用聊天机器人语言模型的AI系统的安全性风险。我们引入了一种基于原则的方法来指定和构建基准,涵盖v0.5的只有一个用例(用英语与通用助手进行交流的成年人),以及一组有限的人物角色(即典型用户、恶意用户和易受攻击的用户)。我们创建了一个包含13个危险类别的新分类器,其中7个在v0.5基准中有测试。我们计划在2024年底发布AI安全基准的1.0版本。v1.0基准将为AI系统的安全性提供有意义的见解。然而,v0.5基准不应用于评估AI系统的安全性。我们努力全面记录v0.5基准的局限性、缺陷和挑战。发布v0.5 AI安全基准包括:(1)基于原则指定和构建基准的方法,包括使用案例、测试类型、系统类型、语言和上下文、人物角色、测试和测试项目;(2)一个包含13个危险类别的分类器及其定义和子类别;(3)对7个危险类别的测试,每个测试都包括一个独特的测试项目,即提示。总共有43,090个测试项目,我们使用模板创建;(4)一个针对基准对AI系统进行评估的评分系统;(5)一个公开可用的平台和可下载的工具,名为ModelBench,用于在基准上评估AI系统的安全性;(6)一个基准评估报告,该报告衡量了超过12个公开可用的聊天机器人语言模型的性能;(7)基准测试规格。
https://arxiv.org/abs/2404.12241
Instruction fine-tuning pretrained LLMs for diverse downstream tasks has demonstrated remarkable success and has captured the interest of both academics and practitioners. To ensure such fine-tuned LLMs align with human preferences, techniques such as RLHF and DPO have emerged. At the same time, there is increasing interest in smaller parameter counts for models. In this work, using OpenLLaMA 3Bv2 as a base model, we describe the recipe used to fine-tune the OpenBezoar family of models. In this recipe: We first generate synthetic instruction fine-tuning data using an open and commercially non-restrictive instruction fine-tuned variant of the Falcon-40B model under three schemes based on: LaMini-LM, WizardLM/Evol-Instruct (with databricks-dolly-15k as a seed dataset) and Orca (with the Flan Collection as a seed dataset), then filter these generations using GPT-4 as a human proxy. We then perform cost-effective QLoRA-based supervised fine-tuning sequentially with each scheme. The resulting checkpoint is further fine-tuned with a subset of the HH-RLHF dataset to minimize distribution shift prior to using the DPO loss to obtain the final checkpoint. Evaluation is done with the LM Eval Harness tasks/metrics as well as on MT-Bench using the "LLM-as-a-judge" framework with Claude 2.1, with the finding that the final checkpoint, "OpenBezoar-HH-RLHF-DPO", demonstrates superior performance over many models at the 3B parameter scale, even outperforming the top model in one of the categories on the Huggingface Open LLM Leaderboard. We release "OpenBezoar-SFT", "OpenBezoar-HH-RLHF-SFT", "OpenBezoar-HH-RLHF-DPO" checkpoints, alongside our generated datasets on HuggingFace at this https URL and our codebase at this https URL.
翻译:对 diverse下游任务的指令微调预训练语言模型已经取得了显著的成功,并吸引了学术界和实践界的广泛关注。为了确保微调后的 LLM 符合人类偏好,出现了诸如 RLHF 和 DPO 等技术。与此同时,对于模型参数数量的需求也在增加。在这项工作中,我们使用 OpenLLaMA 3Bv2 作为基础模型,描述了用于微调 OpenBezoar 模型的食谱。在这个食谱中: 我们首先使用一个基于 Falcon-40B 模型,在三个方案(基于 LaMini-LM、WizardLM/Evol-Instruct(使用 databricks-dolly-15k 作为 seed 数据集)和 Orca(使用 Flan Collection 作为 seed 数据集)下生成合成指令微调数据,然后使用 GPT-4 作为人类代理过滤这些世代。接着,我们使用每个方案的成本效益 QLoRA 进行逐步微调。得到的最终checkpoint 进一步通过 HH-RLHF 子集的微调来最小化在使用 DPO 损失之前分布漂移。使用LM Eval Harness任务/指标以及MT-Bench 使用 "LLM-as-a-judge"框架对Claude 2.1进行评估。结果表明,在3B参数级别,"OpenBezoar-HH-RLHF-DPO" 显示的性能优于许多模型,即使在其中一个分类上,也超过了该分类中的顶级模型。我们发布了 "OpenBezoar-SFT"、"OpenBezoar-HH-RLHF-SFT" 和 "OpenBezoar-HH-RLHF-DPO" 检查点,这些检查点与我们的生成数据一起存放在 HuggingFace 的这个链接:https://www.huggingface.co/openbezoar-sft/。
https://arxiv.org/abs/2404.12195
Stance detection, a key task in natural language processing, determines an author's viewpoint based on textual analysis. This study evaluates the evolution of stance detection methods, transitioning from early machine learning approaches to the groundbreaking BERT model, and eventually to modern Large Language Models (LLMs) such as ChatGPT, LLaMa-2, and Mistral-7B. While ChatGPT's closed-source nature and associated costs present challenges, the open-source models like LLaMa-2 and Mistral-7B offers an encouraging alternative. Initially, our research focused on fine-tuning ChatGPT, LLaMa-2, and Mistral-7B using several publicly available datasets. Subsequently, to provide a comprehensive comparison, we assess the performance of these models in zero-shot and few-shot learning scenarios. The results underscore the exceptional ability of LLMs in accurately detecting stance, with all tested models surpassing existing benchmarks. Notably, LLaMa-2 and Mistral-7B demonstrate remarkable efficiency and potential for stance detection, despite their smaller sizes compared to ChatGPT. This study emphasizes the potential of LLMs in stance detection and calls for more extensive research in this field.
姿态检测是自然语言处理中的一个关键任务,它通过文本分析来确定作者的观点。这项研究评估了姿态检测方法的演变,从早期的机器学习方法到突破性的BERT模型,最终到现代的大型语言模型(LLMs),如ChatGPT、LLLM-2和Mistral-7B。尽管ChatGPT的闭源性和相关成本带来了挑战,但像LLMa-2和Mistral-7B这样的开源模型仍然具有鼓舞人心的 alternative。最初,我们的研究专注于通过几个公开可用的数据集对ChatGPT、LLMa-2和Mistral-7B进行微调。随后,为了提供全面的比较,我们评估了这些模型在零散和少散学习场景下的性能。结果强调了LLMs在准确检测立场方面的非凡能力,所有测试模型都超过了现有基准。值得注意的是,LLMa-2和Mistral-7B展示了令人印象深刻的效率和立场检测潜力,尽管它们相对于ChatGPT来说较小。这项研究强调了LLMs在立场检测方面的潜力,并呼吁在這個領域进行更廣泛的研究。
https://arxiv.org/abs/2404.12171
The staggering pace with which the capabilities of large language models (LLMs) are increasing, as measured by a range of commonly used natural language understanding (NLU) benchmarks, raises many questions regarding what "understanding" means for a language model and how it compares to human understanding. This is especially true since many LLMs are exclusively trained on text, casting doubt on whether their stellar benchmark performances are reflective of a true understanding of the problems represented by these benchmarks, or whether LLMs simply excel at uttering textual forms that correlate with what someone who understands the problem would say. In this philosophically inspired work, we aim to create some separation between form and meaning, with a series of tests that leverage the idea that world understanding should be consistent across presentational modes - inspired by Fregean senses - of the same meaning. Specifically, we focus on consistency across languages as well as paraphrases. Taking GPT-3.5 as our object of study, we evaluate multisense consistency across five different languages and various tasks. We start the evaluation in a controlled setting, asking the model for simple facts, and then proceed with an evaluation on four popular NLU benchmarks. We find that the model's multisense consistency is lacking and run several follow-up analyses to verify that this lack of consistency is due to a sense-dependent task understanding. We conclude that, in this aspect, the understanding of LLMs is still quite far from being consistent and human-like, and deliberate on how this impacts their utility in the context of learning about human language and understanding.
测量大型语言模型(LLMs)能力增加的速度,通过一系列常用的自然语言理解(NLU)基准进行衡量,引发了许多关于语言模型“理解”的问题,以及它与人类理解的比较。这在许多LLM仅基于文本训练的事实下更是如此,让人怀疑这些卓越的基准表现是否反映了真正理解这些问题,或者是否只是LLM在表达与理解问题相关的文本形式方面表现出色。在深受哲学启发的此项工作中,我们旨在创造一些形式与意义之间的区分,通过一系列利用 Fregean 感知的思想进行测试,其中包括跨语言的一致性以及同义词。具体来说,我们关注跨语言的一致性以及同义词。以GPT-3.5为研究对象,我们在五种不同语言和各种任务上进行多义词一致性测试。我们在一个受控的环境中进行评估,要求模型提供简单的事实,然后进行四个流行 NLU 基准的评估。我们发现,模型的多义词一致性缺乏,并进行多次后续分析来验证这一缺乏是否与感知相关。我们得出结论,在这一点上,LLM的理解仍然与人类理解相去甚远,并且需要特别关注如何影响其在学习关于人类语言及其理解方面的实用性。
https://arxiv.org/abs/2404.12145
The burgeoning landscape of text-to-image models, exemplified by innovations such as Midjourney and DALLE 3, has revolutionized content creation across diverse sectors. However, these advancements bring forth critical ethical concerns, particularly with the misuse of open-source models to generate content that violates societal norms. Addressing this, we introduce Ethical-Lens, a framework designed to facilitate the value-aligned usage of text-to-image tools without necessitating internal model revision. Ethical-Lens ensures value alignment in text-to-image models across toxicity and bias dimensions by refining user commands and rectifying model outputs. Systematic evaluation metrics, combining GPT4-V, HEIM, and FairFace scores, assess alignment capability. Our experiments reveal that Ethical-Lens enhances alignment capabilities to levels comparable with or superior to commercial models like DALLE 3, ensuring user-generated content adheres to ethical standards while maintaining image quality. This study indicates the potential of Ethical-Lens to ensure the sustainable development of open-source text-to-image tools and their beneficial integration into society. Our code is available at this https URL.
文本转图像模型的迅速发展,如Midjourney和DALLE 3等创新,已经彻底颠覆了内容创作的各个领域。然而,这些进步也带来了关键的伦理担忧,特别是开放源代码模型被用于生成违反社会规范的内容时。为解决这一问题,我们引入了Ethical-Lens,一个框架,旨在促进在不需要修改内部模型的情况下实现文本转图像工具的价值对齐使用。Ethical-Lens通过优化用户命令和纠正模型输出,在毒性 and bias维度上确保文本转图像模型的价值对齐。组合使用GPT4-V、HEIM和FairFace分数的系统评估指标评估了对齐能力。我们的实验结果表明,Ethical-Lens增强了与商业模型如DALLE 3相当或更强的对齐能力,确保用户生成内容符合道德准则,同时保持图像质量。本研究表示Ethical-Lens确保了开源文本转图像工具的可持续发展和有益整合到社会。我们的代码可在此链接下载:https://www.ethical-lens.org/。
https://arxiv.org/abs/2404.12104
In this paper, we introduce a MusIc conditioned 3D Dance GEneraTion model, named MIDGET based on Dance motion Vector Quantised Variational AutoEncoder (VQ-VAE) model and Motion Generative Pre-Training (GPT) model to generate vibrant and highquality dances that match the music rhythm. To tackle challenges in the field, we introduce three new components: 1) a pre-trained memory codebook based on the Motion VQ-VAE model to store different human pose codes, 2) employing Motion GPT model to generate pose codes with music and motion Encoders, 3) a simple framework for music feature extraction. We compare with existing state-of-the-art models and perform ablation experiments on AIST++, the largest publicly available music-dance dataset. Experiments demonstrate that our proposed framework achieves state-of-the-art performance on motion quality and its alignment with the music.
在本文中,我们提出了一种名为MIDGET的3D舞蹈条件生成模型,基于舞蹈运动向量量化变分自编码器(VQ-VAE)模型和运动生成预训练(GPT)模型,以生成与音乐节奏相符的鲜艳和高质量的舞蹈。为了解决该领域内的挑战,我们引入了三个新组件:1)基于Motion VQ-VAE模型的预训练记忆编码书,用于存储不同人类姿势代码;2)使用运动生成预训练(GPT)模型生成音乐和运动编码器的姿势编码;3)一个简单的音乐特征提取框架。我们与现有的最先进模型进行了比较,并在AIST++这个最大的公开可用音乐舞蹈数据集上进行了消融实验。实验结果表明,我们提出的框架在动量质量和与音乐的对齐方面均取得了最先进的性能。
https://arxiv.org/abs/2404.12062
Fine-tuning pre-trained Large Language Models (LLMs) is essential to align them with human values and intentions. This process often utilizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the evaluation of full answers generated by the models. However, the generation of these responses occurs in a token level, following a sequential, auto-regressive fashion. In this paper, we introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level. Unlike previous methods, which face challenges in divergence efficiency, TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. Utilizing the Bradley-Terry model for a token-based reward system, TDPO enhances the regulation of KL divergence, while preserving simplicity without the need for explicit reward modeling. Experimental results across various text tasks demonstrate TDPO's superior performance in balancing alignment with generation diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. Our code is open-sourced at this https URL.
微调预训练的大型语言模型(LLMs)与人类价值观和意图对齐至关重要。这一过程通常采用比较对等关系和与参考LLM的KL散度的方法,重点关注模型生成的完整答案的评估。然而,这些回答的生成是在标记级别进行的,遵循了序列、自回归的样式。在本文中,我们引入了Token-level Direct Preference Optimization(TDPO),一种通过优化模型在每个标记级别的策略来与人类偏好对齐的新颖方法。与之前的方法不同,TDPO通过每个标记点的正向KL散度约束来改善对齐和多样性。利用布拉德利-特里模型作为基于标记的奖励系统,TDPO增强了KL散度的规范,同时保留了简单性,无需显式奖励建模。在各种文本任务的各种实验结果中,TDPO在平衡对齐与生成多样性方面的表现优于DPO。值得注意的是,在受控情感生成和单轮对话数据集上,TDPO与DPO的微调效果略好于PPO,显著地提高了生成的响应的质量。我们的代码目前是开源的,在以下链接处。
https://arxiv.org/abs/2404.11999
The digital divide describes disparities in access to and usage of digital tooling between social and economic groups. Emerging generative artificial intelligence tools, which strongly affect productivity, could magnify the impact of these divides. However, the affordability, multi-modality, and multilingual capabilities of these tools could also make them more accessible to diverse users in comparison with previous forms of digital tooling. In this study, we characterize spatial differences in U.S. residents' knowledge of a new generative AI tool, ChatGPT, through an analysis of state- and county-level search query data. In the first six months after the tool's release, we observe the highest rates of users searching for ChatGPT in West Coast states and persistently low rates of search in Appalachian and Gulf states. Counties with the highest rates of search are relatively more urbanized and have proportionally more educated, more economically advantaged, and more Asian residents in comparison with other counties or with the U.S. average. In multilevel models adjusting for socioeconomic and demographic factors as well as industry makeup, education is the strongest positive predictor of rates of search for generative AI tooling. Although generative AI technologies may be novel, early differences in uptake appear to be following familiar paths of digital marginalization.
数字鸿沟描述了社会和经济群体之间访问和使用数字工具的差异。新兴的生成人工智能工具,这些工具对生产力产生严重影响,可能夸大这些差异的影响。然而,这些工具的易用性、多模态和多语言功能,也可能使它们比以往数字工具更具吸引力,使其对不同用户更具吸引力。在本研究中,我们通过分析州和县一级的搜索查询数据,对美国居民对新型生成人工智能工具ChatGPT的空间差异进行刻画。在工具发布后的前六个月内,我们在西海岸州观察到用户搜索ChatGPT的率最高,而阿巴拉契亚和阿拉巴马州持续较低。搜索率最高的是相对较发达的县,与其他县或美国平均水平相比,这些县更有利于教育、经济优势和亚洲居民。在考虑社会经济和人口因素以及行业组成的多层模型中,教育是对生成人工智能工具搜索率的最强积极预测因素。尽管生成人工智能技术可能新颖,但早期采用差异似乎正在沿着数字边缘化的熟悉路径发展。
https://arxiv.org/abs/2404.11988
Explainable Artificial Intelligence (XAI) poses a significant challenge in providing transparent and understandable insights into complex AI models. Traditional post-hoc algorithms, while useful, often struggle to deliver interpretable explanations. Concept-based models offer a promising avenue by incorporating explicit representations of concepts to enhance interpretability. However, existing research on automatic concept discovery methods is often limited by lower-level concepts, costly human annotation requirements, and a restricted domain of background knowledge. In this study, we explore the potential of a Large Language Model (LLM), specifically GPT-4, by leveraging its domain knowledge and common-sense capability to generate high-level concepts that are meaningful as explanations for humans, for a specific setting of image classification. We use minimal textual object information available in the data via prompting to facilitate this process. To evaluate the output, we compare the concepts generated by the LLM with two other methods: concepts generated by humans and the ECII heuristic concept induction system. Since there is no established metric to determine the human understandability of concepts, we conducted a human study to assess the effectiveness of the LLM-generated concepts. Our findings indicate that while human-generated explanations remain superior, concepts derived from GPT-4 are more comprehensible to humans compared to those generated by ECII.
可解释人工智能(XAI)在为复杂AI模型提供透明和可理解的洞察方面提出了重大挑战。虽然传统的后验算法在某种程度上很有用,但往往很难提供可解释的解释。基于概念的模型通过将概念的显式表示来提高可解释性,为这一途径提供了有前景的方法。然而,现有的自动概念发现方法的 research 通常受到较低级别的概念、昂贵的人类标注要求和受限的知识域的限制。在这项研究中,我们探讨了大型语言模型(LLM)特别是 GPT-4 的潜力,通过利用其领域知识和常识能力生成高质量的概念,作为解释为人类对图像分类的特定场景的高层次概念。我们通过提示来最小化数据中可用到的文本对象信息,促进这一过程。为了评估输出,我们比较了 LLM 生成的概念与其他两种方法:由人类生成的概念和 ECII 隐式概念诱导系统生成的概念。由于没有明确的方法来确定人类对概念的理解,我们进行了一项人类研究来评估 LLM 生成的概念的有效性。我们的研究结果表明,尽管人类生成的解释仍然具有优势,但 GPT-4 生成的概念对人类来说更有可理解性,与 ECII 生成的概念相比更具可理解性。
https://arxiv.org/abs/2404.11875
We introduce the CAUS (Curious About Uncertain Scene) dataset, designed to enable Large Language Models, specifically GPT-4, to emulate human cognitive processes for resolving uncertainties. Leveraging this dataset, we investigate the potential of LLMs to engage in questioning effectively. Our approach involves providing scene descriptions embedded with uncertainties to stimulate the generation of reasoning and queries. The queries are then classified according to multi-dimensional criteria. All procedures are facilitated by a collaborative system involving both LLMs and human researchers. Our results demonstrate that GPT-4 can effectively generate pertinent questions and grasp their nuances, particularly when given appropriate context and instructions. The study suggests that incorporating human-like questioning into AI models improves their ability to manage uncertainties, paving the way for future advancements in Artificial Intelligence (AI).
我们提出了CAUS(好奇关于不确定场景)数据集,旨在使大型语言模型(特别是GPT-4)能够模拟人类认知过程来解决不确定性。利用这个数据集,我们研究了LLMs有效地参与提问的潜力。我们的方法包括向场景中嵌入带有不确定性的描述,以刺激推理和查询的生成。然后对查询进行多维度的分类。所有程序都由LLMs和人类研究人员合作的系统来促进。我们的结果表明,GPT-4可以有效地生成相关问题并把握其细微差别,尤其是在给出适当的背景和指令时。这项研究建议,将人类类似的问题引入AI模型可以提高它们处理不确定性的能力,为未来人工智能(AI)的进一步发展铺平道路。
https://arxiv.org/abs/2404.11835
As the integration of large language models into daily life is on the rise, there is a clear gap in benchmarks for advising on subjective and personal dilemmas. To address this, we introduce AdvisorQA, the first benchmark developed to assess LLMs' capability in offering advice for deeply personalized concerns, utilizing the LifeProTips subreddit forum. This forum features a dynamic interaction where users post advice-seeking questions, receiving an average of 8.9 advice per query, with 164.2 upvotes from hundreds of users, embodying a collective intelligence framework. Therefore, we've completed a benchmark encompassing daily life questions, diverse corresponding responses, and majority vote ranking to train our helpfulness metric. Baseline experiments validate the efficacy of AdvisorQA through our helpfulness metric, GPT-4, and human evaluation, analyzing phenomena beyond the trade-off between helpfulness and harmlessness. AdvisorQA marks a significant leap in enhancing QA systems for providing personalized, empathetic advice, showcasing LLMs' improved understanding of human subjectivity.
随着大型语言模型融入日常生活的趋势不断上升,为解决在主观和个人困境方面进行建议的基准存在明显的差距,我们引入了AdvisorQA,第一个旨在评估LLM提供个性化关注建议能力的基准。这个论坛以动态互动的方式进行,用户发布寻求建议的问题,平均每条获得8.9条建议,获得了来自成千上用户的164.2个赞,体现了集体智能框架。因此,我们已经完成了一个涵盖日常生活问题、多样对应回答和多数投票排名的基准,以训练我们的有用性度量。基准实验证实了AdvisorQA通过我们的有用性度量GPT-4以及人类评估的有效性。 AdvisorQA标志着在提高提供个性化、富有同情心的建议的QAS系统方面取得了显著的飞跃,展示了LLM对人类主观性的更好理解。
https://arxiv.org/abs/2404.11826
Foundation models, i.e., very large deep learning models, have demonstrated impressive performances in various language and vision tasks that are otherwise difficult to reach using smaller-size models. The major success of GPT-type of language models is particularly exciting and raises expectations on the potential of foundation models in other domains including satellite remote sensing. In this context, great efforts have been made to build foundation models to test their capabilities in broader applications, and examples include Prithvi by NASA-IBM, Segment-Anything-Model, ViT, etc. This leads to an important question: Are foundation models always a suitable choice for different remote sensing tasks, and when or when not? This work aims to enhance the understanding of the status and suitability of foundation models for pixel-level classification using multispectral imagery at moderate resolution, through comparisons with traditional machine learning (ML) and regular-size deep learning models. Interestingly, the results reveal that in many scenarios traditional ML models still have similar or better performance compared to foundation models, especially for tasks where texture is less useful for classification. On the other hand, deep learning models did show more promising results for tasks where labels partially depend on texture (e.g., burn scar), while the difference in performance between foundation models and deep learning models is not obvious. The results conform with our analysis: The suitability of foundation models depend on the alignment between the self-supervised learning tasks and the real downstream tasks, and the typical masked autoencoder paradigm is not necessarily suitable for many remote sensing problems.
基础模型,即大型深度学习模型,已经在各种语言和视觉任务中展示了令人印象深刻的性能,而使用较小规模的模型通常很难达到这种成就。GPT类型的语言模型的主要成功特别令人兴奋,并提高了在包括卫星遥感在内的其他领域的基模潜在能力的期望。在这种情况下,已经付出了很大努力来构建基模型以测试其在更广泛应用中的能力,例如NASA-IBM的Prithvi,Segment-Anything-Model和ViT等。这导致了一个重要的问题:基模型是否总是适用于各种遥感任务,何时或何时不适用?这项工作旨在通过与传统机器学习(ML)和常规大小的深度学习模型进行比较,增强对基模型在多光谱图像上进行像素级别分类的可行性和适用性的理解。有趣的是,在许多场景中,传统ML模型与基模的性能相当或者更好,尤其是在纹理对分类较少有用的任务中。另一方面,深度学习模型在纹理对分类有部分依赖的任务上表现出了更好的效果,而基模型和深度学习模型的性能之间的差异不明显。结果与我们的分析相符:基模的适用性取决于自监督学习任务与真实下游任务的拟合程度,而典型的遮罩自动编码器范式并不一定适用于许多遥感问题。
https://arxiv.org/abs/2404.11797
The rapid evolution of text-to-image diffusion models has opened the door of generative AI, enabling the translation of textual descriptions into visually compelling images with remarkable quality. However, a persistent challenge within this domain is the optimization of prompts to effectively convey abstract concepts into concrete objects. For example, text encoders can hardly express "peace", while can easily illustrate olive branches and white doves. This paper introduces a novel approach named Prompt Optimizer for Abstract Concepts (POAC) specifically designed to enhance the performance of text-to-image diffusion models in interpreting and generating images from abstract concepts. We propose a Prompt Language Model (PLM), which is initialized from a pre-trained language model, and then fine-tuned with a curated dataset of abstract concept prompts. The dataset is created with GPT-4 to extend the abstract concept to a scene and concrete objects. Our framework employs a Reinforcement Learning (RL)-based optimization strategy, focusing on the alignment between the generated images by a stable diffusion model and optimized prompts. Through extensive experiments, we demonstrate that our proposed POAC significantly improves the accuracy and aesthetic quality of generated images, particularly in the description of abstract concepts and alignment with optimized prompts. We also present a comprehensive analysis of our model's performance across diffusion models under different settings, showcasing its versatility and effectiveness in enhancing abstract concept representation.
文本到图像扩散模型的快速演变为生成型人工智能打开了大门,使将文本描述转化为具有引人入胜视觉效果的图像成为可能,特别是在描述抽象概念方面。然而,这一领域的一个持续挑战是优化提示以有效地传达抽象概念为具体物体。例如,文本编码器很难表达“和平”,但可以轻松地描绘橄榄枝和白鸽子。本文介绍了一种名为抽象概念提示优化器(POAC)的新颖方法,专门设计用于提高文本到图像扩散模型在解释和生成图像时的性能。我们提出了一个基于预训练语言模型的Prompt语言模型(PLM),然后用经过精心挑选的抽象概念提示数据集进行微调。数据集使用GPT-4来扩展抽象概念场景和物体。我们的框架采用了一种基于强化学习的优化策略,重点关注通过稳定的扩散模型生成的图像与优化提示之间的对齐。通过广泛的实验,我们证明了我们的POAC显著提高了生成图像的准确性和美学质量,特别是在描述抽象概念和与优化提示对齐方面。我们还对在不同设置下的扩散模型性能进行了全面的分析,展示了模型在增强抽象概念表示方面的多样性和有效性。
https://arxiv.org/abs/2404.11589