Recent advancements in instruction-following models have made user interactions with models more user-friendly and efficient, broadening their applicability. In graphic design, non-professional users often struggle to create visually appealing layouts due to limited skills and resources. In this work, we introduce a novel multimodal instruction-following framework for layout planning, allowing users to easily arrange visual elements into tailored layouts by specifying canvas size and design purpose, such as for book covers, posters, brochures, or menus. We developed three layout reasoning tasks to train the model in understanding and executing layout instructions. Experiments on two benchmarks show that our method not only simplifies the design process for non-professionals but also surpasses the performance of few-shot GPT-4V models, with mIoU higher by 12% on Crello. This progress highlights the potential of multimodal instruction-following models to automate and simplify the design process, providing an approachable solution for a wide range of design tasks on visually-rich documents.
近年来,指令跟随模型的进步使得用户与模型之间的交互更加友好和高效,拓宽了其应用范围。在图形设计中,非专业用户通常由于技能和资源有限,难以创建视觉上吸引人的布局。在这项工作中,我们引入了一个新颖的多模态指令跟随布局规划框架,允许用户通过指定画布大小和设计目的,轻松地将视觉元素排版到定制布局中,如书籍封面、海报、宣传册或菜单。我们开发了三个布局推理任务来训练模型理解并执行布局指令。在两个基准测试上的实验证明,我们的方法不仅简化了非专业用户的设计流程,而且超越了少样本GPT-4V模型的性能,在Crello上的mIoU值较高。这一进步突出了多模态指令跟随模型的潜力,可以自动化和简化设计过程,为各种视觉丰富的文档提供了一种易于设计的解决方案。
https://arxiv.org/abs/2404.15271
We study interactive learning of language agents based on user edits made to the agent's output. In a typical setting such as writing assistants, the user interacts with a language agent to generate a response given a context, and may optionally edit the agent response to personalize it based on their latent preference, in addition to improving the correctness. The edit feedback is naturally generated, making it a suitable candidate for improving the agent's alignment with the user's preference, and for reducing the cost of user edits over time. We propose a learning framework, PRELUDE that infers a description of the user's latent preference based on historic edit data and using it to define a prompt policy that drives future response generation. This avoids fine-tuning the agent, which is costly, challenging to scale with the number of users, and may even degrade its performance on other tasks. Furthermore, learning descriptive preference improves interpretability, allowing the user to view and modify the learned preference. However, user preference can be complex and vary based on context, making it challenging to learn. To address this, we propose a simple yet effective algorithm named CIPHER that leverages a large language model (LLM) to infer the user preference for a given context based on user edits. In the future, CIPHER retrieves inferred preferences from the k-closest contexts in the history, and forms an aggregate preference for response generation. We introduce two interactive environments -- summarization and email writing, for evaluation using a GPT-4 simulated user. We compare with algorithms that directly retrieve user edits but do not learn descriptive preference, and algorithms that learn context-agnostic preference. On both tasks, CIPHER achieves the lowest edit distance cost and learns preferences that show significant similarity to the ground truth preferences
我们研究基于用户对语言代理输出进行修改的交互式学习。在一个典型的场景(如写作助手)中,用户与语言代理交互以根据给定上下文生成响应,并且可以选项性地编辑代理的响应以根据他们的潜在偏好进行个性化定制,除了提高正确性外。编辑反馈自然生成,使其成为一个适合改进代理与用户偏好的合适候选者,并减少用户编辑的时间成本。我们提出了一个学习框架,PRELUDE,根据历史编辑数据推断用户的潜在偏好,并使用它定义一个提示策略来驱动未来的响应生成。这避免了微调代理,这是昂贵且难以扩展的。此外,学习描述性偏好提高了可解释性,使用户可以查看和修改学到的偏好。然而,用户偏好可能复杂多样,并根据上下文有所不同,这使得学习具有挑战性。为解决这个问题,我们提出了一个简单而有效的算法,名为CIPHER,它利用大型语言模型(LLM)根据用户编辑推断给定上下文的用户偏好。在未来的研究中,CIPHER从历史编辑数据中的k个最近上下文中提取推断的偏好,并形成反应生成的聚合偏好。我们引入了两个交互式环境——总结和电子邮件写作,用于使用GPT-4模拟的用户进行评估。我们比较了直接从用户编辑中检索算法,但不学习描述性偏好的算法,以及学习上下文无关偏好的算法。在两个任务上,CIPHER都实现了最低的编辑距离成本,并学习了与真实偏好具有显著相似性的偏好。
https://arxiv.org/abs/2404.15269
Training task-oriented dialogue systems typically requires turn-level annotations for interacting with their APIs: e.g. a dialogue state and the system actions taken at each step. These annotations can be costly to produce, error-prone, and require both domain and annotation expertise. With advances in LLMs, we hypothesize unlabelled data and a schema definition are sufficient for building a working task-oriented dialogue system, completely unsupervised. Using only (1) a well-defined API schema (2) a set of unlabelled dialogues between a user and agent, we develop a novel approach for inferring turn-level annotations as latent variables using a noisy channel model. We iteratively improve these pseudo-labels with expectation-maximization (EM), and use the inferred labels to train an end-to-end dialogue agent. Evaluating our approach on the MultiWOZ benchmark, our method more than doubles the dialogue success rate of a strong GPT-3.5 baseline.
基于任务的对话系统通常需要进行交互级别的注释,例如对话状态和每个步骤系统采取的行动。这些注释可能会产生费用,具有错误率,并且需要领域和注释专业知识。随着LLM的进步,我们假设无标签数据和数据定义足以构建一个无需监督的 task-oriented 对话系统。仅使用(1)定义良好的 API 模式和(2)用户和代理之间的无标签对话,我们提出了一种通过噪声信道模型推断回合级别注释的新方法。我们通过期望最大化(EM)迭代改进这些伪标签,并使用推断的标签来训练端到端对话代理。在 MultiWOZ 基准上评估我们的方法,我们的方法将 strong GPT-3.5 基线的对话成功率加倍。
https://arxiv.org/abs/2404.15219
This paper presents a novel exploration into the regressive side effects of training Large Language Models (LLMs) to mimic student misconceptions for personalized education. We highlight the problem that as LLMs are trained to more accurately mimic student misconceptions, there is a compromise in the factual integrity and reasoning ability of the models. Our work involved training an LLM on a student-tutor dialogue dataset to predict student responses. The results demonstrated a decrease in the model's performance across multiple benchmark datasets, including the ARC reasoning challenge and TruthfulQA, which evaluates the truthfulness of model's generated responses. Furthermore, the HaluEval Dial dataset, used for hallucination detection, and MemoTrap, a memory-based task dataset, also reported a decline in the model accuracy. To combat these side effects, we introduced a "hallucination token" technique. This token, appended at the beginning of each student response during training, instructs the model to switch between mimicking student misconceptions and providing factually accurate responses. Despite the significant improvement across all datasets, the technique does not completely restore the LLM's baseline performance, indicating the need for further research in this area. This paper contributes to the ongoing discussion on the use of LLMs for student modeling, emphasizing the need for a balance between personalized education and factual accuracy.
本文提出了对将大型语言模型(LLMs)用于个性化教育时产生的退化性副作用的深入探索。我们强调了LLMs在更准确地模仿学生错误观念的同时,模型事实准确性和推理能力之间存在妥协的问题。我们的工作包括在学生-导师对话数据集上训练一个LLM,预测学生回答。结果表明,模型在多个基准数据集上的表现都下降了,包括ARC推理挑战和TruthfulQA,这些数据集评估了模型生成的回答的准确性。此外,用于幻觉检测的HaluEval Dial数据集和基于记忆的任务数据集MemoTrap也报告了模型准确性的下降。为了应对这些副作用,我们引入了一种“幻觉标记”技术。这个标记附加在每个学生回答的开头,指示模型在模仿学生错误观念和提供准确回答之间进行切换。尽管在所有数据集上都取得了显著的改进,但这种技术并没有完全恢复LLM的基线性能,表明需要进一步研究这一领域。本文为LLM在学生建模中的应用提供了进一步的讨论,强调了在个性化教育和事实准确性之间需要保持平衡的重要性。
https://arxiv.org/abs/2404.15156
Large Language Models (LLMs) are frequently discussed in academia and the general public as support tools for virtually any use case that relies on the production of text, including software engineering. Currently there is much debate, but little empirical evidence, regarding the practical usefulness of LLM-based tools such as ChatGPT for engineers in industry. We conduct an observational study of 24 professional software engineers who have been using ChatGPT over a period of one week in their jobs, and qualitatively analyse their dialogues with the chatbot as well as their overall experience (as captured by an exit survey). We find that, rather than expecting ChatGPT to generate ready-to-use software artifacts (e.g., code), practitioners more often use ChatGPT to receive guidance on how to solve their tasks or learn about a topic in more abstract terms. We also propose a theoretical framework for how (i) purpose of the interaction, (ii) internal factors (e.g., the user's personality), and (iii) external factors (e.g., company policy) together shape the experience (in terms of perceived usefulness and trust). We envision that our framework can be used by future research to further the academic discussion on LLM usage by software engineering practitioners, and to serve as a reference point for the design of future empirical LLM research in this domain.
大语言模型(LLMs)在学术界和公众中常被讨论为支持文本生成几乎所有应用场景的工具,包括软件工程。目前有很多关于LLM工具的辩论,但很少有实证证据,关于这些工具在工业界工程师中的应用效果。我们对24名职业软件工程师在工作的一个星期内使用ChatGPT进行观察研究,并对其与聊天机器的对话以及整体经验(通过退出调查进行捕捉)进行定性分析。我们发现,实践者更倾向于使用ChatGPT获得有关任务解决方案的指导,而不是期望该工具生成可用的软件输出(例如,代码)。我们也提出了一个理论框架,即(i)交互的目的,(ii)内部因素(例如用户的个性)和(iii)外部因素(例如公司政策)共同塑造了体验(以感知有用性和信任为基础)。我们展望,我们的框架可以为未来的研究提供一个进一步探讨LLM在软件工程师中的应用、为该领域未来实证LLM研究的指导,以及作为未来研究的一个参考点的框架。
https://arxiv.org/abs/2404.14901
With the increasingly giant scales of (causal) large language models (LLMs), the inference efficiency comes as one of the core concerns along the improved performance. In contrast to the memory footprint, the latency bottleneck seems to be of greater importance as there can be billions of requests to a LLM (e.g., GPT-4) per day. The bottleneck is mainly due to the autoregressive innateness of LLMs, where tokens can only be generated sequentially during decoding. To alleviate the bottleneck, the idea of speculative execution, which originates from the field of computer architecture, is introduced to LLM decoding in a \textit{draft-then-verify} style. Under this regime, a sequence of tokens will be drafted in a fast pace by utilizing some heuristics, and then the tokens shall be verified in parallel by the LLM. As the costly sequential inference is parallelized, LLM decoding speed can be significantly boosted. Driven by the success of LLMs in recent couple of years, a growing literature in this direction has emerged. Yet, there lacks a position survey to summarize the current landscape and draw a roadmap for future development of this promising area. To meet this demand, we present the very first survey paper that reviews and unifies literature of speculative execution in LLMs (e.g., blockwise parallel decoding, speculative decoding, etc.) in a comprehensive framework and a systematic taxonomy. Based on the taxonomy, we present a critical review and comparative analysis of the current arts. Finally we highlight various key challenges and future directions to further develop the area.
随着大型语言模型(LLMs)越来越大,提高性能的核心问题之一是推理效率。相比之下,内存开销似乎不太重要,因为每天可能有数十亿个请求到LLM(例如GPT-4)。瓶颈主要源于LLMs的自回归性质,其中在解码过程中只能按顺序生成标记。为了减轻瓶颈,借鉴计算机架构领域的思想,以“草案-验证”的方式引入了LLM解码中的speculative execution。在这种模式下,通过使用一些启发式方法,可以快速生成一系列标记,然后由LLM并行验证这些标记。随着成本sequential inference的并行化,LLM解码速度可以大幅提高。 在LLM在过去几年取得成功的情况下,这一方向出现了越来越多的文献。然而,目前尚缺乏一份全面的调查报告,总结当前格局并为未来这个有前景的领域的发展路线图。为了满足这一需求,我们提出了第一篇 survey 论文,它回顾和统一了LLMs中speculative execution(例如块式并行解码,speculative decoding等)的文獻,并建立了一个全面的框架和系统分类学。根据这一分类学,我们给出了对当前艺术的关键审查和比较分析。最后,我们强调了进一步发展和该领域的各种关键挑战和未来方向。
https://arxiv.org/abs/2404.14897
Understanding the limits of language is a prerequisite for Large Language Models (LLMs) to act as theories of natural language. LLM performance in some language tasks presents both quantitative and qualitative differences from that of humans, however it remains to be determined whether such differences are amenable to model size. This work investigates the critical role of model scaling, determining whether increases in size make up for such differences between humans and models. We test three LLMs from different families (Bard, 137 billion parameters; ChatGPT-3.5, 175 billion; ChatGPT-4, 1.5 trillion) on a grammaticality judgment task featuring anaphora, center embedding, comparatives, and negative polarity. N=1,200 judgments are collected and scored for accuracy, stability, and improvements in accuracy upon repeated presentation of a prompt. Results of the best performing LLM, ChatGPT-4, are compared to results of n=80 humans on the same stimuli. We find that increased model size may lead to better performance, but LLMs are still not sensitive to (un)grammaticality as humans are. It seems possible but unlikely that scaling alone can fix this issue. We interpret these results by comparing language learning in vivo and in silico, identifying three critical differences concerning (i) the type of evidence, (ii) the poverty of the stimulus, and (iii) the occurrence of semantic hallucinations due to impenetrable linguistic reference.
理解语言的局限性是大型语言模型(LLMs)作为自然语言理论的先决条件。 LLM在某些语言任务中的表现与人类相比存在定量和定性差异,然而目前尚不清楚是否这些差异可以通过模型规模得到缩小。这项工作研究了模型规模对模型作用的影响,确定增加模型规模是否可以弥补人类和模型之间的差异。我们在包含三种不同家族的 LLM(Bard,1370亿参数;ChatGPT-3.5,1750亿;ChatGPT-4,1.5万亿)的语义正确性判断任务上进行了测试,该任务包括同位语、中心嵌入、比较和负极性极性。收集了1,200个判断,为准确性、稳定性以及重复呈现提示后准确性的提高进行评分。最佳表现的 LLM ChatGPT-4 的结果与相同刺激下的人类 N=80 个个体的结果进行了比较。我们发现,增加模型规模可能会有更好的表现,但 LLMs 对(不)语法正确性仍然不如人类。似乎通过单独调整规模无法解决此问题。我们通过比较在实境和虚拟语言学习中语言学习来解释这些结果,并指出了三个关键差异,即(i)证据类型,(ii)刺激的贫困,(iii)由于无法理解的语义参考而出现的语义幻觉现象。
https://arxiv.org/abs/2404.14883
A well-executed graphic design typically achieves harmony in two levels, from the fine-grained design elements (color, font and layout) to the overall design. This complexity makes the comprehension of graphic design challenging, for it needs the capability to both recognize the design elements and understand the design. With the rapid development of Multimodal Large Language Models (MLLMs), we establish the DesignProbe, a benchmark to investigate the capability of MLLMs in design. Our benchmark includes eight tasks in total, across both the fine-grained element level and the overall design level. At design element level, we consider both the attribute recognition and semantic understanding tasks. At overall design level, we include style and metaphor. 9 MLLMs are tested and we apply GPT-4 as evaluator. Besides, further experiments indicates that refining prompts can enhance the performance of MLLMs. We first rewrite the prompts by different LLMs and found increased performances appear in those who self-refined by their own LLMs. We then add extra task knowledge in two different ways (text descriptions and image examples), finding that adding images boost much more performance over texts.
优秀的平面设计通常在两个层面上实现和谐,从微观的设计元素(颜色、字体和布局)到整体设计。这种复杂性使得理解平面设计具有挑战性,因为它需要具备同时识别设计元素和理解设计的能力。随着多模态大型语言模型(MLLMs)的快速发展,我们建立了DesignProbe,作为研究MLLMs在设计能力方面的基准。我们的基准包括8个任务,跨越微观设计元素和整体设计级别。在设计元素级别,我们考虑了属性的识别和语义理解任务。在整体设计级别,我们包括了风格和隐喻。我们对9个MLLM进行了测试,并使用了GPT-4作为评估器。此外,进一步的实验表明,优化提示可以提高MLLMs的性能。我们首先通过不同的LLM重新撰写了提示,发现那些通过自己LLMs进行自定义的性能明显提高。然后我们以两种方式添加额外任务知识(文本描述和图像示例),发现添加图像极大地提高了文本的性能。
https://arxiv.org/abs/2404.14801
Mainstream poisoning attacks on large language models (LLMs) typically set a fixed trigger in the input instance and specific responses for triggered queries. However, the fixed trigger setting (e.g., unusual words) may be easily detected by human detection, limiting the effectiveness and practicality in real-world scenarios. To enhance the stealthiness of the trigger, we present a poisoning attack against LLMs that is triggered by a generation/output condition-token limitation, which is a commonly adopted strategy by users for reducing costs. The poisoned model performs normally for output without token limitation, while becomes harmful for output with limited tokens. To achieve this objective, we introduce BrieFool, an efficient attack framework. It leverages the characteristics of generation limitation by efficient instruction sampling and poisoning data generation, thereby influencing the behavior of LLMs under target conditions. Our experiments demonstrate that BrieFool is effective across safety domains and knowledge domains. For instance, with only 20 generated poisoning examples against GPT-3.5-turbo, BrieFool achieves a 100% Attack Success Rate (ASR) and a 9.28/10 average Harmfulness Score (HS) under token limitation conditions while maintaining the benign performance.
主流对大型语言模型(LLMs)的毒攻击通常会在输入实例中设置固定的触发器和特定的触发查询的特定响应。然而,固定的触发器设置(例如不寻常的单词)可能很容易被人类检测到,从而限制了在现实场景中的有效性和实用性。为了提高触发器的隐秘性,我们提出了一个针对LLMs的毒攻击,该攻击由生成/输出条件词限制触发。这是一种常用于用户减少费用的策略。在无词限制的情况下,有毒模型表现正常,但在有限词的情况下,变得具有破坏性。为了实现这一目标,我们引入了BrieFool,一种高效的攻击框架。它利用了生成限制的特点,通过有效的指令采样和污染数据生成,从而影响LLMs在目标条件下的行为。我们的实验结果表明,BrieFool在安全领域和知识领域都有效。例如,仅针对GPT-3.5-turbo的20个生成性污染示例,BrieFool在词限制条件下实现了100%的攻击成功率(ASR)和9.28/10的平均危害分数(HS),同时保持良性性能。
https://arxiv.org/abs/2404.14795
Large Language Models (LLMs) and multi-agent systems have shown impressive capabilities in natural language tasks but face challenges in clinical trial applications, primarily due to limited access to external knowledge. Recognizing the potential of advanced clinical trial tools that aggregate and predict based on the latest medical data, we propose an integrated solution to enhance their accessibility and utility. We introduce Clinical Agent System (CT-Agent), a Clinical multi-agent system designed for clinical trial tasks, leveraging GPT-4, multi-agent architectures, LEAST-TO-MOST, and ReAct reasoning technology. This integration not only boosts LLM performance in clinical contexts but also introduces novel functionalities. Our system autonomously manages the entire clinical trial process, demonstrating significant efficiency improvements in our evaluations, which include both computational benchmarks and expert feedback.
大语言模型(LLMs)和多代理系统在自然语言任务中表现出令人印象深刻的 capabilities,但在临床试验应用中面临挑战,主要原因是对外部知识的访问有限。认识到基于最新医疗数据的聚合和预测的高级临床试验工具的潜在能力,我们提出了一个集成解决方案,以增强其可访问性和实用性。我们引入了临床代理系统(CT-Agent),这是一种专为临床试验任务设计的临床多代理系统,利用了GPT-4、多代理架构、LEAST-TO-MOST和ReAct推理技术。这一集成不仅提高了LLM在临床情境中的性能,还引入了新的功能。我们的系统自主管理整个临床试验过程,证明了我们在评估中实现显著的效率改进,包括计算基准和专家反馈。
https://arxiv.org/abs/2404.14777
This paper explores SynTOD, a new synthetic data generation approach for developing end-to-end Task-Oriented Dialogue (TOD) Systems capable of handling complex tasks such as intent classification, slot filling, conversational question-answering, and retrieval-augmented response generation, without relying on crowdsourcing or real-world data. SynTOD utilizes a state transition graph to define the desired behavior of a TOD system and generates diverse, structured conversations through random walks and response simulation using large language models (LLMs). In our experiments, using graph-guided response simulations leads to significant improvements in intent classification, slot filling and response relevance compared to naive single-prompt simulated conversations. We also investigate the end-to-end TOD effectiveness of different base and instruction-tuned LLMs, with and without the constructed synthetic conversations. Finally, we explore how various LLMs can evaluate responses in a TOD system and how well they are correlated with human judgments. Our findings pave the path towards quick development and evaluation of domain-specific TOD systems. We release our datasets, models, and code for research purposes.
本文探讨了SynTOD,一种新的合成数据生成方法,用于开发端到端的任务导向对话(TOD)系统,能够处理复杂任务,如意图分类、槽填充、会话问题回答和检索增强回答,而无需依赖众包或真实世界数据。SynTOD利用状态转移图来定义TOD系统的期望行为,并通过随机漫步和响应仿真生成多样、有结构的对话。在我们的实验中,使用图指导的响应仿真比 naive single-prompt 模拟对话在意图分类、槽填充和响应相关性方面显著提高了效果。我们还研究了不同基础和指令调整的LLM在端到端TOD系统中的效果,以及有无构建合成对话。最后,我们探讨了各种LLM如何评估TOD系统中的响应,以及它们与人类评价的关联。我们的研究结果为快速开发和评估领域特定的TOD系统奠定了道路。我们将我们的数据、模型和代码公开发布,供研究之用。
https://arxiv.org/abs/2404.14772
Large Language Models (LLMs) have demonstrated remarkable performance across a spectrum of tasks. Recently, Direct Preference Optimization (DPO) has emerged as an RL-free approach to optimize the policy model on human preferences. However, several limitations hinder the widespread adoption of this method. To address these shortcomings, various versions of DPO have been introduced. Yet, a comprehensive evaluation of these variants across diverse tasks is still lacking. In this study, we aim to bridge this gap by investigating the performance of alignment methods across three distinct scenarios: (1) keeping the Supervised Fine-Tuning (SFT) part, (2) skipping the SFT part, and (3) skipping the SFT part and utilizing an instruction-tuned model. Furthermore, we explore the impact of different training sizes on their performance. Our evaluation spans a range of tasks including dialogue systems, reasoning, mathematical problem-solving, question answering, truthfulness, and multi-task understanding, encompassing 13 benchmarks such as MT-Bench, Big Bench, and Open LLM Leaderboard. Key observations reveal that alignment methods achieve optimal performance with smaller training data subsets, exhibit limited effectiveness in reasoning tasks yet significantly impact mathematical problem-solving, and employing an instruction-tuned model notably influences truthfulness. We anticipate that our findings will catalyze further research aimed at developing more robust models to address alignment challenges.
大语言模型(LLMs)在各种任务上都表现出了令人印象深刻的性能。最近,直接偏好优化(DPO)作为一种无需使用强化学习(RL)的方法,优化了关于人类偏好的策略模型,成为一个备受关注的研究方向。然而,几种限制性的限制使得这种方法在广泛应用上受到了阻碍。为了应对这些缺陷,各种版本的DPO应运而生。然而,对这些变体的全面评估在各种任务上的表现仍然是缺乏的。在这项研究中,我们通过研究在不同场景下对对齐方法的表现,试图弥补这一空白。我们研究了三种不同场景下的对齐方法:(1)保留监督微调(SFT)部分,(2)跳过SFT部分,(3)跳过SFT部分并利用指令微调模型。此外,我们还研究了不同训练规模对它们性能的影响。我们的评估范围包括对话系统、推理、数学问题求解、问答、真理fulness和多任务理解,涵盖了包括MT-Bench、Big Bench和Open LLM Leaderboard在内的13个基准。关键观察表明,对齐方法在较小的训练数据子集上实现最优性能,在推理任务上的效果有限,但在数学问题求解上具有显著影响,而利用指令微调模型在很大程度上影响了真理fulness。我们预计,我们的研究将催化进一步研究,以开发更健壮的模型来解决对齐挑战。
https://arxiv.org/abs/2404.14723
Large language models (LLMs) can adapt to new tasks through in-context learning (ICL) based on a few examples presented in dialogue history without any model parameter update. Despite such convenience, the performance of ICL heavily depends on the quality of the in-context examples presented, which makes the in-context example selection approach a critical choice. This paper proposes a novel Bayesian in-Context example Selection method (ByCS) for ICL. Extending the inference probability conditioned on in-context examples based on Bayes' theorem, ByCS focuses on the inverse inference conditioned on test input. Following the assumption that accurate inverse inference probability (likelihood) will result in accurate inference probability (posterior), in-context examples are selected based on their inverse inference results. Diverse and extensive cross-tasking and cross-modality experiments are performed with speech, text, and image examples. Experimental results show the efficacy and robustness of our ByCS method on various models, tasks and modalities.
大语言模型(LLMs)可以通过基于对话历史中提供的几个示例进行基于无模型参数更新的语境学习(ICL)来适应新的任务。尽管这种便利性,但ICL的性能很大程度上取决于提供的语境示例的质量,这使得语境示例选择方法成为一个关键的选择。本文提出了一种新颖的贝叶斯语境示例选择方法(ByCS)用于ICL。基于贝叶斯公式的推理概率条件,ByCS关注于基于测试输入的逆推理条件。假设准确的反向推理概率(概率)会导致准确的后验概率(后),根据逆推理结果选择语境示例。我们对语音、文本和图像例子进行了多样且广泛的跨任务和跨模态实验。实验结果展示了我们ByCS方法在不同模型、任务和模态上的有效性和鲁棒性。
https://arxiv.org/abs/2404.14716
Recent progress in large-scale pre-training has led to the development of advanced vision-language models (VLMs) with remarkable proficiency in comprehending and generating multimodal content. Despite the impressive ability to perform complex reasoning for VLMs, current models often struggle to effectively and precisely capture the compositional information on both the image and text sides. To address this, we propose FineMatch, a new aspect-based fine-grained text and image matching benchmark, focusing on text and image mismatch detection and correction. This benchmark introduces a novel task for boosting and evaluating the VLMs' compositionality for aspect-based fine-grained text and image matching. In this task, models are required to identify mismatched aspect phrases within a caption, determine the aspect's class, and propose corrections for an image-text pair that may contain between 0 and 3 mismatches. To evaluate the models' performance on this new task, we propose a new evaluation metric named ITM-IoU for which our experiments show a high correlation to human evaluation. In addition, we also provide a comprehensive experimental analysis of existing mainstream VLMs, including fully supervised learning and in-context learning settings. We have found that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches. Moreover, models (e.g., GPT-4V, Gemini Pro Vision) with strong abilities to perform multimodal in-context learning are not as skilled at fine-grained compositional image and text matching analysis. With FineMatch, we are able to build a system for text-to-image generation hallucination detection and correction.
近年来在大型预训练方面的进步导致开发了在多模态内容理解和解剖方面表现出众的先进视觉语言模型(VLMs)。尽管VLMs在复杂推理方面表现出色,但目前的模型通常很难有效地和精确地捕捉图像和文本两侧的组合信息。为了解决这个问题,我们提出了FineMatch,一个新的基于 aspects 的细粒度文本和图像匹配基准,重点关注文本和图像不匹配检测和纠正。这个基准为基于 aspects 的细粒度文本和图像匹配的 VLMs 的组合性评估引入了一个新的任务。在这个任务中,模型需要找出文本中的不匹配 aspects,确定 aspect 的类别,并针对可能包含 0 到 3 不匹配的图像-文本对提出修正。为了评估模型在新任务上的表现,我们提出了一个名为 ITM-IoU 的新评估指标,我们的实验结果表明它与人类评价高度相关。此外,我们还对现有的主流 VLMs 进行了全面的实验分析,包括完全监督学习和上下文学习场景。我们发现,在 FineMatch 上训练的模型在检测细粒度文本和图像不匹配方面表现更出色。此外,具有良好多模态上下文学习能力的模型(如 GPT-4V,Gemini Pro Vision)在细粒度组合图像和文本匹配分析方面并不熟练。通过 FineMatch,我们能够构建一个系统,用于检测文本到图像生成的幻觉,并进行修正。
https://arxiv.org/abs/2404.14715
The task of accurate and efficient language translation is an extremely important information processing task. Machine learning enabled and automated translation that is accurate and fast is often a large topic of interest in the machine learning and data science communities. In this study, we examine using local Generative Pretrained Transformer (GPT) models to perform automated zero shot black-box, sentence wise, multi-natural-language translation into English text. We benchmark 16 different open-source GPT models, with no custom fine-tuning, from the Huggingface LLM repository for translating 50 different non-English languages into English using translated TED Talk transcripts as the reference dataset. These GPT model inference calls are performed strictly locally, on single A100 Nvidia GPUs. Benchmark metrics that are reported are language translation accuracy, using BLEU, GLEU, METEOR, and chrF text overlap measures, and wall-clock time for each sentence translation. The best overall performing GPT model for translating into English text for the BLEU metric is ReMM-v2-L2-13B with a mean score across all tested languages of $0.152$, for the GLEU metric is ReMM-v2-L2-13B with a mean score across all tested languages of $0.256$, for the chrF metric is Llama2-chat-AYT-13B with a mean score across all tested languages of $0.448$, and for the METEOR metric is ReMM-v2-L2-13B with a mean score across all tested languages of $0.438$.
准确且高效的机器翻译是一个极其重要的信息处理任务。通过机器学习实现和自动翻译,通常在机器学习和数据科学社区是一个大的研究主题。在这项研究中,我们使用局部生成预训练Transformer(GPT)模型在英语文本上进行自动零样本黑色文本翻译。我们使用翻译的TED演讲转录作为参考数据集,将50种非英语语言翻译成英语。这些GPT模型推理都在本地进行,使用单个A100 Nvidia GPU。报告的基准指标包括语言翻译准确性、BLEU、GLEU、METEOR和chrF文本重叠度衡量,以及每个句子翻译的墙钟时间。在BLEU指标上,翻译成英语文本的最佳GPT模型是ReMM-v2-L2-13B,平均分数为所有测试语言的$0.152$;在GLEU指标上,翻译成英语文本的最佳GPT模型是ReMM-v2-L2-13B,平均分数为所有测试语言的$0.256$;在chrF指标上,Llama2-chat-AYT-13B的平均分数为所有测试语言的$0.448$;在METEOR指标上,ReMM-v2-L2-13B的平均分数为所有测试语言的$0.438$。
https://arxiv.org/abs/2404.14680
Recent deep learning models such as ChatGPT utilizing the back-propagation algorithm have exhibited remarkable performance. However, the disparity between the biological brain processes and the back-propagation algorithm has been noted. The Forward-Forward algorithm, which trains deep learning models solely through the forward pass, has emerged to address this. Although the Forward-Forward algorithm cannot replace back-propagation due to limitations such as having to use special input and loss functions, it has the potential to be useful in special situations where back-propagation is difficult to use. To work around this limitation and verify usability, we propose an Unsupervised Forward-Forward algorithm. Using an unsupervised learning model enables training with usual loss functions and inputs without restriction. Through this approach, we lead to stable learning and enable versatile utilization across various datasets and tasks. From a usability perspective, given the characteristics of the Forward-Forward algorithm and the advantages of the proposed method, we anticipate its practical application even in scenarios such as federated learning, where deep learning layers need to be trained separately in physically distributed environments.
近年来,利用反向传播算法训练的深度学习模型(如ChatGPT)表现出非凡的性能。然而,生物大脑过程和反向传播算法之间的差异已引起注意。通过训练深度学习模型仅通过前向传播,前向-前向算法(FFA)应运而生。尽管FFA无法取代反向传播,但由于必须使用特殊的输入和损失函数等限制,但它有望在某些情况下替代反向传播。为了克服这一限制,并验证其可用性,我们提出了一个无监督的前向-前向算法。使用无监督学习模型进行训练,无需限制。通过这种方法,我们实现了稳定学习和各种数据集和任务的可变利用。从可用性角度来看,由于前向-前向算法的特点和所提出方法的优点,我们预计在类似联邦学习场景中,FFA具有实际应用价值,这些场景中需要在分布式物理环境中分别训练深度学习层。
https://arxiv.org/abs/2404.14664
Open-source multimodal large language models (MLLMs) excel in various tasks involving textual and visual inputs but still struggle with complex multimodal mathematical reasoning, lagging behind proprietary models like GPT-4V(ision) and Gemini-Pro. Although fine-tuning with intermediate steps (i.e., rationales) elicits some mathematical reasoning skills, the resulting models still fall short in visual comprehension due to inadequate visual-centric supervision, which leads to inaccurate interpretation of math figures. To address this issue, we propose a two-step training pipeline VCAR, which emphasizes the Visual Comprehension training in Addition to mathematical Reasoning learning. It first improves the visual comprehension ability of MLLMs through the visual description generation task, followed by another training step on generating rationales with the assistance of descriptions. Experimental results on two popular benchmarks demonstrate that VCAR substantially outperforms baseline methods solely relying on rationale supervision, especially on problems with high visual demands.
开源的多模态大型语言模型(MLLMs)在涉及文本和视觉输入的各种任务中表现出色,但仍然在多模态数学推理上表现不佳,落后于像GPT-4V(视觉)和Gemini-Pro这样的专用模型。尽管通过中间步骤(即推理)进行微调可以获得一些数学推理能力,但由于视觉集中监督不足,导致模型在视觉理解方面仍然存在误差,从而导致对数学公式的错误解释。为了解决这个问题,我们提出了VCAR训练管道,它强调在Addition to mathematical Reasoning学习过程中加强视觉理解能力。它首先通过视觉描述生成任务提高MLLMs的视觉理解能力,然后通过描述的帮助生成推理。在两个流行的基准上进行的实验结果表明,VCAR在仅依靠推理监督的基线方法上取得了显著的优越性,尤其是在具有高视觉要求的问题上。
https://arxiv.org/abs/2404.14604
This paper presents a framework that can interpret humans' navigation commands containing temporal elements and directly translate their natural language instructions into robot motion planning. Central to our framework is utilizing Large Language Models (LLMs). To enhance the reliability of LLMs in the framework and improve user experience, we propose methods to resolve the ambiguity in natural language instructions and capture user preferences. The process begins with an ambiguity classifier, identifying potential uncertainties in the instructions. Ambiguous statements trigger a GPT-4-based mechanism that generates clarifying questions, incorporating user responses for disambiguation. Also, the framework assesses and records user preferences for non-ambiguous instructions, enhancing future interactions. The last part of this process is the translation of disambiguated instructions into a robot motion plan using Linear Temporal Logic. This paper details the development of this framework and the evaluation of its performance in various test scenarios.
本文提出了一种框架,可以解释人类包含时间元素的导航指令,并直接将自然语言指令翻译成机器人运动规划。该框架的核心是利用大型语言模型(LLMs)。为了提高LLMs在框架中的可靠性并改善用户体验,我们提出了方法来解决自然语言指令中的歧义,并捕获用户偏好。过程从歧义分类器开始,该分类器确定指令中的潜在不确定性。含糊不清的声明会触发基于GPT-4的机制,该机制生成澄清的问题,包括用户的回答用于消除歧义。此外,该框架评估并记录了非歧义指令的用户偏好,提高了未来交互。该过程的最后一部分是将解歧后的指令翻译成机器人运动规划,使用线性时间逻辑。本文详细介绍了该框架的开发以及在不同测试场景中的性能评估。
https://arxiv.org/abs/2404.14547
Generative AI, such as OpenAI's GPT-4V large-language model, has rapidly entered mainstream discourse. Novel capabilities in image processing and natural-language communication may augment existing forecasting methods. Large language models further display potential to better communicate weather hazards in a style honed for diverse communities and different languages. This study evaluates GPT-4V's ability to interpret meteorological charts and communicate weather hazards appropriately to the user, despite challenges of hallucinations, where generative AI delivers coherent, confident, but incorrect responses. We assess GPT-4V's competence via its web interface ChatGPT in two tasks: (1) generating a severe-weather outlook from weather-chart analysis and conducting self-evaluation, revealing an outlook that corresponds well with a Storm Prediction Center human-issued forecast; and (2) producing hazard summaries in Spanish and English from weather charts. Responses in Spanish, however, resemble direct (not idiomatic) translations from English to Spanish, yielding poorly translated summaries that lose critical idiomatic precision required for optimal communication. Our findings advocate for cautious integration of tools like GPT-4V in meteorology, underscoring the necessity of human oversight and development of trustworthy, explainable AI.
生成式 AI,如 OpenAI 的 GPT-4V 大语言模型,已经迅速进入主流论述。图像处理和自然语言通信的新功能可能增强现有的预测方法。大语言模型进一步表现出更好地沟通各种社区和语言的天气危险的能力。这项研究评估了 GPT-4V 通过其 Web 界面 ChatGPT 解释气象图表并适当地向用户传达天气危险的能力,尽管存在幻觉,生成式 AI 提供了连贯、自信、但错误的回应。我们通过 GPT-4V 的 Web 界面 ChatGPT 对其进行了评估: (1)从气象图表分析中生成严重天气展望并进行自我评估,揭示出与风暴预测中心人类发布的预报相符的展望; (2)从气象图表中生产西班牙语和英语的警示摘要。然而,西班牙语的回答似乎更像是从英语到西班牙语的直接翻译,导致译文准确性差,关键的惯用语 precision 丢失,影响了 optimal communication 的最佳效果。 我们的研究建议在气象学中谨慎使用类似 GPT-4V 的工具,强调在气象学中人类监督和开发可信赖、可解释的 AI 的必要性。
https://arxiv.org/abs/2404.15166
The emergence of advanced neural networks has opened up new ways in automated code generation from conceptual models, promising to enhance software development processes. This paper presents a preliminary evaluation of GPT-4-Vision, a state-of-the-art deep learning model, and its capabilities in transforming Unified Modeling Language (UML) class diagrams into fully operating Java class files. In our study, we used exported images of 18 class diagrams comprising 10 single-class and 8 multi-class diagrams. We used 3 different prompts for each input, and we manually evaluated the results. We created a scoring system in which we scored the occurrence of elements found in the diagram within the source code. On average, the model was able to generate source code for 88% of the elements shown in the diagrams. Our results indicate that GPT-4-Vision exhibits proficiency in handling single-class UML diagrams, successfully transforming them into syntactically correct class files. However, for multi-class UML diagrams, the model's performance is weaker compared to single-class diagrams. In summary, further investigations are necessary to exploit the model's potential completely.
高级神经网络的出现为自动编程从概念模型中开辟了新的途径,并承诺提高软件开发流程。本文对GPT-4-Vision进行了初步评估,该模型是当前最先进的深度学习模型,并探讨了其在将统一建模语言(UML)类图转换为完全操作的Java类文件时的能力。在我们的研究中,我们使用了18个类图的导出图像,包括10个单类和8个多类类图。我们对每个输入使用了3个提示,并手动评估了结果。我们创建了一个评分系统,在其中给源代码中找到的元素打分。平均情况下,该模型能够生成源代码的88%的元素。我们的研究结果表明,GPT-4-Vision在处理单类UML类图方面表现出熟练,成功地将它们转换为语义正确的类文件。然而,对于多类UML类图,模型的性能较弱。总之,进一步的研究是必要的,以充分利用该模型的潜力。
https://arxiv.org/abs/2404.14370