Multiple choice questions (MCQs) are a popular method for evaluating students' knowledge due to their efficiency in administration and grading. Crafting high-quality math MCQs is a labor-intensive process that requires educators to formulate precise stems and plausible distractors. Recent advances in large language models (LLMs) have sparked interest in automating MCQ creation, but challenges persist in ensuring mathematical accuracy and addressing student errors. This paper introduces a prototype tool designed to facilitate collaboration between LLMs and educators for streamlining the math MCQ generation process. We conduct a pilot study involving math educators to investigate how the tool can help them simplify the process of crafting high-quality math MCQs. We found that while LLMs can generate well-formulated question stems, their ability to generate distractors that capture common student errors and misconceptions is limited. Nevertheless, a human-AI collaboration has the potential to enhance the efficiency and effectiveness of MCQ generation.
多选题问题(MCQs)是一种评价学生知识的有效方法,由于其在管理和评分方面的效率而受到欢迎。打造高质量的数学MCQ是一个需要教育者精确制定题目和吸引人的干扰物的过程。近年来,大型语言模型(LLMs)的进步引发了对自动生成MCQ的兴趣,但确保数学准确性和解决学生错误仍然具有挑战性。本文介绍了一个原型工具,旨在促进LLMs和教育者之间的合作,简化数学MCQ的生成过程。我们进行了一项涉及数学教育工作者的试点研究,以调查该工具如何帮助教育者简化制作高质量数学MCQ的过程。我们发现,虽然LLMs可以生成良好的问题陈述,但它们生成干扰物以捕捉常见的学生错误和误解的能力有限。然而,人机合作有可能增强MCQ生成过程的效率和效果。
https://arxiv.org/abs/2405.00864
The prevalence of unwarranted beliefs, spanning pseudoscience, logical fallacies, and conspiracy theories, presents substantial societal hurdles and the risk of disseminating misinformation. Utilizing established psychometric assessments, this study explores the capabilities of large language models (LLMs) vis-a-vis the average human in detecting prevalent logical pitfalls. We undertake a philosophical inquiry, juxtaposing the rationality of humans against that of LLMs. Furthermore, we propose methodologies for harnessing LLMs to counter misconceptions, drawing upon psychological models of persuasion such as cognitive dissonance theory and elaboration likelihood theory. Through this endeavor, we highlight the potential of LLMs as personalized misinformation debunking agents.
不正当信念的普遍存在,从伪科学、逻辑谬误和阴谋论到,给社会带来了巨大的障碍,并可能传播错误信息。运用已有的心理测量法,本研究探讨了大语言模型(LLMs)与平均人类在发现普遍逻辑陷阱方面的能力。我们进行了一项哲学探讨,将人类理性的边界与LLMs的理性相对照。此外,我们还提出了利用LLMs消除误解的方法,借鉴了说服力心理模型,如认知失调理论和阐述可能性理论。通过这项工作,我们突出了LLMs作为个性化错误信息反驳工具的潜力。
https://arxiv.org/abs/2405.00843
In this paper, we present Sim-Grasp, a robust 6-DOF two-finger grasping system that integrates advanced language models for enhanced object manipulation in cluttered environments. We introduce the Sim-Grasp-Dataset, which includes 1,550 objects across 500 scenarios with 7.9 million annotated labels, and develop Sim-GraspNet to generate grasp poses from point clouds. The Sim-Grasp-Polices achieve grasping success rates of 97.14% for single objects and 87.43% and 83.33% for mixed clutter scenarios of Levels 1-2 and Levels 3-4 objects, respectively. By incorporating language models for target identification through text and box prompts, Sim-Grasp enables both object-agnostic and target picking, pushing the boundaries of intelligent robotic systems.
在本文中,我们提出了Sim-Grasp系统,这是一个 robust 的6-DOF两指握持系统,集成了先进的语言模型以增强在复杂环境中的物体操作。我们引入了Sim-Grasp-Dataset,其中包括500个场景中1550个物体的79000个注释标签,并开发了Sim-GraspNet来生成点云中的抓持姿势。Sim-Grasp-Policies在单物体和混合复杂场景(级别1-2和级别3-4)中的抓持成功率为97.14%和87.43%和83.33%。通过通过文本和框提示集成目标识别语言模型,Sim-Grasp enabling both object-agnostic and target picking,推动了智能机器人系统的边界。
https://arxiv.org/abs/2405.00841
We propose WIBA, a novel framework and suite of methods that enable the comprehensive understanding of "What Is Being Argued" across contexts. Our approach develops a comprehensive framework that detects: (a) the existence, (b) the topic, and (c) the stance of an argument, correctly accounting for the logical dependence among the three tasks. Our algorithm leverages the fine-tuning and prompt-engineering of Large Language Models. We evaluate our approach and show that it performs well in all the three capabilities. First, we develop and release an Argument Detection model that can classify a piece of text as an argument with an F1 score between 79% and 86% on three different benchmark datasets. Second, we release a language model that can identify the topic being argued in a sentence, be it implicit or explicit, with an average similarity score of 71%, outperforming current naive methods by nearly 40%. Finally, we develop a method for Argument Stance Classification, and evaluate the capability of our approach, showing it achieves a classification F1 score between 71% and 78% across three diverse benchmark datasets. Our evaluation demonstrates that WIBA allows the comprehensive understanding of What Is Being Argued in large corpora across diverse contexts, which is of core interest to many applications in linguistics, communication, and social and computer science. To facilitate accessibility to the advancements outlined in this work, we release WIBA as a free open access platform (wiba.dev).
我们提出了WIBA,一种新的框架和方法,可让您全面理解跨背景下的“什么是被争论的”。我们的方法开发了一个全面的框架,能检测到:(a)存在的论据,(b)主题和(c)论点,正确地解释了这三个任务之间的逻辑依赖关系。我们的算法利用了大型语言模型的微调和高提示工程。我们评估了我们的方法,并证明了它在所有三个能力上都表现出色。首先,我们开发并发布了一个人工论据检测模型,在三个不同的基准数据集上,该模型的F1得分在79%至86%之间。其次,我们发布了一个语言模型,可以识别句子中正在被争论的主题,无论是隐含还是明确的,平均相似度为71%,几乎比当前的 naive 方法快40%。最后,我们开发了一种论点分类方法,并评估了我们的方法的表现,证明它在大致不同的基准数据集上的分类F1得分在71%至78%之间。我们的评估表明,WIBA允许在大型语料库的多样性背景下全面理解“什么是被争论的”,这正是许多语言学、交流和计算机科学等应用的核心兴趣所在。为了方便您了解本工作的进展,我们已将WIBA作为免费开源平台(wiba.dev)发布。
https://arxiv.org/abs/2405.00828
Customer service is how companies interface with their customers. It can contribute heavily towards the overall customer satisfaction. However, high-quality service can become expensive, creating an incentive to make it as cost efficient as possible and prompting most companies to utilize AI-powered assistants, or "chat bots". On the other hand, human-to-human interaction is still desired by customers, especially when it comes to complex scenarios such as disputes and sensitive topics like bill payment. This raises the bar for customer service agents. They need to accurately understand the customer's question or concern, identify a solution that is acceptable yet feasible (and within the company's policy), all while handling multiple conversations at once. In this work, we introduce "Ask Me Anything" (AMA) as an add-on feature to an agent-facing customer service interface. AMA allows agents to ask questions to a large language model (LLM) on demand, as they are handling customer conversations -- the LLM provides accurate responses in real-time, reducing the amount of context switching the agent needs. In our internal experiments, we find that agents using AMA versus a traditional search experience spend approximately 10% fewer seconds per conversation containing a search, translating to millions of dollars of savings annually. Agents that used the AMA feature provided positive feedback nearly 80% of the time, demonstrating its usefulness as an AI-assisted feature for customer care.
客户服务是公司与客户之间的互动方式。它有可能极大地影响客户满意度。然而,高质量的服务可能会变得昂贵,导致公司产生一种将服务尽可能地成本效益化的激励,催生出使用AI驱动的助手(或“聊天机器人”)的潮流。另一方面,客户仍然希望与人类进行人际互动,尤其是在复杂场景(如纠纷和敏感话题的付款情况)下。这对客户服务代理提出了更高的要求。他们需要准确理解客户的提问或担忧,找出一个可接受且可行的解决方案,同时处理多通对话。在这项工作中,我们将“问我任何问题”(AMA)作为面向代理的客户服务界面的附加功能引入。AMA允许代理在处理客户对话时向大型语言模型(LLM)提出问题,因为它们正在与客户交流——LLM会实时提供准确的回答,减少代理需要进行上下文切换的时间。在我们内部的实验中,我们发现使用AMA的代理与使用传统搜索体验的代理相比,每个对话中大约可以节省10%的时间,这意味着每年可以节省数百万美元。使用AMA功能的代理几乎得到了80%的积极反馈,这表明它作为AI辅助功能在客户服务中的实用性。
https://arxiv.org/abs/2405.00801
Emerging multi-model workloads with heavy models like recent large language models significantly increased the compute and memory demands on hardware. To address such increasing demands, designing a scalable hardware architecture became a key problem. Among recent solutions, the 2.5D silicon interposer multi-chip module (MCM)-based AI accelerator has been actively explored as a promising scalable solution due to their significant benefits in the low engineering cost and composability. However, previous MCM accelerators are based on homogeneous architectures with fixed dataflow, which encounter major challenges from highly heterogeneous multi-model workloads due to their limited workload adaptivity. Therefore, in this work, we explore the opportunity in the heterogeneous dataflow MCM AI accelerators. We identify the scheduling of multi-model workload on heterogeneous dataflow MCM AI accelerator is an important and challenging problem due to its significance and scale, which reaches O(10^18) scale even for a single model case on 6x6 chiplets. We develop a set of heuristics to navigate the huge scheduling space and codify them into a scheduler with advanced techniques such as inter-chiplet pipelining. Our evaluation on ten multi-model workload scenarios for datacenter multitenancy and AR/VR use-cases has shown the efficacy of our approach, achieving on average 35.3% and 31.4% less energy-delay product (EDP) for the respective applications settings compared to homogeneous baselines.
随着具有大型语言模型的 Emergent 多模态工作负载在硬件上显著增加计算和内存需求,设计可扩展的硬件架构成为了一个关键问题。在最近的解决方案中,基于 2.5D SiPAMCM(多芯片模块)的 AI 加速器被广泛研究,作为一种有前途的可扩展解决方案,因为它们在低制造成本和可编程性方面具有显著优势。然而,之前的 MCM 加速器基于均匀架构,具有固定的数据流,在高异质多模态工作负载下遇到了重大挑战,因为它们的负载适应性有限。因此,在本文中,我们探讨了异质数据流 MCM AI 加速器中的机会。我们确定了在异质数据流 MCM AI 加速器上多模态工作负载的调度是一个重要而具有挑战性的问题,由于其重要性和规模,甚至在一个六芯片的案例上,其规模可以达到 O(10^18)。我们开发了一系列启发式方法来导航巨大的调度空间,并将它们编成具有高级技术的调度器,如芯片间管道。我们对数据中心多租户和 AR/VR 使用情况进行 ten 个多模态工作负载场景的评估,证明了我们的方法的有效性,与均匀基线相比,平均降低了 35.3% 和 31.4% 的能源延迟产品(EDP)。
https://arxiv.org/abs/2405.00790
Traditional reinforcement learning from human feedback (RLHF) approaches relying on parametric models like the Bradley-Terry model fall short in capturing the intransitivity and irrationality in human preferences. Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences, enabling more flexible and accurate language model alignment. In this paper, we propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game aimed at identifying the Nash equilibrium policy. Our approach, dubbed \textit{Self-Play Preference Optimization} (SPPO), approximates the Nash equilibrium through iterative policy updates and enjoys theoretical convergence guarantee. Our method can effectively increase the log-likelihood of the chosen response and decrease that of the rejected response, which cannot be trivially achieved by symmetric pairwise loss such as Direct Preference Optimization (DPO) and Identity Preference Optimization (IPO). In our experiments, using only 60k prompts (without responses) from the UltraFeedback dataset and without any prompt augmentation, by leveraging a pre-trained preference model PairRM with only 0.4B parameters, SPPO can obtain a model from fine-tuning Mistral-7B-Instruct-v0.2 that achieves the state-of-the-art length-controlled win-rate of 28.53% against GPT-4-Turbo on AlpacaEval 2.0. It also outperforms the (iterative) DPO and IPO on MT-Bench and the Open LLM Leaderboard. Notably, the strong performance of SPPO is achieved without additional external supervision (e.g., responses, preferences, etc.) from GPT-4 or other stronger language models.
传统强化学习从人类反馈(RLHF)方法依赖于参数模型(如Bradley-Terry模型)则无法捕捉到人类偏好中的可塑性和非理睬性。最近的研究表明,直接与偏好概率互动可能产生更准确的人类偏好的反映,从而实现更灵活和准确的语言模型对齐。在本文中,我们提出了一种基于自博弈的语言模型对齐方法,将问题视为一个恒等和两个玩家的无限博弈,旨在确定纳什均衡策略。我们的方法被称为\textit{自博弈偏好优化》(SPPO),通过迭代策略更新来近似纳什均衡,并具有理论收敛保证。我们的方法可以有效增加所选响应的似然,并减小拒绝响应的似然,这不能通过对称成对损失(如Direct Preference Optimization,DPO)和Identity Preference Optimization(IPO)等简单方式实现。在实验中,使用超Feedback集的60k个提示(没有响应)以及没有提示增强,仅利用预训练偏好模型PairRM,仅包含0.4B参数,SPPO可以从微调后的Mistral-7B-Instruct-v0.2模型中获得,该模型在AlpacaEval 2.0上的长度控制赢得率达到了28.53%。此外,SPPO在MT-Bench和Open LLM Leaderboard上的表现优于(迭代)DPO和IPO。值得注意的是,SPPO的高性能在没有额外外部监督(如响应、偏好等)的情况下,从GPT-4或其他更强的语言模型中实现。
https://arxiv.org/abs/2405.00675
This study presents a targeted model editing analysis focused on the latest large language model, Llama-3. We explore the efficacy of popular model editing techniques - ROME, MEMIT, and EMMET, which are designed for precise layer interventions. We identify the most effective layers for targeted edits through an evaluation that encompasses up to 4096 edits across three distinct strategies: sequential editing, batch editing, and a hybrid approach we call as sequential-batch editing. Our findings indicate that increasing edit batch-sizes may degrade model performance more significantly than using smaller edit batches sequentially for equal number of edits. With this, we argue that sequential model editing is an important component for scaling model editing methods and future research should focus on methods that combine both batched and sequential editing. This observation suggests a potential limitation in current model editing methods which push towards bigger edit batch sizes, and we hope it paves way for future investigations into optimizing batch sizes and model editing performance.
本研究针对最新的大型语言模型Llama-3,提出了一种针对模型编辑的定向分析模型。我们探讨了流行模型编辑技术——ROME、MEMIT和EMMET——对于精确层干预的有效性。通过一种包括对三个不同策略(序列编辑、批量编辑和我们称之为序列-批量编辑的混合方法)的4096个编辑的评估,我们确定了对目标编辑最有效的层。我们的研究结果表明,增加编辑批量大小的提升对模型性能的损害要远大于使用较小的编辑批量大小的逐个编辑。因此,我们认为序列模型编辑是扩展模型编辑方法的重要组成部分,未来的研究应集中于结合批量和序列编辑的方法。这个观察结果表明,当前的模型编辑方法在推动更大的编辑批量大小的方向上存在一个潜在的局限性,我们希望这为未来对批量大小的优化和模型编辑性能的研究奠定基础。
https://arxiv.org/abs/2405.00664
Recent studies introduced effective compression techniques for Large Language Models (LLMs) via post-training quantization or low-bit weight representation. Although quantized weights offer storage efficiency and allow for faster inference, existing works have indicated that quantization might compromise performance and exacerbate biases in LLMs. This study investigates the confidence and calibration of quantized models, considering factors such as language model type and scale as contributors to quantization loss. Firstly, we reveal that quantization with GPTQ to 4-bit results in a decrease in confidence regarding true labels, with varying impacts observed among different language models. Secondly, we observe fluctuations in the impact on confidence across different scales. Finally, we propose an explanation for quantization loss based on confidence levels, indicating that quantization disproportionately affects samples where the full model exhibited low confidence levels in the first place.
近年来,通过后训练量化或低位权重表示,有效压缩技术已经引入了大型语言模型(LLMs)。尽管量化权重具有存储效率,能够加速推理,但已有研究表示,量化可能会影响性能,甚至加剧LLMs中的偏见。本研究调查了量化模型的置信度和校准度,考虑了诸如语言模型类型和规模等因素对量化损失的贡献。首先,我们发现,将GPTQ量化到4位会导致关于真实标签的置信度下降,不同语言模型上观察到的影响有所不同。其次,我们观察到不同缩放级别上置信度对的影响波动。最后,我们提出了一个基于置信度的量化损失解释,表明量化 disproportionately影响在训练过程中首次表现出低置信度的样本。
https://arxiv.org/abs/2405.00632
Widely deployed large language models (LLMs) can produce convincing yet incorrect outputs, potentially misleading users who may rely on them as if they were correct. To reduce such overreliance, there have been calls for LLMs to communicate their uncertainty to end users. However, there has been little empirical work examining how users perceive and act upon LLMs' expressions of uncertainty. We explore this question through a large-scale, pre-registered, human-subject experiment (N=404) in which participants answer medical questions with or without access to responses from a fictional LLM-infused search engine. Using both behavioral and self-reported measures, we examine how different natural language expressions of uncertainty impact participants' reliance, trust, and overall task performance. We find that first-person expressions (e.g., "I'm not sure, but...") decrease participants' confidence in the system and tendency to agree with the system's answers, while increasing participants' accuracy. An exploratory analysis suggests that this increase can be attributed to reduced (but not fully eliminated) overreliance on incorrect answers. While we observe similar effects for uncertainty expressed from a general perspective (e.g., "It's not clear, but..."), these effects are weaker and not statistically significant. Our findings suggest that using natural language expressions of uncertainty may be an effective approach for reducing overreliance on LLMs, but that the precise language used matters. This highlights the importance of user testing before deploying LLMs at scale.
广泛部署的大型语言模型(LLMs)可能会产生令人信服但错误的输出,这可能会误导用户,让他们将其视为正确。为了减少这种过度依赖,有人呼吁LLMs向用户沟通其不确定性。然而,很少有实证研究来探讨用户如何理解和应对LLMs表达的不确定性。我们通过一个大型、预注册、人实验(N=404)来研究这个问题,其中参与者使用没有来自虚构LLM填充的搜索引擎的回答回答医疗问题。我们使用行为和自我报告 measures 研究不同自然语言表达不确定性如何影响参与者的依赖、信任和整体任务表现。我们发现,第一人称表达(例如,“我不确定,但...”)减少了参与者对系统的信心,并倾向于同意系统的答案,同时提高了参与者的准确性。探索性分析表明,这种增加可以归因于对错误答案的过度依赖程度降低(但并未完全消除)。虽然我们从一般角度观察到类似的效果(例如,“不清楚,但...”),但这些效果较弱,不具有统计学意义。我们的研究结果表明,使用自然语言表达不确定性可能是减少对LLM过度依赖的有效方法,但所使用的具体语言至关重要。这突出了在部署大规模LLM之前进行用户测试的重要性。
https://arxiv.org/abs/2405.00623
Causal reasoning is viewed as crucial for achieving human-level machine intelligence. Recent advances in language models have expanded the horizons of artificial intelligence across various domains, sparking inquiries into their potential for causal reasoning. In this work, we introduce Causal evaluation of Language Models (CaLM), which, to the best of our knowledge, is the first comprehensive benchmark for evaluating the causal reasoning capabilities of language models. First, we propose the CaLM framework, which establishes a foundational taxonomy consisting of four modules: causal target (i.e., what to evaluate), adaptation (i.e., how to obtain the results), metric (i.e., how to measure the results), and error (i.e., how to analyze the bad results). This taxonomy defines a broad evaluation design space while systematically selecting criteria and priorities. Second, we compose the CaLM dataset, comprising 126,334 data samples, to provide curated sets of causal targets, adaptations, metrics, and errors, offering extensive coverage for diverse research pursuits. Third, we conduct an extensive evaluation of 28 leading language models on a core set of 92 causal targets, 9 adaptations, 7 metrics, and 12 error types. Fourth, we perform detailed analyses of the evaluation results across various dimensions (e.g., adaptation, scale). Fifth, we present 50 high-level empirical findings across 9 dimensions (e.g., model), providing valuable guidance for future language model development. Finally, we develop a multifaceted platform, including a website, leaderboards, datasets, and toolkits, to support scalable and adaptable assessments. We envision CaLM as an ever-evolving benchmark for the community, systematically updated with new causal targets, adaptations, models, metrics, and error types to reflect ongoing research advancements. Project website is at this https URL.
因果推理被认为是实现人类水平机器智能的关键。近年来,自然语言处理模型(NLP)的进步扩展了人工智能在各个领域的视野,引发了关于它们进行因果推理潜力的讨论。在这项工作中,我们引入了Causal评估语言模型(CaLM),据我们所知,这是评估语言模型因果推理能力的第一個全面基准。首先,我们提出了CaLM框架,该框架由四个模块组成:因果目标(即要评估的内容)、适应(即如何获得结果)、指标(即如何衡量结果)和错误(即如何分析不良结果)。这个分类定义了一个广泛的评估设计空间,并系统地选择标准和优先级。其次,我们将CaLM数据集组合成包含126,334个数据样本的 curated集,提供了精心挑选的因果目标、适应、指标和错误,为各种研究提供了广泛的覆盖。第三,我们对92个因果目标、9个适应、7个指标和12个错误类型的28个领先语言模型进行了广泛的评估。第四,我们详细分析了评估结果在不同维度(例如适应、规模)上的情况。第五,我们在9个维度(例如模型)上呈现了50个高级实证发现,为未来的语言模型发展提供了宝贵的指导。最后,我们开发了一个多方面的平台,包括网站、排行榜、数据集和工具包,以支持可扩展和适应性评估。我们设想CaLM成为社区不断演变的基准,定期更新以反映持续的研究进步。项目网站此刻位于https://www.projecturl.com。
https://arxiv.org/abs/2405.00622
Large language models (LLMs) with their strong zero-shot topic extraction capabilities offer an alternative to probabilistic topic modelling and closed-set topic classification approaches. As zero-shot topic extractors, LLMs are expected to understand human instructions to generate relevant and non-hallucinated topics based on the given documents. However, LLM-based topic modelling approaches often face difficulties in generating topics with adherence to granularity as specified in human instructions, often resulting in many near-duplicate topics. Furthermore, methods for addressing hallucinated topics generated by LLMs have not yet been investigated. In this paper, we focus on addressing the issues of topic granularity and hallucinations for better LLM-based topic modelling. To this end, we introduce a novel approach that leverages Direct Preference Optimisation (DPO) to fine-tune open-source LLMs, such as Mistral-7B. Our approach does not rely on traditional human annotation to rank preferred answers but employs a reconstruction pipeline to modify raw topics generated by LLMs, thus enabling a fast and efficient training and inference framework. Comparative experiments show that our fine-tuning approach not only significantly improves the LLM's capability to produce more coherent, relevant, and precise topics, but also reduces the number of hallucinated topics.
大语言模型(LLMs)具有其强大的零击主题提取能力,为概率主题建模和关闭集主题分类方法提供了另一种选择。作为零击主题提取器,LLM预计能够根据给定的文档理解人类指令来生成相关和非虚构的主题。然而,基于LLM的主题建模方法通常在生成符合人类指令的丰富主题方面遇到困难,通常导致许多近似主题。此外,尚未研究解决LLM生成的主题出现魔幻的方法。在本文中,我们关注主题粒度问题和魔幻生成问题,以改善基于LLM的主题建模。为此,我们引入了一种新方法,该方法利用直接偏好优化(DPO)对开源LLM进行微调,例如Mistral-7B。我们的方法不依赖于传统的人类注释对首选答案进行排名,而是采用重构管道来修改由LLM生成的原始主题,从而实现快速而有效的训练和推理框架。比较实验表明,我们的微调方法不仅显著提高了LLM产生更有条理、相关和非精确主题的能力,而且减少了生成的魔幻主题的数量。
https://arxiv.org/abs/2405.00611
Automatic grading and feedback have been long studied using traditional machine learning and deep learning techniques using language models. With the recent accessibility to high performing large language models (LLMs) like LLaMA-2, there is an opportunity to investigate the use of these LLMs for automatic grading and feedback generation. Despite the increase in performance, LLMs require significant computational resources for fine-tuning and additional specific adjustments to enhance their performance for such tasks. To address these issues, Parameter Efficient Fine-tuning (PEFT) methods, such as LoRA and QLoRA, have been adopted to decrease memory and computational requirements in model fine-tuning. This paper explores the efficacy of PEFT-based quantized models, employing classification or regression head, to fine-tune LLMs for automatically assigning continuous numerical grades to short answers and essays, as well as generating corresponding feedback. We conducted experiments on both proprietary and open-source datasets for our tasks. The results show that prediction of grade scores via finetuned LLMs are highly accurate, achieving less than 3% error in grade percentage on average. For providing graded feedback fine-tuned 4-bit quantized LLaMA-2 13B models outperform competitive base models and achieve high similarity with subject matter expert feedback in terms of high BLEU and ROUGE scores and qualitatively in terms of feedback. The findings from this study provide important insights into the impacts of the emerging capabilities of using quantization approaches to fine-tune LLMs for various downstream tasks, such as automatic short answer scoring and feedback generation at comparatively lower costs and latency.
自动评分和反馈已经使用传统的机器学习和深度学习技术,特别是语言模型进行研究。随着最近高性能大型语言模型(LLMs)如LLLM-2的可用性,有机会研究这些LLMs在自动评分和反馈生成方面的应用。尽管性能有所提高,但LLMs在精调过程中需要大量的计算资源,以及对特定任务的附加调整以提高其性能。为解决这些问题,采用参数高效的微调(PEFT)方法,如LoRA和QLoRA,以降低模型精调的内存和计算要求。 本文探讨了基于PEFT的量化模型在自动为短文和论文评分以及生成相应反馈方面的效果。我们在专有和开源数据集上进行了实验。实验结果表明,通过微调精度的LLM进行预测,预测的分数非常准确,平均分数误差不到3%。为了提供有质量的反馈精调4位量化LLM-13B模型在竞争基础模型方面表现出优异的性能,并与专家反馈在BLEU和ROUGE分数和高相似性方面具有高度相似性。本研究的结果为使用量化方法微调LLM为各种下游任务提供了重要的见解,包括自动短文评分和反馈生成,以及较低的成本和延迟。
https://arxiv.org/abs/2405.00602
Gender bias research has been pivotal in revealing undesirable behaviors in large language models, exposing serious gender stereotypes associated with occupations, and emotions. A key observation in prior work is that models reinforce stereotypes as a consequence of the gendered correlations that are present in the training data. In this paper, we focus on bias where the effect from training data is unclear, and instead address the question: Do language models still exhibit gender bias in non-stereotypical settings? To do so, we introduce UnStereoEval (USE), a novel framework tailored for investigating gender bias in stereotype-free scenarios. USE defines a sentence-level score based on pretraining data statistics to determine if the sentence contain minimal word-gender associations. To systematically benchmark the fairness of popular language models in stereotype-free scenarios, we utilize USE to automatically generate benchmarks without any gender-related language. By leveraging USE's sentence-level score, we also repurpose prior gender bias benchmarks (Winobias and Winogender) for non-stereotypical evaluation. Surprisingly, we find low fairness across all 28 tested models. Concretely, models demonstrate fair behavior in only 9%-41% of stereotype-free sentences, suggesting that bias does not solely stem from the presence of gender-related words. These results raise important questions about where underlying model biases come from and highlight the need for more systematic and comprehensive bias evaluation. We release the full dataset and code at this https URL.
性别偏见研究在揭示大型语言模型中的不良行为、职业 associated 的严重性别刻板印象和情感方面具有关键作用。以前的工作中发现,模型会强化刻板印象,因为训练数据中存在的性别相关关系。在本文中,我们关注训练数据效果不明确的情况,并回答这个问题:语言模型在非刻板印象的场景中是否仍然存在性别偏见?为了回答这个问题,我们引入了UnStereoEval(USE),一种专为研究性别偏见在非刻板印象场景中的框架。USE根据预训练数据统计数据定义了一个句子级得分,以确定句子是否包含最小词-性别关联。为了系统地评估流行语言模型在非刻板印象场景中的公平性,我们利用USE自动生成没有任何性别相关语言的基准。通过利用USE的句子级得分,我们还重新利用了以前的性别偏见基准(Winobias和Winogender)进行非刻板印象评估。令人惊讶的是,我们发现所有28个测试模型在非刻板印象场景中都存在低公平性。具体来说,模型在非刻板印象句子中的公正行为仅占9%-41%。这些结果引发了一些重要问题,即潜在模型偏见来自何处,以及需要更系统化和全面的偏见评估。我们在这个链接上发布了完整的数据集和代码:https://www.aclweb.org/anthology/N18-2172
https://arxiv.org/abs/2405.00588
Large language model alignment is widely used and studied to avoid LLM producing unhelpful and harmful responses. However, the lengthy training process and predefined preference bias hinder adaptation to online diverse human preferences. To this end, this paper proposes an alignment framework, called Reinforcement Learning with Human Behavior (RLHB), to align LLMs by directly leveraging real online human behaviors. By taking the generative adversarial framework, the generator is trained to respond following expected human behavior; while the discriminator tries to verify whether the triplets of query, response, and human behavior come from real online environments. Behavior modeling in natural-language form and the multi-model joint training mechanism enable an active and sustainable online alignment. Experimental results confirm the effectiveness of our proposed methods by both human and automatic evaluations.
大语言模型对齐广泛应用于并受到研究,以避免LLM产生不有用和有害的反应。然而,漫长的训练过程和预定义的偏好偏差会阻碍LLM适应在线多样的人类偏好。为此,本文提出了一种名为强化学习与人类行为(RLHB)的alignment框架,通过直接利用真实在线人类行为来对齐LLM。通过采用生成对抗网络,生成器被训练为根据预期的人类行为响应;而判别器则试图验证三元组查询、响应和人类行为的三元组是否来自真实在线环境。自然语言形式的行为建模和多模型联合训练机制使主动和可持续的在线对齐成为可能。实验结果证实了我们提出方法的既定有效性,同时通过人类和自动评估进行评估。
https://arxiv.org/abs/2405.00578
Emotion AI is the ability of computers to understand human emotional states. Existing works have achieved promising progress, but two limitations remain to be solved: 1) Previous studies have been more focused on short sequential video emotion analysis while overlooking long sequential video. However, the emotions in short sequential videos only reflect instantaneous emotions, which may be deliberately guided or hidden. In contrast, long sequential videos can reveal authentic emotions; 2) Previous studies commonly utilize various signals such as facial, speech, and even sensitive biological signals (e.g., electrocardiogram). However, due to the increasing demand for privacy, developing Emotion AI without relying on sensitive signals is becoming important. To address the aforementioned limitations, in this paper, we construct a dataset for Emotion Analysis in Long-sequential and De-identity videos called EALD by collecting and processing the sequences of athletes' post-match interviews. In addition to providing annotations of the overall emotional state of each video, we also provide the Non-Facial Body Language (NFBL) annotations for each player. NFBL is an inner-driven emotional expression and can serve as an identity-free clue to understanding the emotional state. Moreover, we provide a simple but effective baseline for further research. More precisely, we evaluate the Multimodal Large Language Models (MLLMs) with de-identification signals (e.g., visual, speech, and NFBLs) to perform emotion analysis. Our experimental results demonstrate that: 1) MLLMs can achieve comparable, even better performance than the supervised single-modal models, even in a zero-shot scenario; 2) NFBL is an important cue in long sequential emotion analysis. EALD will be available on the open-source platform.
情感人工智能是计算机理解人类情感状态的能力。现有的工作已经取得了一定的进展,但还需要解决两个限制:1)以前的研究主要关注短序列视频的情感分析,而忽略了长序列视频;然而,短序列视频中的情感仅反映瞬时的情感,可能是有意引导或隐藏的。相反,长序列视频可以揭示真正的情感;2)以前的研究通常会利用各种信号,如面部、语音,甚至是敏感的生物信号(例如心率图),然而,由于对隐私的需求不断增加,不依赖敏感信号的开发情感人工智能变得越来越重要。为了应对上述限制,在本文中,我们通过收集和处理运动员比赛后采访的序列,构建了一个用于情感分析的长序列和去识别性视频的 dataset,称为 EALD。除了提供每个视频的整体情感状态的注释外,我们还为每个球员提供了非面部身体语言(NFBL)注释。NFBL是一种内部驱动的情感表达,可以作为无身份的线索来理解情感状态。此外,我们提供了一个简单但有效的基础,供进一步的研究使用。具体来说,我们使用去识别性信号(如视觉、语音和 NFBL)评估多模态大型语言模型(MLLM)进行情感分析。我们的实验结果表明:1)MLLM可以在零散景观中实现与监督单一模态模型相当甚至更好的性能;2)NFBL在长序列情感分析中是一个重要的线索。EALD 将公开发布在开源平台上。
https://arxiv.org/abs/2405.00574
Recently, many works have proposed various financial large language models (FinLLMs) by pre-training from scratch or fine-tuning open-sourced LLMs on financial corpora. However, existing FinLLMs exhibit unsatisfactory performance in understanding financial text when numeric variables are involved in questions. In this paper, we propose a novel LLM, called numeric-sensitive large language model (NumLLM), for Chinese finance. We first construct a financial corpus from financial textbooks which is essential for improving numeric capability of LLMs during fine-tuning. After that, we train two individual low-rank adaptation (LoRA) modules by fine-tuning on our constructed financial corpus. One module is for adapting general-purpose LLMs to financial domain, and the other module is for enhancing the ability of NumLLM to understand financial text with numeric variables. Lastly, we merge the two LoRA modules into the foundation model to obtain NumLLM for inference. Experiments on financial question-answering benchmark show that NumLLM can boost the performance of the foundation model and can achieve the best overall performance compared to all baselines, on both numeric and non-numeric questions.
近年来,许多研究通过从零开始预训练或对开源LLM进行微调来提出各种金融大语言模型(FinLLMs)。然而,现有的FinLLMs在涉及数值变量的问题时对金融文本的理解表现不佳。在本文中,我们提出了一个名为数理敏感的大语言模型(NumLLM)的金融语言模型,用于中文金融。我们首先构建了一个金融文本库,这对于在微调过程中提高LLMs的数值能力至关重要。接着,我们在构建的金融文本库上微调两个低秩适应(LoRA)模块。一个模块是将通用LLM适应金融领域的,另一个模块是增强NumLLM理解金融文本的能力。最后,我们将两个LoRA模块合并到基础模型中以获得NumLLM用于推理。在金融问题回答基准上的实验表明,NumLLM可以提高基础模型的性能,并且在数值和非数值问题上的表现均优于所有基线。
https://arxiv.org/abs/2405.00566
As the capabilities of large language models (LLMs) have expanded dramatically, aligning these models with human values presents a significant challenge, posing potential risks during deployment. Traditional alignment strategies rely heavily on human intervention, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), or on the self-alignment capacities of LLMs, which usually require a strong LLM's emergent ability to improve its original bad answer. To address these challenges, we propose a novel self-alignment method that utilizes a Chain of Thought (CoT) approach, termed AlignCoT. This method encompasses stages of Question Analysis, Answer Guidance, and Safe Answer production. It is designed to enable LLMs to generate high-quality, safe responses throughout various stages of their development. Furthermore, we introduce the Mixture of insighTful Experts (MoTE) architecture, which applies the mixture of experts to enhance each component of the AlignCoT process, markedly increasing alignment efficiency. The MoTE approach not only outperforms existing methods in aligning LLMs with human values but also highlights the benefits of using self-generated data, revealing the dual benefits of improved alignment and training efficiency.
随着大型语言模型(LLMs)的能力大幅扩展,将这些模型与人类价值观对齐面临着一项重大挑战,部署过程中可能存在潜在风险。传统的对齐策略很大程度上依赖于人类干预,例如监督微调(SFT)和人类反馈强化学习(RLHF),或依赖于LLM的自我对齐能力,通常需要强大的LLM emergence能力来改善其原始糟糕答案。为了应对这些挑战,我们提出了一种新颖的自对齐方法,称为对齐思想链(CoT)方法,该方法包括问题分析、答案指导和安全答案生产阶段。它旨在使LLM在发展过程中生成高质量、安全的响应。此外,我们还引入了有意义的专家(MoTE)架构,将专家的混合应用于对齐过程的每个组成部分,大大提高了对齐效率。MoTE方法不仅在将LLM与人类价值观对齐方面超越了现有方法,而且强调了使用自我生成数据的优势,揭示了提高对齐和训练效率的双重好处。
https://arxiv.org/abs/2405.00557
We present a novel approach for long-term human trajectory prediction, which is essential for long-horizon robot planning in human-populated environments. State-of-the-art human trajectory prediction methods are limited by their focus on collision avoidance and short-term planning, and their inability to model complex interactions of humans with the environment. In contrast, our approach overcomes these limitations by predicting sequences of human interactions with the environment and using this information to guide trajectory predictions over a horizon of up to 60s. We leverage Large Language Models (LLMs) to predict interactions with the environment by conditioning the LLM prediction on rich contextual information about the scene. This information is given as a 3D Dynamic Scene Graph that encodes the geometry, semantics, and traversability of the environment into a hierarchical representation. We then ground these interaction sequences into multi-modal spatio-temporal distributions over human positions using a probabilistic approach based on continuous-time Markov Chains. To evaluate our approach, we introduce a new semi-synthetic dataset of long-term human trajectories in complex indoor environments, which also includes annotations of human-object interactions. We show in thorough experimental evaluations that our approach achieves a 54% lower average negative log-likelihood (NLL) and a 26.5% lower Best-of-20 displacement error compared to the best non-privileged baselines for a time horizon of 60s.
我们提出了一种新的长期人类轨迹预测方法,这对于在人口环境中进行长距离机器人规划至关重要。最先进的的人类轨迹预测方法受到其关注点在于避障和短期规划的限制,以及它们无法模拟人类与环境之间复杂互动的限制。相比之下,我们的方法通过预测与环境的交互序列,并利用这些信息来指导轨迹预测,超出了这些限制。我们利用大型语言模型(LLMs)预测环境交互,通过根据场景的丰富上下文信息对LLM预测进行条件化。该信息以3D动态场景图的形式编码环境的几何、语义和可访问性。然后,我们使用基于连续时间 Markov 链的概率方法将交互序列 grounded 到人类位置的多模态时空分布中。为了评估我们的方法,我们引入了一个新的半合成数据集,其中包含复杂室内环境中长期人类轨迹的注释,同时也包括人类-物体交互的注释。我们通过彻底的实验评估,展示了我们的方法在60秒时间间隔内的平均负对数(NLL)降低了54%,最佳 of 20 移动误差降低了26.5%。
https://arxiv.org/abs/2405.00552
This paper introduces CookingSense, a descriptive collection of knowledge assertions in the culinary domain extracted from various sources, including web data, scientific papers, and recipes, from which knowledge covering a broad range of aspects is acquired. CookingSense is constructed through a series of dictionary-based filtering and language model-based semantic filtering techniques, which results in a rich knowledgebase of multidisciplinary food-related assertions. Additionally, we present FoodBench, a novel benchmark to evaluate culinary decision support systems. From evaluations with FoodBench, we empirically prove that CookingSense improves the performance of retrieval augmented language models. We also validate the quality and variety of assertions in CookingSense through qualitative analysis.
本文介绍了 CookingSense,一个描述性知识库,来源于各种来源,包括网页数据、科学论文和食谱,涵盖了广泛的各个方面。通过一系列基于词典的过滤和基于语言模型的语义过滤技术,CookingSense 构建了一个多学科食品相关断言的知识库。此外,我们提出了 FoodBench,一种新的基准,用于评估烹饪决策支持系统。通过 FoodBench 的评估,我们通过经验证明了 CookingSense 能够提高检索增强语言模型的性能。我们还通过定性分析验证了 CookingSense 中断言的质量和多样性。
https://arxiv.org/abs/2405.00523