This paper presents our approach to the EHRSQL-2024 shared task, which aims to develop a reliable Text-to-SQL system for electronic health records. We propose two approaches that leverage large language models (LLMs) for prompting and fine-tuning to generate EHRSQL queries. In both techniques, we concentrate on bridging the gap between the real-world knowledge on which LLMs are trained and the domain specific knowledge required for the task. The paper provides the results of each approach individually, demonstrating that they achieve high execution accuracy. Additionally, we show that an ensemble approach further enhances generation reliability by reducing errors. This approach secured us 2nd place in the shared task competition. The methodologies outlined in this paper are designed to be transferable to domain-specific Text-to-SQL problems that emphasize both accuracy and reliability.
本文介绍了我们针对EHRSQL-2024共享任务的途径,该目标是为电子病历开发一个可靠的文本到关系数据库系统。我们提出了两种方法,通过利用大型语言模型(LLMs)进行提示和微调来生成EHRSQL查询。在两种技术中,我们专注于缩小LLMs训练的现实世界知识和任务特定知识之间的差距。本文分别给出了每种算法的结果,证明了它们具有高执行准确性。此外,我们还证明了通过集成方法可以进一步提高生成可靠性,从而使该方法在共享任务竞争中获得了第2名。本文提出的方法是可移植的,适用于强调准确性和可靠性的领域特定文本到关系问题。
https://arxiv.org/abs/2405.08839
This paper presents a new tool learning dataset Seal-Tools, which contains self-instruct API-like tools. Seal-Tools not only offers a large number of tools, but also includes instances which demonstrate the practical application of tools. Seeking to generate data on a large scale while ensuring reliability, we propose a self-instruct method to generate tools and instances, allowing precise control over the process. Moreover, our Seal-Tools contains hard instances that call multiple tools to complete the job, among which some are nested tool callings. For precise and comprehensive evaluation, we use strict format control and design three metrics from different dimensions. Therefore, Seal-Tools can serve as a new benchmark to evaluate the tool-calling ability of LLMs. Finally, we evaluate several prevalent LLMs and our finetuned model on Seal-Tools. The results show that current systems are far from perfect. The code, data and experiment results are available at this https URL .
本文介绍了一个名为Seal-Tools的新工具学习数据集,其中包含类似于自我指导的API工具。Seal-Tools不仅提供了大量工具,还包括一些实例,展示了工具的实际应用。为了在保证可靠性的同时生成大规模数据,我们提出了一个自指导方法来生成工具和实例,允许对过程进行精确控制。此外,我们的Seal-Tools还包含一些嵌套工具调用实例。为了进行精确和全面的评估,我们使用了严格的格式控制并从不同维度设计三个指标。因此,Seal-Tools可以作为评估LLM工具调用能力的新的基准。最后,我们在Seal-Tools上评估了几个流行的LLM模型和我们的优化模型。结果显示,当前系统离完美还有很长的路要走。代码、数据和实验结果都可以在以下链接处获取:https://url.cn/ Seal-Tools。
https://arxiv.org/abs/2405.08355
Integrated Speech and Large Language Models (SLMs) that can follow speech instructions and generate relevant text responses have gained popularity lately. However, the safety and robustness of these models remains largely unclear. In this work, we investigate the potential vulnerabilities of such instruction-following speech-language models to adversarial attacks and jailbreaking. Specifically, we design algorithms that can generate adversarial examples to jailbreak SLMs in both white-box and black-box attack settings without human involvement. Additionally, we propose countermeasures to thwart such jailbreaking attacks. Our models, trained on dialog data with speech instructions, achieve state-of-the-art performance on spoken question-answering task, scoring over 80% on both safety and helpfulness metrics. Despite safety guardrails, experiments on jailbreaking demonstrate the vulnerability of SLMs to adversarial perturbations and transfer attacks, with average attack success rates of 90% and 10% respectively when evaluated on a dataset of carefully designed harmful questions spanning 12 different toxic categories. However, we demonstrate that our proposed countermeasures reduce the attack success significantly.
近年来,整合了语音指令并能够生成相关文本响应的大规模语言模型(SLMs)受到了欢迎。然而,这些模型的安全性和鲁棒性仍然存在较大不确定性。在这项工作中,我们研究了这些指令跟随语音模型的潜在漏洞,以及针对这些模型的对抗攻击和破解。具体来说,我们设计了几种算法,可以生成对抗性样本来破解SLMs,无论是在白盒还是黑盒攻击设置中。此外,我们还提出了防止此类破解攻击的措施。我们基于语音指令的对话数据训练的模型在口头问题回答任务上实现了最先进的性能,得分超过80%的安全性和帮助性指标。尽管有安全防护措施,但在针对精心设计的具有12种有毒类别的有害问题的大型数据集上进行破解实验表明,SLMs对对抗扰动和传输攻击非常脆弱。然而,我们证明了我们提出的措施显著减少了攻击的成功率。
https://arxiv.org/abs/2405.08317
Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore develop SpeechVerse, a robust multi-task training and curriculum learning framework that combines pre-trained speech and text foundation models via a small set of learnable parameters, while keeping the pre-trained models frozen during training. The models are instruction finetuned using continuous latent representations extracted from the speech foundation model to achieve optimal zero-shot performance on a diverse range of speech processing tasks using natural language instructions. We perform extensive benchmarking that includes comparing our model performance against traditional baselines across several datasets and tasks. Furthermore, we evaluate the model's capability for generalized instruction following by testing on out-of-domain datasets, novel prompts, and unseen tasks. Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks.
大语言模型(LLMs)在执行需要自然语言指令的语义理解任务方面表现出了惊人的能力。最近,许多工作进一步扩展了这种能力,以感知多模态音频和文本输入,但它们的性能往往局限于特定的微调任务,如自动语音识别和翻译。因此,我们开发了SpeechVerse,一种 robust 的多任务训练和课程学习框架,通过一些可学习参数将预训练的语音和文本基础模型结合在一起,而在训练过程中冻结预训练模型。通过连续从语音基础模型中提取的潜在表示进行指令微调,模型能够在多样性的语音处理任务上实现最佳零散shot性能,并使用自然语言指令进行指导。我们对多个数据集和任务进行了广泛的基准测试,并进一步测试了模型在跨域数据集和未见过的任务上的能力。我们的实证实验证实,我们的多任务SpeechVerse模型甚至优于传统任务特定基线,在9个任务上。
https://arxiv.org/abs/2405.08295
Misinformation about climate change is a complex societal issue requiring holistic, interdisciplinary solutions at the intersection between technology and psychology. One proposed solution is a "technocognitive" approach, involving the synthesis of psychological and computer science research. Psychological research has identified that interventions in response to misinformation require both fact-based (e.g., factual explanations) and technique-based (e.g., explanations of misleading techniques) content. However, little progress has been made on documenting and detecting fallacies in climate misinformation. In this study, we apply a previously developed critical thinking methodology for deconstructing climate misinformation, in order to develop a dataset mapping different types of climate misinformation to reasoning fallacies. This dataset is used to train a model to detect fallacies in climate misinformation. Our study shows F1 scores that are 2.5 to 3.5 better than previous works. The fallacies that are easiest to detect include fake experts and anecdotal arguments, while fallacies that require background knowledge, such as oversimplification, misrepresentation, and slothful induction, are relatively more difficult to detect. This research lays the groundwork for development of solutions where automatically detected climate misinformation can be countered with generative technique-based corrections.
关于气候变化的错误信息是一个复杂的社会问题,需要跨学科解决方案在技术和心理学交叉的领域。一个建议的解决方案是“技术认知”方法,涉及心理学和计算机科学研究的合成。心理学研究指出,针对错误信息需要既基于事实(例如,事实解释)又基于技术(例如,误导性技术)的内容。然而,在记录和检测气候变化错误信息方面,尚未取得很大进展。在这项研究中,我们运用了之前开发 critical thinking 方法来解构气候变化错误信息,以开发一个将不同类型气候变化错误信息与推理谬误进行映射的数据集。这个数据集用于训练一个模型来检测气候变化错误信息中的谬误。我们的研究显示,F1 分数比以前的工作提高了2.5到3.5个百分点。最容易检测的谬误包括虚假专家和口头证据,而需要背景知识(例如,过度简化和误导性、轻视归纳)的谬误则相对较难检测。这项研究为开发可以自动检测气候变化错误信息并提供生成技术基础的解决方案奠定了基础。
https://arxiv.org/abs/2405.08254
Speech perception involves storing and integrating sequentially presented items. Recent work in cognitive neuroscience has identified temporal and contextual characteristics in humans' neural encoding of speech that may facilitate this temporal processing. In this study, we simulated similar analyses with representations extracted from a computational model that was trained on unlabelled speech with the learning objective of predicting upcoming acoustics. Our simulations revealed temporal dynamics similar to those in brain signals, implying that these properties can arise without linguistic knowledge. Another property shared between brains and the model is that the encoding patterns of phonemes support some degree of cross-context generalization. However, we found evidence that the effectiveness of these generalizations depends on the specific contexts, which suggests that this analysis alone is insufficient to support the presence of context-invariant encoding.
言语感知涉及存储和整合顺序呈现的项。近年来,认知神经科学领域的工作已经确定,人类对言语的神经编码包括时间特性和上下文特征,这可能促进这种时间处理。在这项研究中,我们使用从训练在未标注语音上的学习目标为预测即将到来的听力的计算机模型中提取的表示来模拟这些分析。我们的模拟揭示了类似于脑信号的时间动态,表明这些属性可以产生,而无需具备语言知识。 与大脑和模型共享的一个属性是,音位编码模式支持某种程度的跨上下文泛化。然而,我们发现,这些泛化的有效性取决于具体上下文,这表明,仅凭这项分析还不足以证明存在上下文无关编码。
https://arxiv.org/abs/2405.08237
A large body of work in psycholinguistics has focused on the idea that online language comprehension can be shallow or `good enough': given constraints on time or available computation, comprehenders may form interpretations of their input that are plausible but inaccurate. However, this idea has not yet been linked with formal theories of computation under resource constraints. Here we use information theory to formulate a model of language comprehension as an optimal trade-off between accuracy and processing depth, formalized as bits of information extracted from the input, which increases with processing time. The model provides a measure of processing effort as the change in processing depth, which we link to EEG signals and reading times. We validate our theory against a large-scale dataset of garden path sentence reading times, and EEG experiments featuring N400, P600 and biphasic ERP effects. By quantifying the timecourse of language processing as it proceeds from shallow to deep, our model provides a unified framework to explain behavioral and neural signatures of language comprehension.
心理语言学领域有一项大量工作集中在这样一个想法上,即在线语言理解可能是浅薄或“足够好”的:受到时间限制或可用计算资源的限制,理解者可能会形成对输入的解读,这些解读可能是合理的但不够准确。然而,这个想法尚未与在资源限制下计算理论相联系。在这里,我们使用信息论来表述一个语言理解模型,作为准确性和处理深度的最优权衡,用输入中提取的信息的比特数表示,该模型随着处理时间的增加而增加。该模型提供了处理努力的变化,我们将其与EEG信号和阅读时间联系起来。我们对一个大型花园小径句子阅读时间的数据集进行验证,并进行了包括N400、P600和双半波ERP效应的脑电实验。通过量化语言处理的时间进程从浅到深,我们的模型提供了一个统一框架来解释语言理解的神经和行为特征。
https://arxiv.org/abs/2405.08223
Recent advances in artificial intelligence for education leverage generative large language models, including using them to predict open-ended student responses rather than their correctness only. However, the black-box nature of these models limits the interpretability of the learned student knowledge representations. In this paper, we conduct a first exploration into interpreting latent student knowledge representations by presenting InfoOIRT, an Information regularized Open-ended Item Response Theory model, which encourages the latent student knowledge states to be interpretable while being able to generate student-written code for open-ended programming questions. InfoOIRT maximizes the mutual information between a fixed subset of latent knowledge states enforced with simple prior distributions and generated student code, which encourages the model to learn disentangled representations of salient syntactic and semantic code features including syntactic styles, mastery of programming skills, and code structures. Through experiments on a real-world programming education dataset, we show that InfoOIRT can both accurately generate student code and lead to interpretable student knowledge representations.
近年来,人工智能在教育领域的进步主要依赖于生成式大型语言模型,包括使用这些模型预测开放性学生答案,而不仅仅是正确性。然而,这些模型的黑盒性质限制了学习到的学生知识表示的可解释性。在本文中,我们首先对通过InfoOIRT(信息 regularized Open-ended Item Response Theory 模型)解释潜在学生知识表示进行了探索。InfoOIRT 通过简单先验分布强制指定固定子集的潜在学生知识状态,并能够在生成学生编程问题的情况下鼓励潜在学生知识状态具有可解释性。InfoOIRT 最大化强制简单先验分布与生成的学生代码之间的互信息,这鼓励模型学习显著的语义和语序代码特征,包括语义风格、编程技能和代码结构。通过在现实世界的编程教育数据集上进行的实验,我们发现InfoOIRT 既能准确生成学生代码,又能产生具有可解释性的学生知识表示。
https://arxiv.org/abs/2405.08213
Large language models (LLM) have demonstrated remarkable capabilities in various biomedical natural language processing (NLP) tasks, leveraging the demonstration within the input context to adapt to new tasks. However, LLM is sensitive to the selection of demonstrations. To address the hallucination issue inherent in LLM, retrieval-augmented LLM (RAL) offers a solution by retrieving pertinent information from an established database. Nonetheless, existing research work lacks rigorous evaluation of the impact of retrieval-augmented large language models on different biomedical NLP tasks. This deficiency makes it challenging to ascertain the capabilities of RAL within the biomedical domain. Moreover, the outputs from RAL are affected by retrieving the unlabeled, counterfactual, or diverse knowledge that is not well studied in the biomedical domain. However, such knowledge is common in the real world. Finally, exploring the self-awareness ability is also crucial for the RAL system. So, in this paper, we systematically investigate the impact of RALs on 5 different biomedical tasks (triple extraction, link prediction, classification, question answering, and natural language inference). We analyze the performance of RALs in four fundamental abilities, including unlabeled robustness, counterfactual robustness, diverse robustness, and negative awareness. To this end, we proposed an evaluation framework to assess the RALs' performance on different biomedical NLP tasks and establish four different testbeds based on the aforementioned fundamental abilities. Then, we evaluate 3 representative LLMs with 3 different retrievers on 5 tasks over 9 datasets.
大语言模型(LLM)在各种生物医学自然语言处理(NLP)任务中表现出非凡的能力,通过在输入上下文演示以适应新任务。然而,LLM 对演示的选择非常敏感。为解决LLM固有的虚构问题,检索增强的大语言模型(RAL)通过从已建立的数据库中检索相关信息提供解决方案。然而,现有的研究作品在生物医学领域对检索增强的大语言模型的影响缺乏严谨的评估。这一缺陷使得确定RAL在生物医学领域的能力具有挑战性。此外,RAL的输出受到从生物医学领域检索未标记、反事实或多样知识的影響,而这些知识在生物医学领域中并没有得到充分研究。然而,在现实世界中,这些知识是很常见的。最后,探索自意识能力对RAL系统也是至关重要的。因此,在本文中,我们系统地研究了RAL对5种生物医学任务(三重提取、链接预测、分类、问答和自然语言推理)的影响。我们分析RAL在四种基本能力(未标记稳健性、反事实稳健性、多样性稳健性和负面意识)上的性能。为此,我们提出了一个评估框架来评估RAL在不同生物医学NLP任务上的性能,并基于上述基本能力建立四个测试台。然后,我们在9个数据集上评估了3个具有不同检索器的LLM的5个任务的表现。
https://arxiv.org/abs/2405.08151
Most Americans agree that misinformation, hate speech and harassment are harmful and inadequately curbed on social media through current moderation practices. In this paper, we aim to understand the discursive strategies employed by people in response to harmful speech in news comments. We conducted a content analysis of more than 6500 comment replies to trending news videos on YouTube and Twitter and identified seven distinct discursive objection strategies (Study 1). We examined the frequency of each strategy's occurrence from the 6500 comment replies, as well as from a second sample of 2004 replies (Study 2). Together, these studies show that people deploy a diversity of discursive strategies when objecting to speech, and reputational attacks are the most common. The resulting classification scheme accounts for different theoretical approaches for expressing objections and offers a comprehensive perspective on grassroots efforts aimed at stopping offensive or problematic speech on campus.
大多数美国人认为,在社交媒体上传播错误信息、仇恨言论和骚扰是有害的,并且目前的 moderation 做法不足以遏制这种不良行为。(本文旨在了解人们对有害言论的反应策略。我们在 YouTube 和 Twitter 上对热门新闻视频的超过 6500 条评论进行了内容分析,并识别出了七种不同的 discourse 反对策略(研究 1)。我们研究了每种策略在 6500 条评论中的出现频率,以及从第二个样本 2004 条评论中收集的数据。这些研究共同表明,人们在反对言论时采取了多种 discursive strategies,而声誉攻击是最常见的。(研究 2) 该分类方案考虑了表达异议的不同理论方法,并提供了关于校园内停止冒犯或问题言论的基层努力的全局视角。
https://arxiv.org/abs/2405.08142
We introduce Many-Shot Regurgitation (MSR) prompting, a new black-box membership inference attack framework for examining verbatim content reproduction in large language models (LLMs). MSR prompting involves dividing the input text into multiple segments and creating a single prompt that includes a series of faux conversation rounds between a user and a language model to elicit verbatim regurgitation. We apply MSR prompting to diverse text sources, including Wikipedia articles and open educational resources (OER) textbooks, which provide high-quality, factual content and are continuously updated over time. For each source, we curate two dataset types: one that LLMs were likely exposed to during training ($D_{\rm pre}$) and another consisting of documents published after the models' training cutoff dates ($D_{\rm post}$). To quantify the occurrence of verbatim matches, we employ the Longest Common Substring algorithm and count the frequency of matches at different length thresholds. We then use statistical measures such as Cliff's delta, Kolmogorov-Smirnov (KS) distance, and Kruskal-Wallis H test to determine whether the distribution of verbatim matches differs significantly between $D_{\rm pre}$ and $D_{\rm post}$. Our findings reveal a striking difference in the distribution of verbatim matches between $D_{\rm pre}$ and $D_{\rm post}$, with the frequency of verbatim reproduction being significantly higher when LLMs (e.g. GPT models and LLaMAs) are prompted with text from datasets they were likely trained on. For instance, when using GPT-3.5 on Wikipedia articles, we observe a substantial effect size (Cliff's delta $= -0.984$) and a large KS distance ($0.875$) between the distributions of $D_{\rm pre}$ and $D_{\rm post}$. Our results provide compelling evidence that LLMs are more prone to reproducing verbatim content when the input text is likely sourced from their training data.
我们提出了Many-Shot Regurgitation(MSR)提示,一种新的大语言模型(LLM)中正文内容复制的黑盒会员推理攻击框架,用于研究大型语言模型(LLMs)中的 verbatim 内容复制。MSR 提示包括将输入文本划分为多个部分并创建一个包含用户和语言模型之间一系列伪对话环节的单个提示。我们将 MSR 提示应用于各种文本来源,包括维基百科文章和开放教育资源(OER)教科书,这些资料提供高质量、事实内容,并且随着时间的推移不断更新。对于每个来源,我们挑选两个数据集:一个是在训练过程中 likely接触到LLM的($D_{\rm pre}$),另一个是模型训练截止日期之后发布的文档($D_{\rm post}$)。为了量化匹配的出现情况,我们使用 Longest Common Substring 算法计数不同长度阈值下的匹配频率。然后使用统计方法如 Cliff's delta、Kolmogorov-Smirnov(KS)距离和 Kruskal-Wallis H 检验来确定是否在 $D_{\rm pre}$ 和 $D_{\rm post}$ 之间显著存在匹配分布的差异。我们的研究结果揭示了 $D_{\rm pre}$ 和 $D_{\rm post}$ 之间 verbatim 匹配分布的显著差异,当 LLM(例如 GPT 模型和 LLaMAs)从其训练数据中接收到文本时,复制的频率显著更高。例如,在用 GPT-3.5 处理维基百科文章时,我们观察到显著的效应量(Cliff's delta = -0.984)和较大的KS距离(0.875)。我们的研究结果提供了令人信服的证据,表明当输入文本可能来自其训练数据时,LLM 更倾向于复制 verbatim 内容。
https://arxiv.org/abs/2405.08134
Due to the concise and structured nature of tables, the knowledge contained therein may be incomplete or missing, posing a significant challenge for table question answering (TableQA) and data analysis systems. Most existing datasets either fail to address the issue of external knowledge in TableQA or only utilize unstructured text as supplementary information for tables. In this paper, we propose to use a knowledge base (KB) as the external knowledge source for TableQA and construct a dataset KET-QA with fine-grained gold evidence annotation. Each table in the dataset corresponds to a sub-graph of the entire KB, and every question requires the integration of information from both the table and the sub-graph to be answered. To extract pertinent information from the vast knowledge sub-graph and apply it to TableQA, we design a retriever-reasoner structured pipeline model. Experimental results demonstrate that our model consistently achieves remarkable relative performance improvements ranging from 1.9 to 6.5 times and absolute improvements of 11.66% to 44.64% on EM scores across three distinct settings (fine-tuning, zero-shot, and few-shot), in comparison with solely relying on table information in the traditional TableQA manner. However, even the best model achieves a 60.23% EM score, which still lags behind the human-level performance, highlighting the challenging nature of KET-QA for the question-answering community. We also provide a human evaluation of error cases to analyze further the aspects in which the model can be improved. Project page: this https URL.
由于表格的简洁和结构化特点,其中包含的知识可能不完整或缺失,这对表格问题回答(TableQA)和数据分析系统构成了重大挑战。现有数据集中,要么未能解决表格外部知识的这个问题,要么仅将结构化文本作为表格的补充信息。在本文中,我们将知识库(KB)作为表格问题回答的外部知识来源,并构建了一个细粒度 gold 证据注释的 dataset KET-QA。数据集中的每个表对应于整个知识库的子图,每个问题都需要从表格和子图整合信息来回答。为了从庞大的知识子图中提取相关信息并应用于表格问题回答,我们设计了一个retriever-reasoner结构化管道模型。实验结果表明,我们的模型在三个不同的设置(微调、零散和少散)上实现了引人注目的相对性能改进,从1.9到6.5倍,以及绝对性能改进11.66%到44.64%。与仅依赖表格信息的传统 TableQA 方法相比,即使在最好的模型上,EM 分数也落后了人类水平。这揭示了对于问答社区来说,KET-QA 的具有挑战性的本质。我们还提供了人类评估错误案例以进一步分析模型可以改进的方面。项目页面:https:// this URL。
https://arxiv.org/abs/2405.08099
Diagnosing and managing a patient is a complex, sequential decision making process that requires physicians to obtain information -- such as which tests to perform -- and to act upon it. Recent advances in artificial intelligence (AI) and large language models (LLMs) promise to profoundly impact clinical care. However, current evaluation schemes overrely on static medical question-answering benchmarks, falling short on interactive decision-making that is required in real-life clinical work. Here, we present AgentClinic: a multimodal benchmark to evaluate LLMs in their ability to operate as agents in simulated clinical environments. In our benchmark, the doctor agent must uncover the patient's diagnosis through dialogue and active data collection. We present two open benchmarks: a multimodal image and dialogue environment, AgentClinic-NEJM, and a dialogue-only environment, AgentClinic-MedQA. We embed cognitive and implicit biases both in patient and doctor agents to emulate realistic interactions between biased agents. We find that introducing bias leads to large reductions in diagnostic accuracy of the doctor agents, as well as reduced compliance, confidence, and follow-up consultation willingness in patient agents. Evaluating a suite of state-of-the-art LLMs, we find that several models that excel in benchmarks like MedQA are performing poorly in AgentClinic-MedQA. We find that the LLM used in the patient agent is an important factor for performance in the AgentClinic benchmark. We show that both having limited interactions as well as too many interaction reduces diagnostic accuracy in doctor agents. The code and data for this work is publicly available at this https URL.
诊断和处理患者是一个复杂、顺序的决策过程,需要医生获取信息——例如进行哪些检查——并采取行动。近年来的人工智能(AI)和大型语言模型(LLM)的进步有望深刻地影响临床护理。然而,目前的评估方案过于依赖静态医疗问题问答基准,缺乏现实生活中的互动决策,这不符合临床工作的要求。在这里,我们提出了AgentClinic:一个多模态基准,以评估LLM在模拟临床环境中作为代理的能力。在我们的基准中,医生代理必须通过对话和主动数据收集来揭示患者的诊断。我们提出了两个开放基准:多模态图像和对话环境,AgentClinic-NEJM,以及对话环境,AgentClinic-MedQA。我们将认知和隐含偏见都融入到患者和医生代理中,以模拟真实世界中偏见代理之间的互动。我们发现,引入偏见会导致医生代理的诊断准确度大幅降低,以及患者代理的遵从、信心和后续咨询意愿降低。评估了一系列最先进的LLM后,我们发现像MedQA这样在基准中表现优秀的模型在AgentClinic-MedQA中的表现不佳。我们发现,患者代理中使用的LLM对代理在AgentClinic基准中的表现至关重要。我们证明了有限互动和过多互动都会降低医生代理的诊断准确度。这一工作的代码和数据可以在这个https URL上找到。
https://arxiv.org/abs/2405.07960
Many commercial and open-source models claim to detect machine-generated text with very high accuracy (99\% or higher). However, very few of these detectors are evaluated on shared benchmark datasets and even when they are, the datasets used for evaluation are insufficiently challenging -- lacking variations in sampling strategy, adversarial attacks, and open-source generative models. In this work we present RAID: the largest and most challenging benchmark dataset for machine-generated text detection. RAID includes over 6 million generations spanning 11 models, 8 domains, 11 adversarial attacks and 4 decoding strategies. Using RAID, we evaluate the out-of-domain and adversarial robustness of 8 open- and 4 closed-source detectors and find that current detectors are easily fooled by adversarial attacks, variations in sampling strategies, repetition penalties, and unseen generative models. We release our dataset and tools to encourage further exploration into detector robustness.
许多商业和开源模型声称其对机器生成的文本检测具有非常高的准确度(99%或更高)。然而,几乎所有的这些检测器都没有在共享基准数据集上进行评估,即使它们进行了评估,用于评估的数据集也缺乏挑战性——缺乏抽样策略、对抗攻击和开源生成模型的变化。在这项工作中,我们提出了RAID:机器生成文本检测中最大的、最具挑战性的基准数据集。RAID包括11个模型的超过600万组训练样本,8个领域,11个对抗攻击和4种解码策略。使用RAID,我们评估了8个开源的和4个闭源的检测器的离域和对抗鲁棒性,发现当前的检测器很容易受到对抗攻击、抽样策略的变化、重复惩罚和未见过的生成模型的欺骗。我们发布了我们的数据集和工具,鼓励对检测器鲁棒性的进一步探索。
https://arxiv.org/abs/2405.07940
In this paper, we introduce EconLogicQA, a rigorous benchmark designed to assess the sequential reasoning capabilities of large language models (LLMs) within the intricate realms of economics, business, and supply chain management. Diverging from traditional benchmarks that predict subsequent events individually, EconLogicQA poses a more challenging task: it requires models to discern and sequence multiple interconnected events, capturing the complexity of economic logics. EconLogicQA comprises an array of multi-event scenarios derived from economic articles, which necessitate an insightful understanding of both temporal and logical event relationships. Through comprehensive evaluations, we exhibit that EconLogicQA effectively gauges a LLM's proficiency in navigating the sequential complexities inherent in economic contexts. We provide a detailed description of EconLogicQA dataset and shows the outcomes from evaluating the benchmark across various leading-edge LLMs, thereby offering a thorough perspective on their sequential reasoning potential in economic contexts. Our benchmark dataset is available at this https URL.
在本文中,我们引入了EconLogicQA,一个严谨的基准,旨在评估大型语言模型(LLMs)在经济学、商业和供应链管理复杂领域中的序列推理能力。与传统基准预测后续事件逐一不同,EconLogicQA提出了更具挑战性的任务:它要求模型能够辨别和排序多个相互关联的事件,捕捉经济逻辑的复杂性。EconLogicQA由来自经济文章的多事件场景组成,这需要对时间和逻辑事件关系进行深入的理解。通过全面的评估,我们证明了EconLogicQA有效地衡量了LLM在经济学环境中导航复杂序列的能力。我们详细描述了EconLogicQA数据集,并展示了评估该基准在不同领先LLM上的结果,从而对其在经济学环境中的序列推理潜力进行了深入的视角。我们的基准数据集可通过此链接访问:https://www.academia.edu/39511841/EconLogicQA_Determining_the_Sequential_Reasoning_Capabilities_of_LLMs_within_Economic_Contexts.
https://arxiv.org/abs/2405.07938
The paper discusses the creation of a multimodal dataset of Russian-language scientific papers and testing of existing language models for the task of automatic text summarization. A feature of the dataset is its multimodal data, which includes texts, tables and figures. The paper presents the results of experiments with two language models: Gigachat from SBER and YandexGPT from Yandex. The dataset consists of 420 papers and is publicly available on this https URL.
本文讨论了创建一个多模态俄罗斯语科学论文数据集以及为自动文本摘要任务测试现有语言模型的过程。数据集的一个特点是其多模态数据,包括文本、表格和图表。本文介绍了使用两个语言模型(Gigachat来自SBER和YandexGPT来自Yandex)的实验结果。数据集包括420篇论文,可以在此链接上公开获取。
https://arxiv.org/abs/2405.07886
Language models (LMs) are bound to their tokenizer, which maps raw text to a sequence of vocabulary items (tokens). This restricts their flexibility: for example, LMs trained primarily on English may still perform well in other natural and programming languages, but have vastly decreased efficiency due to their English-centric tokenizer. To mitigate this, we should be able to swap the original LM tokenizer with an arbitrary one, on the fly, without degrading performance. Hence, in this work we define a new problem: Zero-Shot Tokenizer Transfer (ZeTT). The challenge at the core of ZeTT is finding embeddings for the tokens in the vocabulary of the new tokenizer. Since prior heuristics for initializing embeddings often perform at chance level in a ZeTT setting, we propose a new solution: we train a hypernetwork taking a tokenizer as input and predicting the corresponding embeddings. We empirically demonstrate that the hypernetwork generalizes to new tokenizers both with encoder (e.g., XLM-R) and decoder LLMs (e.g., Mistral-7B). Our method comes close to the original models' performance in cross-lingual and coding tasks while markedly reducing the length of the tokenized sequence. We also find that the remaining gap can be quickly closed by continued training on less than 1B tokens. Finally, we show that a ZeTT hypernetwork trained for a base (L)LM can also be applied to fine-tuned variants without extra training. Overall, our results make substantial strides toward detaching LMs from their tokenizer.
语言模型(LMs)将自己限制在其词典中,该词典将原始文本映射为词汇项(词)的序列。这限制了其灵活性:例如,主要针对英语的LM在其他国家自然和编程语言上表现良好,但由于其英语中心词典,效率大幅降低。为了减轻这种限制,我们应该能够随时在动态过程中交换原始LM词典,而不会降低性能。因此,在本文中我们定义了一个新问题:零样本词典迁移(ZeTT)。 ZeTT的核心挑战是找到新词典中词的嵌入。由于为初始化嵌入应用的先前启发式方法在ZeTT设置中表现具有偶然性水平,我们提出了一种新方法:我们通过训练一个超网络,接受词典作为输入并预测相应嵌入。我们通过实验实证证明,超网络既可以在编码器(如XLM-R)中泛化,也可以在解码器LLM(如Mistral-7B)中泛化。 我们的方法在跨语言和编码任务上与原始模型相当,同时显著减少了词标本的尺寸。我们还发现,通过在不到1B个词上继续训练,可以迅速关闭剩余的差距。最后,我们证明了为基(L)LM训练的ZeTT超网络也可以应用于细粒度调整版本,而无需额外训练。总体而言,我们的结果在使LM从词典中独立出来方面取得了显著的进步。
https://arxiv.org/abs/2405.07883
Rerunning a metric-based evaluation should be more straightforward, and results should be closer, than in a human-based evaluation, especially where code and model checkpoints are made available by the original authors. As this report of our efforts to rerun a metric-based evaluation of a set of single-attribute and multiple-attribute controllable text generation (CTG) techniques shows however, such reruns of evaluations do not always produce results that are the same as the original results, and can reveal errors in the reporting of the original work.
基于指标的评估应该会更为简单,结果应该会更接近,而不是像基于人类评估一样。特别是在原始作者已经提供了代码和模型检查点的文本生成(CTG)技术指标基于评估中,这种评估的重新运行并不总是产生与原始结果相同的结果,并可能揭示原始工作的报告中的错误。正如我们努力重新运行指标基于评估一个集合的单属性可控文本生成(CTG)技术所作的报告所示,这种重新运行的评估并不总是产生与原始结果相同的结果,并可能揭示原始工作的报告中的错误。
https://arxiv.org/abs/2405.07875
Decoding language information from brain signals represents a vital research area within brain-computer interfaces, particularly in the context of deciphering the semantic information from the fMRI signal. However, many existing efforts concentrate on decoding small vocabulary sets, leaving space for the exploration of open vocabulary continuous text decoding. In this paper, we introduce a novel method, the \textbf{Brain Prompt GPT (BP-GPT)}. By using the brain representation that is extracted from the fMRI as a prompt, our method can utilize GPT-2 to decode fMRI signals into stimulus text. Further, we introduce a text-to-text baseline and align the fMRI prompt to the text prompt. By introducing the text-to-text baseline, our BP-GPT can extract a more robust brain prompt and promote the decoding of pre-trained LLM. We evaluate our BP-GPT on the open-source auditory semantic decoding dataset and achieve a significant improvement up to $4.61\%$ on METEOR and $2.43\%$ on BERTScore across all the subjects compared to the state-of-the-art method. The experimental results demonstrate that using brain representation as a prompt to further drive LLM for auditory neural decoding is feasible and effective.
解析脑信号中的语言信息是脑-计算机接口领域的一个重要研究内容,尤其是在从fMRI信号中解读语义信息方面。然而,许多现有努力都集中在小词汇集的解析上,这为探索基于open词汇的连续文本解析留下了空间。在本文中,我们介绍了一种新颖的方法,即《大脑提示GPT(BP-GPT)》。通过利用从fMRI中提取的大脑表示作为提示,我们的方法可以利用GPT-2将fMRI信号解码为刺激文本。此外,我们还引入了一个文本到文本的基线,并将fMRI提示与文本提示对齐。通过引入文本到文本的基线,我们的BP-GPT可以提取更健壮的大脑提示,促进预训练LLM的解码。我们在开放的音频语义解码数据集上评估我们的BP-GPT,并实现了在所有受试者上的METEOR和BERTScore分别提高4.61%和2.43%的显著改善。实验结果表明,将大脑表示作为提示以进一步驱动音频神经解码的LLM是可行的和有效的。
https://arxiv.org/abs/2405.07840
Based on the principles of information theory, measure theory, and theoretical computer science, we introduce a univariate signal deconvolution method with a wide range of applications to coding theory, particularly in zero-knowledge one-way communication channels, such as in deciphering messages from unknown generating sources about which no prior knowledge is available and to which no return message can be sent. Our multidimensional space reconstruction method from an arbitrary received signal is proven to be agnostic vis-a-vis the encoding-decoding scheme, computation model, programming language, formal theory, the computable (or semi-computable) method of approximation to algorithmic complexity, and any arbitrarily chosen (computable) probability measure of the events. The method derives from the principles of an approach to Artificial General Intelligence capable of building a general-purpose model of models independent of any arbitrarily assumed prior probability distribution. We argue that this optimal and universal method of decoding non-random data has applications to signal processing, causal deconvolution, topological and geometric properties encoding, cryptography, and bio- and technosignature detection.
根据信息论、测度理论和理论计算机科学的原理,我们提出了一种单变量信号解卷积方法,具有广泛的编码理论应用,特别是对于无需先验知识即可进行单向通信信道中的加密和解码,例如从未知的生成源中解码消息以及无法发送返回消息的信道。我们在任意接收信号的多维度空间重构方法已经被证明对编码-解码方案、计算模型、编程语言、形式理论以及近似算法复杂度的可计算(或半可计算)方法是无关的。这个方法源于能够构建独立于任何随意假设先验概率分布的通用人工智能方法的原则。我们认为,这种最优和普遍的解码非随机数据的办法具有对信号处理、因果解卷积、拓扑和几何编码、密码学和生物和科技签名检测等领域的应用。
https://arxiv.org/abs/2405.07803