Searching dependency graphs and manipulating them can be a time consuming and challenging task to get right. We document Semgrex, a system for searching dependency graphs, and introduce Ssurgeon, a system for manipulating the output of Semgrex. The compact language used by these systems allows for easy command line or API processing of dependencies. Additionally, integration with publicly released toolkits in Java and Python allows for searching text relations and attributes over natural text.
搜索依赖关系图并对其进行操作可能是一个耗时且具有挑战性的任务。我们记录了Semgrex,一个用于搜索依赖关系的系统,并介绍了Ssurgeon,一个用于操作Semgrex输出的系统。这些系统所使用的简洁语言允许轻松地进行命令行或API处理依赖关系。此外,与Java和Python中公开发布的工具包的集成允许在自然文本中搜索文本关系和属性。
https://arxiv.org/abs/2404.16250
Objectives: This study aims to systematically review the literature on the computational processing of the language of pain, whether generated by patients or physicians, identifying current trends and challenges. Methods: Following the PRISMA guidelines, a comprehensive literature search was conducted to select relevant studies on the computational processing of the language of pain and answer pre-defined research questions. Data extraction and synthesis were performed to categorize selected studies according to their primary purpose and outcome, patient and pain population, textual data, computational methodology, and outcome targets. Results: Physician-generated language of pain, specifically from clinical notes, was the most used data. Tasks included patient diagnosis and triaging, identification of pain mentions, treatment response prediction, biomedical entity extraction, correlation of linguistic features with clinical states, and lexico-semantic analysis of pain narratives. Only one study included previous linguistic knowledge on pain utterances in their experimental setup. Most studies targeted their outcomes for physicians, either directly as clinical tools or as indirect knowledge. The least targeted stage of clinical pain care was self-management, in which patients are most involved. The least studied dimensions of pain were affective and sociocultural. Only two studies measured how physician performance on clinical tasks improved with the inclusion of the proposed algorithm. Discussion: This study found that future research should focus on analyzing patient-generated language of pain, developing patient-centered resources for self-management and patient-empowerment, exploring affective and sociocultural aspects of pain, and measuring improvements in physician performance when aided by the proposed tools.
研究目标:本研究旨在系统地回顾有关疼痛语言计算的相关文献,无论是由患者还是医生产生的,以识别当前的趋势和挑战。方法:遵循PRISMA指南,进行全面的文献搜索,以选择与疼痛语言计算相关的研究,并回答预先设定的研究问题。数据提取和合成是将所选研究根据其主要目的和结果、患者和痛苦人群、文本数据、计算方法以及结果目标进行分类的过程。结果:医生产生的疼痛语言,特别是从病历中提取的数据,是最常用的数据。任务包括患者诊断和分诊、疼痛提及的识别、治疗反应预测、生物医学实体提取、语言特征与临床状态的关联以及疼痛叙述的词汇-语义分析。只有1篇论文包括了他们在实验设置中之前对疼痛语句的语言知识。大多数研究将重点放在医生身上,无论是直接作为临床工具,还是作为间接知识。最少的针对性临床疼痛护理阶段是自我管理,其中患者最积极参与。最少的疼痛研究维度是情感和社会文化方面。只有2篇论文测量了医生在临床任务中表现随着所提出的算法的引入而改善。讨论:本研究发现,未来的研究应该集中于分析患者产生的疼痛语言,为自我管理和患者赋权开发基于患者的资源,探索疼痛的情感和社会文化方面,以及衡量医生在使用所提出的工具时的表现改善。
https://arxiv.org/abs/2404.16226
Objective: Clinical trials are essential for advancing pharmaceutical interventions, but they face a bottleneck in selecting eligible participants. Although leveraging electronic health records (EHR) for recruitment has gained popularity, the complex nature of unstructured medical texts presents challenges in efficiently identifying participants. Natural Language Processing (NLP) techniques have emerged as a solution with a recent focus on transformer models. In this study, we aimed to evaluate the performance of a prompt-based large language model for the cohort selection task from unstructured medical notes collected in the EHR. Methods: To process the medical records, we selected the most related sentences of the records to the eligibility criteria needed for the trial. The SNOMED CT concepts related to each eligibility criterion were collected. Medical records were also annotated with MedCAT based on the SNOMED CT ontology. Annotated sentences including concepts matched with the criteria-relevant terms were extracted. A prompt-based large language model (Generative Pre-trained Transformer (GPT) in this study) was then used with the extracted sentences as the training set. To assess its effectiveness, we evaluated the model's performance using the dataset from the 2018 n2c2 challenge, which aimed to classify medical records of 311 patients based on 13 eligibility criteria through NLP techniques. Results: Our proposed model showed the overall micro and macro F measures of 0.9061 and 0.8060 which were among the highest scores achieved by the experiments performed with this dataset. Conclusion: The application of a prompt-based large language model in this study to classify patients based on eligibility criteria received promising scores. Besides, we proposed a method of extractive summarization with the aid of SNOMED CT ontology that can be also applied to other medical texts.
目标:临床试验对于推动制药干预至关重要,但在选择合适参与者方面存在瓶颈。尽管利用电子病历(EHR)进行招募的做法已经受到欢迎,但非结构化医疗文本复杂的 nature 提出了有效地识别参与者的挑战。自然语言处理(NLP)技术在最近关注于Transformer模型方面成为了解决方案。在这项研究中,我们旨在评估基于提示的大型语言模型在从EHR中收集的非结构化医疗文本的队列选择任务中的性能。方法:为了处理医学记录,我们选择了与需要试验资格标准相关的最相关的句子。收集了与每个资格标准相关的SNOMED CT概念。同时,根据SNOMED CT语义数据库对医学记录进行了注释。包括与标准匹配的概念的注解句子被提取出来。然后,使用基于提示的大型语言模型(本研究中使用的是Generative Pre-trained Transformer(GPT))对提取的句子进行训练。为了评估其效果,我们使用2018 n2c2挑战的数据集来评估模型的性能,该数据集旨在根据13个资格标准对311名患者的医疗记录进行分类。结果:与该数据集上进行的实验相比,我们提出的模型在整体微和宏观F分数方面得分最高,为0.9061和0.8060,这是该数据集中实现的最高分数。结论:将提示式大型语言模型应用于根据资格标准对患者进行分类,在本研究中得到了有前景的分数。此外,我们还提出了使用SNOMED CT语义数据库的提取式总结方法,该方法也可以应用于其他医学文本。
https://arxiv.org/abs/2404.16198
We present a diachronic acoustic analysis of the voice of 1023 speakers from French media archives. The speakers are spread across 32 categories based on four periods (years 1955/56, 1975/76, 1995/96, 2015/16), four age groups (20-35; 36-50; 51-65, >65), and two genders. The fundamental frequency ($F_0$) and the first four formants (F1-4) were estimated. Procedures used to ensure the quality of these estimations on heterogeneous data are described. From each speaker's $F_0$ distribution, the base-$F_0$ value was calculated to estimate the register. Average vocal tract length was estimated from formant frequencies. Base-$F_0$ and vocal tract length were fit by linear mixed models to evaluate how they may have changed across time periods and genders, corrected for age effects. Results show an effect of the period with a tendency to lower voices, independently of gender. A lowering of pitch is observed with age for female but not male speakers.
我们对来自法国媒体档案的1023个发言者的声音进行了一个历时声学分析。发言者跨越了基于四个时期(1955/56年, 1975/76年, 1995/96年, 2015/16年)的32个类别,以及两个性别(男性和女性)。我们估计了基本频率($F_0$)和第一个前四个形态(F1-4)。描述了保证这些估计质量的程序。从每个发言者的$F_0$分布中,计算了基点-$F_0$值以估计调高。从形态频率中估计了平均声道长度。通过线性混合模型,将基点-$F_0$和声道长度对年龄效应进行校正,以评估它们在不同时间和性别下的变化。结果表明,随着周期性的降低声音。对于女性,随着年龄的增长,声音的音调降低。而对于男性,没有观察到这种现象。
https://arxiv.org/abs/2404.16104
Recent work has developed optimization procedures to find token sequences, called adversarial triggers, which can elicit unsafe responses from aligned language models. These triggers are believed to be universally transferable, i.e., a trigger optimized on one model can jailbreak other models. In this paper, we concretely show that such adversarial triggers are not universal. We extensively investigate trigger transfer amongst 13 open models and observe inconsistent transfer. Our experiments further reveal a significant difference in robustness to adversarial triggers between models Aligned by Preference Optimization (APO) and models Aligned by Fine-Tuning (AFT). We find that APO models are extremely hard to jailbreak even when the trigger is optimized directly on the model. On the other hand, while AFT models may appear safe on the surface, exhibiting refusals to a range of unsafe instructions, we show that they are highly susceptible to adversarial triggers. Lastly, we observe that most triggers optimized on AFT models also generalize to new unsafe instructions from five diverse domains, further emphasizing their vulnerability. Overall, our work highlights the need for more comprehensive safety evaluations for aligned language models.
近年来,研究者们开发了寻找令牌序列的优化方法,称为对抗性触发器,这些触发器可以从对齐的语言模型中引起不安全的反应。这些触发器被认为具有普遍可转移性,即在一种模型上优化的触发器可以解锁其他模型。在本文中,我们明确地证明了这种普遍可转移的对抗性触发器并不存在。我们深入研究了13个开源模型之间的触发器传递,并观察到不一致的传递。我们的实验进一步揭示了使用偏好优化(APO)模型和 Fine-Tuning(FT)模型对 adversarial 触发器的鲁棒性差异。我们发现,即使 APO 模型直接优化触发器,也很难被破解。另一方面,虽然 AFT 模型在表面上看起来非常安全,对各种不安全的指令表现出拒绝,但我们发现它们对 adversarial 触发器非常敏感。最后,我们观察到,大多数在 AFT 模型上优化的触发器也适用于来自五个不同领域的全新不安全指令,这进一步突显了它们的脆弱性。总体而言,我们的工作强调了对于对齐语言模型的更全面的安全性评估的必要性。
https://arxiv.org/abs/2404.16020
Human feedback plays a central role in the alignment of Large Language Models (LLMs). However, open questions remain about the methods (how), domains (where), people (who) and objectives (to what end) of human feedback collection. To navigate these questions, we introduce PRISM, a new dataset which maps the sociodemographics and stated preferences of 1,500 diverse participants from 75 countries, to their contextual preferences and fine-grained feedback in 8,011 live conversations with 21 LLMs. PRISM contributes (i) wide geographic and demographic participation in human feedback data; (ii) two census-representative samples for understanding collective welfare (UK and US); and (iii) individualised feedback where every rating is linked to a detailed participant profile, thus permitting exploration of personalisation and attribution of sample artefacts. We focus on collecting conversations that centre subjective and multicultural perspectives on value-laden and controversial topics, where we expect the most interpersonal and cross-cultural disagreement. We demonstrate the usefulness of PRISM via three case studies of dialogue diversity, preference diversity, and welfare outcomes, showing that it matters which humans set alignment norms. As well as offering a rich community resource, we advocate for broader participation in AI development and a more inclusive approach to technology design.
人类反馈在大型语言模型的对齐中扮演着中心角色。然而,关于人类反馈的方法(如何)、领域(在哪里)、参与人群(谁)以及目标(为什么)等问题,仍然存在 open questions。为了回答这些问题,我们引入了 PRISM,一个新数据集,它将 1,500 个不同国家和地区的参与者的社会人口统计学和个人陈述偏好与他们对语境中的人工智能模型的反馈联系起来,在 8,011 个与 21 个大型语言模型进行的有 21,011 个实时对话。PRISM 作出了以下贡献:(i)在人类反馈数据中广泛地理和人口统计学参与;(ii)两个具有代表性的英国和美国人口统计样本,以了解共同福利;(iii)每个人工智能模型中的评分都与详细参与者个人资料相关联,因此可以探索个性化以及对样本元数据的归属。我们关注的是收集那些关注有价值和争议话题的对话,我们预计这将是人与人之间最人际化和跨文化分歧最大的情况。通过三个对话多样性的案例研究、偏好多样性案例研究和福利结果案例研究,我们展示了 PRISM 的有用性。它不仅提供了一个丰富的社区资源,还倡导更广泛地参与人工智能发展和更包容的技术设计。
https://arxiv.org/abs/2404.16019
Large language models (LLMs) are highly capable of many tasks but they can sometimes generate unreliable or inaccurate outputs. To tackle this issue, this paper studies the problem of uncertainty estimation and calibration for LLMs. We begin by formulating the uncertainty estimation problem for LLMs and then propose a supervised approach that takes advantage of the labeled datasets and estimates the uncertainty of the LLMs' responses. Based on the formulation, we illustrate the difference between the uncertainty estimation for LLMs and that for standard ML models and explain why the hidden activations of the LLMs contain uncertainty information. Our designed approach effectively demonstrates the benefits of utilizing hidden activations for enhanced uncertainty estimation across various tasks and shows robust transferability in out-of-distribution settings. Moreover, we distinguish the uncertainty estimation task from the uncertainty calibration task and show that a better uncertainty estimation mode leads to a better calibration performance. In practice, our method is easy to implement and is adaptable to different levels of model transparency including black box, grey box, and white box, each demonstrating strong performance based on the accessibility of the LLM's internal mechanisms.
大语言模型(LLMs)具有许多任务的丰富能力,但有时它们可能生成不可靠或不准确的结果。为解决这个问题,本文研究了LLMs的不确定性估计和校准问题。我们首先形式化LLMs的不确定性估计问题,然后提出了一种监督方法,该方法利用已标记的数据集并估计LLMs的响应不确定性。根据公式,我们阐明了LLMs和标准机器学习模型不确定性估计之间的差异,并解释了LLMs隐藏激活中包含不确定性信息的原因。我们设计的方法有效地证明了利用隐藏激活增强不确定估计在各种任务中的优势,并在离散设置中展示了鲁棒性。此外,我们区分了不确定性估计任务和不确定性校准任务,并表明更好的不确定性估计模式会导致更好的校准性能。在实践中,我们的方法易于实现,并适用于包括黑盒、灰盒和白盒在内的不同模型透明度级别,每个级别都基于LLM内部机制的可访问性表现出强大的性能。
https://arxiv.org/abs/2404.15993
Large Language Models (LLMs), despite their impressive performance on a wide range of tasks, require significant GPU memory and consume substantial computational resources. In addition to model weights, the memory occupied by KV cache increases linearly with sequence length, becoming a main bottleneck for inference. In this paper, we introduce a novel approach for optimizing the KV cache which significantly reduces its memory footprint. Through a comprehensive investigation, we find that on LLaMA2 series models, (i) the similarity between adjacent tokens' query vectors is remarkably high, and (ii) current query's attention calculation can rely solely on the attention information of a small portion of the preceding queries. Based on these observations, we propose CORM, a KV cache eviction policy that dynamically retains important key-value pairs for inference without finetuning the model. We validate that CORM reduces the inference memory usage of KV cache by up to 70% without noticeable performance degradation across six tasks in LongBench.
大语言模型(LLMs)虽然在各种任务上的表现令人印象深刻,但需要大量的GPU内存,并且消耗大量的计算资源。除了模型权重外,KV缓存所占的内存随序列长度的增加而线性增加,成为推理的主要瓶颈。在本文中,我们提出了一种新的优化KV缓存的策略,显著减少了其内存足迹。通过全面的调查,我们发现,在LLaMA2系列模型中,(i)相邻词查询向量之间的相似性非常高,并且(ii)当前查询的注意力计算仅依赖于前几个查询的注意力信息。基于这些观察结果,我们提出了CORM,一种用于保留用于推理的重要键值对的KV缓存淘汰策略,而无需对模型进行微调。我们验证,CORM在LongBench中的六个任务上,将KV缓存的推理内存使用量降低至最多70%,且没有显式的性能下降。
https://arxiv.org/abs/2404.15949
A model's capacity to generalize its knowledge to interpret unseen inputs with different characteristics is crucial to build robust and reliable machine learning systems. Language model evaluation tasks lack information metrics about model generalization and their applicability in a new setting is measured using task and language-specific downstream performance, which is often lacking in many languages and tasks. In this paper, we explore a set of efficient and reliable measures that could aid in computing more information related to the generalization capability of language models in cross-lingual zero-shot settings. In addition to traditional measures such as variance in parameters after training and distance from initialization, we also measure the effectiveness of sharpness in loss landscape in capturing the success in cross-lingual transfer and propose a novel and stable algorithm to reliably compute the sharpness of a model optimum that correlates to generalization.
模型将知识泛化到解释具有不同特性的未见输入的能力对构建稳健可靠的机器学习系统至关重要。语言模型评估任务缺乏有关模型泛化及其在新环境中的适用性的信息指标。在本文中,我们探讨了一系列可能有助于计算跨语言零样本设置中语言模型泛化能力更详细信息的有效且可靠的度量。除了训练后参数的方差和初始化点的距离等传统度量外,我们还测量了损失函数在捕捉跨语言传输成功方面的有效性,并提出了一种可靠计算模型最优尖度的新算法。
https://arxiv.org/abs/2404.15928
Social media users drive the spread of misinformation online by sharing posts that include erroneous information or commenting on controversial topics with unsubstantiated arguments often in earnest. Work on echo chambers has suggested that users' perspectives are reinforced through repeated interactions with like-minded peers, promoted by homophily and bias in information diffusion. Building on long-standing interest in the social bases of language and linguistic underpinnings of social behavior, this work explores how conversations around misinformation are mediated through language use. We compare a number of linguistic measures, e.g., in-/out-group cues, readability, and discourse connectives, within and across topics of conversation and user communities. Our findings reveal increased presence of group identity signals and processing fluency within echo chambers during discussions of misinformation. We discuss the specific character of these broader trends across topics and examine contextual influences.
社交媒体用户通过分享包括错误信息或对有争议话题真诚评论的帖子,推动了网上虚假信息的传播。在回音室研究中,用户的角度通过与志同道合的伙伴反复互动得到强化,这是由同质性和偏见在信息传播中推动的。本研究在探讨语言使用如何通过语言的社会基础和社会行为的语用学角度来介导围绕虚假信息的对话。我们在对话和用户社区的主题之间比较了多种语言度量,例如,群体内/外的提示、可读性和连词。我们的研究结果表明,在讨论虚假信息时,回音室内的群体身份信号和处理流畅性都有所增加。我们讨论了这些更广泛的趋势在各个主题上的具体特征,并探讨了语境影响。
https://arxiv.org/abs/2404.15925
This study explores the use of Large Language Models (LLMs) for automatic evaluation of knowledge graph (KG) completion models. Historically, validating information in KGs has been a challenging task, requiring large-scale human annotation at prohibitive cost. With the emergence of general-purpose generative AI and LLMs, it is now plausible that human-in-the-loop validation could be replaced by a generative agent. We introduce a framework for consistency and validation when using generative models to validate knowledge graphs. Our framework is based upon recent open-source developments for structural and semantic validation of LLM outputs, and upon flexible approaches to fact checking and verification, supported by the capacity to reference external knowledge sources of any kind. The design is easy to adapt and extend, and can be used to verify any kind of graph-structured data through a combination of model-intrinsic knowledge, user-supplied context, and agents capable of external knowledge retrieval.
本研究探讨了使用大型语言模型(LLMs)自动评估知识图(KG)完成模型的应用。历史上,验证知识图中的有效信息是一个具有挑战性的任务,需要大规模的人类标注,代价高昂。随着通用生成式人工智能(GSA)和LLM的出现,现在可能用生成代理来代替人机交互验证。我们提出了一个使用生成模型验证知识图的一致性和验证框架。该框架基于LLM输出结构和语义验证的最近开源发展,以及支持外部知识来源访问的能力。该设计易于调整和扩展,可以通过模型固有知识、用户提供的上下文和支持外部知识检索的代理来验证任何类型的图状数据。
https://arxiv.org/abs/2404.15923
Large language models, such as GPT-4 and Med-PaLM, have shown impressive performance on clinical tasks; however, they require access to compute, are closed-source, and cannot be deployed on device. Mid-size models such as BioGPT-large, BioMedLM, LLaMA 2, and Mistral 7B avoid these drawbacks, but their capacity for clinical tasks has been understudied. To help assess their potential for clinical use and help researchers decide which model they should use, we compare their performance on two clinical question-answering (QA) tasks: MedQA and consumer query answering. We find that Mistral 7B is the best performing model, winning on all benchmarks and outperforming models trained specifically for the biomedical domain. While Mistral 7B's MedQA score of 63.0% approaches the original Med-PaLM, and it often can produce plausible responses to consumer health queries, room for improvement still exists. This study provides the first head-to-head assessment of open source mid-sized models on clinical tasks.
大语言模型,如GPT-4和Med-PaLM,在临床任务上表现出色;然而,它们需要访问计算资源,是闭源的,且无法在设备上部署。中大型模型,如BioGPT-large、BioMedLM、LLaMA 2和Mistral 7B,避免了这些缺点,但它们在临床任务上的能力仍被低估。为了帮助评估它们在临床应用中的潜力,并帮助研究人员决定应使用哪种模型,我们比较了它们在两个临床问答(QA)任务上的表现:MedQA和消费者问题回答。我们发现,Mistral 7B是表现最好的模型,在基准测试和专门针对生物医学领域的模型上均获胜。虽然Mistral 7B的MedQA得分为63.0%接近原始的Med-PaLM,但它经常只能对消费者健康问题提供合理的回答,仍有改进的空间。这项研究是开源中大型模型在临床任务上首次直接的比较。
https://arxiv.org/abs/2404.15894
Unsupervised constrained text generation aims to generate text under a given set of constraints without any supervised data. Current state-of-the-art methods stochastically sample edit positions and actions, which may cause unnecessary search steps. In this paper, we propose PMCTG to improve effectiveness by searching for the best edit position and action in each step. Specifically, PMCTG extends perturbed masking technique to effectively search for the most incongruent token to edit. Then it introduces four multi-aspect scoring functions to select edit action to further reduce search difficulty. Since PMCTG does not require supervised data, it could be applied to different generation tasks. We show that under the unsupervised setting, PMCTG achieves new state-of-the-art results in two representative tasks, namely keywords-to-sentence generation and paraphrasing.
无监督约束文本生成旨在生成满足给定约束条件的文本,而无需任何有监督数据。目前最先进的方法随机采样编辑位置和动作,这可能导致不必要的搜索步骤。在本文中,我们提出PMCTG来提高效果,通过在每一步中寻找最佳编辑位置和动作。具体来说,PMCTG扩展了扰动掩码技术,以有效搜索最不和谐的词。然后它引入了四个多方面评分函数,以选择编辑动作进一步减少搜索难度。由于PMCTG不需要有监督数据,因此可以应用于不同的生成任务。我们证明了在无监督设置下,PMCTG在两个具有代表性的任务(即关键词到句子生成和的同义词)上实现了与当前最佳方法相同的最新的最佳结果。
https://arxiv.org/abs/2404.15877
We present a novel approach to detecting noun abstraction within a large language model (LLM). Starting from a psychologically motivated set of noun pairs in taxonomic relationships, we instantiate surface patterns indicating hypernymy and analyze the attention matrices produced by BERT. We compare the results to two sets of counterfactuals and show that we can detect hypernymy in the abstraction mechanism, which cannot solely be related to the distributional similarity of noun pairs. Our findings are a first step towards the explainability of conceptual abstraction in LLMs.
我们提出了一个在大型语言模型(LLM)中检测名词抽象的新方法。从心理上动机的一组名词对中开始,我们实例化表明超类和分析由BERT产生的注意矩阵。我们将结果与两组反事实进行比较,并表明我们可以在抽象机制中检测超类,而不仅仅是名词对之间的分布相似性。我们的研究结果是LLM中概念抽象解释的第一步。
https://arxiv.org/abs/2404.15848
It is imperative for Large language models (LLMs) to follow instructions with elaborate requirements (i.e. Complex Instructions Following). Yet, it remains under-explored how to enhance the ability of LLMs to follow complex instructions with multiple constraints. To bridge the gap, we initially study what training data is effective in enhancing complex constraints following abilities. We found that training LLMs with instructions containing multiple constraints enhances their understanding of complex instructions, especially those with lower complexity levels. The improvement can even generalize to compositions of out-of-domain constraints. Additionally, we further propose methods addressing how to obtain and utilize the effective training data. Finally, we conduct extensive experiments to prove the effectiveness of our methods in terms of overall performance, training efficiency, and generalization abilities under four settings.
大语言模型(LLMs)必须遵循详细的指令(即复杂指令跟随),然而,尚未深入研究如何增强LLMs遵循复杂指令的多重约束的能力。为了弥合这个差距,我们首先研究了哪种训练数据可以有效增强LLMs对复杂指令的理解能力,尤其是那些复杂度较低的指令。我们发现,通过包含多个约束条件的指令进行训练,可以增强LLMs对复杂指令的理解能力,尤其是那些复杂度较低的指令。这种改进甚至可以扩展到跨域约束的组合。此外,我们进一步提出了如何获得和使用有效训练数据的方法。最后,我们通过四个设置进行广泛的实验,证明了我们方法在整体性能、训练效率和泛化能力方面的有效性。
https://arxiv.org/abs/2404.15846
Individual feedback can help students improve their essay writing skills. However, the manual effort required to provide such feedback limits individualization in practice. Automatically-generated essay feedback may serve as an alternative to guide students at their own pace, convenience, and desired frequency. Large language models (LLMs) have demonstrated strong performance in generating coherent and contextually relevant text. Yet, their ability to provide helpful essay feedback is unclear. This work explores several prompting strategies for LLM-based zero-shot and few-shot generation of essay feedback. Inspired by Chain-of-Thought prompting, we study how and to what extent automated essay scoring (AES) can benefit the quality of generated feedback. We evaluate both the AES performance that LLMs can achieve with prompting only and the helpfulness of the generated essay feedback. Our results suggest that tackling AES and feedback generation jointly improves AES performance. However, while our manual evaluation emphasizes the quality of the generated essay feedback, the impact of essay scoring on the generated feedback remains low ultimately.
个人反馈有助于提高学生的论文写作技能。然而,提供这样的反馈需要消耗大量的努力,因此在实践中很难实现个性化。自动生成的论文反馈可以作为指导学生自行 pace、convenience 和 desired frequency 的替代方案。大型语言模型(LLMs)已经在生成连贯且上下文相关的文本方面表现出强大的性能。然而,它们提供有帮助的论文反馈的能力仍然不清楚。本研究探讨了基于LLM的零 shot 和零 shot 生成论文反馈的几种提示策略。受到 Chain-of-Thought 提示的启发,我们研究了自动评分(AES)在生成反馈质量方面的优势和程度。我们评估了LLM仅通过提示所能达到的AES性能以及生成的论文反馈的有用性。我们的结果表明,联合处理AES和反馈生成可以提高AES性能。然而,尽管我们的手动评估强调了生成的论文反馈的质量,但论文评分对生成的反馈的影响仍然较低。
https://arxiv.org/abs/2404.15845
Knowledge Graph Completion (KGC) has garnered massive research interest recently, and most existing methods are designed following a transductive setting where all entities are observed during training. Despite the great progress on the transductive KGC, these methods struggle to conduct reasoning on emerging KGs involving unseen entities. Thus, inductive KGC, which aims to deduce missing links among unseen entities, has become a new trend. Many existing studies transform inductive KGC as a graph classification problem by extracting enclosing subgraphs surrounding each candidate triple. Unfortunately, they still face certain challenges, such as the expensive time consumption caused by the repeat extraction of enclosing subgraphs, and the deficiency of entity-independent feature learning. To address these issues, we propose a global-local anchor representation (GLAR) learning method for inductive KGC. Unlike previous methods that utilize enclosing subgraphs, we extract a shared opening subgraph for all candidates and perform reasoning on it, enabling the model to perform reasoning more efficiently. Moreover, we design some transferable global and local anchors to learn rich entity-independent features for emerging entities. Finally, a global-local graph reasoning model is applied on the opening subgraph to rank all candidates. Extensive experiments show that our GLAR outperforms most existing state-of-the-art methods.
知识图谱完成(KGC)最近吸引了大量研究兴趣,而且大多数现有方法都是在训练过程中采用转换设置的,其中所有实体都在观察过程中。尽管在转换式KGC方面取得了很大进展,但这些方法在处理涉及未见实体的 emergence KG 时仍然存在挑战。因此,归纳式KGC,旨在从未见实体中推断缺失链接,已成为一个新的趋势。许多现有研究将归纳式KGC转换为一个图分类问题,通过提取围绕每个候选三元组的内包容子图来完成。然而,他们仍然面临着某些挑战,例如由于重复提取内包容子图而产生的高时间消耗,以及实体独立特征学习不足的问题。为了解决这些问题,我们提出了一个全局-局部锚表示(GLAR)学习方法来解决归纳式KGC。与之前的方法不同,我们为所有候选者提取共享的内包容子图,并在其上进行推理,使模型能够更有效地进行推理。此外,我们还设计了一些可转移的全局和局部锚来学习新兴实体的丰富实体独立特征。最后,在打开子图上应用全局-局部图推理模型对所有候选者进行排名。大量实验证明,我们的GLAR超越了大多数现有最先进的方法。
https://arxiv.org/abs/2404.15807
Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications often require multiple responses and how to perform speculative decoding in a batched setting while preserving its latency benefits poses non-trivial challenges. This paper describes a system of batched speculative decoding that sets a new state of the art in multi-sequence generation latency and that demonstrates superior GPU utilization as well as quality of generations within a time budget. For example, for a 7.8B-size model on a single A100 GPU and with a batch size of 8, each sequence is generated at an average speed of 5.8ms per token, the overall throughput being 1.1K tokens per second. These results represent state-of-the-art latency and a 2.15X speed-up over optimized regular decoding. Within a time budget that regular decoding does not finish, our system is able to generate sequences with HumanEval Pass@First of 43% and Pass@All of 61%, far exceeding what's feasible with single-sequence speculative decoding. Our peak GPU utilization during decoding reaches as high as 15.8%, more than 3X the highest of that of regular decoding and around 10X of single-sequence speculative decoding.
推测解码已成为提高在主机大语言模型中的延迟和吞吐量的强大方法。然而,大多数现有实现都关注生成单个序列。现实世界的生成性人工智能应用程序通常需要多个响应。在批处理设置中,如何进行批处理的推测解码以及如何在保持延迟优势的同时实现出色的GPU利用率仍然具有非 trivial 挑战。本文描述了一个批处理的推测解码系统,在多序列生成延迟方面达到了新的技术水平,并展示了GPU利用率以及在不超过预设时间预算内的生成质量的优越性。例如,在单个A100 GPU上,每个序列的生成平均速度为5.8ms,总吞吐量为1.1K Token/s。这些结果代表了最先进的延迟,以及优化常规解码的2.15倍速度。在预设时间预算内,我们的系统能够生成具有人类评估Pass@First 43%和Pass@All 61%的序列,远远超过了单序列推测解码的可实现水平。在解码过程中,我们的GPU利用率达到最高点,达到15.8%,比普通解码的顶峰高约3倍,比单序列推测解码的顶峰高约10倍。
https://arxiv.org/abs/2404.15778
Since the inception of the Transformer architecture in 2017, Large Language Models (LLMs) such as GPT and BERT have evolved significantly, impacting various industries with their advanced capabilities in language understanding and generation. These models have shown potential to transform the medical field, highlighting the necessity for specialized evaluation frameworks to ensure their effective and ethical deployment. This comprehensive survey delineates the extensive application and requisite evaluation of LLMs within healthcare, emphasizing the critical need for empirical validation to fully exploit their capabilities in enhancing healthcare outcomes. Our survey is structured to provide an in-depth analysis of LLM applications across clinical settings, medical text data processing, research, education, and public health awareness. We begin by exploring the roles of LLMs in different medical applications, detailing how they are evaluated based on their performance in tasks such as clinical application, medical text data processing, information retrieval, data analysis, medical scientific writing, educational content generation etc. The subsequent sections delve into the methodologies employed in these evaluations, discussing the benchmarks and metrics used to assess the models' effectiveness, accuracy, and ethical alignment. Through this survey, we aim to equip healthcare professionals, researchers, and policymakers with a comprehensive understanding of the potential strengths and limitations of LLMs in medical applications. By providing detailed insights into the evaluation processes and the challenges faced in integrating LLMs into healthcare, this survey seeks to guide the responsible development and deployment of these powerful models, ensuring they are harnessed to their full potential while maintaining stringent ethical standards.
自2017年Transformer架构的创立以来,大型语言模型(LLMs)如GPT和BERT等在语言理解和生成方面的先进能力显著发展,对 various行业产生了重大影响。这些模型展示出在医疗领域进行变革的潜力,突显了需要专业评估框架以确保其有效和道德部署的必要性。这次全面调查详细探讨了LLMs在医疗保健领域中的应用和评估需求,强调了对这些模型的实证验证以全面发挥其在提高医疗保健成果方面的关键作用的重要性。我们的调查旨在为医疗保健专业人员、研究人员和政策制定者提供全面了解LLM在医疗应用中的潜力和限制的全面理解。通过提供关于这些评估过程和将LLMs整合到医疗保健中的挑战的详细见解,这次调查旨在指导这些强大模型的 responsible development 和 deployment,确保它们在保持严格道德标准的同时充分发挥其全部潜力。
https://arxiv.org/abs/2404.15777
Chain-of-thought responses from language models improve performance across most benchmarks. However, it remains unclear to what extent these performance gains can be attributed to human-like task decomposition or simply the greater computation that additional tokens allow. We show that transformers can use meaningless filler tokens (e.g., '......') in place of a chain of thought to solve two hard algorithmic tasks they could not solve when responding without intermediate tokens. However, we find empirically that learning to use filler tokens is difficult and requires specific, dense supervision to converge. We also provide a theoretical characterization of the class of problems where filler tokens are useful in terms of the quantifier depth of a first-order formula. For problems satisfying this characterization, chain-of-thought tokens need not provide information about the intermediate computational steps involved in multi-token computations. In summary, our results show that additional tokens can provide computational benefits independent of token choice. The fact that intermediate tokens can act as filler tokens raises concerns about large language models engaging in unauditable, hidden computations that are increasingly detached from the observed chain-of-thought tokens.
翻译:通过语言模型的连续思考回答可以提高大多数基准测试的性能。然而,尚不清楚这些性能提升是否可以归因于人机类似任务的分层、或是说增加的标记允许的更广泛的计算。我们证明了,Transformer 可以在没有中间标记的情况下,使用无意义的填充标记(例如,'......')代替连续思考来解决它们无法解决的两个困难算法任务。然而,我们通过经验发现,学会使用填充标记是困难的,并且需要特定的、密集的监督来收敛。我们还给出了关于一类问题中填充标记有益于什么的问题的 theoretical 描述,即第一个度量深度的公式。对于满足这种描述的问题,连续思考标记不必提供关于多标记计算中中间计算步骤的信息。总之,我们的结果表明,无需选择标记,额外的标记可以提供计算优势。由于中间标记可以作为填充标记发挥作用,这引发了对大型语言模型从事未经审计、隐藏计算的担忧。
https://arxiv.org/abs/2404.15758