The advent of Large Language Models (LLM) provides new insights to validate Automated Driving Systems (ADS). In the herein-introduced work, a novel approach to extracting scenarios from naturalistic driving datasets is presented. A framework called Chat2Scenario is proposed leveraging the advanced Natural Language Processing (NLP) capabilities of LLM to understand and identify different driving scenarios. By inputting descriptive texts of driving conditions and specifying the criticality metric thresholds, the framework efficiently searches for desired scenarios and converts them into ASAM OpenSCENARIO and IPG CarMaker text files. This methodology streamlines the scenario extraction process and enhances efficiency. Simulations are executed to validate the efficiency of the approach. The framework is presented based on a user-friendly web app and is accessible via the following link: this https URL.
大语言模型的出现为验证自动驾驶系统(ADS)提供了新的见解。在本文中,我们提出了一种从自然驾驶数据集中提取场景的新方法。一个名为Chat2Scenario的框架利用了LLM的先进自然语言处理(NLP)功能来理解和识别不同的驾驶场景。通过输入驾驶条件的描述文本并指定关键度指标阈值,该框架有效地搜索所需的场景,并将它们转换为ASAM OpenSCENARIO和IPG CarMaker文本文件。这种方法简化了场景提取过程,提高了效率。通过模拟验证了该方法的有效性。该框架基于易用的网页应用程序,可以通过以下链接访问:this <https://url>。
https://arxiv.org/abs/2404.16147
Human feedback plays a central role in the alignment of Large Language Models (LLMs). However, open questions remain about the methods (how), domains (where), people (who) and objectives (to what end) of human feedback collection. To navigate these questions, we introduce PRISM, a new dataset which maps the sociodemographics and stated preferences of 1,500 diverse participants from 75 countries, to their contextual preferences and fine-grained feedback in 8,011 live conversations with 21 LLMs. PRISM contributes (i) wide geographic and demographic participation in human feedback data; (ii) two census-representative samples for understanding collective welfare (UK and US); and (iii) individualised feedback where every rating is linked to a detailed participant profile, thus permitting exploration of personalisation and attribution of sample artefacts. We focus on collecting conversations that centre subjective and multicultural perspectives on value-laden and controversial topics, where we expect the most interpersonal and cross-cultural disagreement. We demonstrate the usefulness of PRISM via three case studies of dialogue diversity, preference diversity, and welfare outcomes, showing that it matters which humans set alignment norms. As well as offering a rich community resource, we advocate for broader participation in AI development and a more inclusive approach to technology design.
人类反馈在大型语言模型的对齐中扮演着中心角色。然而,关于人类反馈的方法(如何)、领域(在哪里)、参与人群(谁)以及目标(为什么)等问题,仍然存在 open questions。为了回答这些问题,我们引入了 PRISM,一个新数据集,它将 1,500 个不同国家和地区的参与者的社会人口统计学和个人陈述偏好与他们对语境中的人工智能模型的反馈联系起来,在 8,011 个与 21 个大型语言模型进行的有 21,011 个实时对话。PRISM 作出了以下贡献:(i)在人类反馈数据中广泛地理和人口统计学参与;(ii)两个具有代表性的英国和美国人口统计样本,以了解共同福利;(iii)每个人工智能模型中的评分都与详细参与者个人资料相关联,因此可以探索个性化以及对样本元数据的归属。我们关注的是收集那些关注有价值和争议话题的对话,我们预计这将是人与人之间最人际化和跨文化分歧最大的情况。通过三个对话多样性的案例研究、偏好多样性案例研究和福利结果案例研究,我们展示了 PRISM 的有用性。它不仅提供了一个丰富的社区资源,还倡导更广泛地参与人工智能发展和更包容的技术设计。
https://arxiv.org/abs/2404.16019
Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, reasoning, and planning. MMT-Bench comprises $31,325$ meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering $32$ core meta-tasks and $162$ subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating the discovery of in- and out-of-domain tasks. Evaluation results involving $30$ LVLMs such as the proprietary GPT-4V, GeminiProVision, and open-sourced InternVL-Chat, underscore the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench will inspire the community to develop next-generation multimodal foundation models aimed at achieving general-purpose multimodal intelligence.
大视图语言模型(LVLMs)在诸如视觉对话和 embodied 导航等通用多模态应用方面取得了显著的进步。然而,现有的多模态评估基准测试的项目数量有限,无法跟踪 LVLM 的开发。在这项研究中,我们提出了 MMT-Bench,一个全面的多模态基准,旨在评估 LVLMs 在需要专家知识和故意视觉识别、定位、推理和规划的大型多模态任务中的能力。MMT-Bench 包括来自各种多模态场景的 $31,325$ 个精心策划的多选题视觉问题,涵盖了 $32$ 个核心元任务和 $162$ 个亚任务的多模态理解。由于其广泛的任务覆盖,MMT-Bench 使使用任务图评估 LVLMs 成为可能,促进发现领域内和领域外任务。评估了 $30$ 个 LVLM,如专有 GPT-4V、GeminiProVision 和开源的 InternVL-Chat,结果表明 MMT-Bench 带来了重大挑战。我们预计,MMT-Bench 将激发社区开发指向实现通用多模态智能的下一代多模态基础模型。
https://arxiv.org/abs/2404.16006
Concepts involved in long-form videos such as people, objects, and their interactions, can be viewed as following an implicit prior. They are notably complex and continue to pose challenges to be comprehensively learned. In recent years, generative pre-training (GPT) has exhibited versatile capacities in modeling any kind of text content even visual locations. Can this manner work for learning long-form video prior? Instead of operating on pixel space, it is efficient to employ visual locations like bounding boxes and keypoints to represent key information in videos, which can be simply discretized and then tokenized for consumption by GPT. Due to the scarcity of suitable data, we create a new dataset called \textbf{Storyboard20K} from movies to serve as a representative. It includes synopses, shot-by-shot keyframes, and fine-grained annotations of film sets and characters with consistent IDs, bounding boxes, and whole body keypoints. In this way, long-form videos can be represented by a set of tokens and be learned via generative pre-training. Experimental results validate that our approach has great potential for learning long-form video prior. Code and data will be released at \url{this https URL}.
涉及长视频的概念,如人物、物品及其互动,可以看作是一种隐含的先前知识。它们尤其复杂,继续挑战我们对它们进行全面学习的困难。近年来,生成预训练(GPT)在模拟任何类型的文本内容,包括视觉位置方面表现出多样化的能力。这种方法是否适用于学习长视频的先前知识呢? Instead of operating on pixel space, it is efficient to employ visual locations like bounding boxes and keypoints to represent key information in videos, which can be simply discretized and then tokenized for consumption by GPT. 由于数据的稀缺性,我们创建了一个名为 \textbf{Storyboard20K} 的新数据集,从电影中选取,作为代表性的数据集。它包括剧情梗概、镜头之间的关键帧以及带有唯一ID的影视场景和角色的细粒度注释,还有边框和全身关键点。这样,长视频就可以用一系列标记来表示,并通过生成预训练来学习。实验结果证实了我们的方法具有学习长视频先前知识的大幅潜力。代码和数据将发布在 \url{这个 https URL }。
https://arxiv.org/abs/2404.15909
Large language models, such as GPT-4 and Med-PaLM, have shown impressive performance on clinical tasks; however, they require access to compute, are closed-source, and cannot be deployed on device. Mid-size models such as BioGPT-large, BioMedLM, LLaMA 2, and Mistral 7B avoid these drawbacks, but their capacity for clinical tasks has been understudied. To help assess their potential for clinical use and help researchers decide which model they should use, we compare their performance on two clinical question-answering (QA) tasks: MedQA and consumer query answering. We find that Mistral 7B is the best performing model, winning on all benchmarks and outperforming models trained specifically for the biomedical domain. While Mistral 7B's MedQA score of 63.0% approaches the original Med-PaLM, and it often can produce plausible responses to consumer health queries, room for improvement still exists. This study provides the first head-to-head assessment of open source mid-sized models on clinical tasks.
大语言模型,如GPT-4和Med-PaLM,在临床任务上表现出色;然而,它们需要访问计算资源,是闭源的,且无法在设备上部署。中大型模型,如BioGPT-large、BioMedLM、LLaMA 2和Mistral 7B,避免了这些缺点,但它们在临床任务上的能力仍被低估。为了帮助评估它们在临床应用中的潜力,并帮助研究人员决定应使用哪种模型,我们比较了它们在两个临床问答(QA)任务上的表现:MedQA和消费者问题回答。我们发现,Mistral 7B是表现最好的模型,在基准测试和专门针对生物医学领域的模型上均获胜。虽然Mistral 7B的MedQA得分为63.0%接近原始的Med-PaLM,但它经常只能对消费者健康问题提供合理的回答,仍有改进的空间。这项研究是开源中大型模型在临床任务上首次直接的比较。
https://arxiv.org/abs/2404.15894
In this preliminary study, we investigate a GPT-driven intent-based reasoning approach to streamline tool selection for large language models (LLMs) aimed at system efficiency. By identifying the intent behind user prompts at runtime, we narrow down the API toolset required for task execution, reducing token consumption by up to 24.6\%. Early results on a real-world, massively parallel Copilot platform with over 100 GPT-4-Turbo nodes show cost reductions and potential towards improving LLM-based system efficiency.
在这份初步研究中,我们研究了一种基于意图的推理方法,以简化大型语言模型(LLMs)的选择工具,提高系统效率。通过在运行时识别用户提示的意图,我们缩小了任务执行所需API工具集的范围,将每个单词的消耗减少至24.6%。在一个具有超过100个GPT-4-Turbo节点的真实世界、大规模并行 Copilot 平台上进行初步实验,结果显示了成本降低以及潜在的提高LLM基础系统效率的可能性。
https://arxiv.org/abs/2404.15804
Since the inception of the Transformer architecture in 2017, Large Language Models (LLMs) such as GPT and BERT have evolved significantly, impacting various industries with their advanced capabilities in language understanding and generation. These models have shown potential to transform the medical field, highlighting the necessity for specialized evaluation frameworks to ensure their effective and ethical deployment. This comprehensive survey delineates the extensive application and requisite evaluation of LLMs within healthcare, emphasizing the critical need for empirical validation to fully exploit their capabilities in enhancing healthcare outcomes. Our survey is structured to provide an in-depth analysis of LLM applications across clinical settings, medical text data processing, research, education, and public health awareness. We begin by exploring the roles of LLMs in different medical applications, detailing how they are evaluated based on their performance in tasks such as clinical application, medical text data processing, information retrieval, data analysis, medical scientific writing, educational content generation etc. The subsequent sections delve into the methodologies employed in these evaluations, discussing the benchmarks and metrics used to assess the models' effectiveness, accuracy, and ethical alignment. Through this survey, we aim to equip healthcare professionals, researchers, and policymakers with a comprehensive understanding of the potential strengths and limitations of LLMs in medical applications. By providing detailed insights into the evaluation processes and the challenges faced in integrating LLMs into healthcare, this survey seeks to guide the responsible development and deployment of these powerful models, ensuring they are harnessed to their full potential while maintaining stringent ethical standards.
自2017年Transformer架构的创立以来,大型语言模型(LLMs)如GPT和BERT等在语言理解和生成方面的先进能力显著发展,对 various行业产生了重大影响。这些模型展示出在医疗领域进行变革的潜力,突显了需要专业评估框架以确保其有效和道德部署的必要性。这次全面调查详细探讨了LLMs在医疗保健领域中的应用和评估需求,强调了对这些模型的实证验证以全面发挥其在提高医疗保健成果方面的关键作用的重要性。我们的调查旨在为医疗保健专业人员、研究人员和政策制定者提供全面了解LLM在医疗应用中的潜力和限制的全面理解。通过提供关于这些评估过程和将LLMs整合到医疗保健中的挑战的详细见解,这次调查旨在指导这些强大模型的 responsible development 和 deployment,确保它们在保持严格道德标准的同时充分发挥其全部潜力。
https://arxiv.org/abs/2404.15777
Generative pre-trained transformers (GPT's) are a type of large language machine learning model that are unusually adept at producing novel, and coherent, natural language. In this study the ability of GPT models to generate novel and correct versions, and notably very insecure versions, of implementations of the cryptographic hash function SHA-1 is examined. The GPT models Llama-2-70b-chat-h, Mistral-7B-Instruct-v0.1, and zephyr-7b-alpha are used. The GPT models are prompted to re-write each function using a modified version of the localGPT framework and langchain to provide word embedding context of the full source code and header files to the model, resulting in over 130,000 function re-write GPT output text blocks, approximately 40,000 of which were able to be parsed as C code and subsequently compiled. The generated code is analyzed for being compilable, correctness of the algorithm, memory leaks, compiler optimization stability, and character distance to the reference implementation. Remarkably, several generated function variants have a high implementation security risk of being correct for some test vectors, but incorrect for other test vectors. Additionally, many function implementations were not correct to the reference algorithm of SHA-1, but produced hashes that have some of the basic characteristics of hash functions. Many of the function re-writes contained serious flaws such as memory leaks, integer overflows, out of bounds accesses, use of uninitialised values, and compiler optimization instability. Compiler optimization settings and SHA-256 hash checksums of the compiled binaries are used to cluster implementations that are equivalent but may not have identical syntax - using this clustering over 100,000 novel and correct versions of the SHA-1 codebase were generated where each component C function of the reference implementation is different from the original code.
生成预训练的变换器(GPT)是一种大型自然语言机器学习模型,特别擅长生成新颖且连贯的自然语言。在这项研究中,研究了 GPT 模型在生成新颖且正确的 SHA-1 哈希函数实现版本方面的能力,特别是非常不安全的实现版本。所使用的 GPT 模型包括 LLama-2-70b-chat-h、Mistral-7B-Instruct-v0.1 和 zephyr-7b-alpha。这些 GPT 模型使用修改的 localGPT 框架和 langchain,在每个函数上生成词嵌入上下文,并提供完整的源代码和头文件给模型,导致超过 130,000 个函数重写 GPT 输出文本块,其中大约 40,000 个被解析为 C 代码并后续编译。生成的代码被分析是否可编译、算法的正确性、内存泄漏、编译优化稳定性以及与参考实现的字符距离。值得注意的是,几个生成的函数变体在某些测试数据上的实现安全性非常高,但在其他测试数据上的实现是不正确的。此外,许多函数实现与 SHA-1 参考算法不正确,但生成了具有哈希函数的一些基本特征的哈希值。许多函数重写包含严重的漏洞,如内存泄漏、整数溢出、越界访问、使用未初始化值以及编译优化不稳定。编译优化设置和编译二进制文件的 SHA-256 哈希值检查用于将具有等效但可能不具有相同语法的实现聚类在一起 - 使用这种聚类在 SHA-1 代码库上生成了超过 100,000 个新颖且正确的函数版本,其中每个参考实现组件的 C 函数与原始代码不同。
https://arxiv.org/abs/2404.15681
Systematic review (SR) is a popular research method in software engineering (SE). However, conducting an SR takes an average of 67 weeks. Thus, automating any step of the SR process could reduce the effort associated with SRs. Our objective is to investigate if Large Language Models (LLMs) can accelerate title-abstract screening by simplifying abstracts for human screeners, and automating title-abstract screening. We performed an experiment where humans screened titles and abstracts for 20 papers with both original and simplified abstracts from a prior SR. The experiment with human screeners was reproduced with GPT-3.5 and GPT-4 LLMs to perform the same screening tasks. We also studied if different prompting techniques (Zero-shot (ZS), One-shot (OS), Few-shot (FS), and Few-shot with Chain-of-Thought (FS-CoT)) improve the screening performance of LLMs. Lastly, we studied if redesigning the prompt used in the LLM reproduction of screening leads to improved performance. Text simplification did not increase the screeners' screening performance, but reduced the time used in screening. Screeners' scientific literacy skills and researcher status predict screening performance. Some LLM and prompt combinations perform as well as human screeners in the screening tasks. Our results indicate that the GPT-4 LLM is better than its predecessor, GPT-3.5. Additionally, Few-shot and One-shot prompting outperforms Zero-shot prompting. Using LLMs for text simplification in the screening process does not significantly improve human performance. Using LLMs to automate title-abstract screening seems promising, but current LLMs are not significantly more accurate than human screeners. To recommend the use of LLMs in the screening process of SRs, more research is needed. We recommend future SR studies publish replication packages with screening data to enable more conclusive experimenting with LLM screening.
系统综述法(SR)是软件工程领域(SE)中的一种流行研究方法。然而,进行SR平均需要67周的时间。因此,自动化SR过程中任何步骤都可能减少与SR相关的努力。我们的目标是调查大型语言模型(LLMs)是否可以通过简化摘要,从而加速标题摘要筛选,并自动化标题摘要筛选。我们进行了一项实验,其中人类对20篇具有原始和简化摘要的论文进行了筛选。使用人类筛选者和基于GPT-3.5和GPT-4的LLM进行了相同筛选任务。我们还研究了不同的提示技术(零击(ZS)、一次击(OS)、少量击(FS)和少量击与思考(FS-CoT))是否改善LLM的筛选性能。最后,我们研究了在LLM复制筛选提示的使用是否会导致性能提升。虽然文本简化没有提高筛选者的性能,但减少了筛选所需的时间。筛选者的科学素养和研究者身份预测了筛选绩效。一些LLM和提示组合在筛选任务中表现与人类筛选者相当。我们的结果表明,GPT-4 LLM比其前任GPT-3.5更好。此外,少量击和一次击提示优于零击提示。在筛选过程中使用LLM进行文本简化并没有显著提高人类性能。使用LLM自动进行标题摘要筛选看起来很有前途,但目前的LLM并没有比人类筛选者更准确。为了推荐在SR筛选过程中使用LLM,还需要进行更多的研究。我们建议,未来的SR研究者在SR研究中发布带有筛选数据的复制包,以促进更确凿的尝试使用LLM进行筛选。
https://arxiv.org/abs/2404.15667
Multiple clustering has gained significant attention in recent years due to its potential to reveal multiple hidden structures of data from different perspectives. The advent of deep multiple clustering techniques has notably advanced the performance by uncovering complex patterns and relationships within large datasets. However, a major challenge arises as users often do not need all the clusterings that algorithms generate, and figuring out the one needed requires a substantial understanding of each clustering result. Traditionally, aligning a user's brief keyword of interest with the corresponding vision components was challenging, but the emergence of multi-modal and large language models (LLMs) has begun to bridge this gap. In response, given unlabeled target visual data, we propose Multi-MaP, a novel method employing a multi-modal proxy learning process. It leverages CLIP encoders to extract coherent text and image embeddings, with GPT-4 integrating users' interests to formulate effective textual contexts. Moreover, reference word constraint and concept-level constraint are designed to learn the optimal text proxy according to the user's interest. Multi-MaP not only adeptly captures a user's interest via a keyword but also facilitates identifying relevant clusterings. Our extensive experiments show that Multi-MaP consistently outperforms state-of-the-art methods in all benchmark multi-clustering vision tasks. Our code is available at this https URL.
近年来,由于从不同角度揭示数据中多个潜在的结构具有可能性,多聚类技术引起了人们的广泛关注。深度多聚类技术的出现显著提高了大型数据集的性能,通过揭示复杂模式和关系。然而,用户通常不需要算法生成的所有聚类,而且确定所需聚类需要对每个聚类结果进行深入的理解。传统上,将用户的感兴趣关键词与相应的视觉组件对齐是具有挑战性的,但多模态和大型语言模型(LLMs)的出现已经开始弥合这一差距。 为了应对无标签目标视觉数据,我们提出了Multi-MaP,一种采用多模态代理学习过程的新型方法。它依赖于CLIP编码器提取连贯的文本和图像嵌入,GPT-4将用户的兴趣组合成有效的文本上下文。此外,参考词约束和概念级别约束旨在根据用户的兴趣学习最优的文本代理。Multi-MaP不仅通过关键词捕获用户的兴趣,而且有助于发现相关的聚类。 我们进行了广泛的实验,结果表明,Multi-MaP在所有基准多聚类视觉任务中均显著优于最先进的方法。我们的代码可在此处访问:https://thisurl.com/。
https://arxiv.org/abs/2404.15655
General purpose Large Language Models (LLM) such as the Generative Pretrained Transformer (GPT) and Large Language Model Meta AI (LLaMA) have attracted much attention in recent years. There is strong evidence that these models can perform remarkably well in various natural language processing tasks. However, how to leverage them to approach domain-specific use cases and drive value remains an open question. In this work, we focus on a specific use case, pharmaceutical manufacturing investigations, and propose that leveraging historical records of manufacturing incidents and deviations in an organization can be beneficial for addressing and closing new cases, or de-risking new manufacturing campaigns. Using a small but diverse dataset of real manufacturing deviations selected from different product lines, we evaluate and quantify the power of three general purpose LLMs (GPT-3.5, GPT-4, and Claude-2) in performing tasks related to the above goal. In particular, (1) the ability of LLMs in automating the process of extracting specific information such as root cause of a case from unstructured data, as well as (2) the possibility of identifying similar or related deviations by performing semantic search on the database of historical records are examined. While our results point to the high accuracy of GPT-4 and Claude-2 in the information extraction task, we discuss cases of complex interplay between the apparent reasoning and hallucination behavior of LLMs as a risk factor. Furthermore, we show that semantic search on vector embedding of deviation descriptions can be used to identify similar records, such as those with a similar type of defect, with a high level of accuracy. We discuss further improvements to enhance the accuracy of similar record identification.
近年来,通用大型语言模型(LLM)如生成预训练Transformer(GPT)和大语言模型元AI(LLaMA)引起了广泛关注。这些模型在各种自然语言处理任务中的表现确实非常出色。然而,如何将它们应用于领域特定应用场景并实现价值仍然是一个未解之谜。在这项工作中,我们关注一个具体的应用场景,即药品制造业调查,并提出利用组织历史记录的制造事件和偏差有益于解决和关闭新案件,或降低新生产活动的风险。使用来自不同产品线的真实制造偏差的小而多样的数据集,我们评估并量化三种通用LLM(GPT-3.5,GPT-4和Claude-2)在执行与上述目标相关的任务的功率。 特别是,我们检查了LLM在提取特定信息,如案件根原因,以及通过数据库执行语义搜索来识别类似或相关偏差的可能性。虽然我们的结果表明GPT-4和Claude-2在信息提取任务中的高准确性,但讨论了LLM似乎推理和幻觉行为的复杂相互作用作为风险因素。此外,我们还证明了通过向偏差描述的向量嵌入进行语义搜索可以用来识别具有相似类型的缺陷的类似记录,具有很高的准确性。我们进一步讨论了提高类似记录识别准确性的改进措施。
https://arxiv.org/abs/2404.15578
Clinical trial matching is the task of identifying trials for which patients may be potentially eligible. Typically, this task is labor-intensive and requires detailed verification of patient electronic health records (EHRs) against the stringent inclusion and exclusion criteria of clinical trials. This process is manual, time-intensive, and challenging to scale up, resulting in many patients missing out on potential therapeutic options. Recent advancements in Large Language Models (LLMs) have made automating patient-trial matching possible, as shown in multiple concurrent research studies. However, the current approaches are confined to constrained, often synthetic datasets that do not adequately mirror the complexities encountered in real-world medical data. In this study, we present the first, end-to-end large-scale empirical evaluation of clinical trial matching using real-world EHRs. Our study showcases the capability of LLMs to accurately match patients with appropriate clinical trials. We perform experiments with proprietary LLMs, including GPT-4 and GPT-3.5, as well as our custom fine-tuned model called OncoLLM and show that OncoLLM, despite its significantly smaller size, not only outperforms GPT-3.5 but also matches the performance of qualified medical doctors. All experiments were carried out on real-world EHRs that include clinical notes and available clinical trials from a single cancer center in the United States.
临床试验匹配的任务是确定可能符合条件的患者。通常,这项任务费力且需要对患者的电子病历(EHR)与临床试验的严格纳入和排除标准进行详细验证。这个过程是手动、时间密集且难以扩展的,导致许多患者错过了潜在的治疗选择。近年来,大型语言模型(LLMs)的进步使得自动化患者-试验匹配成为可能,正如多个同时研究论文所展示的那样。然而,现有方法仅限于受限的、通常是由合成数据集,这些数据集并不能充分反映现实医学数据的复杂性。在本研究中,我们首次完成了针对现实世界EHR的大型规模实证评估,对临床试验匹配。我们使用专有的LLM进行了实验,包括GPT-4和GPT-3.5,以及我们自定义的微调模型OncoLLM,并证明了OncoLLM在显著较小的规模下不仅超过了GPT-3.5,而且其性能甚至超过了合格的医生。所有实验都是在包括美国单个癌症中心在内的现实世界EHR上进行的。
https://arxiv.org/abs/2404.15549
Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks. But, can they really "reason" over the natural language? This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied. However, the crucial skill pertaining to 'logical reasoning' has remained underexplored. Existing work investigating this reasoning ability of LLMs has focused only on a couple of inference rules (such as modus ponens and modus tollens) of propositional and first-order logic. Addressing the above limitation, we comprehensively evaluate the logical reasoning ability of LLMs on 25 different reasoning patterns spanning over propositional, first-order, and non-monotonic logics. To enable systematic evaluation, we introduce LogicBench, a natural language question-answering dataset focusing on the use of a single inference rule. We conduct detailed analysis with a range of LLMs such as GPT-4, ChatGPT, Gemini, Llama-2, and Mistral using chain-of-thought prompting. Experimental results show that existing LLMs do not fare well on LogicBench; especially, they struggle with instances involving complex reasoning and negations. Furthermore, they sometimes overlook contextual information necessary for reasoning to arrive at the correct conclusion. We believe that our work and findings facilitate future research for evaluating and enhancing the logical reasoning ability of LLMs. Data and code are available at this https URL.
近年来发展的大型语言模型(LLMs)在各种语言理解任务上的表现确实非常出色。但是,它们能否真正“理性”地处理自然语言呢?这个问题受到了大量的研究关注,并且已经研究了很多推理技能,如常识、数量推理和定性推理。然而,关于“逻辑推理”这一关键技能,迄今为止仍缺乏深入的研究。现有工作在研究LLMs的推理能力时,仅关注了命题和第一级逻辑的几个推理规则(如模态推理和推论规则)。为解决这一局限,我们全面评估了LLMs在25个不同的推理模式上的逻辑推理能力,这些模式跨越了命题、第一级逻辑和非规范逻辑。为了进行系统性的评估,我们引入了LogicBench,这是一个关注使用单一推理规则的自然语言问题回答数据集。我们使用包括GPT-4、ChatGPT、Gemini、Llama-2和Mistral在内的各种LLM,使用连锁思考提示进行详细分析。实验结果表明,现有LLM在LogicBench的表现不佳;尤其是,它们在涉及复杂推理和否定实例时表现不佳。此外,它们有时会忽视推理所需的语言上下文信息。我们认为,我们的工作及其成果有助于未来研究为评估和提高LLMs的逻辑推理能力提供方向。数据和代码可在此https://www.academia.edu/39411041/Logic_Reasoning_for_LLMs_LogicBench_and_Beyond_with_Chain_of_Thought_Prompting_Towards_a_Systematic_Evaluation
https://arxiv.org/abs/2404.15522
Geospatial Copilots unlock unprecedented potential for performing Earth Observation (EO) applications through natural language instructions. However, existing agents rely on overly simplified single tasks and template-based prompts, creating a disconnect with real-world scenarios. In this work, we present GeoLLM-Engine, an environment for tool-augmented agents with intricate tasks routinely executed by analysts on remote sensing platforms. We enrich our environment with geospatial API tools, dynamic maps/UIs, and external multimodal knowledge bases to properly gauge an agent's proficiency in interpreting realistic high-level natural language commands and its functional correctness in task completions. By alleviating overheads typically associated with human-in-the-loop benchmark curation, we harness our massively parallel engine across 100 GPT-4-Turbo nodes, scaling to over half a million diverse multi-tool tasks and across 1.1 million satellite images. By moving beyond traditional single-task image-caption paradigms, we investigate state-of-the-art agents and prompting techniques against long-horizon prompts.
地理空间协同飞行器通过自然语言指令解锁了执行地球观测(EO)应用前所未有的潜力。然而,现有的代理依赖于过于简单的单一任务和基于模板的提示,与现实世界的场景存在割裂。在这项工作中,我们提出了GeoLLM-Engine,一个由远程 sensing 平台上的分析师定期执行复杂任务的工具增强代理的环境。我们通过添加地理空间 API 工具、动态地图/UI 和外部多模态知识库来丰富我们的环境,以便更准确地衡量代理在解释真实高级自然语言命令方面的熟练程度及其在任务完成中的功能性正确性。通过减轻与人类在环基准 Curation 相关的开销,我们在100个GPT-4-Turbo节点上充分利用我们的大规模并行引擎,扩展到超过50000个多样化的多工具任务和1100万卫星图像。通过超越传统单一任务图像捕捉范例,我们研究了最先进的代理和提示技术对抗长距离提示的现状。
https://arxiv.org/abs/2404.15500
Phishing, a prevalent cybercrime tactic for decades, remains a significant threat in today's digital world. By leveraging clever social engineering elements and modern technology, cybercrime targets many individuals, businesses, and organizations to exploit trust and security. These cyber-attackers are often disguised in many trustworthy forms to appear as legitimate sources. By cleverly using psychological elements like urgency, fear, social proof, and other manipulative strategies, phishers can lure individuals into revealing sensitive and personalized information. Building on this pervasive issue within modern technology, this paper aims to analyze the effectiveness of 15 Large Language Models (LLMs) in detecting phishing attempts, specifically focusing on a randomized set of "419 Scam" emails. The objective is to determine which LLMs can accurately detect phishing emails by analyzing a text file containing email metadata based on predefined criteria. The experiment concluded that the following models, ChatGPT 3.5, GPT-3.5-Turbo-Instruct, and ChatGPT, were the most effective in detecting phishing emails.
网络钓鱼,这种了几十年来的普遍网络犯罪手段,在当今数字世界中仍然是一个重要的威胁。通过利用聪明的社交工程要素和现代技术,网络犯罪目标众多个人、企业和组织,以利用信任和安全性。这些网络攻击者通常以多种可信形式伪装自己,伪装成合法来源。通过巧妙地使用心理要素,如紧迫感、恐惧、社交证据等操纵策略,网络钓鱼者可以将个人引入透露敏感和个人信息。在现代技术的普遍问题基础上,本文旨在分析15个大型语言模型(LLMs)在检测网络钓鱼尝试方面的有效性,特别关注一个预定义的“419诈骗”电子邮件随机集。目标是要确定哪些LLM可以准确检测到网络钓鱼电子邮件,通过分析包含邮件元数据的文本文件来确定预定义标准。实验结果表明,ChatGPT 3.5、GPT-3.5-Turbo-Instruct和ChatGPT是最有效的检测网络钓鱼电子邮件的模型。
https://arxiv.org/abs/2404.15485
This workshop paper presents a critical examination of the integration of Generative AI (Gen AI) into the academic writing process, focusing on the use of AI as a collaborative tool. It contrasts the performance and interaction of two AI models, Gemini and ChatGPT, through a collaborative inquiry approach where researchers engage in facilitated sessions to design prompts that elicit specific AI responses for crafting research outlines. This case study highlights the importance of prompt design, output analysis, and recognizing the AI's limitations to ensure responsible and effective AI integration in scholarly work. Preliminary findings suggest that prompt variation significantly affects output quality and reveals distinct capabilities and constraints of each model. The paper contributes to the field of Human-Computer Interaction by exploring effective prompt strategies and providing a comparative analysis of Gen AI models, ultimately aiming to enhance AI-assisted academic writing and prompt a deeper dialogue within the HCI community.
这份工作论文对将生成式人工智能(Gen AI)融入学术写作过程的整合进行了批判性探讨,重点关注了使用AI作为合作工具。通过采用合作研究方法,研究人员参与设计提示,以诱使特定AI响应生成研究大纲。这个案例研究突出了提示设计、输出分析和认识到AI的局限对于确保责任且有效的AI融入学术研究的重要性。初步研究结果表明,提示变化显著影响了输出质量,揭示了每个模型的独特能力和限制。本文为人工智能领域的人类与计算机交互研究做出了贡献,探讨了有效的提示策略,并提供了对Gen AI模型的比较分析,最终旨在增强AI辅助学术写作,并引导HCI社区内的更深刻的对话。
https://arxiv.org/abs/2404.16071
Multimodal LLMs are the natural evolution of LLMs, and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and vision-and-language adapters, in this paper we concentrate on endowing such models with the capability of answering questions that require external knowledge. Our approach, termed Wiki-LLaVA, aims at integrating an external knowledge source of multimodal documents, which is accessed through a hierarchical retrieval pipeline. Relevant passages, using this approach, are retrieved from the external knowledge source and employed as additional context for the LLM, augmenting the effectiveness and precision of generated dialogues. We conduct extensive experiments on datasets tailored for visual question answering with external data and demonstrate the appropriateness of our approach.
多模态LLM是LLM的自然演变,并扩大其功能以实现超越纯文本模态。在设计新颖架构和视觉与语言适配器的研究过程中,本文重点关注为这样的模型赋予回答需要外部知识的问题的能力。我们称之为Wiki-LLaVA的方法旨在通过分层检索管道访问外部知识源,为LLM提供额外的上下文,提高生成对话的有效性和精确度。我们在针对视觉问题回答的外部数据集上进行广泛的实验,证明了我们的方法的合适性。
https://arxiv.org/abs/2404.15406
Recent advancements in instruction-following models have made user interactions with models more user-friendly and efficient, broadening their applicability. In graphic design, non-professional users often struggle to create visually appealing layouts due to limited skills and resources. In this work, we introduce a novel multimodal instruction-following framework for layout planning, allowing users to easily arrange visual elements into tailored layouts by specifying canvas size and design purpose, such as for book covers, posters, brochures, or menus. We developed three layout reasoning tasks to train the model in understanding and executing layout instructions. Experiments on two benchmarks show that our method not only simplifies the design process for non-professionals but also surpasses the performance of few-shot GPT-4V models, with mIoU higher by 12% on Crello. This progress highlights the potential of multimodal instruction-following models to automate and simplify the design process, providing an approachable solution for a wide range of design tasks on visually-rich documents.
近年来,指令跟随模型的进步使得用户与模型之间的交互更加友好和高效,拓宽了其应用范围。在图形设计中,非专业用户通常由于技能和资源有限,难以创建视觉上吸引人的布局。在这项工作中,我们引入了一个新颖的多模态指令跟随布局规划框架,允许用户通过指定画布大小和设计目的,轻松地将视觉元素排版到定制布局中,如书籍封面、海报、宣传册或菜单。我们开发了三个布局推理任务来训练模型理解并执行布局指令。在两个基准测试上的实验证明,我们的方法不仅简化了非专业用户的设计流程,而且超越了少样本GPT-4V模型的性能,在Crello上的mIoU值较高。这一进步突出了多模态指令跟随模型的潜力,可以自动化和简化设计过程,为各种视觉丰富的文档提供了一种易于设计的解决方案。
https://arxiv.org/abs/2404.15271
We study interactive learning of language agents based on user edits made to the agent's output. In a typical setting such as writing assistants, the user interacts with a language agent to generate a response given a context, and may optionally edit the agent response to personalize it based on their latent preference, in addition to improving the correctness. The edit feedback is naturally generated, making it a suitable candidate for improving the agent's alignment with the user's preference, and for reducing the cost of user edits over time. We propose a learning framework, PRELUDE that infers a description of the user's latent preference based on historic edit data and using it to define a prompt policy that drives future response generation. This avoids fine-tuning the agent, which is costly, challenging to scale with the number of users, and may even degrade its performance on other tasks. Furthermore, learning descriptive preference improves interpretability, allowing the user to view and modify the learned preference. However, user preference can be complex and vary based on context, making it challenging to learn. To address this, we propose a simple yet effective algorithm named CIPHER that leverages a large language model (LLM) to infer the user preference for a given context based on user edits. In the future, CIPHER retrieves inferred preferences from the k-closest contexts in the history, and forms an aggregate preference for response generation. We introduce two interactive environments -- summarization and email writing, for evaluation using a GPT-4 simulated user. We compare with algorithms that directly retrieve user edits but do not learn descriptive preference, and algorithms that learn context-agnostic preference. On both tasks, CIPHER achieves the lowest edit distance cost and learns preferences that show significant similarity to the ground truth preferences
我们研究基于用户对语言代理输出进行修改的交互式学习。在一个典型的场景(如写作助手)中,用户与语言代理交互以根据给定上下文生成响应,并且可以选项性地编辑代理的响应以根据他们的潜在偏好进行个性化定制,除了提高正确性外。编辑反馈自然生成,使其成为一个适合改进代理与用户偏好的合适候选者,并减少用户编辑的时间成本。我们提出了一个学习框架,PRELUDE,根据历史编辑数据推断用户的潜在偏好,并使用它定义一个提示策略来驱动未来的响应生成。这避免了微调代理,这是昂贵且难以扩展的。此外,学习描述性偏好提高了可解释性,使用户可以查看和修改学到的偏好。然而,用户偏好可能复杂多样,并根据上下文有所不同,这使得学习具有挑战性。为解决这个问题,我们提出了一个简单而有效的算法,名为CIPHER,它利用大型语言模型(LLM)根据用户编辑推断给定上下文的用户偏好。在未来的研究中,CIPHER从历史编辑数据中的k个最近上下文中提取推断的偏好,并形成反应生成的聚合偏好。我们引入了两个交互式环境——总结和电子邮件写作,用于使用GPT-4模拟的用户进行评估。我们比较了直接从用户编辑中检索算法,但不学习描述性偏好的算法,以及学习上下文无关偏好的算法。在两个任务上,CIPHER都实现了最低的编辑距离成本,并学习了与真实偏好具有显著相似性的偏好。
https://arxiv.org/abs/2404.15269
Training task-oriented dialogue systems typically requires turn-level annotations for interacting with their APIs: e.g. a dialogue state and the system actions taken at each step. These annotations can be costly to produce, error-prone, and require both domain and annotation expertise. With advances in LLMs, we hypothesize unlabelled data and a schema definition are sufficient for building a working task-oriented dialogue system, completely unsupervised. Using only (1) a well-defined API schema (2) a set of unlabelled dialogues between a user and agent, we develop a novel approach for inferring turn-level annotations as latent variables using a noisy channel model. We iteratively improve these pseudo-labels with expectation-maximization (EM), and use the inferred labels to train an end-to-end dialogue agent. Evaluating our approach on the MultiWOZ benchmark, our method more than doubles the dialogue success rate of a strong GPT-3.5 baseline.
基于任务的对话系统通常需要进行交互级别的注释,例如对话状态和每个步骤系统采取的行动。这些注释可能会产生费用,具有错误率,并且需要领域和注释专业知识。随着LLM的进步,我们假设无标签数据和数据定义足以构建一个无需监督的 task-oriented 对话系统。仅使用(1)定义良好的 API 模式和(2)用户和代理之间的无标签对话,我们提出了一种通过噪声信道模型推断回合级别注释的新方法。我们通过期望最大化(EM)迭代改进这些伪标签,并使用推断的标签来训练端到端对话代理。在 MultiWOZ 基准上评估我们的方法,我们的方法将 strong GPT-3.5 基线的对话成功率加倍。
https://arxiv.org/abs/2404.15219