In this paper, we explore the generation of one-liner jokes through multi-step reasoning. Our work involved reconstructing the process behind creating humorous one-liners and developing a working prototype for humor generation. We conducted comprehensive experiments with human participants to evaluate our approach, comparing it with human-created jokes, zero-shot GPT-4 generated humor, and other baselines. The evaluation focused on the quality of humor produced, using human labeling as a benchmark. Our findings demonstrate that the multi-step reasoning approach consistently improves the quality of generated humor. We present the results and share the datasets used in our experiments, offering insights into enhancing humor generation with artificial intelligence.
在本文中,我们通过多步推理来探索通过 Multi-step reasoning 生成一句幽默笑话的方法。我们的工作包括重新构建创建幽默一句笑话的过程并开发一个幽默生成工作原型。我们与人类参与者进行了全面的实验以评估我们的方法,将其与人类创造的经典笑话,零 shot GPT-4 生成的幽默和其他基线进行比较。评估的重点是产生的幽默的质量,使用人类标注作为基准。我们的研究结果表明,多步推理方法始终可以提高生成的幽默质量。我们提供了实验结果和实验中使用的数据集,为使用人工智能增强幽默生成提供了见解。
https://arxiv.org/abs/2405.07280
Large language models have seen extraordinary growth in popularity due to their human-like content generation capabilities. We show that these models can also be used to successfully cluster human-generated content, with success defined through the measures of distinctiveness and interpretability. This success is validated by both human reviewers and ChatGPT, providing an automated means to close the 'validation gap' that has challenged short-text clustering. Comparing the machine and human approaches we identify the biases inherent in each, and question the reliance on human-coding as the 'gold standard'. We apply our methodology to Twitter bios and find characteristic ways humans describe themselves, agreeing well with prior specialist work, but with interesting differences characteristic of the medium used to express identity.
由于它们具有类似于人类的内容生成能力,大型语言模型在流行度上看到了非凡的增长。我们证明了这些模型还可以用于成功聚类人类生成的内容,成功定义为 distinctiveness(区分度)和 interpretability(可解释性)。这一成功得到了人类审核者和 ChatGPT 的验证,为解决短文本聚类中的“验证差距”提供了一种自动化的方法。比较机器和人类方法,我们识别出每种方法固有的偏见,并质疑将人类编码作为“黄金标准”的可信度。我们将我们的方法应用于 Twitter 简介,发现人们描述自己的方式具有特征性,与先前专家工作相一致,但用于表达身份的有趣差异。
https://arxiv.org/abs/2405.07278
The humanlike responses of large language models (LLMs) have prompted social scientists to investigate whether LLMs can be used to simulate human participants in experiments, opinion polls and surveys. Of central interest in this line of research has been mapping out the psychological profiles of LLMs by prompting them to respond to standardized questionnaires. The conflicting findings of this research are unsurprising given that mapping out underlying, or latent, traits from LLMs' text responses to questionnaires is no easy task. To address this, we use psychometrics, the science of psychological measurement. In this study, we prompt OpenAI's flagship models, GPT-3.5 and GPT-4, to assume different personas and respond to a range of standardized measures of personality constructs. We used two kinds of persona descriptions: either generic (four or five random person descriptions) or specific (mostly demographics of actual humans from a large-scale human dataset). We found that the responses from GPT-4, but not GPT-3.5, using generic persona descriptions show promising, albeit not perfect, psychometric properties, similar to human norms, but the data from both LLMs when using specific demographic profiles, show poor psychometrics properties. We conclude that, currently, when LLMs are asked to simulate silicon personas, their responses are poor signals of potentially underlying latent traits. Thus, our work casts doubt on LLMs' ability to simulate individual-level human behaviour across multiple-choice question answering tasks.
大语言模型(LLMs)的人性化回应引起了社会学家调查是否LLMs可以用于模拟人类参与实验、民意调查和调查。在研究这一领域时,一个关键兴趣是通过提示它们回答标准化问卷来绘制LLMs的心理特征。这一研究中的不同发现并不令人意外,因为从LLMs的问卷回答中绘制潜在特征并不是一项容易的任务。为解决这个问题,我们使用心理计量学,即心理测量学。在这项研究中,我们要求OpenAI的旗舰模型GPT-3.5和GPT-4扮演不同的角色,并回答一系列人格构成的标准化测量。我们使用两种人物描述:通用(四个或五个随机人物描述)或具体(主要是从大型数据集中的人类实际数据的 mostly demographic 信息)。我们发现,使用通用人物描述的GPT-4的回应具有有前景的心理计量属性,尽管并不完美,类似于人类标准,但当LLMs使用具体人口参数时,两个LLM的数据在心理计量属性上表现不佳。因此,我们得出结论,目前,当LLMs被要求模拟硅片人格时,它们的回应表现出了可能存在的潜在特征的低水平信号。因此,我们的工作对LLMs在多选题问答任务中模拟个体级别人类行为的能力提出了质疑。
https://arxiv.org/abs/2405.07248
Educational scholars have analyzed various image data acquired from teaching and learning situations, such as photos that shows classroom dynamics, students' drawings with regard to the learning content, textbook illustrations, etc. Unquestioningly, most qualitative analysis of and explanation on image data have been conducted by human researchers, without machine-based automation. It was partially because most image processing artificial intelligence models were not accessible to general educational scholars or explainable due to their complex deep neural network architecture. However, the recent development of Visual Question Answering (VQA) techniques is accomplishing usable visual language models, which receive from the user a question about the given image and returns an answer, both in natural language. Particularly, GPT-4V released by OpenAI, has wide opened the state-of-the-art visual langauge model service so that VQA could be used for a variety of purposes. However, VQA and GPT-4V have not yet been applied to educational studies much. In this position paper, we suggest that GPT-4V contributes to realizing VQA for education. By 'realizing' VQA, we denote two meanings: (1) GPT-4V realizes the utilization of VQA techniques by any educational scholars without technical/accessibility barrier, and (2) GPT-4V makes educational scholars realize the usefulness of VQA to educational research. Given these, this paper aims to introduce VQA for educational studies so that it provides a milestone for educational research methodology. In this paper, chapter II reviews the development of VQA techniques, which primes with the release of GPT-4V. Chapter III reviews the use of image analysis in educational studies. Chapter IV demonstrates how GPT-4V can be used for each research usage reviewed in Chapter III, with operating prompts provided. Finally, chapter V discusses the future implications.
教育学者们对从教学和学习情境中获取的各种图像数据进行了分析,例如显示课堂动态的照片、关于学习内容的学生的绘画,教科书插图等。毫无疑问,大多数图像数据的可视化和解释都是通过人类研究人员进行的,没有机器基于的自动化。这部分是因为大多数图像处理人工智能模型对一般教育学者来说难以获取,或者由于其复杂深度神经网络架构,难以解释。然而,最近开发的视觉问答技术(VQA)正在取得可用性,该技术接受用户关于给定图像的问题,并返回自然语言的答案。特别是,OpenAI 发布的 GPT-4V 已经大大拓展了最先进的视觉语言模型服务,使得 VQA 可以用于各种目的。然而,迄今为止,VQA 和 GPT-4V 还没有在教育研究中得到广泛应用。在本文论文中,我们建议 GPT-4V 对教育研究有所贡献。通过“实现” VQA,我们指的是两个含义:(1)GPT-4V 实现了任何教育学者在不存在技术/可用性障碍的情况下利用 VQA 技术,以及(2)GPT-4V 使教育学者意识到 VQA 对教育研究的有用性。基于这些,本文旨在为教育研究提供 VQA 的里程碑,以便为教育研究方法论提供基准。本文第 II 章回顾了 VQA 技术的发展历程,第 III 章讨论了图像分析在教育研究中的应用,第 IV 章展示了 GPT-4V 在每个审查的研究用途中的应用,并提供操作提示。最后,第 V 章讨论了未来的影响。
https://arxiv.org/abs/2405.07163
Learning reward functions remains the bottleneck to equip a robot with a broad repertoire of skills. Large Language Models (LLM) contain valuable task-related knowledge that can potentially aid in the learning of reward functions. However, the proposed reward function can be imprecise, thus ineffective which requires to be further grounded with environment information. We proposed a method to learn rewards more efficiently in the absence of humans. Our approach consists of two components: We first use the LLM to propose features and parameterization of the reward, then update the parameters through an iterative self-alignment process. In particular, the process minimizes the ranking inconsistency between the LLM and the learnt reward functions based on the execution feedback. The method was validated on 9 tasks across 2 simulation environments. It demonstrates a consistent improvement over training efficacy and efficiency, meanwhile consuming significantly fewer GPT tokens compared to the alternative mutation-based method.
学习奖励函数仍然是配备机器人广泛技能的瓶颈。大型语言模型(LLM)包含与任务相关的有价值的信息,可能有助于学习奖励函数。然而,所提出的奖励函数可能不够精确,因此需要进一步通过环境信息进行 grounded。我们提出了一种在没有人类的情况下学习奖励更有效的方法。我们的方法由两个组件组成:首先,使用LLM提出奖励的特征和参数化;然后通过迭代自适应过程更新参数。特别地,过程最小化LLM和学到的奖励函数之间的排名不一致。该方法在两个模拟环境中对9个任务进行了验证。它证明了在训练效果和效率方面具有稳定的提高,同时比基于突变的替代方法消耗的GPT标记还要少。
https://arxiv.org/abs/2405.07162
Large Language Models (LLMs) have distinguished themselves with outstanding performance in complex language modeling tasks, yet they come with significant computational and storage challenges. This paper explores the potential of quantization to mitigate these challenges. We systematically study the combined application of two well-known post-training techniques, SmoothQuant and GPTQ, and provide a comprehensive analysis of their interactions and implications for advancing LLM quantization. We enhance the versatility of both techniques by enabling quantization to microscaling (MX) formats, expanding their applicability beyond their initial fixed-point format targets. We show that by applying GPTQ and SmoothQuant, and employing MX formats for quantizing models, we can achieve a significant reduction in the size of OPT models by up to 4x and LLaMA models by up to 3x with a negligible perplexity increase of 1-3%.
大语言模型(LLMs)通过在复杂语言建模任务中的卓越表现而脱颖而出,然而它们带来了显著的计算和存储挑战。本文探讨了量化在减轻这些挑战方面的潜力。我们系统地研究了两种已知后训练技术的联合应用,SmoothQuant 和 GPTQ,并提供关于它们相互作用的全面分析,以推动LLM的量化。通过使量化能够达到微缩(MX)格式,并将它们的应用范围扩展到初始的固定点格式目标之外,我们增强了这两种技术的多样性。我们证明了通过应用GPTQ和SmoothQuant,并使用MX格式对模型进行量化,我们可以将OPT模型的大小减少至原来的4倍,LLLaMA模型的规模减少至原来的3倍,且误差不至于增加1-3%。
https://arxiv.org/abs/2405.07135
Social robotics researchers are increasingly interested in multi-party trained conversational agents. With a growing demand for real-world evaluations, our study presents Large Language Models (LLMs) deployed in a month-long live show at the Edinburgh Festival Fringe. This case study investigates human improvisers co-creating with conversational agents in a professional theatre setting. We explore the technical capabilities and constraints of on-the-spot multi-party dialogue, providing comprehensive insights from both audience and performer experiences with AI on stage. Our human-in-the-loop methodology underlines the challenges of these LLMs in generating context-relevant responses, stressing the user interface's crucial role. Audience feedback indicates an evolving interest for AI-driven live entertainment, direct human-AI interaction, and a diverse range of expectations about AI's conversational competence and utility as a creativity support tool. Human performers express immense enthusiasm, varied satisfaction, and the evolving public opinion highlights mixed emotions about AI's role in arts.
社会机器人学研究人员越来越关注多方训练的会话机器人。随着对现实世界评估需求的增加,我们的研究在爱丁堡艺术节 Fringe 的一个月现场表演中部署了大型语言模型(LLMs)。这个案例研究探讨了在专业戏剧环境中,人类表演者与会话机器人共同创作的现象。我们探讨了即兴 multi-party 对话的技术能力和限制,同时从观众和表演者的 AI 体验中提供了全面深入的洞察。我们的人机交互方法论突出了这些 LLMs 在生成相关回应方面所面临的挑战,强调了用户界面在生成有用信息中的关键作用。观众反馈表明,AI 驱动的现场娱乐、直接的人机互动和关于 AI 的会话能力及作为创意支持工具的多样期望正在不断演变。人类表演者表达出极大的热情、多样满意的反馈,而公众 opinion 则揭示了关于 AI 在艺术领域中作用的情感和复杂性。
https://arxiv.org/abs/2405.07111
Despite the significant progress of fully-supervised video captioning, zero-shot methods remain much less explored. In this paper, we propose to take advantage of existing pre-trained large-scale vision and language models to directly generate captions with test time adaptation. Specifically, we bridge video and text using three key models: a general video understanding model XCLIP, a general image understanding model CLIP, and a text generation model GPT-2, due to their source-code availability. The main challenge is how to enable the text generation model to be sufficiently aware of the content in a given video so as to generate corresponding captions. To address this problem, we propose using learnable tokens as a communication medium between frozen GPT-2 and frozen XCLIP as well as frozen CLIP. Differing from the conventional way to train these tokens with training data, we update these tokens with pseudo-targets of the inference data under several carefully crafted loss functions which enable the tokens to absorb video information catered for GPT-2. This procedure can be done in just a few iterations (we use 16 iterations in the experiments) and does not require ground truth data. Extensive experimental results on three widely used datasets, MSR-VTT, MSVD, and VATEX, show 4% to 20% improvements in terms of the main metric CIDEr compared to the existing state-of-the-art methods.
尽管在完全监督的视频标题中取得了显著的进展,但零样本方法仍然没有被充分利用。在本文中,我们提出了一种利用现有的大规模视觉和语言预训练模型直接生成适应测试时间调整的标题的方法。具体来说,我们通过三个关键模型:一个通用的视频理解模型XCLIP、一个通用的图像理解模型CLIP和一个文本生成模型GPT-2来桥接视频和文本,因为它们具有源代码可用性。主要挑战是如何让文本生成模型对给定视频的内容有足够的意识,从而生成相应的标题。为解决这个问题,我们提出了一种使用可学习标记作为在冻定的GPT-2和冻定的XCLIP以及冻定的CLIP之间的通信媒介的方法。与传统利用训练数据来训练这些标记的方式不同,我们在多个精心设计的损失函数下使用推理数据的伪目标更新这些标记。这个过程可以在几轮迭代后完成(我们在实验中使用了16轮)并且不需要地面真实数据。在三个广泛使用的数据集MSR-VTT、MSVD和VATEX上的实验结果表明,与现有最先进的方法相比,提高幅度在4%到20%之间。
https://arxiv.org/abs/2405.07046
As the conversation around using geoengineering to combat climate change intensifies, it is imperative to engage the public and deeply understand their perspectives on geoengineering research, development, and potential deployment. Through a comprehensive data-driven investigation, this paper explores the types of news that captivate public interest in geoengineering. We delved into 30,773 English-language news articles from the BBC and the New York Times, combined with Google Trends data spanning 2018 to 2022, to explore how public interest in geoengineering fluctuates in response to news coverage of broader climate issues. Using BERT-based topic modeling, sentiment analysis, and time-series regression models, we found that positive sentiment in energy-related news serves as a good predictor of heightened public interest in geoengineering, a trend that persists over time. Our findings suggest that public engagement with geoengineering and climate action is not uniform, with some topics being more potent in shaping interest over time, such as climate news related to energy, disasters, and politics. Understanding these patterns is crucial for scientists, policymakers, and educators aiming to craft effective strategies for engaging with the public and fostering dialogue around emerging climate technologies.
随着关于使用地球工程对抗气候变化的对话不断加剧,我们有必要与公众深入探讨他们对地球工程研究、开发和潜在部署的看法。通过全面的數據驱动調查,本文探討了哪些新闻吸引了公众对地球工程的兴趣。我们深入挖掘了BBC和《纽约时报》的30,773篇英文新闻文章以及2018年至2022年间的Google Trends数据,以探讨公众对地球工程兴趣的波动如何随针对更广泛气候问题的新闻报道而变化。利用基于BERT的主题建模、情感分析和时间序列回归模型,我们发现,与能源相关的新闻中积极情绪是一个预测公众对地球工程兴趣提高的好指标,这一趋势会随着时间延续。我们的研究结果表明,公众与地球工程和应对气候变化的参与程度并不一致。在塑造长期兴趣方面,一些主题比其他主题更有影响力,比如与能源、灾难和政治相关的气候新闻。了解这些模式对科学家、政策制定者和教育者来说至关重要,他们试图制定有效的策略与公众进行互动,并促进关于新兴气候技术的热烈讨论。
https://arxiv.org/abs/2405.07010
In this paper, we explore a forward-thinking question: Is GPT-4V effective at low-level data analysis tasks on charts? To this end, we first curate a large-scale dataset, named ChartInsights, consisting of 89,388 quartets (chart, task, question, answer) and covering 10 widely-used low-level data analysis tasks on 7 chart types. Firstly, we conduct systematic evaluations to understand the capabilities and limitations of 18 advanced MLLMs, which include 12 open-source models and 6 closed-source models. Starting with a standard textual prompt approach, the average accuracy rate across the 18 MLLMs is 36.17%. Among all the models, GPT-4V achieves the highest accuracy, reaching 56.13%. To understand the limitations of multimodal large models in low-level data analysis tasks, we have designed various experiments to conduct an in-depth test of capabilities of GPT-4V. We further investigate how visual modifications to charts, such as altering visual elements (e.g. changing color schemes) and introducing perturbations (e.g. adding image noise), affect performance of GPT-4V. Secondly, we present 12 experimental findings. These findings suggest potential of GPT-4V to revolutionize interaction with charts and uncover the gap between human analytic needs and capabilities of GPT-4V. Thirdly, we propose a novel textual prompt strategy, named Chain-of-Charts, tailored for low-level analysis tasks, which boosts model performance by 24.36%, resulting in an accuracy of 80.49%. Furthermore, by incorporating a visual prompt strategy that directs attention of GPT-4V to question-relevant visual elements, we further improve accuracy to 83.83%. Our study not only sheds light on the capabilities and limitations of GPT-4V in low-level data analysis tasks but also offers valuable insights for future research.
在本文中,我们探讨了一个前瞻性的问题:GPT-4V 在低级数据分析任务中的图表效果如何?为此,我们首先收集了一个名为 ChartInsights 的的大型数据集,包括 89,388 个四元组(图表、任务、问题、答案),涵盖了七种广泛使用的低级数据分析任务。我们首先对18个先进的MLLM进行系统评估,这些模型包括12个开源模型和6个闭源模型。从标准的文本提示方法开始,这18个MLM的平均准确率为36.17%。在所有模型中,GPT-4V实现的准确率最高,达到56.13%。为了了解大型多模态模型的低级数据分析任务的局限性,我们进行了各种实验来测试GPT-4V的性能。我们进一步研究了如何通过修改图表的视觉元素(例如更改颜色方案)和引入扰动(例如添加图像噪声)来影响GPT-4V的性能。第二,我们提供了12个实验发现。这些发现表明,GPT-4V有可能彻底改变与图表的交互,并揭示GPT-4V在人类分析需求和能力之间的差距。第三,我们提出了一个名为 Chain-of-Charts 的新颖文本提示策略,专门针对低级分析任务,该策略提高了模型的性能,使其达到80.49%的准确率。此外,通过引入一个指导GPT-4V注意力的视觉提示策略,我们进一步提高了准确率到83.83%。我们的研究不仅阐明了GPT-4V在低级数据分析任务中的能力和局限性,而且为未来的研究提供了宝贵的洞见。
https://arxiv.org/abs/2405.07001
While nationality is a pivotal demographic element that enhances the performance of language models, it has received far less scrutiny regarding inherent biases. This study investigates nationality bias in ChatGPT (GPT-3.5), a large language model (LLM) designed for text generation. The research covers 195 countries, 4 temperature settings, and 3 distinct prompt types, generating 4,680 discourses about nationality descriptions in Chinese and English. Automated metrics were used to analyze the nationality bias, and expert annotators alongside ChatGPT itself evaluated the perceived bias. The results show that ChatGPT's generated discourses are predominantly positive, especially compared to its predecessor, GPT-2. However, when prompted with negative inclinations, it occasionally produces negative content. Despite ChatGPT considering its generated text as neutral, it shows consistent self-awareness about nationality bias when subjected to the same pair-wise comparison annotation framework used by human annotators. In conclusion, while ChatGPT's generated texts seem friendly and positive, they reflect the inherent nationality biases in the real world. This bias may vary across different language versions of ChatGPT, indicating diverse cultural perspectives. The study highlights the subtle and pervasive nature of biases within LLMs, emphasizing the need for further scrutiny.
虽然民族身份是一个关键的人口统计因素,可以增强语言模型的表现,但关于其固有偏见,它受到的审查远不如其他方面。这项研究调查了 ChatGPT(GPT-3.5)这个大型语言模型(LLM)在固有偏见方面。研究覆盖了195个国家,4个温度设置和3种不同的提示类型,共产生了关于中文和英文中民族身份描述的4680个论述。使用了自动指标来分析民族身份偏见,专家注释者以及 ChatGPT 本身也对偏见进行了评估。研究结果表明,ChatGPT生成的论述大多是积极的,特别是与它的前任GPT-2相比。然而,当受到负面提示时,它偶尔会生成负面内容。尽管 ChatGPT 将其生成的文本视为中性,但在同样的成对比较注释框架下受到人类注释者处理时,它对民族身份偏见表现出一致的自我意识。总之,ChatGPT生成的文本似乎友好和积极,但它反映了现实世界中民族身份偏见的存在。这种偏见可能因 ChatGPT 的不同语言版本而异,表明了多样文化观点。这项研究突出了 LLMs 中偏见 subtle 和 pervasive 的本质,强调了需要进一步审查的重要性。
https://arxiv.org/abs/2405.06996
Large Language Models (LLMs) are promising analytical tools. They can augment human epistemic, cognitive and reasoning abilities, and support 'sensemaking', making sense of a complex environment or subject by analysing large volumes of data with a sensitivity to context and nuance absent in earlier text processing systems. This paper presents a pilot experiment that explores how LLMs can support thematic analysis of controversial topics. We compare how human researchers and two LLMs GPT-4 and Llama 2 categorise excerpts from media coverage of the controversial Australian Robodebt scandal. Our findings highlight intriguing overlaps and variances in thematic categorisation between human and machine agents, and suggest where LLMs can be effective in supporting forms of discourse and thematic analysis. We argue LLMs should be used to augment, and not replace human interpretation, and we add further methodological insights and reflections to existing research on the application of automation to qualitative research methods. We also introduce a novel card-based design toolkit, for both researchers and practitioners to further interrogate LLMs as analytical tools.
大语言模型(LLMs)是具有前景的分析工具。它们可以增强人类的认知、推理能力,并支持“感知”,通过分析大量数据来对复杂的环境或主题进行敏感的分析,从而理解文本中缺少的上下文和细微差别。本文介绍了一个初步实验,探讨了LLMs如何支持对争议性主题的主题分析。我们比较了人类研究员和两个LLM GPT-4和Llama 2对澳大利亚Robodebt丑闻媒体报道的摘录的分类。我们的研究结果强调了人类和机器代理在主题分类方面的 intriguing 重叠和差异,并表明LLMs在支持形式的话语和主题分析方面可以有效。我们认为,LLMs应被用于增强,而不是取代人类解释。我们还提供了现有关于将自动化应用于定性研究方法的研究的进一步方法和反思。我们还介绍了一种新的基于卡的设计工具包,供研究人员和实践者进一步研究LLMs作为分析工具。
https://arxiv.org/abs/2405.06919
Event relation extraction (ERE) is a critical and fundamental challenge for natural language processing. Existing work mainly focuses on directly modeling the entire document, which cannot effectively handle long-range dependencies and information redundancy. To address these issues, we propose a cluster-aware compression method for improving event relation extraction (TacoERE), which explores a compression-then-extraction paradigm. Specifically, we first introduce document clustering for modeling event dependencies. It splits the document into intra- and inter-clusters, where intra-clusters aim to enhance the relations within the same cluster, while inter-clusters attempt to model the related events at arbitrary distances. Secondly, we utilize cluster summarization to simplify and highlight important text content of clusters for mitigating information redundancy and event distance. We have conducted extensive experiments on both pre-trained language models, such as RoBERTa, and large language models, such as ChatGPT and GPT-4, on three ERE datasets, i.e., MAVEN-ERE, EventStoryLine and HiEve. Experimental results demonstrate that TacoERE is an effective method for ERE.
事件关系提取(ERE)是自然语言处理的一个关键和基本挑战。现有的工作主要集中在直接建模整个文档,这无法有效地处理长距离依赖和信息冗余。为了应对这些问题,我们提出了一个聚类感知压缩方法来提高事件关系提取(TacoREE),这是一种压缩然后再提取范式。具体来说,我们首先引入了文档聚类来建模事件依赖关系。它将文档划分为内聚和外聚的簇,内聚簇旨在在同一簇内增强关系,而外聚簇试图在任意距离上建模相关事件。其次,我们利用聚类摘要来简化并突出聚类的重要文本内容,以减轻信息冗余和事件距离。我们在三个数据集上(即MAVEN-ERE,EventStoryLine和HiEve)对预训练语言模型(如RoBERTa和ChatGPT)进行了广泛的实验。实验结果表明,TacoREE是一种有效的用于事件关系提取的有效方法。
https://arxiv.org/abs/2405.06890
This paper explores the possibilities of the current generation of Large Language Models for incorporating Machine Learning Operations (MLOps) functionalities into ML training code bases. We evaluate the performance of OpenAI (gpt-3.5-turbo) and WizardCoder (open-source, 15B parameters) models on the automated accomplishment of various MLOps functionalities in different settings. We perform a benchmarking study that assesses the ability of these models to: (1) adapt existing code samples (Inlining) with component-specific MLOps functionality such as MLflow and Weights & Biases for experiment tracking, Optuna for hyperparameter optimization etc., and (2) perform the task of Translation from one component of an MLOps functionality to another, e.g., translating existing GitPython library based version control code to Data Version Control library based. We also propose three different approaches that involve teaching LLMs to comprehend the API documentation of the components as a reference while accomplishing the Translation tasks. In our evaluations, the gpt-3.5-turbo model significantly outperforms WizardCoder by achieving impressive Pass@3 accuracy in model optimization (55% compared to 0% by WizardCoder), experiment tracking (100%, compared to 62.5% by WizardCoder), model registration (92% compared to 42% by WizardCoder) and hyperparameter optimization (83% compared to 58% by WizardCoder) on average, in their best possible settings, showcasing its superior code adaptability performance in complex MLOps tasks.
本文探讨了当前一代大型语言模型在将机器学习操作(MLOps)功能集成到ML训练代码库中的可能性。我们评估了OpenAI(gpt-3.5-turbo)和WizardCoder(开源,15B参数)模型在不同设置中自动完成各种MLOps功能的能力。我们进行了一项基准研究来评估这些模型的能力:(1)适应现有的代码样本(内嵌),具有组件特定的MLOps功能,如MLflow和Weights & Biases,用于实验跟踪,Optuna用于超参数优化等;(2)实现从MLOps功能的一个组件到另一个组件的翻译任务,例如,将基于Git的现有Python库版本控制系统翻译为基于Data的版本控制系统。我们还提出了三种不同的方法,涉及将LLM指导其理解组件的API文档作为参考,同时完成翻译任务。在我们的评估中,gpt-3.5-turbo模型在模型优化(与WizardCoder相比,提高了55%)实验跟踪(与WizardCoder相比,提高了100%)模型注册(与WizardCoder相比,提高了92%)和超参数优化(与WizardCoder相比,提高了83%)方面显著优于WizardCoder,在最佳设置中展现了其在复杂MLOps任务中出色的代码适应能力。
https://arxiv.org/abs/2405.06835
This study explores the innovative use of Large Language Models (LLMs) as analytical tools for interpreting complex financial regulations. The primary objective is to design effective prompts that guide LLMs in distilling verbose and intricate regulatory texts, such as the Basel III capital requirement regulations, into a concise mathematical framework that can be subsequently translated into actionable code. This novel approach aims to streamline the implementation of regulatory mandates within the financial reporting and risk management systems of global banking institutions. A case study was conducted to assess the performance of various LLMs, demonstrating that GPT-4 outperforms other models in processing and collecting necessary information, as well as executing mathematical calculations. The case study utilized numerical simulations with asset holdings -- including fixed income, equities, currency pairs, and commodities -- to demonstrate how LLMs can effectively implement the Basel III capital adequacy requirements.
本研究探讨了大型语言模型(LLMs)在解释复杂金融监管法规方面的创新应用。主要目标是为LLMs设计有效的提示,引导它们提炼冗长和复杂的监管文本,如巴塞尔III资本要求法规,进入一个简洁的数学框架,以便后续可转化为实际可执行代码。这种新颖的方法旨在简化全球银行机构在金融报告和风险管理系统内实施监管指令的实施。案例研究是对各种LLM表现的评估,证明GPT-4在其他模型中处理和收集必要信息以及执行数学计算方面表现优异。案例研究利用包括固定收益、股票、货币对和商品在内的资产持有地进行数值模拟,展示了LLM如何有效实施巴塞尔III资本充足性要求。
https://arxiv.org/abs/2405.06808
In today's digital landscape, where cyber attacks have become the norm, the detection of cyber attacks and threats is critically imperative across diverse domains. Our research presents a new empirical framework for cyber threat modeling, adept at parsing and categorizing cyber-related information from news articles, enhancing real-time vigilance for market stakeholders. At the core of this framework is a fine-tuned BERT model, which we call CANAL - Cyber Activity News Alerting Language Model, tailored for cyber categorization using a novel silver labeling approach powered by Random Forest. We benchmark CANAL against larger, costlier LLMs, including GPT-4, LLaMA, and Zephyr, highlighting their zero to few-shot learning in cyber news classification. CANAL demonstrates superior performance by outperforming all other LLM counterparts in both accuracy and cost-effectiveness. Furthermore, we introduce the Cyber Signal Discovery module, a strategic component designed to efficiently detect emerging cyber signals from news articles. Collectively, CANAL and Cyber Signal Discovery module equip our framework to provide a robust and cost-effective solution for businesses that require agile responses to cyber intelligence.
在当今数字 landscape中,网络攻击已成为常态,跨多个领域的检测网络攻击和威胁至关重要。我们的研究提出了一种新的实证框架,用于网络威胁建模,善于解析和分类与网络相关的信息,增强市场参与者的实时警惕。这个框架的核心是一个经过微调的BERT模型,我们称之为CANAL - 网络活动新闻警报语言模型,采用了一种新的银标注方法,利用随机森林进行网络安全分类。我们对比CANAL与其他大型、昂贵的LLM,包括GPT-4、LLLaMA和Zephyr,突显了它们在网络新闻分类中的零到几 shot学习。通过超越所有其他LLM备选,CANAL在准确性和性价比方面都表现卓越。此外,我们还引入了Cyber Signal Discovery模块,这是一个 strategic 组件,旨在有效地从新闻文章中检测新兴的网络安全信号。总之,CANAL 和 Cyber Signal Discovery 模块使我们的框架能够为需要对网络情报作出敏捷反应的企业提供稳健且经济有效的解决方案。
https://arxiv.org/abs/2405.06772
Aligning Large Language Models (LLMs) to cater to different human preferences, learning new skills, and unlearning harmful behavior is an important problem. Search-based methods, such as Best-of-N or Monte-Carlo Tree Search, are performant, but impractical for LLM adaptation due to their high inference cost. On the other hand, using Reinforcement Learning (RL) for adaptation is computationally efficient, but performs worse due to the optimization challenges in co-training the value function and the policy. We present a new framework for reward optimization, Value Augmented Sampling (VAS), that can maximize different reward functions using data sampled from only the initial, frozen LLM. VAS solves for the optimal reward-maximizing policy without co-training the policy and the value function, making the optimization stable, outperforming established baselines, such as PPO and DPO, on standard benchmarks, and achieving comparable results to Best-of-128 with lower inference cost. Unlike existing RL methods that require changing the weights of the LLM, VAS does not require access to the weights of the pre-trained LLM. Thus, it can even adapt LLMs (e.g., ChatGPT), which are available only as APIs. In addition, our algorithm unlocks the new capability of composing several rewards and controlling the extent of each one during deployment time, paving the road ahead for the future of aligned, personalized LLMs.
将大型语言模型(LLMs)适配不同的人类偏好、学习和重新学习有害行为是一个重要的问题。基于搜索的方法,如最佳N或蒙特卡洛树搜索,性能出色,但它们的推理成本过高,不适用于LLM的适应。另一方面,使用强化学习(RL)进行适应具有计算效率,但由于在共同训练价值和策略时的优化挑战,表现较差。我们提出了一个新的奖励优化框架,价值增强采样(VAS),可以通过仅从初始、冻结的LLM中采样数据来最大化不同的奖励函数。VAS在无需共同训练策略和价值函数的情况下求解最优奖励最大化策略,使优化稳定,超越了现有的基线,如PPO和DPO,在标准基准上实现类似的结果,同时具有较低的推理成本。与现有的RL方法需要更改LLM的权重不同,VAS不需要访问预训练LLM的权重。因此,它甚至可以适应LLM(例如ChatGPT),这些LLM只能作为API提供。此外,我们的算法还解锁了在部署时间组合多个奖励并控制每个奖励的程度的新能力,为未来基于对齐、个性化的LLM铺平道路。
https://arxiv.org/abs/2405.06639
We evaluate the zero-shot ability of GPT-4 and LLaVa to perform simple Visual Network Analysis (VNA) tasks on small-scale graphs. We evaluate the Vision Language Models (VLMs) on 5 tasks related to three foundational network science concepts: identifying nodes of maximal degree on a rendered graph, identifying whether signed triads are balanced or unbalanced, and counting components. The tasks are structured to be easy for a human who understands the underlying graph theoretic concepts, and can all be solved by counting the appropriate elements in graphs. We find that while GPT-4 consistently outperforms LLaVa, both models struggle with every visual network analysis task we propose. We publicly release the first benchmark for the evaluation of VLMs on foundational VNA tasks.
我们评估了 GPT-4 和 LLaVa 在小规模图形上执行简单视觉网络分析(VNA)任务的零散度。我们在三个基础网络科学概念上评估了 Vision Language Models(VLMs):在渲染图中识别最大度数的节点,判断有向三元组是否平衡或不平衡,以及计数组件。任务结构使得对于理解底层图论概念的人来说容易,而且所有任务都可以通过计算图中的适当元素来解决。我们发现,尽管 GPT-4 consistently 优于 LLaVa,但这两个模型在所有我们提出的视觉网络分析(VNA)任务上都遇到了困难。我们公开发布了第一个基准,用于评估 VLMs 在基础 VNA 任务上的性能。
https://arxiv.org/abs/2405.06634
Reference path following is a key component in the functioning of almost all engineered autonomous agents. Among several path following guidance methods in existing literature, vector-field-based guidance approach has got wide attention because of its simplicity and guarantee of stability under a broad class of scenarios. However, the usage of same cross-track-error-dependent structure of desired vector field in most of the existing literature irrespective of instantaneous cross-track error and course angle of unmanned vehicle makes it quite restrictive in attaining faster convergence and also leads to infeasibly high turn rate command for many scenarios. To this end, this paper presents a novel switched vector field-based guidance for following a general reference path, in which the structure of the desired vector field depends on instantaneous cross-track-error and vehicle's course angle. While the developed method ensures faster convergence, it also ensures that the guidance command always stays within a realistic threshold satisfying its curvature constraint, thus making it more real-life implementable for autonomous vehicles with kino-dynamic constraints. Theoretical analysis for convergence of the developed guidance scheme is presented. Possibilities of undesirable chattering at phase transitions are also eliminated. Numerical simulation studies are presented to validate the satisfactory performance of the developed algorithm.
参考路径跟随是几乎所有工程自主代理商运作的关键组件。在现有文献中,基于向量场的方法因为其简单性和在多种场景下保证稳定性而得到了广泛关注。然而,在大多数现有文献中,无论瞬时跨道误差和无人车辆的 course angle 为何,使用相同的跨道误差相关结构期望向量场会使得实现更快收敛变得相当困难,同时也导致许多场景中指令过大的问题。为此,本文提出了一种新颖的切换向量场引导方法,用于跟随一般参考路径,其中期望向量场的结构取决于瞬时跨道误差和车辆 course angle。虽然所开发的方法确保了更快的收敛,但它还确保了指导命令始终保持在满足其曲率约束的合理范围内,因此对于具有 kino-dynamic 约束的自动驾驶车辆来说,这种方法更具现实性。理论分析证明了所开发引导方案的收敛性。还消除了在相变过程中出现的 unwanted chattering 现象。数值仿真研究验证了所开发算法的满意性能。
https://arxiv.org/abs/2405.06355
The rapid advancements in generative artificial intelligence have opened up new avenues for enhancing various aspects of research, including the design and evaluation of survey questionnaires. However, the recent pioneering applications have not considered questionnaire pretesting. This article explores the use of GPT models as a useful tool for pretesting survey questionnaires, particularly in the early stages of survey design. Illustrated with two applications, the article suggests incorporating GPT feedback as an additional stage before human pretesting, potentially reducing successive iterations. The article also emphasizes the indispensable role of researchers' judgment in interpreting and implementing AI-generated feedback.
快速进步的生成人工智能为各种研究开辟了新的途径,包括调查问卷的设计和评估。然而,最近的应用探索并没有考虑到问卷预测试。本文探讨了将自然语言处理(NLP)模型作为预测试调查问卷的有用工具,特别是在调查设计的早期阶段。通过两个应用的示例,文章建议在人类预测试之前,将GPT反馈作为一个附加阶段,可能减少迭代次数。文章还强调了研究人员判断在解释和实施AI生成的反馈方面不可或缺的重要性。
https://arxiv.org/abs/2405.06329