In this study, we introduce Generative Manufacturing Systems (GMS) as a novel approach to effectively manage and coordinate autonomous manufacturing assets, thereby enhancing their responsiveness and flexibility to address a wide array of production objectives and human preferences. Deviating from traditional explicit modeling, GMS employs generative AI, including diffusion models and ChatGPT, for implicit learning from envisioned futures, marking a shift from a model-optimum to a training-sampling decision-making. Through the integration of generative AI, GMS enables complex decision-making through interactive dialogue with humans, allowing manufacturing assets to generate multiple high-quality global decisions that can be iteratively refined based on human feedback. Empirical findings showcase GMS's substantial improvement in system resilience and responsiveness to uncertainties, with decision times reduced from seconds to milliseconds. The study underscores the inherent creativity and diversity in the generated solutions, facilitating human-centric decision-making through seamless and continuous human-machine interactions.
在这项研究中,我们提出了生成制造系统(GMS)作为一种新颖的方法来有效地管理和协调自主制造资产,从而提高其对生产各种目标及人类偏好的响应能力和灵活性。与传统显式建模不同,GMS采用生成式AI,包括扩散模型和ChatGPT,进行从预见到的未来进行隐式学习,标志着从模型最优到训练抽样的决策转变。通过集成生成式AI,GMS使通过与人类交互进行复杂决策成为可能,允许制造资产生成多个高质量的全球决策,并可以根据人类反馈进行迭代改进。实证研究展示了GMS在系统弹性和对不确定性的改进,决策时间从秒减少到毫秒。该研究突出了生成式解决方案固有的创造力和多样性,通过无缝连续的人机交互促进人本决策。
https://arxiv.org/abs/2405.00958
Vision language models (VLMs) have recently emerged and gained the spotlight for their ability to comprehend the dual modality of image and textual data. VLMs such as LLaVA, ChatGPT-4, and Gemini have recently shown impressive performance on tasks such as natural image captioning, visual question answering (VQA), and spatial reasoning. Additionally, a universal segmentation model by Meta AI, Segment Anything Model (SAM) shows unprecedented performance at isolating objects from unforeseen images. Since medical experts, biologists, and materials scientists routinely examine microscopy or medical images in conjunction with textual information in the form of captions, literature, or reports, and draw conclusions of great importance and merit, it is indubitably essential to test the performance of VLMs and foundation models such as SAM, on these images. In this study, we charge ChatGPT, LLaVA, Gemini, and SAM with classification, segmentation, counting, and VQA tasks on a variety of microscopy images. We observe that ChatGPT and Gemini are impressively able to comprehend the visual features in microscopy images, while SAM is quite capable at isolating artefacts in a general sense. However, the performance is not close to that of a domain expert - the models are readily encumbered by the introduction of impurities, defects, artefact overlaps and diversity present in the images.
近年来,随着Vision语言模型(VLMs)的出现,它们在理解图像和文本数据的双模态方面得到了关注。例如,LLaVA、ChatGPT-4和Gemini等VLM最近在自然图像描述性、视觉问答(VQA)和空间推理等任务中表现出色。此外,由元人工智能(Meta AI)开发的普遍分割模型Semantic Anywhere Model(SAM)在从未见过的图像中隔离物体方面表现出史无前例的性能。由于医疗专家、生物学家和材料科学家通常将显微镜图像或医学图像与文本信息(标题、文献或报告)一起检查,并从中得出重要且有益的结论,因此测试VLM和基础模型(如SAM)在这些图像上的性能无疑至关重要。在这项研究中,我们对ChatGPT、LLaVA、Gemini和SAM在各种显微镜图像上执行分类、分割、计数和VQA任务。我们观察到,ChatGPT和Gemini在显微镜图像的视觉特征方面表现出惊人的理解能力,而SAM在分离总体上的伪影方面表现相当出色。然而,这些模型的性能与领域专家的相当距离,模型很容易受到图像中存在的杂质、缺陷、伪影和多样性等因素的影响。
https://arxiv.org/abs/2405.00876
We introduce WorkBench: a benchmark dataset for evaluating agents' ability to execute tasks in a workplace setting. WorkBench contains a sandbox environment with five databases, 26 tools, and 690 tasks. These tasks represent common business activities, such as sending emails and scheduling meetings. The tasks in WorkBench are challenging as they require planning, tool selection, and often multiple actions. If a task has been successfully executed, one (or more) of the database values may change. The correct outcome for each task is unique and unambiguous, which allows for robust, automated evaluation. We call this key contribution outcome-centric evaluation. We evaluate five existing ReAct agents on WorkBench, finding they successfully complete as few as 3% of tasks (Llama2-70B), and just 43% for the best-performing (GPT-4). We further find that agents' errors can result in the wrong action being taken, such as an email being sent to the wrong person. WorkBench reveals weaknesses in agents' ability to undertake common business activities, raising questions about their use in high-stakes workplace settings. WorkBench is publicly available as a free resource at this https URL.
我们介绍了一个名为WorkBench的基准数据集,用于评估代理在职场环境中执行任务的能力。WorkBench包含一个沙盒环境,包含五个数据库、26个工具和690个任务。这些任务代表常见的商务活动,如发送电子邮件和安排会议。WorkBench中的任务具有挑战性,因为它们需要计划、工具选择和通常需要多个行动。如果任务成功执行,则其中一个(或多个)数据库值可能会发生变化。每个任务的正确结果都是独特且无歧义的,这使得可以进行稳健且自动评估。我们将这种关键贡献称为成果中心评估。我们在WorkBench上评估了五种现有的ReAct代理,发现它们成功完成了不到3%的任务(Llama2-70B),而最佳表现者(GPT-4)也只完成了43%的任务。我们进一步发现,代理的错误可能导致错误的行动,例如将邮件发送给错误的人。WorkBench揭示了代理在执行常见商务活动方面的不足,引发了关于它们在高风险职场环境中的使用的疑问。WorkBench作为免费资源,现在可以在该链接 https:// URL 上公开使用。
https://arxiv.org/abs/2405.00823
Customer service is how companies interface with their customers. It can contribute heavily towards the overall customer satisfaction. However, high-quality service can become expensive, creating an incentive to make it as cost efficient as possible and prompting most companies to utilize AI-powered assistants, or "chat bots". On the other hand, human-to-human interaction is still desired by customers, especially when it comes to complex scenarios such as disputes and sensitive topics like bill payment. This raises the bar for customer service agents. They need to accurately understand the customer's question or concern, identify a solution that is acceptable yet feasible (and within the company's policy), all while handling multiple conversations at once. In this work, we introduce "Ask Me Anything" (AMA) as an add-on feature to an agent-facing customer service interface. AMA allows agents to ask questions to a large language model (LLM) on demand, as they are handling customer conversations -- the LLM provides accurate responses in real-time, reducing the amount of context switching the agent needs. In our internal experiments, we find that agents using AMA versus a traditional search experience spend approximately 10% fewer seconds per conversation containing a search, translating to millions of dollars of savings annually. Agents that used the AMA feature provided positive feedback nearly 80% of the time, demonstrating its usefulness as an AI-assisted feature for customer care.
客户服务是公司与客户之间的互动方式。它有可能极大地影响客户满意度。然而,高质量的服务可能会变得昂贵,导致公司产生一种将服务尽可能地成本效益化的激励,催生出使用AI驱动的助手(或“聊天机器人”)的潮流。另一方面,客户仍然希望与人类进行人际互动,尤其是在复杂场景(如纠纷和敏感话题的付款情况)下。这对客户服务代理提出了更高的要求。他们需要准确理解客户的提问或担忧,找出一个可接受且可行的解决方案,同时处理多通对话。在这项工作中,我们将“问我任何问题”(AMA)作为面向代理的客户服务界面的附加功能引入。AMA允许代理在处理客户对话时向大型语言模型(LLM)提出问题,因为它们正在与客户交流——LLM会实时提供准确的回答,减少代理需要进行上下文切换的时间。在我们内部的实验中,我们发现使用AMA的代理与使用传统搜索体验的代理相比,每个对话中大约可以节省10%的时间,这意味着每年可以节省数百万美元。使用AMA功能的代理几乎得到了80%的积极反馈,这表明它作为AI辅助功能在客户服务中的实用性。
https://arxiv.org/abs/2405.00801
Traditional reinforcement learning from human feedback (RLHF) approaches relying on parametric models like the Bradley-Terry model fall short in capturing the intransitivity and irrationality in human preferences. Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences, enabling more flexible and accurate language model alignment. In this paper, we propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game aimed at identifying the Nash equilibrium policy. Our approach, dubbed \textit{Self-Play Preference Optimization} (SPPO), approximates the Nash equilibrium through iterative policy updates and enjoys theoretical convergence guarantee. Our method can effectively increase the log-likelihood of the chosen response and decrease that of the rejected response, which cannot be trivially achieved by symmetric pairwise loss such as Direct Preference Optimization (DPO) and Identity Preference Optimization (IPO). In our experiments, using only 60k prompts (without responses) from the UltraFeedback dataset and without any prompt augmentation, by leveraging a pre-trained preference model PairRM with only 0.4B parameters, SPPO can obtain a model from fine-tuning Mistral-7B-Instruct-v0.2 that achieves the state-of-the-art length-controlled win-rate of 28.53% against GPT-4-Turbo on AlpacaEval 2.0. It also outperforms the (iterative) DPO and IPO on MT-Bench and the Open LLM Leaderboard. Notably, the strong performance of SPPO is achieved without additional external supervision (e.g., responses, preferences, etc.) from GPT-4 or other stronger language models.
传统强化学习从人类反馈(RLHF)方法依赖于参数模型(如Bradley-Terry模型)则无法捕捉到人类偏好中的可塑性和非理睬性。最近的研究表明,直接与偏好概率互动可能产生更准确的人类偏好的反映,从而实现更灵活和准确的语言模型对齐。在本文中,我们提出了一种基于自博弈的语言模型对齐方法,将问题视为一个恒等和两个玩家的无限博弈,旨在确定纳什均衡策略。我们的方法被称为\textit{自博弈偏好优化》(SPPO),通过迭代策略更新来近似纳什均衡,并具有理论收敛保证。我们的方法可以有效增加所选响应的似然,并减小拒绝响应的似然,这不能通过对称成对损失(如Direct Preference Optimization,DPO)和Identity Preference Optimization(IPO)等简单方式实现。在实验中,使用超Feedback集的60k个提示(没有响应)以及没有提示增强,仅利用预训练偏好模型PairRM,仅包含0.4B参数,SPPO可以从微调后的Mistral-7B-Instruct-v0.2模型中获得,该模型在AlpacaEval 2.0上的长度控制赢得率达到了28.53%。此外,SPPO在MT-Bench和Open LLM Leaderboard上的表现优于(迭代)DPO和IPO。值得注意的是,SPPO的高性能在没有额外外部监督(如响应、偏好等)的情况下,从GPT-4或其他更强的语言模型中实现。
https://arxiv.org/abs/2405.00675
Recent studies introduced effective compression techniques for Large Language Models (LLMs) via post-training quantization or low-bit weight representation. Although quantized weights offer storage efficiency and allow for faster inference, existing works have indicated that quantization might compromise performance and exacerbate biases in LLMs. This study investigates the confidence and calibration of quantized models, considering factors such as language model type and scale as contributors to quantization loss. Firstly, we reveal that quantization with GPTQ to 4-bit results in a decrease in confidence regarding true labels, with varying impacts observed among different language models. Secondly, we observe fluctuations in the impact on confidence across different scales. Finally, we propose an explanation for quantization loss based on confidence levels, indicating that quantization disproportionately affects samples where the full model exhibited low confidence levels in the first place.
近年来,通过后训练量化或低位权重表示,有效压缩技术已经引入了大型语言模型(LLMs)。尽管量化权重具有存储效率,能够加速推理,但已有研究表示,量化可能会影响性能,甚至加剧LLMs中的偏见。本研究调查了量化模型的置信度和校准度,考虑了诸如语言模型类型和规模等因素对量化损失的贡献。首先,我们发现,将GPTQ量化到4位会导致关于真实标签的置信度下降,不同语言模型上观察到的影响有所不同。其次,我们观察到不同缩放级别上置信度对的影响波动。最后,我们提出了一个基于置信度的量化损失解释,表明量化 disproportionately影响在训练过程中首次表现出低置信度的样本。
https://arxiv.org/abs/2405.00632
Recent advances in deep learning have promoted the advent of many computational systems capable of performing intelligent actions that, until then, were restricted to the human intellect. In the particular case of human languages, these advances allowed the introduction of applications like ChatGPT that are capable of generating coherent text without being explicitly programmed to do so. Instead, these models use large volumes of textual data to learn meaningful representations of human languages. Associated with these advances, concerns about copyright and data privacy infringements caused by these applications have emerged. Despite these concerns, the pace at which new natural language processing applications continued to be developed largely outperformed the introduction of new regulations. Today, communication barriers between legal experts and computer scientists motivate many unintentional legal infringements during the development of such applications. In this paper, a multidisciplinary team intends to bridge this communication gap and promote more compliant Portuguese NLP research by presenting a series of everyday NLP use cases, while highlighting the Portuguese legislation that may arise during its development.
近年来,深度学习的进步促进了许多能够执行智能行动的计算系统的出现,这些系统直到那时都限制在人类智慧之中。在人类语言方面,这些进步使得引入了像ChatGPT这样的应用程序成为可能,这些应用程序可以生成没有明确编程生成文本。相反,这些模型使用大量文本数据来学习人类语言的有意义的表示。伴随着这些进步,人们对这些应用程序可能引起的版权和数据隐私侵犯的担忧也出现了。尽管存在这些担忧,但新自然语言处理应用程序的发展速度往往超过了新法规的引入。如今,法律专家和法律计算机科学家之间的沟通障碍导致了许多无意中的法律侵权在开发这些应用程序过程中。在本文中,一个多学科团队旨在弥合这一沟通障碍,通过呈现一系列日常NLP用例,促进更合规的葡萄牙NLP研究,同时强调其发展过程中可能出现的葡萄牙法律。
https://arxiv.org/abs/2405.00536
There has been growing interest in audio-language retrieval research, where the objective is to establish the correlation between audio and text modalities. However, most audio-text paired datasets often lack rich expression of the text data compared to the audio samples. One of the significant challenges facing audio-text datasets is the presence of similar or identical captions despite different audio samples. Therefore, under many-to-one mapping conditions, audio-text datasets lead to poor performance of retrieval tasks. In this paper, we propose a novel approach to tackle the data imbalance problem in audio-language retrieval task. To overcome the limitation, we introduce a method that employs a distance sampling-based paraphraser leveraging ChatGPT, utilizing distance function to generate a controllable distribution of manipulated text data. For a set of sentences with the same context, the distance is used to calculate a degree of manipulation for any two sentences, and ChatGPT's few-shot prompting is performed using a text cluster with a similar distance defined by the Jaccard similarity. Therefore, ChatGPT, when applied to few-shot prompting with text clusters, can adjust the diversity of the manipulated text based on the distance. The proposed approach is shown to significantly enhance performance in audio-text retrieval, outperforming conventional text augmentation techniques.
音频语言检索研究引起了越来越多的关注,其目标是建立音频和文本模态之间的相关性。然而,大多数音频-文本配对数据集通常缺乏文本数据的丰富表达,与音频样本相比。音频-文本数据集面临的一个关键挑战是,尽管存在不同的音频样本,但存在与音频样本相似或相同的字幕。因此,在许多对一映射条件下,音频-文本数据集导致检索任务的性能较差。在本文中,我们提出了一个新方法来解决音频-语言检索任务中的数据不平衡问题。为了克服这一限制,我们引入了一种基于距离采样 的文本同义词生成方法,利用 ChatGPT,通过距离函数生成可控制文本数据的操纵分布。对于具有相同上下文的句子,距离用于计算任意两个句子之间的 manipulation 程度,而 ChatGPT 的 few-shot 提示通过具有相同距离定义的文本簇进行。因此,当将 ChatGPT 应用于 few-shot 提示与文本簇时,可以根据距离调整被操纵文本的多样性。该方法被证明可以在音频-语言检索中显著增强性能,超过传统文本增强技术。
https://arxiv.org/abs/2405.00367
Large language models (LLMs) have achieved impressive success on many benchmarks for mathematical reasoning. However, there is growing concern that some of this performance actually reflects dataset contamination, where data closely resembling benchmark questions leaks into the training data, instead of true reasoning ability. To investigate this claim rigorously, we commission Grade School Math 1000 (GSM1k). GSM1k is designed to mirror the style and complexity of the established GSM8k benchmark, the gold standard for measuring elementary mathematical reasoning. We ensure that the two benchmarks are comparable across important metrics such as human solve rates, number of steps in solution, answer magnitude, and more. When evaluating leading open- and closed-source LLMs on GSM1k, we observe accuracy drops of up to 13%, with several families of models (e.g., Phi and Mistral) showing evidence of systematic overfitting across almost all model sizes. At the same time, many models, especially those on the frontier, (e.g., Gemini/GPT/Claude) show minimal signs of overfitting. Further analysis suggests a positive relationship (Spearman's r^2=0.32) between a model's probability of generating an example from GSM8k and its performance gap between GSM8k and GSM1k, suggesting that many models may have partially memorized GSM8k.
大语言模型(LLMs)在许多数学推理基准测试中取得了令人印象深刻的成功。然而,越来越担心一些性能实际上反映了数据污染,即紧密类似于基准问题的数据泄露到训练数据中,而不是真正的推理能力。为了严谨地调查这个说法,我们委托了Grade School Math 1000(GSM1k)。GSM1k旨在与现有的GSM8k基准相似,这是衡量小学数学推理的标准。我们确保两个基准在重要指标(如人解率、解决问题步骤、答案大小等)上可比。在用GSM1k评估领先的开源和闭源LLM时,我们观察到准确度下降了13%,并且几个模型家族(如Φ和Mistral)在几乎所有模型大小上都表现出系统性的过拟合。同时,许多模型,特别是前沿模型(如Gemini/GPT/Claude)表现出极小的过拟合。进一步的分析表明,模型从GSM8k生成示例的概率与GSM8k和GSM1k之间的性能差异之间存在积极关系(Spearman的r^2=0.32),这表明许多模型可能已经部分记忆了GSM8k。
https://arxiv.org/abs/2405.00332
Learning never ends, and there is no age limit to grow yourself. However, the educational landscape may face challenges in effectively catering to students' inclusion and diverse learning needs. These students should have access to state-of-the-art methods for lecture delivery, online resources, and technology needs. However, with all the diverse learning sources, it becomes harder for students to comprehend a large amount of knowledge in a short period of time. Traditional assistive technologies and learning aids often lack the dynamic adaptability required for individualized education plans. Large Language Models (LLM) have been used in language translation, text summarization, and content generation applications. With rapid growth in AI over the past years, AI-powered chatbots and virtual assistants have been developed. This research aims to bridge this gap by introducing an innovative study buddy we will be calling the 'SAMCares'. The system leverages a Large Language Model (LLM) (in our case, LLaMa-2 70B as the base model) and Retriever-Augmented Generation (RAG) to offer real-time, context-aware, and adaptive educational support. The context of the model will be limited to the knowledge base of Sam Houston State University (SHSU) course notes. The LLM component enables a chat-like environment to interact with it to meet the unique learning requirements of each student. For this, we will build a custom web-based GUI. At the same time, RAG enhances real-time information retrieval and text generation, in turn providing more accurate and context-specific assistance. An option to upload additional study materials in the web GUI is added in case additional knowledge support is required. The system's efficacy will be evaluated through controlled trials and iterative feedback mechanisms.
学习永无止境,没有年龄限制去发展自己。然而,教育领域可能会面临照顾学生多样性需求和理解能力不足的挑战。这些学生应享有最先进的大课教学方法、在线资源和科技需求。然而,尽管有各种各样的学习资源,学生在短时间内理解大量知识仍然变得更加困难。传统的辅助技术和学习辅助工具通常缺乏个性化的教育计划所需的动态适应性。近年来,随着人工智能的快速发展,已经开发出了一些基于人工智能的聊天机器人或虚拟助手。这项研究旨在通过介绍我们称之为“SAMCares”的创新学习伙伴来填补这一空白。该系统利用大型语言模型(LLM)(在我们的案例中,LLaMa-2 70B作为基础模型)和Retriever-Augmented Generation(RAG)为每个学生提供实时的、上下文感知和自适应的教育支持。模型的上下文将局限于德克萨斯州休斯顿州立大学(SHSU)的课程笔记知识库。LLM部分使学生能够以类似聊天室的环境与它互动,满足每个学生的独特学习需求。为此,我们将构建一个自定义的网页GUI。同时,RAG通过实时信息检索和文本生成增强,提供更准确、上下文相关的帮助。在网页GUI中增加上传额外学习材料的选项,以便需要额外知识支持时使用。系统将通过控制试验和迭代反馈机制来评估其有效性。
https://arxiv.org/abs/2405.00330
Teaching programming in early childhood (4-9) to enhance computational thinking has gained popularity in the recent movement of computer science for all. However, current practices ignore some fundamental issues resulting from young children's developmental readiness, such as the sustained capability to keyboarding, the decomposition of complex tasks to small tasks, the need for intuitive mapping from abstract programming to tangible outcomes, and the limited amount of screen time exposure. To address these issues in this paper, we present a novel methodology with an AI-powered integration platform to effectively teach computational thinking for young children. The system features a hybrid pedagogy that supports both the top-down and bottom-up approach for teaching computational thinking. Young children can describe their desired task in natural language, while the system can respond with an easy-to-understand program consisting of the right level of decomposed sub-tasks. A tangible robot can immediately execute the decomposed program and demonstrate the program's outcomes to young children. The system is equipped with an intelligent chatbot that can interact with young children through natural languages, and children can speak to the chatbot to complete all the needed programming tasks, while the chatbot orchestrates the execution of the program onto the robot. This would completely eliminates the need of keyboards for young children to program. By developing such a system, we aim to make the concept of computational thinking more accessible to young children, fostering a natural understanding of programming concepts without the need of explicit programming skills. Through the interactive experience provided by the robotic agent, our system seeks to engage children in an effective manner, contributing to the field of educational technology for early childhood computer science education.
在最近的教育技术运动中,教授4-9岁儿童编程以增强计算思维已经变得越来越受欢迎。然而,现有的做法忽视了儿童发展准备阶段的一些基本问题,例如持续的键盘能力、将复杂任务分解为小任务、从抽象编程到直观成果的直觉映射以及屏幕时间受限等问题。为了解决这些问题,本文提出了一种新的方法,该方法配备了一个AI驱动的集成平台,以有效教授计算思维给儿童。 该系统采用混合教育方法,支持从上到下的教学方法和从下到上的教学方法,以教授计算思维。儿童可以使用自然语言描述他们的所需任务,而系统会以易于理解的程序回答,该程序包括适当的分解子任务。一个实体机器人可以立即执行分解程序并展示其成果给儿童。系统配备了一个智能聊天机器人,可以通过自然语言与儿童交互,儿童可以与聊天机器人完成所有编程任务,而聊天机器人将程序执行给机器人。这将完全消除年轻孩子编程时使用键盘的需求。 通过开发这样一个系统,我们的目标是让计算思维的概念更容易为儿童所理解,促进对编程概念的自然理解,而无需具备显式的编程技能。通过机器人代理提供的交互式体验,我们的系统试图以有效的方式激发儿童参与,为早期 childhood计算机科学教育领域做出贡献。
https://arxiv.org/abs/2405.00750
Automated explanatory feedback systems play a crucial role in facilitating learning for a large cohort of learners by offering feedback that incorporates explanations, significantly enhancing the learning process. However, delivering such explanatory feedback in real-time poses challenges, particularly when high classification accuracy for domain-specific, nuanced responses is essential. Our study leverages the capabilities of large language models, specifically Generative Pre-Trained Transformers (GPT), to explore a sequence labeling approach focused on identifying components of desired and less desired praise for providing explanatory feedback within a tutor training dataset. Our aim is to equip tutors with actionable, explanatory feedback during online training lessons. To investigate the potential of GPT models for providing the explanatory feedback, we employed two commonly-used approaches: prompting and fine-tuning. To quantify the quality of highlighted praise components identified by GPT models, we introduced a Modified Intersection over Union (M-IoU) score. Our findings demonstrate that: (1) the M-IoU score effectively correlates with human judgment in evaluating sequence quality; (2) using two-shot prompting on GPT-3.5 resulted in decent performance in recognizing effort-based (M-IoU of 0.46) and outcome-based praise (M-IoU of 0.68); and (3) our optimally fine-tuned GPT-3.5 model achieved M-IoU scores of 0.64 for effort-based praise and 0.84 for outcome-based praise, aligning with the satisfaction levels evaluated by human coders. Our results show promise for using GPT models to provide feedback that focuses on specific elements in their open-ended responses that are desirable or could use improvement.
自动解释性反馈系统在促进大规模学习群体的学习方面发挥了关键作用,通过提供包含解释性的反馈来显著提高学习过程。然而,在实时提供这样的解释性反馈方面存在挑战,特别是在对领域特定、细微的回应进行高分类准确度要求时。我们的研究利用大型语言模型的能力,特别是生成预训练转换器(GPT),探讨了一种专注于在导师训练数据集中的识别所需和不需要赞扬的组件的序列标注方法,为导师提供在线培训课程中的行动式、解释性反馈。我们的目标是向导师提供解释性反馈,以便在在线培训课程中进行。为了研究GPT模型的提供解释性反馈的潜力,我们采用了两种常用的方法:提示和微调。为了量化GPT模型确定的突出赞扬部分的品质,我们引入了modified Intersection over Union(M-IoU)分数。我们的研究结果表明: (1)M-IoU分数有效地与人类评价序列质量的程度相关; (2)在GPT-3.5上使用两击提示产生了 decent的性能,以识别基于努力(M-IoU为0.46)和基于结果(M-IoU为0.68)的赞扬; (3)我们通过微调GPT-3.5模型,实现了基于努力赞扬的M-IoU score为0.64和基于结果赞扬的M-IoU score为0.84,与人类编码者评估的水平相符。 我们的研究结果表明,使用GPT模型提供关注其开放性回应中具体元素的反馈具有前景。
https://arxiv.org/abs/2405.00291
Large language models (LLMs) that are proved to be very powerful on different NLP tasks. However, there are still many ways to attack the model with very low costs. How to defend the model becomes an important problem. In our work, we treat adversarial attack results as a new (unseen) domain of the model, and we frame the defending problem into how to improve the robustness of the model on the new domain. We focus on the task of conversation entailment, where multi-turn natural language dialogues are the premise, and the transformer model is fine-tuned to predict whether a given hypothesis about the given dialogue is true or false. The adversary would attack the hypothesis to fool the model to make the wrong predictions. We apply synonym-swapping as the attack method. To show the robustness of the model, we implement some fine-tuning strategies and propose the embedding perturbation loss as a method to improve the robustness of the model. Finally, we show the importance of our work by discussing the adversarial attacks in NLP in the real world.
大型语言模型(LLMs)已经在许多自然语言处理任务上证明非常强大。然而,仍然有许多方法可以以非常低的成本攻击模型。如何防御模型成为重要问题。在我们的工作中,我们将对抗性攻击结果视为一个新的(未见过的)领域,并将防御问题转化为在为新领域提高模型稳健性的问题。我们关注对话含义任务,其中多轮自然语言对话是前提,对模型进行微调以预测给定对话中假设是否为真或假。攻击者会攻击假设以欺骗模型做出错误的预测。我们采用同义词替换作为攻击方法。为了展示模型的稳健性,我们实现了一些微调策略,并提出了嵌入漂移损失作为提高模型稳健性的方法。最后,我们通过讨论在现实生活中NLP中的对抗性攻击重要性,证明了我们的工作的重要性。
https://arxiv.org/abs/2405.00289
Unlike traditional educational chatbots that rely on pre-programmed responses, large-language model-driven chatbots, such as ChatGPT, demonstrate remarkable versatility and have the potential to serve as a dynamic resource for addressing student needs from understanding advanced concepts to solving complex problems. This work explores the impact of such technology on student learning in an interdisciplinary, project-oriented data visualization course. Throughout the semester, students engaged with ChatGPT across four distinct projects, including data visualizations and implementing them using a variety of tools including Tableau, D3, and Vega-lite. We collected conversation logs and reflection surveys from the students after each assignment. In addition, we conducted interviews with selected students to gain deeper insights into their overall experiences with ChatGPT. Our analysis examined the advantages and barriers of using ChatGPT, students' querying behavior, the types of assistance sought, and its impact on assignment outcomes and engagement. Based on the findings, we discuss design considerations for an educational solution that goes beyond the basic interface of ChatGPT, specifically tailored for data visualization education.
与传统教育聊天机器人依赖预设回答不同,大型语言模型驱动的聊天机器人(如ChatGPT)表现出非凡的灵活性,并有可能成为解决学生需求(从理解高级概念到解决复杂问题)的动态资源。本文探讨了这种技术对学生在跨学科、项目导向的数据可视化课程中的学习的影响。在整个学期里,学生与ChatGPT在四个不同的项目中进行互动,包括数据可视化和使用各种工具(包括Tableau、D3和Vega-lite)实现它们。我们收集了每个作业后的对话记录和反思调查。此外,我们还与选定的学生进行了访谈,以更深入地了解他们与ChatGPT的整体经验。我们的分析探讨了使用ChatGPT的优势和障碍,学生的查询行为,寻求的协助类型以及其对作业成果和参与度的影响。根据这些发现,我们讨论了为教育解决方案,超越了ChatGPT的基本界面,特别是为数据可视化教育定制的设计考虑。
https://arxiv.org/abs/2405.00748
Non-cognitive skills are crucial for personal and social life well-being, and such skill development can be supported by narrative-based (e.g., storytelling) technologies. While generative AI enables interactive and role-playing storytelling, little is known about how users engage with and perceive the use of AI in social life simulation for non-cognitive skills learning. To this end, we introduced SimuLife++, an interactive platform enabled by a large language model (LLM). The system allows users to act as protagonists, creating stories with one or multiple AI-based characters in diverse social scenarios. In particular, we expanded the Human-AI interaction to a Human-AI-AI collaboration by including a sage agent, who acts as a bystander to provide users with more insightful perspectives on their choices and conversations. Through a within-subject user study, we found that the inclusion of the sage agent significantly enhanced narrative immersion, according to the narrative transportation scale, leading to more messages, particularly in group chats. Participants' interactions with the sage agent were also associated with significantly higher scores in their perceived motivation, self-perceptions, and resilience and coping, indicating positive impacts on non-cognitive skills reflection. Participants' interview results further explained the sage agent's aid in decision-making, solving ethical dilemmas, and problem-solving; on the other hand, they suggested improvements in user control and balanced responses from multiple characters. We provide design implications on the application of generative AI in narrative solutions for non-cognitive skill development in broader social contexts.
非认知技能对个人和社交生活的幸福感至关重要,而且这种技能的发展可以通过基于叙述(如讲故事)的技术得到支持。虽然生成型AI可以实现交互式和角色扮演式讲故事,但用户如何参与和使用AI在社交生活模拟中学习非认知技能的程度尚不清楚。为此,我们引入了SimuLife++,一个由大型语言模型(LLM)支持的交互式平台。该系统允许用户作为主人公,与一个或多个基于AI的角色在多样社交场景中创造故事。特别是,我们通过增加一个智者代理,将人机交互扩展到人机-AI-人合作,智者代理充当了一个观察者的角色,为用户提供对他们选择和对话的更深刻的见解。通过一个自我中心用户研究,我们发现,智者代理的引入显著增强了故事沉浸感,根据情节传递规模,导致更多的信息,特别是在群聊中。参与者和智者代理的交互还被认为与他们的感知动机、自我认知和韧性及应对的得分显著相关,表明对非认知技能反思产生了积极影响。参与者的访谈结果进一步解释了智者代理在决策、解决伦理困境和解决问题方面的帮助;另一方面,他们提出了用户控制和多角色回应平衡的改进建议。我们在更广泛的社交背景下探讨了生成型AI在非认知技能发展中的叙事解决方案的设计建议。
https://arxiv.org/abs/2405.00273
Underwater robots play a crucial role in exploring aquatic environments. The ability to flexibly adjust their attitudes is essential for underwater robots to effectively accomplish tasks in confined space. However, the highly coupled six degrees of freedom dynamics resulting from attitude changes and the complex turbulence within limited spatial areas present significant challenges. To address the problem of attitude control of underwater robots, this letter investigates large-range pitch angle tracking during station holding as well as simultaneous roll and yaw angle control to enable versatile attitude adjustments. Based on dynamic modeling, this letter proposes an adaptive integral sliding mode controller (AISMC) that integrates an integral module into traditional sliding mode control (SMC) and adaptively adjusts the switching gain for improved tracking accuracy, reduced chattering, and enhanced robustness. The stability of the closed-loop control system is established through Lyapunov analysis. Extensive experiments and comparison studies are conducted using a commercial remotely operated vehicle (ROV), the results of which demonstrate that AISMC achieves satisfactory performance in attitude tracking control in confined space with unknown disturbances, significantly outperforming both PID and SMC.
水下机器人对探索水下环境具有关键作用。实现灵活的态度调整对于水下机器人有效执行任务在受限空间内是至关重要的。然而,由态度变化产生的高度耦合的六自由度动力学以及有限空间内的复杂涡流带来了重大挑战。为解决水下机器人的姿态控制问题,本文研究了在站控期间的大范围俯仰角跟踪以及同时控制横滚和偏航角以实现多功能的姿态调整。基于动态建模,本文提出了一种自适应积分滑动模式控制器(AISMC),将积分模块融入传统的滑动模式控制(SMC),并自适应地调整切换增益以提高跟踪精度、减少扰动和增强鲁棒性。通过Lyapunov分析建立了闭环控制系统的稳定性。使用商用遥控器(ROV)进行广泛的实验和比较研究。结果表明,AISMC在未知扰动下,在受限空间内实现令人满意的姿态跟踪控制,显著优于PID和SMC。
https://arxiv.org/abs/2405.00269
This paper presents a comprehensive exploration of relation extraction utilizing advanced language models, specifically Chain of Thought (CoT) and Graphical Reasoning (GRE) techniques. We demonstrate how leveraging in-context learning with GPT-3.5 can significantly enhance the extraction process, particularly through detailed example-based reasoning. Additionally, we introduce a novel graphical reasoning approach that dissects relation extraction into sequential sub-tasks, improving precision and adaptability in processing complex relational data. Our experiments, conducted on multiple datasets, including manually annotated data, show considerable improvements in performance metrics, underscoring the effectiveness of our methodologies.
本文对利用先进语言模型(如Chain of Thought和Graphical Reasoning)进行关系提取进行了全面的探讨。我们证明了利用GPT-3.5中的上下文学习可以显著增强提取过程,特别是通过详细的基于例子的推理。此外,我们还介绍了一种新的图形推理方法,将关系提取分为序列子任务,从而提高处理复杂关系数据的精度和适应性。我们的实验在多个数据集上进行,包括手动标注的数据,结果表明我们的方法论在性能指标上取得了显著的改进,验证了我们的方法的有效性。
https://arxiv.org/abs/2405.00216
Existing automatic captioning methods for visual content face challenges such as lack of detail, content hallucination, and poor instruction following. In this work, we propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions; 3) captioning, where an LLM generates the final caption by summarizing caption proposals and the fact check verification results. In this step, VFC can flexibly generate captions in various styles following complex instructions. We conduct comprehensive captioning evaluations using four metrics: 1) CLIP-Score for image-text similarity; 2) CLIP-Image-Score for measuring the image-image similarity between the original and the reconstructed image generated by a text-to-image model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V for fine-grained evaluation. Evaluation results show that VFC outperforms state-of-the-art open-sourced captioning methods for 2D images on the COCO dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by combining open-source models into a pipeline, we can attain captioning capability comparable to proprietary models such as GPT-4V, despite being over 10x smaller in model size.
目前,为视觉内容设计的自动摘要方法面临着缺乏细节、内容偏差和差劲的指令等挑战。在这项工作中,我们提出了VisualFactChecker(VFC),一种灵活的训练免费管道,为2D图像和3D对象生成高保真度和详细摘要。VFC包括三个步骤:1)提议,其中图像到文本摘要模型提出多个初始摘要;2)验证,其中大型语言模型(LLM)利用诸如物体检测和VQA模型等工具对提议的摘要进行验证;3)摘要,其中LLM通过总结摘要建议和验证结果生成最终的摘要。在这一步骤,VFC可以根据复杂指令灵活生成各种风格的摘要。我们使用四个指标对全面摘要评估:1)CLIP-Score,衡量图像与文本相似度;2)CLIP-Image-Score,衡量原图像和由文本到图像模型生成的图像之间的图像图像相似度;3)在Amazon Mechanical Turk上的人类研究;4)GPT-4V进行微细化评估。评估结果显示,VFC在COCO数据集上的2D图像上的表现优于最先进的开源摘要方法,而在Objaverse数据集上的3D资产上的表现也优于最先进的开放式源代码方法。我们的研究证明了通过将开源模型集成到管道中,我们可以实现与 proprietary 模型如GPT-4V相当的摘要能力,尽管模型的规模是开源模型的10倍以上。
https://arxiv.org/abs/2404.19752
Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved downstream capabilities with no overhead in training time for both code and natural language models. The method is increasingly useful for larger model sizes, and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points. Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Experiments on small algorithmic tasks demonstrate that multi-token prediction is favorable for the development of induction heads and algorithmic reasoning capabilities. As an additional benefit, models trained with 4-token prediction are up to 3 times faster at inference, even with large batch sizes.
大型语言模型(如GPT和Llama)通过下一个词预测损失进行训练。在本文中,我们建议将训练语言模型预测多个未来词的效果会更高。具体来说,在训练语料库的每个位置,我们要求模型使用n个独立输出头预测以下n个词,在共享模型干线上操作。将多词预测视为辅助训练任务,我们衡量了对于代码和自然语言模型的下游能力没有训练时间开销的改善。这种方法对于较大的模型大小越来越有用,并且在训练多个周期时仍然具有吸引力。在生成基准测试中(如编码),我们的模型在几个百分点的范围内显著优于强大的基线。我们的13B参数模型在HumanEval和MBPP上解决了12%的问题比相近的下一个词模型。在小型算法任务上的实验表明,多词预测对于发展归纳头和算法推理能力具有优势。作为额外的好处,使用4个词预测训练的模型在推理过程中速度快3倍,即使在大批量的情况下也是如此。
https://arxiv.org/abs/2404.19737
Iterative preference optimization methods have recently been shown to perform well for general instruction tuning tasks, but typically make little improvement on reasoning tasks (Yuan et al., 2024, Chen et al., 2024). In this work we develop an iterative approach that optimizes the preference between competing generated Chain-of-Thought (CoT) candidates by optimizing for winning vs. losing reasoning steps that lead to the correct answer. We train using a modified DPO loss (Rafailov et al., 2023) with an additional negative log-likelihood term, which we find to be crucial. We show reasoning improves across repeated iterations of this scheme. While only relying on examples in the training set, our approach results in increasing accuracy for Llama-2-70B-Chat from 55.6% to 81.6% on GSM8K (and 88.7% with majority voting out of 32 samples), from 12.5% to 20.8% on MATH, and from 77.8% to 86.7% on ARC-Challenge, which outperforms other Llama-2-based models not relying on additionally sourced datasets.
迭代偏好优化方法最近在通用指令调整任务中表现良好,但通常在推理任务上没有显著改进(Yuan等人,2024;Chen等人,2024)。在这项工作中,我们开发了一种迭代方法,通过优化赢得或输出的推理步骤,使得竞争生成的链式思考(CoT)候选者之间的偏好得到优化。我们使用了一种修改后的DPO损失函数(Rafailov等人,2023),并增加了一个负对数似然项,我们发现这一项至关重要。我们证明了这种方案在迭代迭代过程中推理能力有所提高。 虽然仅仅依赖训练集中的示例,但我们的方法使得Llama-2-70B-Chat的准确率从55.6%提高到了81.6%(在GSM8K上),从12.5%提高到了20.8%(在MATH上),从77.8%提高到了86.7%(在ARC-Challenge上),这超过了不依赖于额外数据集的 其他Llama-2基模型的表现。
https://arxiv.org/abs/2404.19733