People often answer yes-no questions without explicitly saying yes, no, or similar polar keywords. Figuring out the meaning of indirect answers is challenging, even for large language models. In this paper, we investigate this problem working with dialogues from multiple domains. We present new benchmarks in three diverse domains: movie scripts, tennis interviews, and airline customer service. We present an approach grounded on distant supervision and blended training to quickly adapt to a new dialogue domain. Experimental results show that our approach is never detrimental and yields F1 improvements as high as 11-34%.
人们通常在回答二选一问题时,不会明确地说是或否,或类似的极性关键词。解决这种问题的挑战性很大,即使是大型语言模型也很难。在本文中,我们研究了多个领域的对话,包括电影剧本、网球采访和 airline 客户服务。我们提出了一个基于远距离监督和混合训练的方法,以尽快适应新的对话领域。实验结果表明,我们的方法从未导致任何负面影响,而且F1得分可以达到11-34%。
https://arxiv.org/abs/2404.16262
Human feedback plays a central role in the alignment of Large Language Models (LLMs). However, open questions remain about the methods (how), domains (where), people (who) and objectives (to what end) of human feedback collection. To navigate these questions, we introduce PRISM, a new dataset which maps the sociodemographics and stated preferences of 1,500 diverse participants from 75 countries, to their contextual preferences and fine-grained feedback in 8,011 live conversations with 21 LLMs. PRISM contributes (i) wide geographic and demographic participation in human feedback data; (ii) two census-representative samples for understanding collective welfare (UK and US); and (iii) individualised feedback where every rating is linked to a detailed participant profile, thus permitting exploration of personalisation and attribution of sample artefacts. We focus on collecting conversations that centre subjective and multicultural perspectives on value-laden and controversial topics, where we expect the most interpersonal and cross-cultural disagreement. We demonstrate the usefulness of PRISM via three case studies of dialogue diversity, preference diversity, and welfare outcomes, showing that it matters which humans set alignment norms. As well as offering a rich community resource, we advocate for broader participation in AI development and a more inclusive approach to technology design.
人类反馈在大型语言模型的对齐中扮演着中心角色。然而,关于人类反馈的方法(如何)、领域(在哪里)、参与人群(谁)以及目标(为什么)等问题,仍然存在 open questions。为了回答这些问题,我们引入了 PRISM,一个新数据集,它将 1,500 个不同国家和地区的参与者的社会人口统计学和个人陈述偏好与他们对语境中的人工智能模型的反馈联系起来,在 8,011 个与 21 个大型语言模型进行的有 21,011 个实时对话。PRISM 作出了以下贡献:(i)在人类反馈数据中广泛地理和人口统计学参与;(ii)两个具有代表性的英国和美国人口统计样本,以了解共同福利;(iii)每个人工智能模型中的评分都与详细参与者个人资料相关联,因此可以探索个性化以及对样本元数据的归属。我们关注的是收集那些关注有价值和争议话题的对话,我们预计这将是人与人之间最人际化和跨文化分歧最大的情况。通过三个对话多样性的案例研究、偏好多样性案例研究和福利结果案例研究,我们展示了 PRISM 的有用性。它不仅提供了一个丰富的社区资源,还倡导更广泛地参与人工智能发展和更包容的技术设计。
https://arxiv.org/abs/2404.16019
Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, reasoning, and planning. MMT-Bench comprises $31,325$ meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering $32$ core meta-tasks and $162$ subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating the discovery of in- and out-of-domain tasks. Evaluation results involving $30$ LVLMs such as the proprietary GPT-4V, GeminiProVision, and open-sourced InternVL-Chat, underscore the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench will inspire the community to develop next-generation multimodal foundation models aimed at achieving general-purpose multimodal intelligence.
大视图语言模型(LVLMs)在诸如视觉对话和 embodied 导航等通用多模态应用方面取得了显著的进步。然而,现有的多模态评估基准测试的项目数量有限,无法跟踪 LVLM 的开发。在这项研究中,我们提出了 MMT-Bench,一个全面的多模态基准,旨在评估 LVLMs 在需要专家知识和故意视觉识别、定位、推理和规划的大型多模态任务中的能力。MMT-Bench 包括来自各种多模态场景的 $31,325$ 个精心策划的多选题视觉问题,涵盖了 $32$ 个核心元任务和 $162$ 个亚任务的多模态理解。由于其广泛的任务覆盖,MMT-Bench 使使用任务图评估 LVLMs 成为可能,促进发现领域内和领域外任务。评估了 $30$ 个 LVLM,如专有 GPT-4V、GeminiProVision 和开源的 InternVL-Chat,结果表明 MMT-Bench 带来了重大挑战。我们预计,MMT-Bench 将激发社区开发指向实现通用多模态智能的下一代多模态基础模型。
https://arxiv.org/abs/2404.16006
This workshop paper presents a critical examination of the integration of Generative AI (Gen AI) into the academic writing process, focusing on the use of AI as a collaborative tool. It contrasts the performance and interaction of two AI models, Gemini and ChatGPT, through a collaborative inquiry approach where researchers engage in facilitated sessions to design prompts that elicit specific AI responses for crafting research outlines. This case study highlights the importance of prompt design, output analysis, and recognizing the AI's limitations to ensure responsible and effective AI integration in scholarly work. Preliminary findings suggest that prompt variation significantly affects output quality and reveals distinct capabilities and constraints of each model. The paper contributes to the field of Human-Computer Interaction by exploring effective prompt strategies and providing a comparative analysis of Gen AI models, ultimately aiming to enhance AI-assisted academic writing and prompt a deeper dialogue within the HCI community.
这份工作论文对将生成式人工智能(Gen AI)融入学术写作过程的整合进行了批判性探讨,重点关注了使用AI作为合作工具。通过采用合作研究方法,研究人员参与设计提示,以诱使特定AI响应生成研究大纲。这个案例研究突出了提示设计、输出分析和认识到AI的局限对于确保责任且有效的AI融入学术研究的重要性。初步研究结果表明,提示变化显著影响了输出质量,揭示了每个模型的独特能力和限制。本文为人工智能领域的人类与计算机交互研究做出了贡献,探讨了有效的提示策略,并提供了对Gen AI模型的比较分析,最终旨在增强AI辅助学术写作,并引导HCI社区内的更深刻的对话。
https://arxiv.org/abs/2404.16071
Multimodal LLMs are the natural evolution of LLMs, and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and vision-and-language adapters, in this paper we concentrate on endowing such models with the capability of answering questions that require external knowledge. Our approach, termed Wiki-LLaVA, aims at integrating an external knowledge source of multimodal documents, which is accessed through a hierarchical retrieval pipeline. Relevant passages, using this approach, are retrieved from the external knowledge source and employed as additional context for the LLM, augmenting the effectiveness and precision of generated dialogues. We conduct extensive experiments on datasets tailored for visual question answering with external data and demonstrate the appropriateness of our approach.
多模态LLM是LLM的自然演变,并扩大其功能以实现超越纯文本模态。在设计新颖架构和视觉与语言适配器的研究过程中,本文重点关注为这样的模型赋予回答需要外部知识的问题的能力。我们称之为Wiki-LLaVA的方法旨在通过分层检索管道访问外部知识源,为LLM提供额外的上下文,提高生成对话的有效性和精确度。我们在针对视觉问题回答的外部数据集上进行广泛的实验,证明了我们的方法的合适性。
https://arxiv.org/abs/2404.15406
Training task-oriented dialogue systems typically requires turn-level annotations for interacting with their APIs: e.g. a dialogue state and the system actions taken at each step. These annotations can be costly to produce, error-prone, and require both domain and annotation expertise. With advances in LLMs, we hypothesize unlabelled data and a schema definition are sufficient for building a working task-oriented dialogue system, completely unsupervised. Using only (1) a well-defined API schema (2) a set of unlabelled dialogues between a user and agent, we develop a novel approach for inferring turn-level annotations as latent variables using a noisy channel model. We iteratively improve these pseudo-labels with expectation-maximization (EM), and use the inferred labels to train an end-to-end dialogue agent. Evaluating our approach on the MultiWOZ benchmark, our method more than doubles the dialogue success rate of a strong GPT-3.5 baseline.
基于任务的对话系统通常需要进行交互级别的注释,例如对话状态和每个步骤系统采取的行动。这些注释可能会产生费用,具有错误率,并且需要领域和注释专业知识。随着LLM的进步,我们假设无标签数据和数据定义足以构建一个无需监督的 task-oriented 对话系统。仅使用(1)定义良好的 API 模式和(2)用户和代理之间的无标签对话,我们提出了一种通过噪声信道模型推断回合级别注释的新方法。我们通过期望最大化(EM)迭代改进这些伪标签,并使用推断的标签来训练端到端对话代理。在 MultiWOZ 基准上评估我们的方法,我们的方法将 strong GPT-3.5 基线的对话成功率加倍。
https://arxiv.org/abs/2404.15219
This paper presents a novel exploration into the regressive side effects of training Large Language Models (LLMs) to mimic student misconceptions for personalized education. We highlight the problem that as LLMs are trained to more accurately mimic student misconceptions, there is a compromise in the factual integrity and reasoning ability of the models. Our work involved training an LLM on a student-tutor dialogue dataset to predict student responses. The results demonstrated a decrease in the model's performance across multiple benchmark datasets, including the ARC reasoning challenge and TruthfulQA, which evaluates the truthfulness of model's generated responses. Furthermore, the HaluEval Dial dataset, used for hallucination detection, and MemoTrap, a memory-based task dataset, also reported a decline in the model accuracy. To combat these side effects, we introduced a "hallucination token" technique. This token, appended at the beginning of each student response during training, instructs the model to switch between mimicking student misconceptions and providing factually accurate responses. Despite the significant improvement across all datasets, the technique does not completely restore the LLM's baseline performance, indicating the need for further research in this area. This paper contributes to the ongoing discussion on the use of LLMs for student modeling, emphasizing the need for a balance between personalized education and factual accuracy.
本文提出了对将大型语言模型(LLMs)用于个性化教育时产生的退化性副作用的深入探索。我们强调了LLMs在更准确地模仿学生错误观念的同时,模型事实准确性和推理能力之间存在妥协的问题。我们的工作包括在学生-导师对话数据集上训练一个LLM,预测学生回答。结果表明,模型在多个基准数据集上的表现都下降了,包括ARC推理挑战和TruthfulQA,这些数据集评估了模型生成的回答的准确性。此外,用于幻觉检测的HaluEval Dial数据集和基于记忆的任务数据集MemoTrap也报告了模型准确性的下降。为了应对这些副作用,我们引入了一种“幻觉标记”技术。这个标记附加在每个学生回答的开头,指示模型在模仿学生错误观念和提供准确回答之间进行切换。尽管在所有数据集上都取得了显著的改进,但这种技术并没有完全恢复LLM的基线性能,表明需要进一步研究这一领域。本文为LLM在学生建模中的应用提供了进一步的讨论,强调了在个性化教育和事实准确性之间需要保持平衡的重要性。
https://arxiv.org/abs/2404.15156
Large Language Models (LLMs) are frequently discussed in academia and the general public as support tools for virtually any use case that relies on the production of text, including software engineering. Currently there is much debate, but little empirical evidence, regarding the practical usefulness of LLM-based tools such as ChatGPT for engineers in industry. We conduct an observational study of 24 professional software engineers who have been using ChatGPT over a period of one week in their jobs, and qualitatively analyse their dialogues with the chatbot as well as their overall experience (as captured by an exit survey). We find that, rather than expecting ChatGPT to generate ready-to-use software artifacts (e.g., code), practitioners more often use ChatGPT to receive guidance on how to solve their tasks or learn about a topic in more abstract terms. We also propose a theoretical framework for how (i) purpose of the interaction, (ii) internal factors (e.g., the user's personality), and (iii) external factors (e.g., company policy) together shape the experience (in terms of perceived usefulness and trust). We envision that our framework can be used by future research to further the academic discussion on LLM usage by software engineering practitioners, and to serve as a reference point for the design of future empirical LLM research in this domain.
大语言模型(LLMs)在学术界和公众中常被讨论为支持文本生成几乎所有应用场景的工具,包括软件工程。目前有很多关于LLM工具的辩论,但很少有实证证据,关于这些工具在工业界工程师中的应用效果。我们对24名职业软件工程师在工作的一个星期内使用ChatGPT进行观察研究,并对其与聊天机器的对话以及整体经验(通过退出调查进行捕捉)进行定性分析。我们发现,实践者更倾向于使用ChatGPT获得有关任务解决方案的指导,而不是期望该工具生成可用的软件输出(例如,代码)。我们也提出了一个理论框架,即(i)交互的目的,(ii)内部因素(例如用户的个性)和(iii)外部因素(例如公司政策)共同塑造了体验(以感知有用性和信任为基础)。我们展望,我们的框架可以为未来的研究提供一个进一步探讨LLM在软件工程师中的应用、为该领域未来实证LLM研究的指导,以及作为未来研究的一个参考点的框架。
https://arxiv.org/abs/2404.14901
This paper explores SynTOD, a new synthetic data generation approach for developing end-to-end Task-Oriented Dialogue (TOD) Systems capable of handling complex tasks such as intent classification, slot filling, conversational question-answering, and retrieval-augmented response generation, without relying on crowdsourcing or real-world data. SynTOD utilizes a state transition graph to define the desired behavior of a TOD system and generates diverse, structured conversations through random walks and response simulation using large language models (LLMs). In our experiments, using graph-guided response simulations leads to significant improvements in intent classification, slot filling and response relevance compared to naive single-prompt simulated conversations. We also investigate the end-to-end TOD effectiveness of different base and instruction-tuned LLMs, with and without the constructed synthetic conversations. Finally, we explore how various LLMs can evaluate responses in a TOD system and how well they are correlated with human judgments. Our findings pave the path towards quick development and evaluation of domain-specific TOD systems. We release our datasets, models, and code for research purposes.
本文探讨了SynTOD,一种新的合成数据生成方法,用于开发端到端的任务导向对话(TOD)系统,能够处理复杂任务,如意图分类、槽填充、会话问题回答和检索增强回答,而无需依赖众包或真实世界数据。SynTOD利用状态转移图来定义TOD系统的期望行为,并通过随机漫步和响应仿真生成多样、有结构的对话。在我们的实验中,使用图指导的响应仿真比 naive single-prompt 模拟对话在意图分类、槽填充和响应相关性方面显著提高了效果。我们还研究了不同基础和指令调整的LLM在端到端TOD系统中的效果,以及有无构建合成对话。最后,我们探讨了各种LLM如何评估TOD系统中的响应,以及它们与人类评价的关联。我们的研究结果为快速开发和评估领域特定的TOD系统奠定了道路。我们将我们的数据、模型和代码公开发布,供研究之用。
https://arxiv.org/abs/2404.14772
Large Language Models (LLMs) have demonstrated remarkable performance across a spectrum of tasks. Recently, Direct Preference Optimization (DPO) has emerged as an RL-free approach to optimize the policy model on human preferences. However, several limitations hinder the widespread adoption of this method. To address these shortcomings, various versions of DPO have been introduced. Yet, a comprehensive evaluation of these variants across diverse tasks is still lacking. In this study, we aim to bridge this gap by investigating the performance of alignment methods across three distinct scenarios: (1) keeping the Supervised Fine-Tuning (SFT) part, (2) skipping the SFT part, and (3) skipping the SFT part and utilizing an instruction-tuned model. Furthermore, we explore the impact of different training sizes on their performance. Our evaluation spans a range of tasks including dialogue systems, reasoning, mathematical problem-solving, question answering, truthfulness, and multi-task understanding, encompassing 13 benchmarks such as MT-Bench, Big Bench, and Open LLM Leaderboard. Key observations reveal that alignment methods achieve optimal performance with smaller training data subsets, exhibit limited effectiveness in reasoning tasks yet significantly impact mathematical problem-solving, and employing an instruction-tuned model notably influences truthfulness. We anticipate that our findings will catalyze further research aimed at developing more robust models to address alignment challenges.
大语言模型(LLMs)在各种任务上都表现出了令人印象深刻的性能。最近,直接偏好优化(DPO)作为一种无需使用强化学习(RL)的方法,优化了关于人类偏好的策略模型,成为一个备受关注的研究方向。然而,几种限制性的限制使得这种方法在广泛应用上受到了阻碍。为了应对这些缺陷,各种版本的DPO应运而生。然而,对这些变体的全面评估在各种任务上的表现仍然是缺乏的。在这项研究中,我们通过研究在不同场景下对对齐方法的表现,试图弥补这一空白。我们研究了三种不同场景下的对齐方法:(1)保留监督微调(SFT)部分,(2)跳过SFT部分,(3)跳过SFT部分并利用指令微调模型。此外,我们还研究了不同训练规模对它们性能的影响。我们的评估范围包括对话系统、推理、数学问题求解、问答、真理fulness和多任务理解,涵盖了包括MT-Bench、Big Bench和Open LLM Leaderboard在内的13个基准。关键观察表明,对齐方法在较小的训练数据子集上实现最优性能,在推理任务上的效果有限,但在数学问题求解上具有显著影响,而利用指令微调模型在很大程度上影响了真理fulness。我们预计,我们的研究将催化进一步研究,以开发更健壮的模型来解决对齐挑战。
https://arxiv.org/abs/2404.14723
Large language models (LLMs) can adapt to new tasks through in-context learning (ICL) based on a few examples presented in dialogue history without any model parameter update. Despite such convenience, the performance of ICL heavily depends on the quality of the in-context examples presented, which makes the in-context example selection approach a critical choice. This paper proposes a novel Bayesian in-Context example Selection method (ByCS) for ICL. Extending the inference probability conditioned on in-context examples based on Bayes' theorem, ByCS focuses on the inverse inference conditioned on test input. Following the assumption that accurate inverse inference probability (likelihood) will result in accurate inference probability (posterior), in-context examples are selected based on their inverse inference results. Diverse and extensive cross-tasking and cross-modality experiments are performed with speech, text, and image examples. Experimental results show the efficacy and robustness of our ByCS method on various models, tasks and modalities.
大语言模型(LLMs)可以通过基于对话历史中提供的几个示例进行基于无模型参数更新的语境学习(ICL)来适应新的任务。尽管这种便利性,但ICL的性能很大程度上取决于提供的语境示例的质量,这使得语境示例选择方法成为一个关键的选择。本文提出了一种新颖的贝叶斯语境示例选择方法(ByCS)用于ICL。基于贝叶斯公式的推理概率条件,ByCS关注于基于测试输入的逆推理条件。假设准确的反向推理概率(概率)会导致准确的后验概率(后),根据逆推理结果选择语境示例。我们对语音、文本和图像例子进行了多样且广泛的跨任务和跨模态实验。实验结果展示了我们ByCS方法在不同模型、任务和模态上的有效性和鲁棒性。
https://arxiv.org/abs/2404.14716
When prompting a language model (LM), users frequently expect the model to adhere to a set of behavioral principles across diverse tasks, such as producing insightful content while avoiding harmful or biased language. Instilling such principles into a model can be resource-intensive and technically challenging, generally requiring human preference labels or examples. We introduce SAMI, a method for teaching a pretrained LM to follow behavioral principles that does not require any preference labels or demonstrations. SAMI is an iterative algorithm that finetunes a pretrained LM to increase the conditional mutual information between constitutions and self-generated responses given queries from a datasest. On single-turn dialogue and summarization, a SAMI-trained mistral-7b outperforms the initial pretrained model, with win rates between 66% and 77%. Strikingly, it also surpasses an instruction-finetuned baseline (mistral-7b-instruct) with win rates between 55% and 57% on single-turn dialogue. SAMI requires a "principle writer" model; to avoid dependence on stronger models, we further evaluate aligning a strong pretrained model (mixtral-8x7b) using constitutions written by a weak instruction-finetuned model (mistral-7b-instruct). The SAMI-trained mixtral-8x7b outperforms both the initial model and the instruction-finetuned model, achieving a 65% win rate on summarization. Our results indicate that a pretrained LM can learn to follow constitutions without using preference labels, demonstrations, or human oversight.
当我们提示语言模型(LM)时,用户通常期望模型在各种任务上遵守一系列行为原则,例如在生成有洞察力的内容的同时避免使用有害或偏见的语言。将这样的原则注入模型可能需要大量的资源和技术挑战,通常需要人类偏好标签或示例。我们介绍了一种名为SAMI的方法,用于教授预训练LM遵循行为原则,而不需要任何偏好标签或演示。SAMI是一个迭代算法,通过优化预训练LM的条件 mutual information 增加,给定查询数据集。在单轮对话和摘要中,经过SAMI训练的mistral-7b在初始预训练模型基础上取得了更优异的胜率,范围在66%到77%之间。令人惊讶的是,它还在单轮对话上超过了指令微调的基线(mistral-7b-instruct),在55%到57%的胜率上超过了它。SAMI需要一个“原则编写者”模型;为了避免对更强大的模型的依赖,我们进一步评估使用弱指令微调的模型(mistral-7b-instruct)编写的constitution的alignment。SAMI训练的mistral-8x7b在摘要中超过了初始模型和指令微调模型,实现了65%的胜率。我们的结果表明,预训练LM可以学习遵循constitution,而无需使用偏好标签、演示或人类监督。
https://arxiv.org/abs/2404.14313
Recognizing characters and predicting speakers of dialogue are critical for comic processing tasks, such as voice generation or translation. However, because characters vary by comic title, supervised learning approaches like training character classifiers which require specific annotations for each comic title are infeasible. This motivates us to propose a novel zero-shot approach, allowing machines to identify characters and predict speaker names based solely on unannotated comic images. In spite of their importance in real-world applications, these task have largely remained unexplored due to challenges in story comprehension and multimodal integration. Recent large language models (LLMs) have shown great capability for text understanding and reasoning, while their application to multimodal content analysis is still an open problem. To address this problem, we propose an iterative multimodal framework, the first to employ multimodal information for both character identification and speaker prediction tasks. Our experiments demonstrate the effectiveness of the proposed framework, establishing a robust baseline for these tasks. Furthermore, since our method requires no training data or annotations, it can be used as-is on any comic series.
识别字符并预测对话发言者对于幽默处理任务(如语音生成或翻译)非常重要。然而,因为漫画作品不同,需要为每个漫画作品提供特定注释的监督学习方法(如训练角色分类器)是不切实际的。这激励我们提出了一种新颖的零 shot方法,使机器仅基于未注释的漫画图像识别字符并预测发言者名称。尽管在现实应用中这些任务具有重要性,但它们尚未得到充分的探索,因为理解故事情节和多模态集成方面的挑战。近年来,大型语言模型(LLMs)在文本理解和推理方面表现出巨大能力,但将它们应用于多模态内容分析仍然是未解决的问题。为了解决这个问题,我们提出了一个多模态框架,是第一个将多模态信息用于角色识别和发言者预测任务的。我们的实验证明了所提出的框架的有效性,为这些任务建立了稳健的基线。此外,由于我们的方法无需训练数据或注释,因此可以将其直接应用于任何漫画系列。
https://arxiv.org/abs/2404.13993
Customizing persuasive conversations related to the outcome of interest for specific users achieves better persuasion results. However, existing persuasive conversation systems rely on persuasive strategies and encounter challenges in dynamically adjusting dialogues to suit the evolving states of individual users during interactions. This limitation restricts the system's ability to deliver flexible or dynamic conversations and achieve suboptimal persuasion outcomes. In this paper, we present a novel approach that tracks a user's latent personality dimensions (LPDs) during ongoing persuasion conversation and generates tailored counterfactual utterances based on these LPDs to optimize the overall persuasion outcome. In particular, our proposed method leverages a Bi-directional Generative Adversarial Network (BiCoGAN) in tandem with a Dialogue-based Personality Prediction Regression (DPPR) model to generate counterfactual data. This enables the system to formulate alternative persuasive utterances that are more suited to the user. Subsequently, we utilize the D3QN model to learn policies for optimized selection of system utterances on counterfactual data. Experimental results we obtained from using the PersuasionForGood dataset demonstrate the superiority of our approach over the existing method, BiCoGAN. The cumulative rewards and Q-values produced by our method surpass ground truth benchmarks, showcasing the efficacy of employing counterfactual reasoning and LPDs to optimize reinforcement learning policy in online interactions.
为了获得更好的说服力结果,为特定用户定制说服性对话是很有必要的。然而,现有的说服性对话系统依赖于说服策略,并且在动态调整对话以适应个体用户交互过程中的不断变化的过程中遇到了挑战。这一限制限制了系统实现灵活或动态对话的能力,并导致说服力结果不理想。在本文中,我们提出了一个新方法,用于跟踪用户在持续说服性对话中的潜在人格维度(LPDs),并基于这些LPDs生成定制反事实陈述,以优化整体说服力结果。特别是,我们的方法利用了双向生成对抗网络(BiCoGAN)与对话为基础的人格预测回归(DPPR)模型生成反事实数据。这使得系统能够形成更适合用户的其他说服性陈述。随后,我们使用D3QN模型在反事实数据上学习策略,用于优化系统说出的语句。使用 PersuasionForGood 数据集获得的实验结果表明,我们的方法优越于现有的方法 BiCoGAN。我们方法累积奖励和Q值超过真实基准,这表明在在线交互中利用反事实推理和LPDs优化强化学习策略非常有效。
https://arxiv.org/abs/2404.13792
Speech event detection is crucial for multimedia retrieval, involving the tagging of both semantic and acoustic events. Traditional ASR systems often overlook the interplay between these events, focusing solely on content, even though the interpretation of dialogue can vary with environmental context. This paper tackles two primary challenges in speech event detection: the continual integration of new events without forgetting previous ones, and the disentanglement of semantic from acoustic events. We introduce a new task, continual event detection from speech, for which we also provide two benchmark datasets. To address the challenges of catastrophic forgetting and effective disentanglement, we propose a novel method, 'Double Mixture.' This method merges speech expertise with robust memory mechanisms to enhance adaptability and prevent forgetting. Our comprehensive experiments show that this task presents significant challenges that are not effectively addressed by current state-of-the-art methods in either computer vision or natural language processing. Our approach achieves the lowest rates of forgetting and the highest levels of generalization, proving robust across various continual learning sequences. Our code and data are available at https://anonymous.4open.science/status/Continual-SpeechED-6461.
语音事件检测对于多媒体检索至关重要,涉及对语义和音频事件进行标记。传统的ASR系统通常忽视这些事件之间的相互作用,仅关注内容,尽管对话的解释可能会因环境背景而有所不同。本文解决了两个主要的语音事件检测挑战:持续集成新事件,同时不遗忘以前的事件,以及语义和音频事件的分离。我们还提供了两个基准数据集,用于说明这个任务。为了应对灾难性遗忘和有效分离的问题,我们提出了名为“双混合”的新方法。这种方法将语音专业知识与健壮的存储机制相结合,提高了可塑性和防止遗忘。我们全面的实验结果表明,这个任务对计算机视觉和自然语言处理当前方法提出了严重挑战。我们的方法实现了最低的遗忘率和最高的泛化水平,证明其健壮性在各种连续学习序列中。我们的代码和数据可以从https://anonymous.4open.science/status/Continual-SpeechED-6461获取。
https://arxiv.org/abs/2404.13289
In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback. In a conversational setting such signals are usually unavailable due to the nature of the interactions, and, instead, the evaluation often relies on crowdsourced evaluation labels. The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied. We focus on how the evaluation of task-oriented dialogue systems (TDSs), is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated. We explore and compare two methodologies for assessing TDSs: one includes the user's follow-up utterance and one without. We use both crowdworkers and large language models (LLMs) as annotators to assess system responses across four aspects: relevance, usefulness, interestingness, and explanation quality. Our findings indicate that there is a distinct difference in ratings assigned by both annotator groups in the two setups, indicating user feedback does influence system evaluation. Workers are more susceptible to user feedback on usefulness and interestingness compared to LLMs on interestingness and relevance. User feedback leads to a more personalized assessment of usefulness by workers, aligning closely with the user's explicit feedback. Additionally, in cases of ambiguous or complex user requests, user feedback improves agreement among crowdworkers. These findings emphasize the significance of user feedback in refining system evaluations and suggest the potential for automated feedback integration in future research. We publicly release the annotated data to foster research in this area.
在便携式检索中,评估很大程度上依赖于用户行为,包括隐含反馈。在会话环境中,这些信号通常由于交互的性质而无法获得,相反,评估通常依赖于由 crowdworkers 提供的手动评估标签。用户反馈在会话感知中 turn 的评估者中的作用已经被研究得很少。我们关注的是,考虑用户反馈(无论是显性还是隐性反馈)如何影响会话导向对话系统(TDSs)的评估。我们探讨并比较了两种评估 TDSs 的方法:一种包括用户的后续陈述,另一种不包括。我们使用 both crowdworkers 和 large language models (LLMs) 作为标注者来评估系统的响应跨越四个方面:相关性、有用性、有趣性和解释质量。我们的研究结果表明,在两种设置中,评估者小组分配给用户的评分存在显著差异,这表明用户反馈确实会影响系统评估。与 LLMs 相比,工人在有用性和有趣性上更容易受到用户反馈的影响。用户反馈使工作者能够进行更个性化的有用性评估,与用户的明确反馈非常贴近。此外,在模糊或复杂用户请求的情况下,用户反馈会改善 crowdworkers 之间的共识。这些发现强调了用户反馈在优化系统评估中的重要性,并表明在未来的研究中,自动反馈集成具有很大的潜力。我们公开发布评估数据,以促进该领域的研究。
https://arxiv.org/abs/2404.12994
Aligning language models (LMs) based on human-annotated preference data is a crucial step in obtaining practical and performant LM-based systems. However, multilingual human preference data are difficult to obtain at scale, making it challenging to extend this framework to diverse languages. In this work, we evaluate a simple approach for zero-shot cross-lingual alignment, where a reward model is trained on preference data in one source language and directly applied to other target languages. On summarization and open-ended dialog generation, we show that this method is consistently successful under comprehensive evaluation settings, including human evaluation: cross-lingually aligned models are preferred by humans over unaligned models on up to >70% of evaluation instances. We moreover find that a different-language reward model sometimes yields better aligned models than a same-language reward model. We also identify best practices when there is no language-specific data for even supervised finetuning, another component in alignment.
将基于人类标注偏好数据的语言模型对齐作为获得实际且高性能的语言模型系统的关键步骤。然而,在规模上获得多语言人类偏好数据是困难的,这使得将此框架扩展到各种语言具有挑战性。在这项工作中,我们评估了一种简单的零散跨语言对齐方法,其中在一种源语言的偏好数据上训练了一个奖励模型,并直接应用于其他目标语言。在概述和开放性对话生成方面,我们发现,在综合评估设置中,这种方法在包括人类评估的广泛评估实例中始终成功地实现了卓越表现:跨语言对齐的模型在超过70%的评估实例中优于未对齐的模型。此外,我们还发现,当没有语言特定的数据进行甚至监督微调时,不同语言的奖励模型有时会生成更好的对齐模型。我们也在没有语言特定数据进行监督微调时,识别出最佳实践。
https://arxiv.org/abs/2404.12318
Fine-tuning pre-trained Large Language Models (LLMs) is essential to align them with human values and intentions. This process often utilizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the evaluation of full answers generated by the models. However, the generation of these responses occurs in a token level, following a sequential, auto-regressive fashion. In this paper, we introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level. Unlike previous methods, which face challenges in divergence efficiency, TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. Utilizing the Bradley-Terry model for a token-based reward system, TDPO enhances the regulation of KL divergence, while preserving simplicity without the need for explicit reward modeling. Experimental results across various text tasks demonstrate TDPO's superior performance in balancing alignment with generation diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. Our code is open-sourced at this https URL.
微调预训练的大型语言模型(LLMs)与人类价值观和意图对齐至关重要。这一过程通常采用比较对等关系和与参考LLM的KL散度的方法,重点关注模型生成的完整答案的评估。然而,这些回答的生成是在标记级别进行的,遵循了序列、自回归的样式。在本文中,我们引入了Token-level Direct Preference Optimization(TDPO),一种通过优化模型在每个标记级别的策略来与人类偏好对齐的新颖方法。与之前的方法不同,TDPO通过每个标记点的正向KL散度约束来改善对齐和多样性。利用布拉德利-特里模型作为基于标记的奖励系统,TDPO增强了KL散度的规范,同时保留了简单性,无需显式奖励建模。在各种文本任务的各种实验结果中,TDPO在平衡对齐与生成多样性方面的表现优于DPO。值得注意的是,在受控情感生成和单轮对话数据集上,TDPO与DPO的微调效果略好于PPO,显著地提高了生成的响应的质量。我们的代码目前是开源的,在以下链接处。
https://arxiv.org/abs/2404.11999
Automated dialogue systems are important applications of artificial intelligence, and traditional systems struggle to understand user emotions and provide empathetic feedback. This study integrates emotional intelligence technology into automated dialogue systems and creates a dialogue generation model with emotional intelligence through deep learning and natural language processing techniques. The model can detect and understand a wide range of emotions and specific pain signals in real time, enabling the system to provide empathetic interaction. By integrating the results of the study "Can artificial intelligence detect pain and express pain empathy?", the model's ability to understand the subtle elements of pain empathy has been enhanced, setting higher standards for emotional intelligence dialogue systems. The project aims to provide theoretical understanding and practical suggestions to integrate advanced emotional intelligence capabilities into dialogue systems, thereby improving user experience and interaction quality.
自动对话系统是人工智能的重要应用之一,但传统系统很难理解用户的情感并提供体贴的反馈。通过将情感智能技术集成到自动对话系统中,并通过深度学习和自然语言处理技术创建一个具有情感意识的对话生成模型。该模型可以实时检测和理解广泛的情感和特定的疼痛信号,使系统能够提供体贴的交互。通过将研究的“人工智能能否检测疼痛并表达疼痛同理?”的结果集成到模型中,模型对疼痛同理的理解能力得到了增强,为情感智能对话系统设定了更高的标准。该项目旨在提供理论理解和实际建议,以将先进的情感智能功能集成到对话系统中,从而提高用户体验和交互质量。
https://arxiv.org/abs/2404.11447
The advent of deep learning models has made a considerable contribution to the achievement of Emotion Recognition in Conversation (ERC). However, this task still remains an important challenge due to the plurality and subjectivity of human emotions. Previous work on ERC provides predictive models using mostly graph-based conversation representations. In this work, we propose a way to model the conversational context that we incorporate into a metric learning training strategy, with a two-step process. This allows us to perform ERC in a flexible classification scenario and to end up with a lightweight yet efficient model. Using metric learning through a Siamese Network architecture, we achieve 57.71 in macro F1 score for emotion classification in conversation on DailyDialog dataset, which outperforms the related work. This state-of-the-art result is promising regarding the use of metric learning for emotion recognition, yet perfectible compared to the microF1 score obtained.
深度学习模型的出现对实现对话中情感识别(ERC)做出了显著的贡献。然而,由于人类情感的多样性和主观性,这项任务仍然是一个重要的挑战。以前的工作主要使用基于图的对话表示来构建预测模型。在这项工作中,我们提出了一种将对话上下文建模为元学习训练策略的方法,包括两个步骤。这使我们能够在灵活的分类场景中执行ERC,并实现了一个轻量级但高效的模型。通过Siamese网络架构进行元学习,我们在DailyDialog数据集上取得了57.71的宏观F1分数的 emotion分类,超过了相关研究。这种最先进的结果在关于使用元学习进行情感识别方面具有前景,然而与微F1分数相比还有待提高。
https://arxiv.org/abs/2404.11141