As large language models (LLMs) see increasing adoption across the globe, it is imperative for LLMs to be representative of the linguistic diversity of the world. India is a linguistically diverse country of 1.4 Billion people. To facilitate research on multilingual LLM evaluation, we release IndicGenBench - the largest benchmark for evaluating LLMs on user-facing generation tasks across a diverse set 29 of Indic languages covering 13 scripts and 4 language families. IndicGenBench is composed of diverse generation tasks like cross-lingual summarization, machine translation, and cross-lingual question answering. IndicGenBench extends existing benchmarks to many Indic languages through human curation providing multi-way parallel evaluation data for many under-represented Indic languages for the first time. We evaluate a wide range of proprietary and open-source LLMs including GPT-3.5, GPT-4, PaLM-2, mT5, Gemma, BLOOM and LLaMA on IndicGenBench in a variety of settings. The largest PaLM-2 models performs the best on most tasks, however, there is a significant performance gap in all languages compared to English showing that further research is needed for the development of more inclusive multilingual language models. IndicGenBench is released at this http URL
随着大型语言模型(LLMs)在全球范围内的应用不断增加,LLMs代表世界语言多样性至关重要。印度是一个拥有14亿人口的多语言国家。为了促进对多语言LLM评估的研究,我们发布了IndicGenBench - 针对13个脚本和4个语言家庭的多语言用户生成任务评估的最大基准。IndicGenBench由跨语言摘要、机器翻译和跨语言问题回答等多样化的生成任务组成。通过人类审核,IndicGenBench为许多印度语言提供了多途径并行评估数据,为许多代表性不足的印度语言首次提供了全面评估。我们在IndicGenBench上评估了各种专有和开源LLM,包括GPT-3.5、GPT-4、PaLM-2、mT5、Gemma、BLOOM和LLLaMA。在IndicGenBench上,最大的PaLM-2模型在大多数任务上表现最佳,然而,与英语相比,所有语言之间的性能差距都很大,这表明需要进一步研究更包容的多语言语言模型的开发。IndicGenBench发布在以下URL:
https://arxiv.org/abs/2404.16816
Generative Commonsense Reasoning (GCR) requires a model to reason about a situation using commonsense knowledge, while generating coherent sentences. Although the quality of the generated sentences is crucial, the diversity of the generation is equally important because it reflects the model's ability to use a range of commonsense knowledge facts. Large Language Models (LLMs) have shown proficiency in enhancing the generation quality across various tasks through in-context learning (ICL) using given examples without the need for any fine-tuning. However, the diversity aspect in LLM outputs has not been systematically studied before. To address this, we propose a simple method that diversifies the LLM generations, while preserving their quality. Experimental results on three benchmark GCR datasets show that our method achieves an ideal balance between the quality and diversity. Moreover, the sentences generated by our proposed method can be used as training data to improve diversity in existing commonsense generators.
生成常识推理(GCR)需要一个模型使用常识知识来推理关于一种情况的句子,同时生成连贯的句子。尽管生成的句子的质量至关重要,但生成多样性同样重要,因为它反映了模型能够使用一系列常识知识事实的能力。大型语言模型(LLMs)通过在上下文中学来提高各种任务的生成质量,而不需要进行微调。然而,LLM输出的多样性方面之前还没有系统地研究过。为了解决这个问题,我们提出了一个简单的方法,它扩展了LLM的生成,同时保留了其质量。在三个基准GCR数据集上的实验结果表明,我们的方法实现了质量与多样性的理想平衡。此外,我们提出的方法生成的句子可以作为现有常识生成器的训练数据,以提高其多样性。
https://arxiv.org/abs/2404.16807
Representation-based Siamese networks have risen to popularity in lightweight text matching due to their low deployment and inference costs. While word-level attention mechanisms have been implemented within Siamese networks to improve performance, we propose Feature Attention (FA), a novel downstream block designed to enrich the modeling of dependencies among embedding features. Employing "squeeze-and-excitation" techniques, the FA block dynamically adjusts the emphasis on individual features, enabling the network to concentrate more on features that significantly contribute to the final classification. Building upon FA, we introduce a dynamic "selection" mechanism called Selective Feature Attention (SFA), which leverages a stacked BiGRU Inception structure. The SFA block facilitates multi-scale semantic extraction by traversing different stacked BiGRU layers, encouraging the network to selectively concentrate on semantic information and embedding features across varying levels of abstraction. Both the FA and SFA blocks offer a seamless integration capability with various Siamese networks, showcasing a plug-and-play characteristic. Experimental evaluations conducted across diverse text matching baselines and benchmarks underscore the indispensability of modeling feature attention and the superiority of the "selection" mechanism.
以表示为基础的孪生网络因低部署和推理成本而在轻量文本匹配中备受欢迎。尽管在孪生网络中已经实现了词级关注机制以提高性能,但我们提出了特征关注(FA)这一新型的下游块,旨在丰富嵌入特征之间的建模。通过采用“收缩和激发”技术,FA块动态地调整对单个特征的强调,使得网络能更关注对最终分类具有重要影响的特征。基于FA,我们引入了一种动态选择机制,称为选择性特征关注(SFA),并利用堆叠BiGRU Inception结构。SFA块通过穿越不同的堆叠BiGRU层,促使网络在不同的抽象层次上集中关注语义信息和嵌入特征。FA和SFA块都具有与各种孪生网络的无缝集成能力,展示了可插拔和 Play 的特点。在多样文本匹配基准和挑战中进行实验评估,结果表明建模特征关注和选择机制至关重要,而“选择”机制的优越性得到了充分证实。
https://arxiv.org/abs/2404.16776
Extracting who says what to whom is a crucial part in analyzing human communication in today's abundance of data such as online news articles. Yet, the lack of annotated data for this task in German news articles severely limits the quality and usability of possible systems. To remedy this, we present a new, freely available, creative-commons-licensed dataset for quotation attribution in German news articles based on WIKINEWS. The dataset provides curated, high-quality annotations across 1000 documents (250,000 tokens) in a fine-grained annotation schema enabling various downstream uses for the dataset. The annotations not only specify who said what but also how, in which context, to whom and define the type of quotation. We specify our annotation schema, describe the creation of the dataset and provide a quantitative analysis. Further, we describe suitable evaluation metrics, apply two existing systems for quotation attribution, discuss their results to evaluate the utility of our dataset and outline use cases of our dataset in downstream tasks.
提取谁对谁说了什么是对分析当今丰富数据(如在线新闻文章)中的人际交流至关重要的一部分。然而,德国新闻文章中缺乏带注释的数据,这严重限制了可能系统的质量和可用性。为了弥补这一不足,我们为基于WIKINEWS的德语新闻文章中的引用归因提供一个全新、免费、创意共享许可证的 dataset。该dataset提供了1000个文档(250,000个标记)的精心挑选的注释,采用细粒度的注释模式,使各种下游用途成为可能。注释不仅指明谁说了什么,而且还在何种背景下,对谁说了什么,并定义了引用类型。我们指定我们的注释模式,描述了该dataset的创建,并提供了定量分析。此外,我们描述了合适的评估指标,将两个现有的引用归因系统应用于该dataset,讨论它们的结果以评估我们的dataset的可用性,并概述了我们的dataset在下游任务中的应用场景。
https://arxiv.org/abs/2404.16764
Word error rate (WER) is a metric used to evaluate the quality of transcriptions produced by Automatic Speech Recognition (ASR) systems. In many applications, it is of interest to estimate WER given a pair of a speech utterance and a transcript. Previous work on WER estimation focused on building models that are trained with a specific ASR system in mind (referred to as ASR system-dependent). These are also domain-dependent and inflexible in real-world applications. In this paper, a hypothesis generation method for ASR System-Independent WER estimation (SIWE) is proposed. In contrast to prior work, the WER estimators are trained using data that simulates ASR system output. Hypotheses are generated using phonetically similar or linguistically more likely alternative words. In WER estimation experiments, the proposed method reaches a similar performance to ASR system-dependent WER estimators on in-domain data and achieves state-of-the-art performance on out-of-domain data. On the out-of-domain data, the SIWE model outperformed the baseline estimators in root mean square error and Pearson correlation coefficient by relative 17.58% and 18.21%, respectively, on Switchboard and CALLHOME. The performance was further improved when the WER of the training set was close to the WER of the evaluation dataset.
词错误率(WER)是一种用于评估自动语音识别(ASR)系统产生的转录质量的度量。在许多应用中,估计WER根据语音单位和文本的关系进行估计。之前的工作主要集中在构建特定ASR系统的模型(称为ASR系统依赖的模型)。这些模型在领域上是相关的,且在现实世界的应用中不灵活。在本文中,提出了一个用于ASR System-Independent WER估计(SIWE)的假设生成方法。与之前的工作不同,使用模拟ASR系统输出的数据来训练WER估计算法。假设是通过 phonetically similar 或 linguistically more likely 替代词来生成的。在WER估计实验中,所提出的方法在领域内数据上的性能与ASR系统依赖的WER估计算法相当,同时在对外部数据上的性能达到了最先进的水平。在外部数据上,SIWE模型分别比基线估计器在Switchboard和CALLHOME上的相对误差的绝对值和皮尔逊相关系数上提高了17.58%和18.21%。当训练集的WER接近评估数据上的WER时,性能进一步得到改善。
https://arxiv.org/abs/2404.16743
In the rapidly evolving field of artificial intelligence, ensuring safe decision-making of Large Language Models (LLMs) is a significant challenge. This paper introduces Governance of the Commons Simulation (GovSim), a simulation platform designed to study strategic interactions and cooperative decision-making in LLMs. Through this simulation environment, we explore the dynamics of resource sharing among AI agents, highlighting the importance of ethical considerations, strategic planning, and negotiation skills. GovSim is versatile and supports any text-based agent, including LLMs agents. Using the Generative Agent framework, we create a standard agent that facilitates the integration of different LLMs. Our findings reveal that within GovSim, only two out of 15 tested LLMs managed to achieve a sustainable outcome, indicating a significant gap in the ability of models to manage shared resources. Furthermore, we find that by removing the ability of agents to communicate, they overuse the shared resource, highlighting the importance of communication for cooperation. Interestingly, most LLMs lack the ability to make universalized hypotheses, which highlights a significant weakness in their reasoning skills. We open source the full suite of our research results, including the simulation environment, agent prompts, and a comprehensive web interface.
在人工智能领域,确保大型语言模型(LLMs)的安全决策是一个重要的挑战。本文介绍了治理共享资源模拟(GovSim)模拟平台,该平台旨在研究LLMs的战略互动和合作决策。通过这个仿真环境,我们探讨了AI代理之间资源共享的动态,强调了道德考虑、战略规划和谈判技能的重要性。GovSim具有灵活性,支持任何基于文本的代理,包括LLM代理。利用生成代理框架,我们创建了一个标准代理,促进不同LLM的集成。我们的研究结果表明,在GovSim中,只有两个 out of 15 经测试的LLM成功地实现了可持续的结果,表明模型在管理共享资源方面的能力存在显著的差距。此外,我们发现,通过移除代理与进行沟通的能力,它们超出了共享资源的使用,强调了沟通在合作中的重要性。有趣的是,大多数LLM都缺乏普遍化假设的能力,这表明它们在推理能力方面存在显著的弱点。我们开源了我们所有研究的完整套件,包括仿真环境、代理提示和综合网页界面。
https://arxiv.org/abs/2404.16698
Unsupervised cross-lingual transfer involves transferring knowledge between languages without explicit supervision. Although numerous studies have been conducted to improve performance in such tasks by focusing on cross-lingual knowledge, particularly lexical and syntactic knowledge, current approaches are limited as they only incorporate syntactic or lexical information. Since each type of information offers unique advantages and no previous attempts have combined both, we attempt to explore the potential of this approach. In this paper, we present a novel framework called "Lexicon-Syntax Enhanced Multilingual BERT" that combines both lexical and syntactic knowledge. Specifically, we use Multilingual BERT (mBERT) as the base model and employ two techniques to enhance its learning capabilities. The code-switching technique is used to implicitly teach the model lexical alignment information, while a syntactic-based graph attention network is designed to help the model encode syntactic structure. To integrate both types of knowledge, we input code-switched sequences into both the syntactic module and the mBERT base model simultaneously. Our extensive experimental results demonstrate this framework can consistently outperform all baselines of zero-shot cross-lingual transfer, with the gains of 1.0~3.7 points on text classification, named entity recognition (ner), and semantic parsing tasks. Keywords:cross-lingual transfer, lexicon, syntax, code-switching, graph attention network
无监督跨语言转移涉及在没有任何明确监督的情况下在语言之间传递知识。尽管已经进行了大量研究,以通过关注跨语言知识来提高此类任务的性能,特别是词汇和句法知识,但目前的方法仍然有限,因为它们仅包括语义或词汇信息。由于每种信息都具有独特的优势,并且没有 previous 尝试将两种信息相结合,因此我们试图探索这种方法的潜力。在本文中,我们提出了一个名为 "Lexicon-Syntax Enhanced Multilingual BERT" 的新框架,结合了词汇和句法知识。具体来说,我们使用多语言 BERT(mBERT)作为基础模型,并采用两种技术来增强其学习能力。代码转换技术用于含蓄地教导模型词汇对齐信息,而基于句法的图注意力网络旨在帮助模型编码语义结构。为了整合两种知识,我们将代码转换序列同时输入到语义模块和 mBERT 基础模型中。我们进行了广泛的实验研究,结果表明,与其他零散的跨语言转移 baseline 相比,该框架可以始终如一地优于所有基线,在文本分类、命名实体识别(NER)和语义解析任务中的得分增加了 1.0~3.7 点。关键词:跨语言转移,词汇,语法,代码转换,图注意力网络
https://arxiv.org/abs/2404.16627
It has been found that Transformer-based language models have the ability to perform basic quantitative reasoning. In this paper, we propose a method for studying how these models internally represent numerical data, and use our proposal to analyze the ALBERT family of language models. Specifically, we extract the learned embeddings these models use to represent tokens that correspond to numbers and ordinals, and subject these embeddings to Principal Component Analysis (PCA). PCA results reveal that ALBERT models of different sizes, trained and initialized separately, consistently learn to use the axes of greatest variation to represent the approximate ordering of various numerical concepts. Numerals and their textual counterparts are represented in separate clusters, but increase along the same direction in 2D space. Our findings illustrate that language models, trained purely to model text, can intuit basic mathematical concepts, opening avenues for NLP applications that intersect with quantitative reasoning.
研究发现,基于Transformer的语言模型具有执行基本数量推理的能力。在本文中,我们提出了一种研究这些模型内部如何表示数值数据的方法,并使用我们的建议分析ALBERT家族的语言模型。具体来说,我们提取这些模型用于表示数字和序数的 learned嵌入,并对其进行主成分分析(PCA)。PCA的结果表明,具有不同大小的ALBERT模型,在训练和初始化过程中分别进行,能够一致地使用变化最大的轴来表示各种数值概念的近似顺序。数值和它们的文本对应物分别位于不同的簇中,但在二维空间中沿着相同的方向增加。我们的研究结果表明,训练纯粹用于建模文本的语言模型可以直观地理解基本的数学概念,为与数量推理相关的自然语言处理应用程序开辟了道路。
https://arxiv.org/abs/2404.16574
Large Language Models (LLMs) offer the potential for automatic time series analysis and reporting, which is a critical task across many domains, spanning healthcare, finance, climate, energy, and many more. In this paper, we propose a framework for rigorously evaluating the capabilities of LLMs on time series understanding, encompassing both univariate and multivariate forms. We introduce a comprehensive taxonomy of time series features, a critical framework that delineates various characteristics inherent in time series data. Leveraging this taxonomy, we have systematically designed and synthesized a diverse dataset of time series, embodying the different outlined features. This dataset acts as a solid foundation for assessing the proficiency of LLMs in comprehending time series. Our experiments shed light on the strengths and limitations of state-of-the-art LLMs in time series understanding, revealing which features these models readily comprehend effectively and where they falter. In addition, we uncover the sensitivity of LLMs to factors including the formatting of the data, the position of points queried within a series and the overall time series length.
大语言模型(LLMs)具有自动时间序列分析和报告的潜力,这在许多领域都至关重要,跨越医疗、金融、气候、能源等。在本文中,我们提出了一个框架,旨在严格评估LLMs在时间序列理解方面的能力,包括单变量和多变量形式。我们引入了一个全面的时间序列特征分类器,这是一个界定时间序列数据各种特征的关键框架。利用这个分类器,我们系统地设计并合成了一组时间序列数据,体现了不同所描述的特征。这个数据集成为评估LLMs在理解时间序列方面的熟练程度的坚实基础。我们的实验揭示了LLMs在时间序列理解方面的优势和局限性,揭示了这些模型能否很好理解这些特征以及它们在哪些方面失误。此外,我们还揭示了LLMs对数据格式、系列中点的位置和整个时间序列长度的敏感性。
https://arxiv.org/abs/2404.16563
Document-level Relation Extraction (DocRE) is the task of extracting all semantic relationships from a document. While studies have been conducted on English DocRE, limited attention has been given to DocRE in non-English languages. This work delves into effectively utilizing existing English resources to promote DocRE studies in non-English languages, with Japanese as the representative case. As an initial attempt, we construct a dataset by transferring an English dataset to Japanese. However, models trained on such a dataset suffer from low recalls. We investigate the error cases and attribute the failure to different surface structures and semantics of documents translated from English and those written by native speakers. We thus switch to explore if the transferred dataset can assist human annotation on Japanese documents. In our proposal, annotators edit relation predictions from a model trained on the transferred dataset. Quantitative analysis shows that relation recommendations suggested by the model help reduce approximately 50% of the human edit steps compared with the previous approach. Experiments quantify the performance of existing DocRE models on our collected dataset, portraying the challenges of Japanese and cross-lingual DocRE.
文档级别关系提取(DocRE)是从文档中提取所有语义关系的过程。尽管已经进行了关于英语DocRE的研究,但在非英语语言中,对DocRE的研究却鲜有关注。本文深入研究如何有效地利用现有英语资源来促进非英语语言中的DocRE研究,以日本为例作为代表。作为初始尝试,我们将英语数据集迁移到日本并构建了一个数据集。然而,训练在这样的数据集上的模型,模型的召回率很低。我们研究了错误案例,并将失败归因于从英语到非英语翻译的文档的不同表面结构和语义。因此,我们转向研究是否转移的数据集可以帮助人类对日语文档进行标注。在我们的建议中,注释者编辑从转移数据集中得出的关系预测。定量分析显示,与以前的方法相比,模型建议的关系减少约50%的人为编辑步骤。实验验证了现有DocRE模型的在我们收集的数据集上的性能,揭示了日语和跨语言DocRE的挑战。
https://arxiv.org/abs/2404.16506
Mental health in children and adolescents has been steadily deteriorating over the past few years [ 1 ]. The recent advent of Large Language Models (LLMs) offers much hope for cost and time efficient scaling of monitoring and intervention, yet despite specifically prevalent issues such as school bullying and eating disorders, previous studies on have not investigated performance in this domain or for open information extraction where the set of answers is not predetermined. We create a new dataset of Reddit posts from adolescents aged 12-19 annotated by expert psychiatrists for the following categories: TRAUMA, PRECARITY, CONDITION, SYMPTOMS, SUICIDALITY and TREATMENT and compare expert labels to annotations from two top performing LLMs (GPT3.5 and GPT4). In addition, we create two synthetic datasets to assess whether LLMs perform better when annotating data as they generate it. We find GPT4 to be on par with human inter-annotator agreement and performance on synthetic data to be substantially higher, however we find the model still occasionally errs on issues of negation and factuality and higher performance on synthetic data is driven by greater complexity of real data rather than inherent advantage.
近年来,儿童和青少年的心理健康状况一直在不断恶化[1]。大型语言模型的出现为成本和时间有效的监测和干预带来了很多希望。然而,尽管学校欺凌和饮食障碍等问题是以前研究的主要内容,但以前的研究没有调查这个领域或开放信息提取领域的表现。我们创建了一个由12-19岁青少年在专家精神科医生指导下标注的Reddit帖子的新数据集,以下是这个领域的专家标签和两个顶级LLM(GPT3.5和GPT4)的注释:TRAUMA,PRECARITY,CONDITION,SYMPTOMS,SUICIDALITY和TREATMENT。此外,我们还创建了两个合成数据集来评估LLM在生成数据时注释数据的性能。我们发现GPT4在人类互注解一致性和性能上与LLM相当,而合成数据上的性能远高于LLM。然而,我们发现模型在否定和事实性问题上仍然偶尔犯错,并且 synthetic data上的高性能是由真实数据的复杂性而不是固有优势造成的。
https://arxiv.org/abs/2404.16461
Instruction tuning has shown its ability to not only enhance zero-shot generalization across various tasks but also its effectiveness in improving the performance of specific tasks. A crucial aspect in instruction tuning for a particular task is a strategic selection of related tasks that offer meaningful supervision, thereby enhancing efficiency and preventing performance degradation from irrelevant tasks. Our research reveals that leveraging instruction information \textit{alone} enables the identification of pertinent tasks for instruction tuning. This approach is notably simpler compared to traditional methods that necessitate complex measurements of pairwise transferability between tasks or the creation of data samples for the target task. Furthermore, by additionally learning the unique instructional template style of the meta-dataset, we observe an improvement in task selection accuracy, which contributes to enhanced overall performance. Experimental results demonstrate that training on a small set of tasks, chosen solely based on the instructions, leads to substantial performance improvements on benchmarks like P3, Big-Bench, NIV2, and Big-Bench Hard. Significantly, these improvements exceed those achieved by prior task selection methods, highlighting the efficacy of our approach.
指令调整已经证明了其不仅能够提高各种任务的零 shot通用性,而且在特定任务上的效果也能得到提高。为特定任务进行指令调整的关键方面是选择相关任务进行有意义监督,从而提高效率并防止性能下降。我们的研究揭示,仅利用指令信息就能够识别出与指令调整相关的任务。与传统方法相比,这种方法显得尤为简单,不需要对任务之间 transferability 的复杂测量或为目标任务创建数据样本。此外,通过进一步学习元数据集的独有指令模板风格,我们观察到任务选择准确性的提高,从而提高了整体性能。实验结果表明,仅基于指令对少量任务进行训练,选择任务的标准仅仅基于指令,能够在像P3、Big-Bench、NIV2和Big-Bench Hard等基准上带来显著的性能提升。值得注意的是,这些提升超过了先前的任务选择方法所能实现的效果,充分证明了我们的方法的有效性。
https://arxiv.org/abs/2404.16418
This paper presents a question-answering approach to extract document-level event-argument structures. We automatically ask and answer questions for each argument type an event may have. Questions are generated using manually defined templates and generative transformers. Template-based questions are generated using predefined role-specific wh-words and event triggers from the context document. Transformer-based questions are generated using large language models trained to formulate questions based on a passage and the expected answer. Additionally, we develop novel data augmentation strategies specialized in inter-sentential event-argument relations. We use a simple span-swapping technique, coreference resolution, and large language models to augment the training instances. Our approach enables transfer learning without any corpora-specific modifications and yields competitive results with the RAMS dataset. It outperforms previous work, and it is especially beneficial to extract arguments that appear in different sentences than the event trigger. We also present detailed quantitative and qualitative analyses shedding light on the most common errors made by our best model.
本文提出了一种问题回答方法,用于提取文档级别的事件- argument 结构。我们自动对每个可能具有的事件类型提出问题并进行回答。问题是通过手动定义的模板和生成 transformer 生成的。基于模板的问题使用预定义的角色特定 wh- 词和上下文文档中的事件触发器生成。基于 transformer 的問題使用经过训练的大型语言模型生成的关于文章和预期答案的问题。此外,我们还开发了针对 inter-sentential event-argument 关系的新颖数据增强策略。我们使用简单的跨度替换技术、核心关系解析和大型语言模型来增强训练实例。我们的方法无需对数据集进行特定修改即可实现迁移学习,并与 RAMS 数据集的竞争结果相媲美。它尤其有益于提取不同句子中出现的事件 argument。我们还对最佳模型最常犯的错误进行了详细的定量定性分析,阐明了我们的最佳模型在分析中取得的成就。
https://arxiv.org/abs/2404.16413
Scale has opened new frontiers in natural language processing, but at a high cost. In response, by learning to only activate a subset of parameters in training and inference, Mixture-of-Experts (MoE) have been proposed as an energy efficient path to even larger and more capable language models and this shift towards a new generation of foundation models is gaining momentum, particularly within the field of Automatic Speech Recognition (ASR). Recent works that incorporating MoE into ASR models have complex designs such as routing frames via supplementary embedding network, improving multilingual ability for the experts, and utilizing dedicated auxiliary losses for either expert load balancing or specific language handling. We found that delicate designs are not necessary, while an embarrassingly simple substitution of MoE layers for all Feed-Forward Network (FFN) layers is competent for the ASR task. To be more specific, we benchmark our proposed model on a large scale inner-source dataset (160k hours), the results show that we can scale our baseline Conformer (Dense-225M) to its MoE counterparts (MoE-1B) and achieve Dense-1B level Word Error Rate (WER) while maintaining a Dense-225M level Real Time Factor (RTF). Furthermore, by applying Unified 2-pass framework with bidirectional attention decoders (U2++), we achieve the streaming and non-streaming decoding modes in a single MoE based model, which we call U2++ MoE. We hope that our study can facilitate the research on scaling speech foundation models without sacrificing deployment efficiency.
翻译:在自然语言处理领域,Scale确实开辟了新的前沿,但代价高昂。为了应对这种情况,通过仅在训练和推理时激活参数集的一小部分,提出了Mixture-of-Experts(MoE)方法,作为一种能源高效的途径,以达到更大的和更强大的语言模型,并将这一代基础模型从目前的MoE模型中转移到新一代。在自动语音识别(ASR)领域,采用MoE的ASR模型已经引起了越来越多的关注,尤其是ASR任务。最近的工作包括通过附加嵌入网络路由帧、提高专家多语言能力以及为专家负载均衡或特定语言处理使用专用的辅助损失等复杂设计。我们发现,一些设计并不必要,而用MoE层替换所有前馈网络(FFN)层对所有ASR任务来说也是有效的。具体来说,我们在一个大型内部数据集(160k小时)上对所提出的模型进行了基准测试,结果显示,在保持Dense-225M级别的基线Conformer(Dense-225M)的同时,我们将基线Conformer(Dense-225M)扩展到了MoE对应的同级(MoE-1B),并达到与Dense-225M级别的实时因子(RTF)相同的WER级别。此外,通过应用统一2-路注意力解码器(U2++),我们实现了一个基于MoE的单模型流式和非流式解码模式,我们称之为U2++ MoE。我们希望我们的研究能够促进在不牺牲部署效率的情况下研究扩展语音基础模型。
https://arxiv.org/abs/2404.16407
Our world is shaped by events of various complexity. This includes both small-scale local events like local farmer markets and large complex events like political and military conflicts. The latter are typically not observed directly but through the lenses of intermediaries like newspapers or social media. In other words, we do not witness the unfolding of such events directly but are confronted with narratives surrounding them. Such narratives capture different aspects of a complex event and may also differ with respect to the narrator. Thus, they provide a rich semantics concerning real-world events. In this paper, we show how narratives concerning complex events can be constructed and utilized. We provide a formal representation of narratives based on recursive nodes to represent multiple levels of detail and discuss how narratives can be bound to event-centric knowledge graphs. Additionally, we provide an algorithm based on incremental prompting techniques that mines such narratives from texts to account for different perspectives on complex events. Finally, we show the effectiveness and future research directions in a proof of concept.
我们的世界是由各种复杂事件塑造的。这包括小型的地方事件,如当地农民市场,以及大型复杂事件,如政治和军事冲突。后者的通常不是直接观察到的,而是通过中介机构,如报纸或社交媒体,透过他们的镜头观察到的。换句话说,我们不是直接见证了这些事件的展开,而是面对着围绕这些事件的故事。这些故事捕捉了复杂事件的不同方面,而且也可能与叙述者的观点有所不同。因此,它们提供了关于现实世界事件的丰富语义。在本文中,我们展示了如何构建和利用关于复杂事件的叙述。我们根据递归节点的形式给出了叙述的正式表示,并讨论了叙述如何与事件中心化的知识图谱相联系。此外,我们还基于增量提示技术提供了算法,用于从文本中挖掘这类叙述,以反映复杂事件的不同观点。最后,我们在概念证明中展示了这种方法的有效性和未来的研究方向。
https://arxiv.org/abs/2404.16405
Ensuring the safety alignment of Large Language Models (LLMs) is crucial to generating responses consistent with human values. Despite their ability to recognize and avoid harmful queries, LLMs are vulnerable to "jailbreaking" attacks, where carefully crafted prompts elicit them to produce toxic content. One category of jailbreak attacks is reformulating the task as adversarial attacks by eliciting the LLM to generate an affirmative response. However, the typical attack in this category GCG has very limited attack success rate. In this study, to better study the jailbreak attack, we introduce the DSN (Don't Say No) attack, which prompts LLMs to not only generate affirmative responses but also novelly enhance the objective to suppress refusals. In addition, another challenge lies in jailbreak attacks is the evaluation, as it is difficult to directly and accurately assess the harmfulness of the attack. The existing evaluation such as refusal keyword matching has its own limitation as it reveals numerous false positive and false negative instances. To overcome this challenge, we propose an ensemble evaluation pipeline incorporating Natural Language Inference (NLI) contradiction assessment and two external LLM evaluators. Extensive experiments demonstrate the potency of the DSN and the effectiveness of ensemble evaluation compared to baseline methods.
确保大型语言模型(LLMs)的安全对生成符合人类价值观的响应至关重要。尽管它们能够识别并避免有害查询,但LLMs仍然容易受到“破解”攻击,这种攻击是通过对LLM生成具有毒性内容的精心策划的提示来实现的。其中一种破解攻击是将任务重新建模为对抗性攻击,通过让LLM生成积极响应。然而,这种攻击类型的典型攻击成功率非常有限。 在本研究中,为了更好地研究破解攻击,我们引入了DSN(不要说“不”)攻击,该攻击要求LLM不仅生成积极响应,而且还通过增强目标来抑制拒绝。此外,另一个挑战是破解攻击的评估,因为很难直接且准确地评估攻击的危害。现有的评估方法,如拒绝关键词匹配,本身也有其局限性,因为它揭示了大量的误判和误判实例。为了克服这个挑战,我们提出了一个包含自然语言推理(NLI)矛盾评估和两个外部LLM评估器的元学习评估管道。大量的实验证明,DSN和元学习的组合比基线方法更具有威力。
https://arxiv.org/abs/2404.16369
Transformers trained on natural language data have been shown to learn its hierarchical structure and generalize to sentences with unseen syntactic structures without explicitly encoding any structural bias. In this work, we investigate sources of inductive bias in transformer models and their training that could cause such generalization behavior to emerge. We extensively experiment with transformer models trained on multiple synthetic datasets and with different training objectives and show that while other objectives e.g. sequence-to-sequence modeling, prefix language modeling, often failed to lead to hierarchical generalization, models trained with the language modeling objective consistently learned to generalize hierarchically. We then conduct pruning experiments to study how transformers trained with the language modeling objective encode hierarchical structure. When pruned, we find joint existence of subnetworks within the model with different generalization behaviors (subnetworks corresponding to hierarchical structure and linear order). Finally, we take a Bayesian perspective to further uncover transformers' preference for hierarchical generalization: We establish a correlation between whether transformers generalize hierarchically on a dataset and whether the simplest explanation of that dataset is provided by a hierarchical grammar compared to regular grammars exhibiting linear generalization.
翻译: 训练在自然语言数据上的Transformer模型已经被证明可以在没有明确编码任何结构偏见的情况下学习其层次结构,并泛化到未见过的语法结构的句子。在本文中,我们研究了导致Transformer模型出现这种泛化行为的归纳偏见来源以及它们的训练。我们对多个合成数据集训练的Transformer模型和不同训练目标进行了广泛的实验,并发现,与其他目标(如序列到序列建模、前语言建模)相比,使用语言建模目标训练的模型能够保持层次结构的泛化。然后我们进行了剪枝实验,以研究使用语言建模目标训练的Transformer模型如何编码层次结构。当剪枝时,我们发现模型内存在不同泛化行为的子网络(对应于层次结构和线性顺序的子网络)。最后,我们从贝叶斯角度进一步揭示了Transformer模型对层次泛化的偏好:我们建立了训练数据是否具有层次结构泛化以及该数据是否可以用层次语法给出简单解释与展示线性泛化之间的相关关系。
https://arxiv.org/abs/2404.16367
Computational historical linguistics seeks to systematically understand processes of sound change, including during periods at which little to no formal recording of language is attested. At the same time, few computational resources exist which deeply explore phonological and morphological connections between proto-languages and their descendants. This is particularly true for the family of Italic languages. To assist historical linguists in the study of Italic sound change, we introduce the Proto-Italic to Latin (PILA) dataset, which consists of roughly 3,000 pairs of forms from Proto-Italic and Latin. We provide a detailed description of how our dataset was created and organized. Then, we exhibit PILA's value in two ways. First, we present baseline results for PILA on a pair of traditional computational historical linguistics tasks. Second, we demonstrate PILA's capability for enhancing other historical-linguistic datasets through a dataset compatibility study.
计算历史语言学旨在系统地理解声变过程,包括在语言形式正式记录几乎没有到没有的时候。同时,在深入探索proto-语言及其后代之间的音位和形态联系方面,几乎没有计算资源存在。这一点尤其对于意大利语家族来说更是如此。为了帮助历史语言学家研究意大利语的声变,我们引入了proto-意大利语到拉丁语(PILA)数据集,它包括大约3,000对proto-意大利语和拉丁语的形式。我们详细描述了我们的数据集是如何创建和组织的。接着,我们展示了PILA的价值。首先,我们展示了PILA在传统计算历史语言学任务上的基线结果。其次,我们通过一个数据兼容性研究展示了PILA通过数据集兼容性增强其他历史语言数据的能力。
https://arxiv.org/abs/2404.16341
The awareness of multi-cultural human values is critical to the ability of language models (LMs) to generate safe and personalized responses. However, this awareness of LMs has been insufficiently studied, since the computer science community lacks access to the large-scale real-world data about multi-cultural values. In this paper, we present WorldValuesBench, a globally diverse, large-scale benchmark dataset for the multi-cultural value prediction task, which requires a model to generate a rating response to a value question based on demographic contexts. Our dataset is derived from an influential social science project, World Values Survey (WVS), that has collected answers to hundreds of value questions (e.g., social, economic, ethical) from 94,728 participants worldwide. We have constructed more than 20 million examples of the type "(demographic attributes, value question) $\rightarrow$ answer" from the WVS responses. We perform a case study using our dataset and show that the task is challenging for strong open and closed-source models. On merely $11.1\%$, $25.0\%$, $72.2\%$, and $75.0\%$ of the questions, Alpaca-7B, Vicuna-7B-v1.5, Mixtral-8x7B-Instruct-v0.1, and GPT-3.5 Turbo can respectively achieve $<0.2$ Wasserstein 1-distance from the human normalized answer distributions. WorldValuesBench opens up new research avenues in studying limitations and opportunities in multi-cultural value awareness of LMs.
意识到多元文化人类价值观对于语言模型(LMs)生成安全和个性化的回应至关重要。然而,对于LMs的多元文化价值观的意识尚缺乏充分的研究,因为计算机科学领域无法访问关于多元文化价值观的大规模现实世界数据。在本文中,我们提出了一个全球多样、大规模的多文化价值预测任务基准数据集WorldValuesBench,该数据集要求基于人口背景生成一个评分回答,基于 demographic contexts。我们的数据来源于一个著名的社会科学项目World Values Survey(WVS),它收集了来自全球94,728个参与者的数百个价值问题的答案(例如社会、经济、伦理)。我们基于WVS的响应构建了超过2000万对类型"人口属性,价值问题$\rightarrow$答案"的示例。我们使用我们的数据集进行案例研究,并证明了对于强开放和封闭源模型来说,这项任务具有挑战性。仅在11.1%、25.0%、72.2%和75.0%的问题上,Alpaca-7B、Vicuna-7B-v1.5、Mixtral-8x7B-Instruct-v0.1和GPT-3.5 Turbo可以分别实现与人类归一化答案分布的<0.2 Wasserstein 1-距离。WorldValuesBench在研究多元文化价值意识LMs的局限性和机会方面打开了新的研究途径。
https://arxiv.org/abs/2404.16308
People often answer yes-no questions without explicitly saying yes, no, or similar polar keywords. Figuring out the meaning of indirect answers is challenging, even for large language models. In this paper, we investigate this problem working with dialogues from multiple domains. We present new benchmarks in three diverse domains: movie scripts, tennis interviews, and airline customer service. We present an approach grounded on distant supervision and blended training to quickly adapt to a new dialogue domain. Experimental results show that our approach is never detrimental and yields F1 improvements as high as 11-34%.
人们通常在回答二选一问题时,不会明确地说是或否,或类似的极性关键词。解决这种问题的挑战性很大,即使是大型语言模型也很难。在本文中,我们研究了多个领域的对话,包括电影剧本、网球采访和 airline 客户服务。我们提出了一个基于远距离监督和混合训练的方法,以尽快适应新的对话领域。实验结果表明,我们的方法从未导致任何负面影响,而且F1得分可以达到11-34%。
https://arxiv.org/abs/2404.16262