Using crowdsourcing, we collected more than 10,000 URL pairs (parallel top page pairs) of bilingual websites that contain parallel documents and created a Japanese-Chinese parallel corpus of 4.6M sentence pairs from these websites. We used a Japanese-Chinese bilingual dictionary of 160K word pairs for document and sentence alignment. We then used high-quality 1.2M Japanese-Chinese sentence pairs to train a parallel corpus filter based on statistical language models and word translation probabilities. We compared the translation accuracy of the model trained on these 4.6M sentence pairs with that of the model trained on Japanese-Chinese sentence pairs from CCMatrix (12.4M), a parallel corpus from global web mining. Although our corpus is only one-third the size of CCMatrix, we found that the accuracy of the two models was comparable and confirmed that it is feasible to use crowdsourcing for web mining of parallel data.
使用众包,我们收集了包含平行文档的10,000多个URL对(平行页面对),并创建了一个包含460万句对的日本-中文平行语料库。我们对这些网站使用的日本-中文双语词典进行了文档和句子对齐。然后,我们使用高质量的120万日本-中文句子对来训练一个基于统计语言模型和单词翻译概率的并行语料库过滤器。我们比较了基于这些460万句对训练的模型的翻译准确性与基于CCMatrix(1240万)上训练的模型的翻译准确性。尽管我们的语料库仅是CCMatrix的三分之一大小,但我们发现这两个模型的准确性相当,证实了利用众包进行并行数据挖掘是可行的。
https://arxiv.org/abs/2405.09017
We propose a new paradigm for machine translation that is particularly useful for no-resource languages (those without any publicly available bilingual or monolingual corpora): \acronym (LLM-Assisted Rule Based Machine Translation). Using the \acronym paradigm, we design the first language education/revitalization-oriented machine translator for Owens Valley Paiute (OVP), a critically endangered Indigenous American language for which there is virtually no publicly available data. We present a detailed evaluation of the translator's components: a rule-based sentence builder, an OVP to English translator, and an English to OVP translator. We also discuss the potential of the paradigm, its limitations, and the many avenues for future research that it opens up.
我们提出了一个名为“LLM辅助规则机器翻译”的新范式,特别适用于无资源语言(那些没有公共可用的双语或单语料库的语言):\acronym (LLM-Assisted Rule Based Machine Translation)。使用\acronym范式,我们设计了一种第一个面向奥威谷帕伊特(OVP)的教育/振兴机器翻译器,这是一种几乎没有任何公共可用数据的关键濒危美洲原住民语言。我们详细评估了翻译器的组成部分:基于规则的句子构建者、OVP到英语翻译器和英语到OVP翻译器。我们还讨论了范式的潜力、局限性以及它所开启的许多未来研究的方向。
https://arxiv.org/abs/2405.08997
What can contemporary machine learning (ML) models do? Given the proliferation of ML models in society, answering this question matters to a variety of stakeholders, both public and private. The evaluation of models' capabilities is rapidly emerging as a key subfield of modern ML, buoyed by regulatory attention and government grants. Despite this, the notion of an ML model possessing a capability has not been interrogated: what are we saying when we say that a model is able to do something? And what sorts of evidence bear upon this question? In this paper, we aim to answer these questions, using the capabilities of large language models (LLMs) as a running example. Drawing on the large philosophical literature on abilities, we develop an account of ML models' capabilities which can be usefully applied to the nascent science of model evaluation. Our core proposal is a conditional analysis of model abilities (CAMA): crudely, a machine learning model has a capability to X just when it would reliably succeed at doing X if it 'tried'. The main contribution of the paper is making this proposal precise in the context of ML, resulting in an operationalisation of CAMA applicable to LLMs. We then put CAMA to work, showing that it can help make sense of various features of ML model evaluation practice, as well as suggest procedures for performing fair inter-model comparisons.
当代机器学习(ML)模型可以做些什么?在社会中机器学习模型的普及,回答这个问题对公共和私营部门的各种利益相关者来说都至关重要。对模型能力的评估作为一种现代 ML 的关键子领域,得到了监管关注和政府资助的支持。尽管如此,对一个 ML 模型是否具有能力的概念还没有进行深入的探讨:我们说模型能够做某事时,我们究竟在说什么?这个问题上有哪些证据?在本文中,我们旨在回答这些问题,以大型语言模型(LLMs)的能力为运行范例。借鉴大型哲学文献中关于能力的丰富论述,我们为 ML 模型能力开发了一个可资应用的描述。我们论文的核心提议是条件能力分析(CAMA):简而言之,当一个机器学习模型在可靠地执行 X 时,它具有能力做 X。CAMA 的主要贡献在于在 ML 的背景下将这个提议精确化,从而为 LLMs 提供了操作化 CAMA 的方法。接着我们运用 CAMA,展示了它有助于解析 ML 模型评估实践的 various 特征,并提出了进行公平跨模型比较的程序。
https://arxiv.org/abs/2405.08989
Transformer-based long context generative models power emerging AI applications like hour-long video understanding and project-level coding agent. Deploying long context transformers (e.g., 100K to 10M tokens) is prohibitively expensive compared to short context (e.g., 4K tokens) model variants. Reducing the cost of long-context transformers is becoming a pressing research and engineering challenge starting from the year of 2024. This work describes a concurrent programming framework for quantitatively analyzing the efficiency challenges in serving multiple long-context requests under limited size of GPU high-bandwidth memory (HBM) regime. We give a detailed analysis of how all additional computational costs, compared to 4K context, trace back to \textit{one single source: the large size of the KV cache}. We use a 34B GPT-3.5 level model of 50K context on A100 NVLink as a running example, and describe how its large KV cache causes four types of deployment challenges: (1) prefilling long inputs takes much longer compute time and GPU memory than short inputs; (2) after prefilling, the large KV cache residing on the GPU HBM substantially restricts the number of concurrent users being served; (3) during decoding, repeatedly reading the KV cache from HBM to SM largely increases latency; (4) when KV cache memory overflows, swapping it from HBM to DDR causes significant context switching latency. We use this framework to analyze existing works and identify possibilities of combining them to build end-to-end systems. Overall, this work offers a foundational framework for analyzing long context transformer deployment and identifies directions towards reducing the inference cost of 1M context to be as cheap as 4K.
Transformer-based long context generative models are powering emerging AI applications such as hour-long video understanding and project-level coding agents. However, deploying long context transformers (e.g., 100K to 10M tokens) is cost-prohibitively high compared to shorter context (e.g., 4K tokens) model variants. Reducing the cost of long-context transformers is becoming a pressing research and engineering challenge starting from the year of 2024. This work describes a concurrent programming framework for quantitatively analyzing the efficiency challenges in serving multiple long-context requests under limited size of GPU high-bandwidth memory (HBM) regime. We give a detailed analysis of how all additional computational costs, compared to 4K context, trace back to a single source: the large size of the KV cache. We use a 34B GPT-3.5 level model of 50K context on A100 NVLink as an example and describe how its large KV cache causes four types of deployment challenges: (1) Prefilling long inputs takes much longer compute time and GPU memory than short inputs. (2) After prefilling, the large KV cache residing on the GPU HBM significantly restricts the number of concurrent users being served. (3) During decoding, repeatedly reading the KV cache from HBM to SM largely increases latency. (4) When KV cache memory overflows, swapping it from HBM to DDR causes significant context switching latency. We use this framework to analyze existing works and identify possibilities of combining them to build end-to-end systems. Overall, this work offers a foundational framework for analyzing long context transformer deployment and identifies directions towards reducing the inference cost of 1M context to be as cheap as 4K.
https://arxiv.org/abs/2405.08944
Autonomous tuning of particle accelerators is an active and challenging field of research with the goal of enabling novel accelerator technologies cutting-edge high-impact applications, such as physics discovery, cancer research and material sciences. A key challenge with autonomous accelerator tuning remains that the most capable algorithms require an expert in optimisation, machine learning or a similar field to implement the algorithm for every new tuning task. In this work, we propose the use of large language models (LLMs) to tune particle accelerators. We demonstrate on a proof-of-principle example the ability of LLMs to successfully and autonomously tune a particle accelerator subsystem based on nothing more than a natural language prompt from the operator, and compare the performance of our LLM-based solution to state-of-the-art optimisation algorithms, such as Bayesian optimisation (BO) and reinforcement learning-trained optimisation (RLO). In doing so, we also show how LLMs can perform numerical optimisation of a highly non-linear real-world objective function. Ultimately, this work represents yet another complex task that LLMs are capable of solving and promises to help accelerate the deployment of autonomous tuning algorithms to the day-to-day operations of particle accelerators.
自动调节粒子加速器是一个充满挑战的研究领域,旨在实现新型的加速器技术, cutting-edge 的具有高影响应用的高科技应用,例如物理学发现、癌症研究和材料科学。自动调节粒子加速器的一个重要挑战是,最有效的算法需要优化领域的专家才能实现对每个新调节任务的算法进行操作。在这项工作中,我们提出使用大型语言模型(LLMs)对粒子加速器进行自动调节。我们在一个证明性的例子中展示了LLMs成功且自主地调节一个粒子加速器子系统的能力,仅基于操作员的自然语言提示。我们还比较了我们的LLM基于解决方案与最先进的优化算法(如贝叶斯优化(BO)和强化学习训练的优化(RLO))的性能。通过这样做,我们还展示了LLMs如何执行高度非线性的现实世界目标函数的数值优化。最终,这项工作代表了LLMs能够解决的最新复杂任务,并有望加速将自调节算法应用于粒子加速器日常运营的工作。
https://arxiv.org/abs/2405.08888
We used a dictionary built from biomedical terminology extracted from various sources such as DrugBank, MedDRA, MedlinePlus, TCMGeneDIT, to tag more than 8 million Instagram posts by users who have mentioned an epilepsy-relevant drug at least once, between 2010 and early 2016. A random sample of 1,771 posts with 2,947 term matches was evaluated by human annotators to identify false-positives. OpenAI's GPT series models were compared against human annotation. Frequent terms with a high false-positive rate were removed from the dictionary. Analysis of the estimated false-positive rates of the annotated terms revealed 8 ambiguous terms (plus synonyms) used in Instagram posts, which were removed from the original dictionary. To study the effect of removing those terms, we constructed knowledge networks using the refined and the original dictionaries and performed an eigenvector-centrality analysis on both networks. We show that the refined dictionary thus produced leads to a significantly different rank of important terms, as measured by their eigenvector-centrality of the knowledge networks. Furthermore, the most important terms obtained after refinement are of greater medical relevance. In addition, we show that OpenAI's GPT series models fare worse than human annotators in this task.
我们使用了一个基于生物医学术语提取于各种来源的词典,如DrugBank、MedDRA、MedlinePlus、TCMGeneDIT等,来标记在至少提及一次抗癫痫药物的8000万条Instagram帖子。通过对2010年到2016年间的随机样本进行人名标注者评估,来识别出假阳性结果。OpenAI的GPT系列模型与人类标注进行了比较。高假阳性率和高频词汇从词典中移除。对注释词汇的预计假阳性率进行分析,发现其中8个词汇是 Instagram 帖子中使用的模糊词汇(包括同义词),它们被从原始词典中移除。为了研究移除这些词汇的影响,我们使用精化和原始词典来构建知识网络,并分别对两个网络进行了 eigenvector-centrality 分析。我们发现,因此产生的精化词典导致知识网络中重要词汇的排名明显不同,这是通过它们的 eigenvector-centrality 来衡量的。此外,在精化过程中获得的最重要词汇具有更大的医学意义。此外,我们还发现,OpenAI的GPT系列模型在這種任务上表现不佳,远低于人类标注者。
https://arxiv.org/abs/2405.08784
Navigating the complex landscape of news articles involves understanding the various actors or entities involved, referred to as news stakeholders. These stakeholders, ranging from policymakers to opposition figures, citizens, and more, play pivotal roles in shaping news narratives. Recognizing their stakeholder types, reflecting their roles, political alignments, social standing, and more, is paramount for a nuanced comprehension of news content. Despite existing works focusing on salient entity extraction, coverage variations, and political affiliations through social media data, the automated detection of stakeholder roles within news content remains an underexplored domain. In this paper, we bridge this gap by introducing an effective approach to classify stakeholder types in news articles. Our method involves transforming the stakeholder classification problem into a natural language inference task, utilizing contextual information from news articles and external knowledge to enhance the accuracy of stakeholder type detection. Moreover, our proposed model showcases efficacy in zero-shot settings, further extending its applicability to diverse news contexts.
浏览新闻文章的复杂多变的景观,需要理解各种参与者的身份,这些参与者从决策者到反对派人物、公民等,在塑造新闻故事中发挥着关键作用。了解他们的角色、政治观点、社会地位等,对于深入理解新闻内容至关重要。尽管现有的工作集中于通过社交媒体数据突出实体、覆盖差异和政治立场,但自动检测新闻内容中的参与者角色仍然是一个未被探索的领域。在本文中,我们通过引入一种有效的分类新闻文章中参与者类型的方法,跨越了这个领域的空白。我们的方法将参与者分类问题转化为自然语言推理任务,利用新闻文章的上下文信息和外部知识来提高参与者类型检测的准确性。此外,我们所提出的模型在零散设置中表现出优异效果,进一步拓展了其在各种新闻环境中的应用。
https://arxiv.org/abs/2405.08751
The field of chemistry and Artificial Intelligence (AI) intersection is an area of active research that aims to accelerate scientific discovery. The integration of large language models (LLMs) with scientific modalities has shown significant promise in this endeavour. However, challenges persist in effectively addressing training efficacy and the out-of-distribution problem, particularly as existing approaches rely on larger models and datasets. In this context, we focus on machine language-molecule translation and deploy a novel training approach called contrastive preference optimisation, which avoids generating translations that are merely adequate but not perfect. To ensure generalisability and mitigate memorisation effects, we conduct experiments using only 10\% of the data. Our results demonstrate that our models achieve up to a 32\% improvement compared to counterpart models. We also introduce a scalable fine-grained evaluation methodology that accommodates responsibility.
化学与人工智能(AI)领域的交叉是一个活跃的研究领域,旨在加速科学发现。将大型语言模型(LLMs)与科学方法相结合,在探索中取得了显著的进展。然而,在有效地解决训练效果和离散问题方面,仍然存在挑战。特别是,由于现有方法依赖于较大的模型和数据集,因此我们着手研究一种新的训练方法,称为对比偏好优化,它避免了生成仅仅是适当的但不是完美的翻译。为了确保通用性和减轻记忆效应,我们仅使用10%的数据进行实验。我们的结果表明,与对照模型相比,我们的模型实现了多达32%的改进。我们还介绍了一种可扩展的细粒度评估方法,以增强责任意识。
https://arxiv.org/abs/2405.08619
Since the release of ChatGPT and GPT-4, large language models (LLMs) and multimodal large language models (MLLMs) have garnered significant attention due to their powerful and general capabilities in understanding, reasoning, and generation, thereby offering new paradigms for the integration of artificial intelligence with medicine. This survey comprehensively overviews the development background and principles of LLMs and MLLMs, as well as explores their application scenarios, challenges, and future directions in medicine. Specifically, this survey begins by focusing on the paradigm shift, tracing the evolution from traditional models to LLMs and MLLMs, summarizing the model structures to provide detailed foundational knowledge. Subsequently, the survey details the entire process from constructing and evaluating to using LLMs and MLLMs with a clear logic. Following this, to emphasize the significant value of LLMs and MLLMs in healthcare, we survey and summarize 6 promising applications in healthcare. Finally, the survey discusses the challenges faced by medical LLMs and MLLMs and proposes a feasible approach and direction for the subsequent integration of artificial intelligence with medicine. Thus, this survey aims to provide researchers with a valuable and comprehensive reference guide from the perspectives of the background, principles, and clinical applications of LLMs and MLLMs.
自 ChatGPT 和 GPT-4 发布以来,大型语言模型(LLMs)和多模态大型语言模型(MLLMs)因其在理解、推理和生成方面的强大和通用能力而备受关注,为将人工智能与医疗相结合提供了新的范例。这项调查全面回顾了 LLMs 和 MLLMs 的开发背景和原理,并探讨了它们在医学中的应用场景、挑战和未来发展方向。具体来说,这项调查首先关注范式的转变,从传统模型到 LLMs 和 MLLMs 的演变过程,并总结模型的结构以提供详细的基础知识。接着,调查详细描述了从构建和评估到使用 LLMs 和 MLLMs 的整个过程,并强调了 LLMs 和 MLLMs 在医疗保健中的重要价值。随后,我们调查和总结了 6 个医疗保健领域的有益应用。最后,调查讨论了医疗 LLMs 和 MLLMs 面临的问题,并为将来的人工智能与医疗结合提出了一种可行的方式和方向。因此,这项调查旨在为研究人员提供关于 LLMs 和 MLLMs 的背景、原则和临床应用方面宝贵的全面参考指南。
https://arxiv.org/abs/2405.08603
This article explores the adaptive relationship between Encoder Layers and Decoder Layers using the SOTA model Helsinki-NLP/opus-mt-de-en, which translates German to English. The specific method involves introducing a bias-free fully connected layer between the Encoder and Decoder, with different initializations of the layer's weights, and observing the outcomes of fine-tuning versus retraining. Four experiments were conducted in total. The results suggest that directly modifying the pre-trained model structure for fine-tuning yields suboptimal performance. However, upon observing the outcomes of the experiments with retraining, this structural adjustment shows significant potential.
这篇文章探讨了使用Helsinki-NLP/opus-mt-de-en模型中的编码器层和解码器层之间的自适应关系。具体方法是在编码器和解码器之间引入一个无偏的全连接层,具有不同的初始化权重,并观察微调和解重训练的结果。共进行了四个实验。结果表明,直接修改预训练模型结构以进行微调导致性能劣化。然而,通过观察实验结果的重构,这种结构调整具有显著的潜力。
https://arxiv.org/abs/2405.08570
Machine learning (ML)-based content moderation tools are essential to keep online spaces free from hateful communication. Yet, ML tools can only be as capable as the quality of the data they are trained on allows them. While there is increasing evidence that they underperform in detecting hateful communications directed towards specific identities and may discriminate against them, we know surprisingly little about the provenance of such bias. To fill this gap, we present a systematic review of the datasets for the automated detection of hateful communication introduced over the past decade, and unpack the quality of the datasets in terms of the identities that they embody: those of the targets of hateful communication that the data curators focused on, as well as those unintentionally included in the datasets. We find, overall, a skewed representation of selected target identities and mismatches between the targets that research conceptualizes and ultimately includes in datasets. Yet, by contextualizing these findings in the language and location of origin of the datasets, we highlight a positive trend towards the broadening and diversification of this research space.
机器学习(ML)为基础的内容审查工具对于保持网络空间免于仇恨言论至关重要。然而,ML工具只能以其训练数据允许其实现的能力水平来保持这种功能。尽管有越来越多的证据表明,它们在检测针对特定身份的仇恨言论方面表现不佳,甚至可能歧视他们,但我们对于这种偏见来源的了解仍然非常有限。为了填补这一空白,我们回顾了过去十年中介绍的自动检测仇恨言论的数据集,并探讨了这些数据集的质量:这些数据集的 targets(即受到仇恨言论攻击的对象)以及无意中包括在这些数据集中的其他目标。我们发现,总体而言,选择目标的 representation 存在偏差,研究概念与最终包含在数据集中的目标之间的差距存在。然而,通过将这些发现置于数据集的起源语言和位置的背景下,我们强调了研究空间拓宽和多样化的积极趋势。
https://arxiv.org/abs/2405.08562
Multi-Head Attention (MHA) is a key component of Transformer. In MHA, attention heads work independently, causing problems such as low-rank bottleneck of attention score matrices and head redundancy. We propose Dynamically Composable Multi-Head Attention (DCMHA), a parameter and computation efficient attention architecture that tackles the shortcomings of MHA and increases the expressive power of the model by dynamically composing attention heads. At the core of DCMHA is a $\it{Compose}$ function that transforms the attention score and weight matrices in an input-dependent way. DCMHA can be used as a drop-in replacement of MHA in any transformer architecture to obtain the corresponding DCFormer. DCFormer significantly outperforms Transformer on different architectures and model scales in language modeling, matching the performance of models with ~1.7x-2.0x compute. For example, DCPythia-6.9B outperforms open source Pythia-12B on both pretraining perplexity and downstream task evaluation. The code and models are available at this https URL.
多头注意力(MHA)是Transformer的关键组成部分。在MHA中,关注头独立工作,导致诸如注意力分数矩阵低秩瓶颈和头冗余等问题。我们提出了一种动态可组合的多头注意力(DCMHA),一种参数和计算效率高的注意机制,通过动态组合注意头解决了MHA的不足之处,从而提高了模型的表现力。DCMHA的核心是一个有$\it{Compose}$函数,它以输入为依赖地将注意力分数和权重矩阵转换。DCMHA可以作为任何Transformer架构的 drop-in 替换,以获得相应的DCFormer。DCFormer 在各种架构和模型规模的语言建模中显著优于Transformer,其性能与具有 ~1.7x-2.0x 计算能力的模型相匹敌。例如,DCPythia-6.9B 在预训练预置和下游任务评估方面都优于开源Pythia-12B。代码和模型可通过此https URL获取。
https://arxiv.org/abs/2405.08553
Conversation requires a substantial amount of coordination between dialogue participants, from managing turn taking to negotiating mutual understanding. Part of this coordination effort surfaces as the reuse of linguistic behaviour across speakers, a process often referred to as alignment. While the presence of linguistic alignment is well documented in the literature, several questions remain open, including the extent to which patterns of reuse across speakers have an impact on the emergence of labelling conventions for novel referents. In this study, we put forward a methodology for automatically detecting shared lemmatised constructions -- expressions with a common lexical core used by both speakers within a dialogue -- and apply it to a referential communication corpus where participants aim to identify novel objects for which no established labels exist. Our analyses uncover the usage patterns of shared constructions in interaction and reveal that features such as their frequency and the amount of different constructions used for a referent are associated with the degree of object labelling convergence the participants exhibit after social interaction. More generally, the present study shows that automatically detected shared constructions offer a useful level of analysis to investigate the dynamics of reference negotiation in dialogue.
对话需要对话参与者在语篇、回合管理以及相互理解等方面进行大量的协调。这种协调努力在文献中得到了充分记录,但仍有几个问题有待回答,包括跨说话者之间语言行为的再利用程度对新颖参照物标签的出现有何影响。在这项研究中,我们提出了一个自动检测共享同义词短语的方法——用于对话中共同使用的词汇核心的表达。我们将该方法应用于一个旨在识别没有现有标签的新物体语料库中。我们的分析揭示了互动中共享同义词短语的使用模式,并揭示了诸如它们的使用频率和使用的不同短语数量等特征与参与者在社交互动后展示的物体标签程度之间的关联。更一般地说,本研究表明,自动检测到的共享同义词短语为研究对话中参照物协商的动态提供了有价值的分析水平。
https://arxiv.org/abs/2405.08546
This paper aims to tackle the challenge posed by the increasing integration of software tools in research across various disciplines by investigating the application of Falcon-7b for the detection and classification of software mentions within scholarly texts. Specifically, the study focuses on solving Subtask I of the Software Mention Detection in Scholarly Publications (SOMD), which entails identifying and categorizing software mentions from academic literature. Through comprehensive experimentation, the paper explores different training strategies, including a dual-classifier approach, adaptive sampling, and weighted loss scaling, to enhance detection accuracy while overcoming the complexities of class imbalance and the nuanced syntax of scholarly writing. The findings highlight the benefits of selective labelling and adaptive sampling in improving the model's performance. However, they also indicate that integrating multiple strategies does not necessarily result in cumulative improvements. This research offers insights into the effective application of large language models for specific tasks such as SOMD, underlining the importance of tailored approaches to address the unique challenges presented by academic text analysis.
本文旨在解决跨学科研究中有望看到越来越多的软件工具整合所带来的挑战,通过研究Falcon-7b在学术文本中检测和分类软件提及的应用,来探讨如何解决软件提及检测中的子任务I。具体来说,研究重点解决SOMD中的子任务I,即从学术文献中识别和分类软件提及。通过全面的实验,本文探讨了不同的训练策略,包括双分类器方法、自适应采样和加权损失扩展,以提高检测准确性并克服分类不平衡和学者的复杂语法。研究结果突出了 selective labeling 和 adaptive sampling 对提高模型性能的益处。然而,它们还表明,整合多种策略并不一定导致累积改进。这项研究为大型语言模型在特定任务上的有效应用提供了洞见,强调了定制方法解决学术文本分析中独特挑战的重要性。
https://arxiv.org/abs/2405.08514
The SemEval task on Argument Reasoning in Civil Procedure is challenging in that it requires understanding legal concepts and inferring complex arguments. Currently, most Large Language Models (LLM) excelling in the legal realm are principally purposed for classification tasks, hence their reasoning rationale is subject to contention. The approach we advocate involves using a powerful teacher-LLM (ChatGPT) to extend the training dataset with explanations and generate synthetic data. The resulting data are then leveraged to fine-tune a small student-LLM. Contrary to previous work, our explanations are not directly derived from the teacher's internal knowledge. Instead they are grounded in authentic human analyses, therefore delivering a superior reasoning signal. Additionally, a new `mutation' method generates artificial data instances inspired from existing ones. We are publicly releasing the explanations as an extension to the original dataset, along with the synthetic dataset and the prompts that were used to generate both. Our system ranked 15th in the SemEval competition. It outperforms its own teacher and can produce explanations aligned with the original human analyses, as verified by legal experts.
民事诉讼中的推理挑战性的任务在于它需要理解法律概念并推断复杂论点。目前,在法律领域表现卓越的大规模语言模型(LLM)主要目的是分类任务,因此其推理是具有争议的。我们倡导的方法包括使用强大的教师-LLM(ChatGPT)扩展训练数据,并生成合成数据。然后将这些数据用于微调小学生-LLM。与之前的工作不同,我们的解释不是直接从教师内部知识中得出的。相反,它们是基于真实的人类分析得出的,因此具有更好的推理信号。此外,一种新方法`突变`生成源自现有方法的 artificial 数据实例。我们公开发布这些解释作为对原始数据集的扩展,以及为生成 both 原始人类分析和合成数据而使用的提示。我们的系统在 SemEval 竞赛中排名第15。它超越了自己的教师,并且可以通过法律专家的验证,产生与原始人类分析一致的解释。
https://arxiv.org/abs/2405.08502
Compositionality in language models presents a problem when processing idiomatic expressions, as their meaning often cannot be directly derived from their individual parts. Although fine-tuning and other optimization strategies can be used to improve representations of idiomatic expressions, this depends on the availability of relevant data. We present the Noun Compound Synonym Substitution in Books - NCSSB - datasets, which are created by substitution of synonyms of potentially idiomatic English noun compounds in public domain book texts. We explore the trade-off between data quantity and quality when training models for idiomaticity detection, in conjunction with contextual information obtained locally (from the surrounding sentences) or externally (through language resources). Performance on an idiomaticity detection task indicates that dataset quality is a stronger factor for context-enriched models, but that quantity also plays a role in models without context inclusion strategies.
在语言模型中,对于处理惯用表达,往往会遇到一个问题,因为其含义往往不能直接从其组成部分中得出。尽管可以使用微调和其他优化策略来提高惯用表达的表示,但这也取决于相关数据的可用性。我们提出了基于同义词替换的命名实体 Compound Synonym Substitution - NCSSB - 数据集,这些数据集是通过在公共领域书籍文本中替换具有可能惯用英语名词组合的替换词来创建的。我们探讨了在训练模型以检测惯用性时,数据数量和质量之间的权衡,同时结合局部信息(从周围句子获得)或外部信息(通过语言资源获得)。在惯用性检测任务上的表现表明,对于上下文增强模型,数据质量是一个更为重要的因素,但对于没有上下文包含策略的模型,数量也很重要。
https://arxiv.org/abs/2405.08497
Machine translation (MT) models are known to suffer from gender bias, especially when translating into languages with extensive gendered morphology. Accordingly, they still fall short in using gender-inclusive language, also representative of non-binary identities. In this paper, we look at gender-inclusive neomorphemes, neologistic elements that avoid binary gender markings as an approach towards fairer MT. In this direction, we explore prompting techniques with large language models (LLMs) to translate from English into Italian using neomorphemes. So far, this area has been under-explored due to its novelty and the lack of publicly available evaluation resources. We fill this gap by releasing Neo-GATE, a resource designed to evaluate gender-inclusive en-it translation with neomorphemes. With Neo-GATE, we assess four LLMs of different families and sizes and different prompt formats, identifying strengths and weaknesses of each on this novel task for MT.
机器翻译(MT)模型已经被证明在性别偏见方面存在不足,尤其是在将翻译成具有广泛性别语素的语言时。因此,它们在使用包容性语言方面也存在不足,这也代表了非二元性身份。在本文中,我们探讨了使用性别包容性新词素、新词素,避免二元性别标记的方法,以实现更公平的MT。在這個方向上,我们探讨了使用大型语言模型(LLMs)的提示技术将英语翻译成意大利的案例。到目前为止,这个领域还缺乏公开可用的评价资源,因为它的新颖性和缺乏可用性。我们通过发布Neo-GATE,一个专门用于评估性别包容性en-it翻译的资源,来填补这个空白。使用Neo-GATE,我们评估了不同家族和规模的LLM,以及不同的提示格式,确定每个模型在这个新任务上的优势和不足。
https://arxiv.org/abs/2405.08477
When studying political communication, combining the information from text, audio, and video signals promises to reflect the richness of human communication more comprehensively than confining it to individual modalities alone. However, when modeling such multimodal data, its heterogeneity, connectedness, and interaction are challenging to address. We argue that aligning the respective modalities can be an essential step in entirely using the potential of multimodal data because it informs the model with human understanding. Exploring aligned modalities unlocks promising analytical leverage. First, it allows us to make the most of information in the data, which inter alia opens the door to better quality predictions. Second, it is possible to answer research questions that span multiple modalities with cross-modal queries. Finally, alignment addresses concerns about model interpretability. We illustrate the utility of this approach by analyzing how German MPs address members of the far-right AfD in their speeches, and predicting the tone of video advertising in the context of the 2020 US presidential race. Our paper offers important insights to all keen to analyze multimodal data effectively.
在研究政治沟通时,将文本、音频和视频信号的信息结合起来,比仅仅局限于单一方式更全面地反映人类沟通的丰富性。然而,当尝试建模这种多模态数据时,其异质性、联系性和交互性是难以解决的问题。我们认为,对各自模态进行对齐可以是完全利用多模态数据潜在功能的关键步骤,因为它赋予了模型人类理解。通过对齐模态,我们解锁了有前途的分析优势。首先,它让我们能够充分利用数据中的信息,这不仅为更好的预测打开了大门,而且还有助于跨模态问题。其次,可以在跨模态查询中回答研究问题。最后,对齐解决了关于模型可解释性的担忧。我们通过分析德国议员如何回应极右翼民粹主义者在讲话中如何对待成员,以及预测2020年美国总统竞选期间视频广告的语气,展示了这种方法的有效性。我们的论文为所有希望有效地分析多模态数据的人提供了重要的见解。
https://arxiv.org/abs/2405.08454
Self-supervised learning has shown great success in Speech Recognition. However, it has been observed that finetuning all layers of the learned model leads to lower performance compared to resetting top layers. This phenomenon is attributed to the ''autoencoder'' behavior: top layers contain information closer to the input and are less suitable for tasks that require linguistic information, such as Speech this http URL better our understanding of this behavior, we propose to study the evolution of high-level information within the model during pretraining. We focus on the HuBERT model, which exhibits a less pronounced ''autoencoder'' behavior. By experimentally exploring various factors that may have an impact, we aim to improve the training procedure and enhance the top layers of HuBERT for high-level tasks.Furthermore, our experiments demonstrate that these improvements in the training procedure result in faster convergence and competitive performance on downstream tasks.
自监督学习在语音识别方面取得了巨大的成功。然而,观察发现,对学习模型的所有层进行微调会导致性能低于重置顶层。这种现象归因于“自编码器”行为:顶层包含更接近输入的信息,并且不太适合需要语言信息的任务,比如更好地理解这个行为。为了研究模型在预训练期间的高级信息演化,我们关注了表现较轻的HuBERT模型。通过实验探索可能影响训练过程的各种因素,我们的目标是改进训练程序并提高HuBERT模型在高级任务上的顶级层。此外,我们的实验还表明,这些训练过程的改进会导致更快的学习曲线和竞争力的下游任务性能。
https://arxiv.org/abs/2405.08402
The rapid advancement of large language models (LLMs) has made it increasingly difficult to distinguish between text written by humans and machines. Addressing this, we propose a novel method for generating watermarks that strategically alters token probabilities during generation. Unlike previous works, this method uniquely employs linguistic features such as stylometry. Concretely, we introduce acrostica and sensorimotor norms to LLMs. Further, these features are parameterized by a key, which is updated every sentence. To compute this key, we use semantic zero shot classification, which enhances resilience. In our evaluation, we find that for three or more sentences, our method achieves a false positive and false negative rate of 0.02. For the case of a cyclic translation attack, we observe similar results for seven or more sentences. This research is of particular of interest for proprietary LLMs to facilitate accountability and prevent societal harm.
大型语言模型(LLMs)的快速发展使得区分由人类和机器撰写的文本变得越来越困难。为了解决这个问题,我们提出了一种新颖的生成水印的方法,在生成过程中策略性地改变词的置信度概率。与之前的工作不同,这种方法独特地使用了语义零样本分类等语言特征。具体来说,我们引入了词表和运动传感器范式给LLM。此外,这些特征由一个键控制,该键每句话都会更新。为了计算这个键,我们使用语义零样本分类,该技术可以增强鲁棒性。在我们的评估中,我们发现对于三个或更多句子,我们的方法实现了一个 false positive 和 false negative 率为 0.02 的准确率。对于循环翻译攻击的情况,我们观察到类似的结果,对于七或更多的句子。这项研究对于推动自监督LLM的负责任性和防止社会伤害具有特别的兴趣。
https://arxiv.org/abs/2405.08400