Linguistic ambiguity continues to represent a significant challenge for natural language processing (NLP) systems, notwithstanding the advancements in architectures such as Transformers and BERT. Inspired by the recent success of instructional models like ChatGPT and Gemini (In 2023, the artificial intelligence was called Bard.), this study aims to analyze and discuss linguistic ambiguity within these models, focusing on three types prevalent in Brazilian Portuguese: semantic, syntactic, and lexical ambiguity. We create a corpus comprising 120 sentences, both ambiguous and unambiguous, for classification, explanation, and disambiguation. The models capability to generate ambiguous sentences was also explored by soliciting sets of sentences for each type of ambiguity. The results underwent qualitative analysis, drawing on recognized linguistic references, and quantitative assessment based on the accuracy of the responses obtained. It was evidenced that even the most sophisticated models, such as ChatGPT and Gemini, exhibit errors and deficiencies in their responses, with explanations often providing inconsistent. Furthermore, the accuracy peaked at 49.58 percent, indicating the need for descriptive studies for supervised learning.
语言歧义一直是自然语言处理(NLP)系统的一个显著挑战,尽管像Transformer和BERT这样的架构取得了进步。受到类似ChatGPT和Gemini等 recent instructional models的成功启发,本研究旨在分析并讨论这些模型中的语言歧义,重点关注巴西葡萄牙语中三种普遍存在的歧义类型:语义、句法 和词汇歧义。我们创建了一个包括120个句子的语料库,包括歧义和明确语义两种,用于分类、解释和去歧义。还研究了模型生成歧义句的能力,通过要求针对每种歧义类型提供一组句子。结果经过定性分析,基于公认的语言参考,以及基于所获回答的准确性的定量评估。结果显示,即使是最先进的模型,如ChatGPT和Gemini,在其回应中也有错误和不足之处,解释往往是不一致的。此外,准确率在49.58%达到峰值,表明需要进行描述性研究来进行有监督学习。
https://arxiv.org/abs/2404.16653
Unsupervised cross-lingual transfer involves transferring knowledge between languages without explicit supervision. Although numerous studies have been conducted to improve performance in such tasks by focusing on cross-lingual knowledge, particularly lexical and syntactic knowledge, current approaches are limited as they only incorporate syntactic or lexical information. Since each type of information offers unique advantages and no previous attempts have combined both, we attempt to explore the potential of this approach. In this paper, we present a novel framework called "Lexicon-Syntax Enhanced Multilingual BERT" that combines both lexical and syntactic knowledge. Specifically, we use Multilingual BERT (mBERT) as the base model and employ two techniques to enhance its learning capabilities. The code-switching technique is used to implicitly teach the model lexical alignment information, while a syntactic-based graph attention network is designed to help the model encode syntactic structure. To integrate both types of knowledge, we input code-switched sequences into both the syntactic module and the mBERT base model simultaneously. Our extensive experimental results demonstrate this framework can consistently outperform all baselines of zero-shot cross-lingual transfer, with the gains of 1.0~3.7 points on text classification, named entity recognition (ner), and semantic parsing tasks. Keywords:cross-lingual transfer, lexicon, syntax, code-switching, graph attention network
无监督跨语言转移涉及在没有任何明确监督的情况下在语言之间传递知识。尽管已经进行了大量研究,以通过关注跨语言知识来提高此类任务的性能,特别是词汇和句法知识,但目前的方法仍然有限,因为它们仅包括语义或词汇信息。由于每种信息都具有独特的优势,并且没有 previous 尝试将两种信息相结合,因此我们试图探索这种方法的潜力。在本文中,我们提出了一个名为 "Lexicon-Syntax Enhanced Multilingual BERT" 的新框架,结合了词汇和句法知识。具体来说,我们使用多语言 BERT(mBERT)作为基础模型,并采用两种技术来增强其学习能力。代码转换技术用于含蓄地教导模型词汇对齐信息,而基于句法的图注意力网络旨在帮助模型编码语义结构。为了整合两种知识,我们将代码转换序列同时输入到语义模块和 mBERT 基础模型中。我们进行了广泛的实验研究,结果表明,与其他零散的跨语言转移 baseline 相比,该框架可以始终如一地优于所有基线,在文本分类、命名实体识别(NER)和语义解析任务中的得分增加了 1.0~3.7 点。关键词:跨语言转移,词汇,语法,代码转换,图注意力网络
https://arxiv.org/abs/2404.16627
It has been found that Transformer-based language models have the ability to perform basic quantitative reasoning. In this paper, we propose a method for studying how these models internally represent numerical data, and use our proposal to analyze the ALBERT family of language models. Specifically, we extract the learned embeddings these models use to represent tokens that correspond to numbers and ordinals, and subject these embeddings to Principal Component Analysis (PCA). PCA results reveal that ALBERT models of different sizes, trained and initialized separately, consistently learn to use the axes of greatest variation to represent the approximate ordering of various numerical concepts. Numerals and their textual counterparts are represented in separate clusters, but increase along the same direction in 2D space. Our findings illustrate that language models, trained purely to model text, can intuit basic mathematical concepts, opening avenues for NLP applications that intersect with quantitative reasoning.
研究发现,基于Transformer的语言模型具有执行基本数量推理的能力。在本文中,我们提出了一种研究这些模型内部如何表示数值数据的方法,并使用我们的建议分析ALBERT家族的语言模型。具体来说,我们提取这些模型用于表示数字和序数的 learned嵌入,并对其进行主成分分析(PCA)。PCA的结果表明,具有不同大小的ALBERT模型,在训练和初始化过程中分别进行,能够一致地使用变化最大的轴来表示各种数值概念的近似顺序。数值和它们的文本对应物分别位于不同的簇中,但在二维空间中沿着相同的方向增加。我们的研究结果表明,训练纯粹用于建模文本的语言模型可以直观地理解基本的数学概念,为与数量推理相关的自然语言处理应用程序开辟了道路。
https://arxiv.org/abs/2404.16574
Large Language Models (LLMs) are extensively used today across various sectors, including academia, research, business, and finance, for tasks such as text generation, summarization, and translation. Despite their widespread adoption, these models often produce incorrect and misleading information, exhibiting a tendency to hallucinate. This behavior can be attributed to several factors, with consistency and reasoning capabilities being significant contributors. LLMs frequently lack the ability to generate explanations and engage in coherent reasoning, leading to inaccurate responses. Moreover, they exhibit inconsistencies in their outputs. This paper aims to evaluate and compare the consistency and reasoning capabilities of both public and proprietary LLMs. The experiments utilize the Boolq dataset as the ground truth, comprising questions, answers, and corresponding explanations. Queries from the dataset are presented as prompts to the LLMs, and the generated responses are evaluated against the ground truth answers. Additionally, explanations are generated to assess the models' reasoning abilities. Consistency is evaluated by repeatedly presenting the same query to the models and observing for variations in their responses. For measuring reasoning capabilities, the generated explanations are compared to the ground truth explanations using metrics such as BERT, BLEU, and F-1 scores. The findings reveal that proprietary models generally outperform public models in terms of both consistency and reasoning capabilities. However, even when presented with basic general knowledge questions, none of the models achieved a score of 90\% in both consistency and reasoning. This study underscores the direct correlation between consistency and reasoning abilities in LLMs and highlights the inherent reasoning challenges present in current language models.
大规模语言模型(LLMs)如今在学术界、研究、商业和金融等各个领域得到了广泛应用,用于诸如文本生成、总结和翻译等任务。尽管它们已经得到了普遍的采用,但这些模型往往会产生错误或误导性的信息,表现出一种幻觉倾向。这种行为可以归因于几个因素,其中一致性和推理能力是重要的因素。LLMs常常缺乏生成解释和进行合乎理性的推理的能力,导致不准确的回答。此外,它们在输出上表现出不一致性。本文旨在评估和比较公共和专有LLMs的一致性和推理能力。实验使用了Boolq数据集作为基线,包括问题、答案和相应的解释。数据集中的问题作为提示呈现在LLMs上,生成的答案与基线答案进行比较。此外,还生成了推理能力来评估模型的能力。一致性是通过反复向模型呈现相同的问题并观察其响应的变化来评估的。为了衡量推理能力,生成的解释与基线解释使用BERT、BLEU和F-1分数进行比较。研究结果表明,专有模型在一致性和推理能力方面通常优于公共模型。然而,即使面对基本的通用知识问题,没有一个模型在一致性和推理能力上获得了90%的分数。这项研究突出了LLMs的一致性和推理能力与当前语言模型的固有推理挑战之间的直接关系,并强调了当前语言模型中存在的固有推理挑战。
https://arxiv.org/abs/2404.16478
We present a novel approach to detecting noun abstraction within a large language model (LLM). Starting from a psychologically motivated set of noun pairs in taxonomic relationships, we instantiate surface patterns indicating hypernymy and analyze the attention matrices produced by BERT. We compare the results to two sets of counterfactuals and show that we can detect hypernymy in the abstraction mechanism, which cannot solely be related to the distributional similarity of noun pairs. Our findings are a first step towards the explainability of conceptual abstraction in LLMs.
我们提出了一个在大型语言模型(LLM)中检测名词抽象的新方法。从心理上动机的一组名词对中开始,我们实例化表明超类和分析由BERT产生的注意矩阵。我们将结果与两组反事实进行比较,并表明我们可以在抽象机制中检测超类,而不仅仅是名词对之间的分布相似性。我们的研究结果是LLM中概念抽象解释的第一步。
https://arxiv.org/abs/2404.15848
Since the inception of the Transformer architecture in 2017, Large Language Models (LLMs) such as GPT and BERT have evolved significantly, impacting various industries with their advanced capabilities in language understanding and generation. These models have shown potential to transform the medical field, highlighting the necessity for specialized evaluation frameworks to ensure their effective and ethical deployment. This comprehensive survey delineates the extensive application and requisite evaluation of LLMs within healthcare, emphasizing the critical need for empirical validation to fully exploit their capabilities in enhancing healthcare outcomes. Our survey is structured to provide an in-depth analysis of LLM applications across clinical settings, medical text data processing, research, education, and public health awareness. We begin by exploring the roles of LLMs in different medical applications, detailing how they are evaluated based on their performance in tasks such as clinical application, medical text data processing, information retrieval, data analysis, medical scientific writing, educational content generation etc. The subsequent sections delve into the methodologies employed in these evaluations, discussing the benchmarks and metrics used to assess the models' effectiveness, accuracy, and ethical alignment. Through this survey, we aim to equip healthcare professionals, researchers, and policymakers with a comprehensive understanding of the potential strengths and limitations of LLMs in medical applications. By providing detailed insights into the evaluation processes and the challenges faced in integrating LLMs into healthcare, this survey seeks to guide the responsible development and deployment of these powerful models, ensuring they are harnessed to their full potential while maintaining stringent ethical standards.
自2017年Transformer架构的创立以来,大型语言模型(LLMs)如GPT和BERT等在语言理解和生成方面的先进能力显著发展,对 various行业产生了重大影响。这些模型展示出在医疗领域进行变革的潜力,突显了需要专业评估框架以确保其有效和道德部署的必要性。这次全面调查详细探讨了LLMs在医疗保健领域中的应用和评估需求,强调了对这些模型的实证验证以全面发挥其在提高医疗保健成果方面的关键作用的重要性。我们的调查旨在为医疗保健专业人员、研究人员和政策制定者提供全面了解LLM在医疗应用中的潜力和限制的全面理解。通过提供关于这些评估过程和将LLMs整合到医疗保健中的挑战的详细见解,这次调查旨在指导这些强大模型的 responsible development 和 deployment,确保它们在保持严格道德标准的同时充分发挥其全部潜力。
https://arxiv.org/abs/2404.15777
Summarizing comparative opinions about entities (e.g., hotels, phones) from a set of source reviews, often referred to as contrastive summarization, can considerably aid users in decision making. However, reliably measuring the contrastiveness of the output summaries without relying on human evaluations remains an open problem. Prior work has proposed token-overlap based metrics, Distinctiveness Score, to measure contrast which does not take into account the sensitivity to meaning-preserving lexical variations. In this work, we propose an automated evaluation metric CASPR to better measure contrast between a pair of summaries. Our metric is based on a simple and light-weight method that leverages natural language inference (NLI) task to measure contrast by segmenting reviews into single-claim sentences and carefully aggregating NLI scores between them to come up with a summary-level score. We compare CASPR with Distinctiveness Score and a simple yet powerful baseline based on BERTScore. Our results on a prior dataset CoCoTRIP demonstrate that CASPR can more reliably capture the contrastiveness of the summary pairs compared to the baselines.
概括比较性意见的翻译(例如,酒店,电话)从一个评论集的来源中进行,通常称为对比性总结,可以帮助用户在决策过程中做出重大贡献。然而,在没有人类评价的基础上可靠地测量输出摘要的对比性仍然是一个开放问题。之前的工作已经提出了基于词重叠的度量标准, distinctiveness score,来衡量对比,但并没有考虑到意义保持词形变化的敏感性。在这项工作中,我们提出了一种自动评估指标CASPR,以更好地衡量摘要对摘要的对比。我们的指标基于一种简单而轻量级的方法,利用自然语言推理(NLI)任务来测量对比,通过将评论分为单个主张句并仔细聚合NLI得分来得出总结级别得分。我们比较CASPR与Distinctiveness Score和基于BERTScore的一种简单而强大的基线。我们在CoCoTRIP先前数据集上的结果表明,CASPR比基线更可靠地捕捉摘要对摘要的对比性。
https://arxiv.org/abs/2404.15565
Natural language generation tools are powerful and effective for generating content. However, language models are known to display bias and fairness issues, making them impractical to deploy for many use cases. We here focus on how fairness issues impact automatically generated test content, which can have stringent requirements to ensure the test measures only what it was intended to measure. Specifically, we identify test content that is focused on particular domains and experiences that only reflect a certain demographic or that are potentially emotionally upsetting; both of which could inadvertently impact a test-taker's score. This kind of content doesn't reflect typical biases out of context, making it challenging even for modern models that contain safeguards. We build a dataset of 621 generated texts annotated for fairness and explore a variety of methods for classification: fine-tuning, topic-based classification, and prompting, including few-shot and self-correcting prompts. We find that combining prompt self-correction and few-shot learning performs best, yielding an F1 score of .791 on our held-out test set, while much smaller BERT- and topic-based models have competitive performance on out-of-domain data.
自然语言生成工具对于生成内容非常强大和有效。然而,语言模型已经被证明存在偏见和不公平问题,这使得它们在许多用例上部署不实用。在这里,我们关注公平性问题如何影响自动生成的测试内容,这些内容可能对测试者得分产生严格的要求,以确保测试只衡量了它本应测量的内容。具体来说,我们识别出关注特定领域和经验的测试内容,这可能只反映了某些人口统计学或可能引起情感不安的内容;这两者都可能无意中影响测试者的得分。这类内容不反映上下文的典型偏见,这使得现代模型(包含安全措施)更难以处理。我们建立了一个为公平性 annotated的621个生成的文本的数据集,并探讨了分类的方法:微调、基于主题的分类和提示,包括少样本和自纠正提示。我们发现,结合自纠正提示和少样本学习效果最好,在 hold-out 测试集上的 F1 分数为.791,而BERT 和基于主题的模型在离域数据上的竞争性能较小。
https://arxiv.org/abs/2404.15104
This study evaluates three different lemmatization approaches to Estonian -- Generative character-level models, Pattern-based word-level classification models, and rule-based morphological analysis. According to our experiments, a significantly smaller Generative model consistently outperforms the Pattern-based classification model based on EstBERT. Additionally, we observe a relatively small overlap in errors made by all three models, indicating that an ensemble of different approaches could lead to improvements.
本研究评估了三种不同的词素化方法:基于生成特征的模型、基于模式的单词级别分类模型和基于规则的形态分析。根据我们的实验结果,一个显著较小的生成模型在基于EstBERT的条件下始终优于基于模式的分类模型。此外,我们观察到三种模型所犯的错误相对较小,这表明不同的方法集可能会带来改进。
https://arxiv.org/abs/2404.15003
The PLAID (Performance-optimized Late Interaction Driver) algorithm for ColBERTv2 uses clustered term representations to retrieve and progressively prune documents for final (exact) document scoring. In this paper, we reproduce and fill in missing gaps from the original work. By studying the parameters PLAID introduces, we find that its Pareto frontier is formed of a careful balance among its three parameters; deviations beyond the suggested settings can substantially increase latency without necessarily improving its effectiveness. We then compare PLAID with an important baseline missing from the paper: re-ranking a lexical system. We find that applying ColBERTv2 as a re-ranker atop an initial pool of BM25 results provides better efficiency-effectiveness trade-offs in low-latency settings. However, re-ranking cannot reach peak effectiveness at higher latency settings due to limitations in recall of lexical matching and provides a poor approximation of an exhaustive ColBERTv2 search. We find that recently proposed modifications to re-ranking that pull in the neighbors of top-scoring documents overcome this limitation, providing a Pareto frontier across all operational points for ColBERTv2 when evaluated using a well-annotated dataset. Curious about why re-ranking methods are highly competitive with PLAID, we analyze the token representation clusters PLAID uses for retrieval and find that most clusters are predominantly aligned with a single token and vice versa. Given the competitive trade-offs that re-ranking baselines exhibit, this work highlights the importance of carefully selecting pertinent baselines when evaluating the efficiency of retrieval engines.
PLAID(高性能晚期交互驱动器)算法用于 ColBERTv2 时,它使用聚类词表示来检索并逐步修剪文本来实现最终(精确)文档评分。在本文中,我们复制并填补了原始工作中的缺失部分。通过研究 PLAID 引入的参数,我们发现其 Pareto 前沿是由其三个参数之间的谨慎平衡组成的;超出建议设置的偏差可能会显著增加延迟,而不仅仅是提高其有效性。然后,我们将 PLAID 与原始论文中重要的基线进行比较:对词汇系统进行重新排名。我们发现,将 ColBERTv2 作为初始池的 BM25 结果上的重新排名提供了更好的效率-效果权衡。然而,由于词汇匹配的回忆限制,在较高延迟设置上无法达到峰值效果,并且对完整的 ColBERTv2 搜索的近似度很低。我们发现,最近提出的重新排名修改方法,如吸引邻居最高评分文档的邻居,克服了这一限制,为使用良好注释的数据集评估 PLAID 时提供了 Pareto 前沿。关于为什么重新排名方法与 PLAID 具有高度竞争性,我们分析了 PLAID 使用时的词表示聚类,并发现大多数聚类都是高度相关的单一词,反之亦然。鉴于重新排名基线的竞争性,这项工作突出了在评估检索引擎的效率时谨慎选择相关基线的重要性。
https://arxiv.org/abs/2404.14989
This paper focuses on a very important societal challenge of water quality analysis. Being one of the key factors in the economic and social development of society, the provision of water and ensuring its quality has always remained one of the top priorities of public authorities. To ensure the quality of water, different methods for monitoring and assessing the water networks, such as offline and online surveys, are used. However, these surveys have several limitations, such as the limited number of participants and low frequency due to the labor involved in conducting such surveys. In this paper, we propose a Natural Language Processing (NLP) framework to automatically collect and analyze water-related posts from social media for data-driven decisions. The proposed framework is composed of two components, namely (i) text classification, and (ii) topic modeling. For text classification, we propose a merit-fusion-based framework incorporating several Large Language Models (LLMs) where different weight selection and optimization methods are employed to assign weights to the LLMs. In topic modeling, we employed the BERTopic library to discover the hidden topic patterns in the water-related tweets. We also analyzed relevant tweets originating from different regions and countries to explore global, regional, and country-specific issues and water-related concerns. We also collected and manually annotated a large-scale dataset, which is expected to facilitate future research on the topic.
本论文重点关注水质量分析这一重要的社会挑战。作为社会经济发展的重要因素,提供水资源并确保其质量始终是公共当局的头等大事。为了确保水质,采用了一些监测和评估水网络的方法,例如离线和在线调查。然而,这些调查存在一些局限性,例如参与人数有限和调查频率较低,因为这些调查需要大量的人力投入。在本文中,我们提出了一个自然语言处理(NLP)框架,用于自动收集和分析社交媒体上的与水相关的帖子,以支持数据驱动的决策。所提出的框架由两个组成部分组成,即(i)文本分类和(ii)主题建模。 在文本分类方面,我们提出了一个基于 merits-fusion 的框架,其中包含多个 large language models (LLMs)。为了给 LLMs 分配权重,我们采用了一些权重选择和优化方法。在主题建模方面,我们使用了 BERTopic 库来发现水相关微博中的隐藏主题模式。我们还分析了许多不同地区和国家的相关 tweets,以探索全球、区域和国家特定的水和相关问题。 我们还收集并手动标注了一个大规模数据集,预计将促进未来关于这个主题的研究。
https://arxiv.org/abs/2404.14977
This research addresses the challenge of estimating bathymetry from imaging sonars where the state-of-the-art works have primarily relied on either supervised learning with ground-truth labels or surface rendering based on the Lambertian assumption. In this letter, we propose a novel, self-supervised framework based on volume rendering for reconstructing bathymetry using forward-looking sonar (FLS) data collected during standard surveys. We represent the seafloor as a neural heightmap encapsulated with a parametric multi-resolution hash encoding scheme and model the sonar measurements with a differentiable renderer using sonar volumetric rendering employed with hierarchical sampling techniques. Additionally, we model the horizontal and vertical beam patterns and estimate them jointly with the bathymetry. We evaluate the proposed method quantitatively on simulation and field data collected by remotely operated vehicles (ROVs) during low-altitude surveys. Results show that the proposed method outperforms the current state-of-the-art approaches that use imaging sonars for seabed mapping. We also demonstrate that the proposed approach can potentially be used to increase the resolution of a low-resolution prior map with FLS data from low-altitude surveys.
这项研究解决了从成像声纳中估计海底地形这一挑战,因为最先进的工作主要依赖于监督学习或基于Lambertian假设的表面渲染。在本文中,我们提出了一个新颖的、自监督的框架,基于体积渲染,用于通过标准调查期间收集的前向声纳数据(FLS)重构海底地形。我们将海底被视为一个参数多分辨率哈希编码方案捕获的神经高度图,并使用采用分层采样技术展开的声纳体积渲染模型来建模声纳测量。此外,我们还建模水平和垂直束模式,并与其共同估计海底地形。我们对使用遥控操作车辆(ROVs)在低空调查期间收集的模拟和现场数据进行定量评估。结果表明,与使用成像声纳进行海底映射的现有最佳方法相比,所提出的方法表现优异。我们还证明了这种方法有可能用于从低空调查中增加低分辨率先验图的分辨率。
https://arxiv.org/abs/2404.14819
Quantitative and numerical comprehension in language is an important task in many fields like education and finance, but still remains a challenging task for language models. While tool and calculator usage has shown to be helpful to improve mathematical reasoning in large pretrained decoder-only language models, this remains unexplored for smaller language models with encoders. In this paper, we propose Pre-Calc, a simple pre-finetuning objective of learning to use the calculator for both encoder-only and encoder-decoder architectures, formulated as a discriminative and generative task respectively. We pre-train BERT and RoBERTa for discriminative calculator use and Flan-T5 for generative calculator use on the MAWPS, SVAMP, and AsDiv-A datasets, which improves performance on downstream tasks that require numerical understanding. Our code and data are available at this https URL.
量化与数值理解在语言中的任务在许多领域(如教育和金融)中非常重要,但仍然是对自然语言处理模型来说具有挑战性的任务。虽然工具和计算器的使用已经被证明在大型预训练 Decoder-Only 语言模型中有助于提高数学推理,但对于较小的具有编码器的自然语言处理模型来说,这仍然是一个未探索的挑战。在本文中,我们提出了 Pre-Calc,一个简单的前预训练目标,旨在学习使用计算器来同时实现编码器-仅和编码器-decoder 架构,分别表示为具有区分性和生成性的任务。我们在 MAWPS、SVAMP 和 AsDiv-A 数据集上预训练 BERT 和 RoBERTa,用于区分性计算器的使用,并使用 Flan-T5 进行生成性计算器的使用,这有助于提高下游需要数值理解的任务的性能。我们的代码和数据可在此链接处获得:https://github.com/yourgmt/pre-calc
https://arxiv.org/abs/2404.14355
Stance detection has been widely studied as the task of determining if a social media post is positive, negative or neutral towards a specific issue, such as support towards vaccines. Research in stance detection has however often been limited to a single language and, where more than one language has been studied, research has focused on few-shot settings, overlooking the challenges of developing a zero-shot cross-lingual stance detection model. This paper makes the first such effort by introducing a novel approach to zero-shot cross-lingual stance detection, Multilingual Translation-Augmented BERT (MTAB), aiming to enhance the performance of a cross-lingual classifier in the absence of explicit training data for target languages. Our technique employs translation augmentation to improve zero-shot performance and pairs it with adversarial learning to further boost model efficacy. Through experiments on datasets labeled for stance towards vaccines in four languages English, German, French, Italian. We demonstrate the effectiveness of our proposed approach, showcasing improved results in comparison to a strong baseline model as well as ablated versions of our model. Our experiments demonstrate the effectiveness of model components, not least the translation-augmented data as well as the adversarial learning component, to the improved performance of the model. We have made our source code accessible on GitHub.
作为一种确定社交媒体帖子是否支持、反对或中立的特定问题(如对疫苗的支持)的任务,姿态检测(Stance detection)已经受到了广泛研究。然而,姿态检测研究通常局限于一种语言,并且在研究多个语言时,研究重点在于少样本设置,忽略了开发零样本跨语言姿态检测模型的挑战。本文通过引入一种名为多语言翻译增强BERT(MTAB)的新方法,第一次在零样本跨语言姿态检测上做出了尝试,旨在提高在没有明确目标语言训练数据的情况下跨语言分类器的性能。我们的技术采用翻译增强来提高零样本性能,并将其与对抗学习相结合,以进一步提高模型的功效。通过在四个语言(英语、德语、法语、意大利)的数据集上进行实验,我们证明了所提出方法的效力,并将其与强基线模型以及我们模型的衰减版本进行了比较。我们的实验结果表明,模型组件(尤其是翻译增强数据和对抗学习组件)对模型的提高性能具有重要作用。我们在GitHub上公开了我们的源代码。
https://arxiv.org/abs/2404.14339
In this paper, we introduce "Marking", a novel grading task that enhances automated grading systems by performing an in-depth analysis of student responses and providing students with visual highlights. Unlike traditional systems that provide binary scores, "marking" identifies and categorizes segments of the student response as correct, incorrect, or irrelevant and detects omissions from gold answers. We introduce a new dataset meticulously curated by Subject Matter Experts specifically for this task. We frame "Marking" as an extension of the Natural Language Inference (NLI) task, which is extensively explored in the field of Natural Language Processing. The gold answer and the student response play the roles of premise and hypothesis in NLI, respectively. We subsequently train language models to identify entailment, contradiction, and neutrality from student response, akin to NLI, and with the added dimension of identifying omissions from gold answers. Our experimental setup involves the use of transformer models, specifically BERT and RoBERTa, and an intelligent training step using the e-SNLI dataset. We present extensive baseline results highlighting the complexity of the "Marking" task, which sets a clear trajectory for the upcoming study. Our work not only opens up new avenues for research in AI-powered educational assessment tools, but also provides a valuable benchmark for the AI in education community to engage with and improve upon in the future. The code and dataset can be found at this https URL.
在本文中,我们介绍了一种名为“标记”(Marking)的新评分任务,通过深入分析学生答案并给出视觉突出点,增强了自动评分系统。与传统系统提供二进制分数不同,“标记”将学生回答中的片段归类为正确、错误或无关,并检测到黄金答案中的遗漏。我们特别为这一任务策划了一个新的数据集。我们将“标记”视为自然语言推理(NLI)任务的扩展,这是自然语言处理领域广泛研究的话题。黄金答案和学生的回答在NLI中分别扮演前提和假设的角色。接下来,我们使用Transformer模型(特别是BERT和RoBERTa)进行训练,类似于NLI,并具有从学生回答中识别推论、矛盾和中立性的额外维度,以及使用e-SNLI数据集进行智能训练步骤。我们展示了广泛的基线结果,突出了“标记”任务的复杂性,并为即将进行的研究打开了新的途径。我们的工作不仅为AI驱动的教育评估工具的研究提供了新的方向,而且也为教育领域的人工智能(AI)社区提供了宝贵的基准,以便他们未来能够更好地研究和改进。代码和数据集可以在该链接处找到。
https://arxiv.org/abs/2404.14301
This paper investigates what insights about linguistic features and what knowledge about the structure of natural language can be obtained from the encodings in transformer language this http URL particular, we explore how BERT encodes the government relation between constituents in a sentence. We use several probing classifiers, and data from two morphologically rich languages. Our experiments show that information about government is encoded across all transformer layers, but predominantly in the early layers of the model. We find that, for both languages, a small number of attention heads encode enough information about the government relations to enable us to train a classifier capable of discovering new, previously unknown types of government, never seen in the training data. Currently, data is lacking for the research community working on grammatical constructions, and government in particular. We release the Government Bank -- a dataset defining the government relations for thousands of lemmas in the languages in our experiments.
本文研究了关于语言特征和自然语言结构的知识可以从 transformer 语言中的编码中得到多少洞见。在这个特定的 HTTP URL 下,我们探讨了 BERT 对句子中语素之间政府关系编码的情况。我们使用了几个查询分类器,并从两种具有丰富形态的语言的数据显示数据。我们的实验结果表明,政府信息在所有 transformer 层中都有编码,但主要在模型的早期层。我们发现,对于这两种语言,只有少数的注意力头编码了足够关于政府关系的信息,使我们能够训练一个能够发现训练数据中未曾见过的全新政府类型的分类器。目前,对于研究社区正在研究语法的构造,以及特别是政府语,数据缺乏。我们发布了 Government Bank 数据集,该数据集定义了我们实验中数千个语言的政府关系。
https://arxiv.org/abs/2404.14270
Biomedical literature is a rapidly expanding field of science and technology. Classification of biomedical texts is an essential part of biomedicine research, especially in the field of biology. This work proposes the fine-tuned DistilBERT, a methodology-specific, pre-trained generative classification language model for mining biomedicine texts. The model has proven its effectiveness in linguistic understanding capabilities and has reduced the size of BERT models by 40\% but by 60\% faster. The main objective of this project is to improve the model and assess the performance of the model compared to the non-fine-tuned model. We used DistilBert as a support model and pre-trained on a corpus of 32,000 abstracts and complete text articles; our results were impressive and surpassed those of traditional literature classification methods by using RNN or LSTM. Our aim is to integrate this highly specialised and specific model into different research industries.
生物医学文献是一个快速发展的科学和技术领域。生物医学文献分类是生物医学研究的重要组成部分,尤其是在生物学领域。本文提出了一个针对生物医学文献的微调DistilBERT,一种特定于方法论的预训练生成分类语言模型,用于挖掘生物医学文本。该模型在语言理解能力方面已经证明了其有效性,并将BERT模型的大小缩小了40\%但速度提高了60\%。本项目的主要目标是为该模型改进并评估其与未微调模型的性能。我们将DistilBERT用作支持模型,预先训练在32,000个摘要和完整文章的语料库中;我们的结果令人印象深刻,超过了传统文献分类方法的水平,这是通过使用RNN或LSTM实现的。我们的目标是将这种高度专业化和特定化的模型整合到不同的研究产业中。
https://arxiv.org/abs/2404.13779
Speech emotion recognition is crucial in human-computer interaction, but extracting and using emotional cues from audio poses challenges. This paper introduces MFHCA, a novel method for Speech Emotion Recognition using Multi-Spatial Fusion and Hierarchical Cooperative Attention on spectrograms and raw audio. We employ the Multi-Spatial Fusion module (MF) to efficiently identify emotion-related spectrogram regions and integrate Hubert features for higher-level acoustic information. Our approach also includes a Hierarchical Cooperative Attention module (HCA) to merge features from various auditory levels. We evaluate our method on the IEMOCAP dataset and achieve 2.6\% and 1.87\% improvements on the weighted accuracy and unweighted accuracy, respectively. Extensive experiments demonstrate the effectiveness of the proposed method.
语音情感识别在人与计算机交互中至关重要,但提取和利用音频中的情感线索仍然具有挑战性。本文介绍了一种名为MFHCA的新方法,用于基于多空间融合和层次合作注意的语音情感识别。我们采用多空间融合模块(MF)来有效地识别与情感相关的频谱图区域,并利用Hubert特征获取更高层次的音频信息。我们的方法还包括一个层次合作注意模块(HCA),以合并来自不同音频层次的特征。我们在IEMOCAP数据集上评估我们的方法,分别实现了2.6%和1.87%的加权准确性和无加权准确性的提高。大量实验证明所提出的方法的有效性。
https://arxiv.org/abs/2404.13509
The vast majority of the popular English named entity recognition (NER) datasets contain American or British English data, despite the existence of many global varieties of English. As such, it is unclear whether they generalize for analyzing use of English globally. To test this, we build a newswire dataset, the Worldwide English NER Dataset, to analyze NER model performance on low-resource English variants from around the world. We test widely used NER toolkits and transformer models, including models using the pre-trained contextual models RoBERTa and ELECTRA, on three datasets: a commonly used British English newswire dataset, CoNLL 2003, a more American focused dataset OntoNotes, and our global dataset. All models trained on the CoNLL or OntoNotes datasets experienced significant performance drops-over 10 F1 in some cases-when tested on the Worldwide English dataset. Upon examination of region-specific errors, we observe the greatest performance drops for Oceania and Africa, while Asia and the Middle East had comparatively strong performance. Lastly, we find that a combined model trained on the Worldwide dataset and either CoNLL or OntoNotes lost only 1-2 F1 on both test sets.
绝大多数流行的英语命名实体识别(NER)数据集包含美国或英国英语数据,尽管存在许多全球英语变体。因此,它们是否适用于全球分析尚不确定。为了测试这一点,我们构建了一个新的新闻数据集,全球英语NER数据集,以分析低资源英语变种的NER模型性能。我们测试了广泛使用的NER工具包和Transformer模型,包括使用预训练上下文模型的RoBERTa和ELECTRA模型,在三个数据集上:一个常用的英国英语新闻数据集,CoNLL 2003,一个更侧重于美国的数据集OntoNotes,以及我们的全球数据集。在CoNLL或OntoNotes数据集上训练的所有模型,在测试 世界英语数据集 时,性能都出现了显著的下降-有时下降了10个F1分数以上。经过对地区特定错误的检查,我们观察到大洋洲和非洲的性能下降最大,而亚洲和中东地区则相对较强。最后,我们发现,在全局数据集上训练的联合模型,无论是使用CoNLL还是OntoNotes,在测试数据集上都只有1-2个F1分数的损失。
https://arxiv.org/abs/2404.13465
The popular subword tokenizers of current language models, such as Byte-Pair Encoding (BPE), are known not to respect morpheme boundaries, which affects the downstream performance of the models. While many improved tokenization algorithms have been proposed, their evaluation and cross-comparison is still an open problem. As a solution, we propose a combined intrinsic-extrinsic evaluation framework for subword tokenization. Intrinsic evaluation is based on our new UniMorph Labeller tool that classifies subword tokenization as either morphological or alien. Extrinsic evaluation, in turn, is performed via the Out-of-Vocabulary Generalization Challenge 1.0 benchmark, which consists of three newly specified downstream text classification tasks. Our empirical findings show that the accuracy of UniMorph Labeller is 98%, and that, in all language models studied (including ALBERT, BERT, RoBERTa, and DeBERTa), alien tokenization leads to poorer generalizations compared to morphological tokenization for semantic compositionality of word meanings.
目前流行的语言模型子词划分器,如Byte-Pair Encoding(BPE)等,已知不尊重语素边界,这会影响模型的下游性能。虽然已经提出了许多改进的词素划分算法,但它们的评估和跨比仍是未解决的问题。为了解决问题,我们提出了一个结合内部和外部评估的词素划分框架。内部评估基于我们新的UniMorph Labeller工具,将词素划分分为语素或外星。外部评估通过Out-of-Vocabulary Generalization Challenge 1.0基准进行,该基准包括三个新的下游文本分类任务。我们的实证研究结果表明,UniMorph Labeller的准确率为98%,而在所有研究语言模型中(包括ALBERT、BERT、RoBERTa和DeBERTa),外星词素划分会导致语义组成性较弱,与语素词分相比更差。
https://arxiv.org/abs/2404.13292