The objective of this study is to address the critical issue of de-identification of clinical reports in order to allow access to data for research purposes, while ensuring patient privacy. The study highlights the difficulties faced in sharing tools and resources in this domain and presents the experience of the Greater Paris University Hospitals (AP-HP) in implementing a systematic pseudonymization of text documents from its Clinical Data Warehouse. We annotated a corpus of clinical documents according to 12 types of identifying entities, and built a hybrid system, merging the results of a deep learning model as well as manual rules. Our results show an overall performance of 0.99 of F1-score. We discuss implementation choices and present experiments to better understand the effort involved in such a task, including dataset size, document types, language models, or rule addition. We share guidelines and code under a 3-Clause BSD license.
本研究的目标是解决临床报告的易错性问题,以便允许进行科学研究,同时确保患者隐私。研究强调了在这个领域的分享工具和资源所面临的困难,并介绍了巴黎大巴黎大学医院(AP-HP)在从临床数据仓库中系统命名化文本文档的经验。我们对临床文档的语料库进行了注释,按照12种识别实体类型进行标注,并建立了一个混合系统,将深度学习模型和手动规则的结果合并。我们的结果显示整体表现达到F1得分的0.99。我们讨论了实现选择,并介绍了实验,以更好地理解这种任务所需的努力,包括数据集大小、文档类型、语言模型或规则增加。我们遵循3项BSD许可证分享指南和代码。
https://arxiv.org/abs/2303.13451
To detect the deployment of large language models for malicious use cases (e.g., fake content creation or academic plagiarism), several approaches have recently been proposed for identifying AI-generated text via watermarks or statistical irregularities. How robust are these detection algorithms to paraphrases of AI-generated text? To stress test these detectors, we first train an 11B parameter paraphrase generation model (DIPPER) that can paraphrase paragraphs, optionally leveraging surrounding text (e.g., user-written prompts) as context. DIPPER also uses scalar knobs to control the amount of lexical diversity and reordering in the paraphrases. Paraphrasing text generated by three large language models (including GPT3.5-davinci-003) with DIPPER successfully evades several detectors, including watermarking, GPTZero, DetectGPT, and OpenAI's text classifier. For example, DIPPER drops the detection accuracy of DetectGPT from 70.3% to 4.6% (at a constant false positive rate of 1%), without appreciably modifying the input semantics. To increase the robustness of AI-generated text detection to paraphrase attacks, we introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider. Given a candidate text, our algorithm searches a database of sequences previously generated by the API, looking for sequences that match the candidate text within a certain threshold. We empirically verify our defense using a database of 15M generations from a fine-tuned T5-XXL model and find that it can detect 80% to 97% of paraphrased generations across different settings, while only classifying 1% of human-written sequences as AI-generated. We will open source our code, model and data for future research.
检测大型语言模型用于恶意使用 case(例如虚假内容创建或学术抄袭),有几个方法最近被提出以通过水印或统计不规则来确定由人工智能生成的文字。这些检测算法对人工智能生成文字的重写攻击的鲁棒性如何?为了压力测试这些检测算法,我们首先训练了一个11B参数的重写生成模型(DIPPER),该模型可以重写段落,并可选地利用周围的文本(例如用户编写的提示)作为上下文。DIPPER还使用 scalar knobs 控制重写句中的词汇多样性和排序。由三个大型语言模型(包括 GPT3.5-davinci-003)生成的重写文本,使用 DIPPER成功地逃避了多个检测器,包括水印、GPTZero、DetectGPT和OpenAI的文字分类器。例如,DIPPER将检测到 DetectGPT的检测准确率从70.3%降低到4.6%(在 constant false positive rate of 1% 不变的情况下),而不会对输入语义性有任何显著影响。为了增加人工智能生成文字检测重写攻击的鲁棒性,我们引入了一种简单的防御措施,它依赖于从语言模型 API 中提取语义相似的生成,必须由语言模型 API 提供商维护。给定一个候选文本,我们的算法搜索先前由 API 生成的序列数据库,寻找匹配候选文本在一定阈值内的序列。我们使用微调的 T5-XXL 模型的15M 生成序列数据库进行经验验证,发现它可以在不同设置下检测到80%至97%的重写生成序列,而仅将人类编写的序列归类为人工智能生成。我们将开源我们的代码、模型和数据,以供未来的研究。
https://arxiv.org/abs/2303.13408
Label scarcity is a bottleneck for improving task performance in specialised domains. We propose a novel compositional transfer learning framework (DoT5 - domain compositional zero-shot T5) for zero-shot domain transfer. Without access to in-domain labels, DoT5 jointly learns domain knowledge (from MLM of unlabelled in-domain free text) and task knowledge (from task training on more readily available general-domain data) in a multi-task manner. To improve the transferability of task training, we design a strategy named NLGU: we simultaneously train NLG for in-domain label-to-data generation which enables data augmentation for self-finetuning and NLU for label prediction. We evaluate DoT5 on the biomedical domain and the resource-lean subdomain of radiology, focusing on NLI, text summarisation and embedding learning. DoT5 demonstrates the effectiveness of compositional transfer learning through multi-task learning. In particular, DoT5 outperforms the current SOTA in zero-shot transfer by over 7 absolute points in accuracy on RadNLI. We validate DoT5 with ablations and a case study demonstrating its ability to solve challenging NLI examples requiring in-domain expertise.
标签稀缺是改善特定领域的任务表现的瓶颈。我们提出了一种全新的组件化 Transfer Learning 框架(DoT5 - 域组件式零次输入 T5),用于零次输入域转移。在没有访问域内标签的情况下,DoT5 以多任务方式共同学习域知识和任务知识(从未标记的域内自由文本的 LM 中提取任务知识,并从任务训练更常见的通用数据集中提取数据增强和 NLU)。为了提高任务训练的可转移性,我们设计了一种名为 NLGU 的策略:我们同时训练 In-domain 标签到数据生成 NLG,这可以实现数据增强的自训练和标签预测的 NLU。我们在生物医学领域和放射学资源受限的子领域评估了 DoT5,重点关注 NLI、文本摘要和嵌入学习。DoT5 通过多任务学习证明了组件化转移学习的 effectiveness。特别是,DoT5 在 RadNLI 上的零次输入转移中比当前的最佳方法高出超过 7 的绝对点的准确性。我们通过实验和案例研究验证了 DoT5 的能力,以解决需要域内专业知识的具有挑战性的 NLI 示例。
https://arxiv.org/abs/2303.13386
In this work we create a question answering dataset over the DBLP scholarly knowledge graph (KG). DBLP is an on-line reference for bibliographic information on major computer science publications that indexes over 4.4 million publications published by more than 2.2 million authors. Our dataset consists of 10,000 question answer pairs with the corresponding SPARQL queries which can be executed over the DBLP KG to fetch the correct answer. DBLP-QuAD is the largest scholarly question answering dataset.
在本研究中,我们利用 DBLP 学术知识图谱(KG)创建一个问答数据集。 DBLP 是计算机科学主要出版物的在线参考,索引了超过 4.4 百万出版物,由超过 2.2 百万的作者发布。我们的数据集包括 10,000 对问答对及其相应的 SPARQL 查询,可以在 DBLP KG 上执行以获取正确答案。 DBLP-QuAD 是世界上最大的学术问答数据集。
https://arxiv.org/abs/2303.13351
We present SwissBERT, a masked language model created specifically for processing Switzerland-related text. SwissBERT is a pre-trained model that we adapted to news articles written in the national languages of Switzerland -- German, French, Italian, and Romansh. We evaluate SwissBERT on natural language understanding tasks related to Switzerland and find that it tends to outperform previous models on these tasks, especially when processing contemporary news and/or Romansh Grischun. Since SwissBERT uses language adapters, it may be extended to Swiss German dialects in future work. The model and our open-source code are publicly released at this https URL.
我们介绍了瑞士BERT语言模型,这是一个专门处理与瑞士相关的文本的Masked语言模型。瑞士BERT是一个预训练模型,我们将其适应于用瑞士四种官方语言撰写的新闻文章。我们针对与瑞士相关的自然语言理解任务进行了评估,发现瑞士BERT在这些任务中往往比先前模型表现更好,特别是处理 contemporary news 和/或瑞士德语方言时。由于瑞士BERT使用语言适应器,未来工作可以将它扩展到瑞士德语方言。模型和我们的开源代码在此httpsURL上公开发布。
https://arxiv.org/abs/2303.13310
In this work, we present an end-to-end Knowledge Graph Question Answering (KGQA) system named GETT-QA. GETT-QA uses T5, a popular text-to-text pre-trained language model. The model takes a question in natural language as input and produces a simpler form of the intended SPARQL query. In the simpler form, the model does not directly produce entity and relation IDs. Instead, it produces corresponding entity and relation labels. The labels are grounded to KG entity and relation IDs in a subsequent step. To further improve the results, we instruct the model to produce a truncated version of the KG embedding for each entity. The truncated KG embedding enables a finer search for disambiguation purposes. We find that T5 is able to learn the truncated KG embeddings without any change of loss function, improving KGQA performance. As a result, we report strong results for LC-QuAD 2.0 and SimpleQuestions-Wikidata datasets on end-to-end KGQA over Wikidata.
在本研究中,我们提出了一个端到端的知识图问答系统,名为GETT-QA。GETT-QA使用了一个流行的文本到文本预训练语言模型T5。该模型以自然语言问题作为输入,并生成简化版的SPARQL查询。在简化版中,模型并不直接生成实体和关系ID。相反,它生成相应的实体和关系标签。在后续步骤中,标签被连接到知识实体和关系ID。为了进一步改善结果,我们要求模型为每个实体生成一份知识实体嵌入的截断版本。截断知识实体嵌入为实现更细的歧义查找而提供了便利。我们发现,T5能够无需改变损失函数而学习截断知识实体嵌入,从而提高了KGQA性能。因此,我们报告了LC-QuAD 2.0和SimpleQuestions-Wikidata datasets在Wikidata上端到端KGQA方面的出色结果。
https://arxiv.org/abs/2303.13284
Parameter-Efficient transfer learning with Adapters have been studied in Natural Language Processing (NLP) as an alternative to full fine-tuning. Adapters are memory-efficient and scale well with downstream tasks by training small bottle-neck layers added between transformer layers while keeping the large pretrained language model (PLMs) frozen. In spite of showing promising results in NLP, these methods are under-explored in Information Retrieval. While previous studies have only experimented with dense retriever or in a cross lingual retrieval scenario, in this paper we aim to complete the picture on the use of adapters in IR. First, we study adapters for SPLADE, a sparse retriever, for which adapters not only retain the efficiency and effectiveness otherwise achieved by finetuning, but are memory-efficient and orders of magnitude lighter to train. We observe that Adapters-SPLADE not only optimizes just 2\% of training parameters, but outperforms fully fine-tuned counterpart and existing parameter-efficient dense IR models on IR benchmark datasets. Secondly, we address domain adaptation of neural retrieval thanks to adapters on cross-domain BEIR datasets and TripClick. Finally, we also consider knowledge sharing between rerankers and first stage rankers. Overall, our study complete the examination of adapters for neural IR
在自然语言处理(NLP)中,使用Adapters作为参数高效的转移学习替代方法已经得到了研究。Adapters能够在Transformer层之间添加小型瓶颈层,同时保持大型预训练语言模型(PLM)冻结,从而实现 Memory-Efficient Transfer Learning。尽管在NLP中取得了 promising 的结果,但在信息检索中这些方法仍然未被深入研究。尽管以前的研究仅尝试过密集检索或跨语言检索场景,但本 paper 旨在完整描述在IR中使用Adapters的情况。首先,我们研究了Adapters-SPLADE,它是一个稀疏检索器,Adapters不仅保留了经过微调后实现的效率与效果,而且具有 Memory-Efficient 和数量级更轻的训练能力。我们观察到Adapters-SPLADE不仅优化了训练参数的2\% ,而且在IR基准数据集上比完全微调的替代品和现有的参数高效的密集IR模型表现更好。其次,我们考虑了跨域BEIR数据和 TripClick Adapters 的神经网络检索域适应问题。最后,我们还考虑了重新排名器和第一级排名器之间的知识共享。总之,我们的研究涵盖了神经网络IRAdapters的使用。
https://arxiv.org/abs/2303.13220
Pretrained language models (PLMs) have shown marvelous improvements across various NLP tasks. Most Chinese PLMs simply treat an input text as a sequence of characters, and completely ignore word information. Although Whole Word Masking can alleviate this, the semantics in words is still not well represented. In this paper, we revisit the segmentation granularity of Chinese PLMs. We propose a mixed-granularity Chinese BERT (MigBERT) by considering both characters and words. To achieve this, we design objective functions for learning both character and word-level representations. We conduct extensive experiments on various Chinese NLP tasks to evaluate existing PLMs as well as the proposed MigBERT. Experimental results show that MigBERT achieves new SOTA performance on all these tasks. Further analysis demonstrates that words are semantically richer than characters. More interestingly, we show that MigBERT also works with Japanese. Our code has been released here~\footnote{\url{this https URL}} and you can download our model here~\footnote{\url{this https URL}}.
预训练语言模型(PLMs)在多种自然语言处理任务中取得了惊人的改善。大多数 Chinese PLMs 仅仅将输入文本视为字符序列,完全忽略了单词信息。虽然全单词遮蔽可以缓解这种情况,但单词的语义仍然未被很好地表示。在本文中,我们重新考虑了 Chinese PLMs 的分割粒度。我们提出了一种混合粒度的 Chinese BERT(MigBERT),通过考虑字符和单词信息。为了实现这一点,我们设计了一个用于学习字符和单词级别的表示的目标函数。我们对多种 Chinese NLP 任务进行了广泛的实验,以评估现有的PLMs 和提出的 MigBERT。实验结果显示,MigBERT在这些任务中取得了新的 SOTA 表现。进一步的分析表明,单词的语义比字符更丰富。更有趣的的是,我们表明 MigBERT 也适用于日语。我们的代码已在这里发布,你可以在这里下载我们的模型。
https://arxiv.org/abs/2303.13065
The emergence of ChatGPT has recently garnered significant attention from the computational linguistics community. To demonstrate its capabilities as a keyphrase generator, we conduct a preliminary evaluation of ChatGPT for the keyphrase generation task. We evaluate its performance in various aspects, including keyphrase generation prompts, keyphrase generation diversity, multi-domain keyphrase generation, and long document understanding. Our evaluation is based on six benchmark datasets, and we adopt the prompt suggested by OpenAI while extending it to six candidate prompts. We find that ChatGPT performs exceptionally well on all six candidate prompts, with minor performance differences observed across the datasets. Based on our findings, we conclude that ChatGPT has great potential for keyphrase generation. Moreover, we discover that ChatGPT still faces challenges when it comes to generating absent keyphrases. Meanwhile, in the final section, we also present some limitations and future expansions of this report.
ChatGPT的出现最近引起了计算语言学社区的广泛关注。为了展示其作为关键词生成器的能力,我们对ChatGPT进行了关键词生成任务的第一步评估。我们评估了其在多个方面的表现,包括关键词生成提示、关键词生成多样性、多主题关键词生成以及长期文档理解。我们的评估基于六个基准数据集,并采用OpenAI推荐的提示,同时将提示扩展到了六个备选提示。我们发现ChatGPT在所有六个备选提示上都表现非常出色,数据集上表现出微小的性能差异。基于我们的发现,我们得出结论,ChatGPT在关键词生成方面具有巨大的潜力。此外,我们还发现ChatGPT在生成缺失关键词时仍然面临挑战。同时,在结论部分,我们还介绍了本报告的一些限制和未来扩展。
https://arxiv.org/abs/2303.13001
The potential of large language models (LLMs) to reason like humans has been a highly contested topic in Machine Learning communities. However, the reasoning abilities of humans are multifaceted and can be seen in various forms, including analogical, spatial and moral reasoning, among others. This fact raises the question whether LLMs can perform equally well across all these different domains. This research work aims to investigate the performance of LLMs on different reasoning tasks by conducting experiments that directly use or draw inspirations from existing datasets on analogical and spatial reasoning. Additionally, to evaluate the ability of LLMs to reason like human, their performance is evaluted on more open-ended, natural language questions. My findings indicate that LLMs excel at analogical and moral reasoning, yet struggle to perform as proficiently on spatial reasoning tasks. I believe these experiments are crucial for informing the future development of LLMs, particularly in contexts that require diverse reasoning proficiencies. By shedding light on the reasoning abilities of LLMs, this study aims to push forward our understanding of how they can better emulate the cognitive abilities of humans.
大型语言模型(LLM)像人类一样进行推理的潜在能力一直是机器学习社区中高度争议的话题。然而,人类的思维能力具有多方面的特点,可以表现在不同的形式中,包括类比、空间和行为推理等。这一事实引发了一个问题,即LLM是否能在所有不同的领域中表现同样出色。本研究旨在通过直接使用或借鉴现有的类比和空间推理数据集来开展实验,以研究LLM在不同推理任务中的表现。此外,为了评估LLM像人类一样推理的能力,我们对更加开放自然语言问题的表现进行了评估。我的研究结果表明,LLM在类比和道德推理方面表现优异,但在空间推理任务中表现不足。我相信这些实验对于LLM未来的发展前景至关重要,特别是在需要多种推理能力的场景下。通过深入研究LLM的推理能力,本研究旨在推动我们理解如何更好地模拟人类的认知能力。
https://arxiv.org/abs/2303.12810
This study evaluates the robustness of two state-of-the-art deep contextual language representations, ELMo and DistilBERT, on supervised learning of binary protest news classification and sentiment analysis of product reviews. A "cross-context" setting is enabled using test sets that are distinct from the training data. Specifically, in the news classification task, the models are developed on local news from India and tested on the local news from China. In the sentiment analysis task, the models are trained on movie reviews and tested on customer reviews. This comparison is aimed at exploring the limits of the representative power of today's Natural Language Processing systems on the path to the systems that are generalizable to real-life scenarios. The models are fine-tuned and fed into a Feed-Forward Neural Network and a Bidirectional Long Short Term Memory network. Multinomial Naive Bayes and Linear Support Vector Machine are used as traditional baselines. The results show that, in binary text classification, DistilBERT is significantly better than ELMo on generalizing to the cross-context setting. ELMo is observed to be significantly more robust to the cross-context test data than both baselines. On the other hand, the baselines performed comparably well to ELMo when the training and test data are subsets of the same corpus (no cross-context). DistilBERT is also found to be 30% smaller and 83% faster than ELMo. The results suggest that DistilBERT can transfer generic semantic knowledge to other domains better than ELMo. DistilBERT is also favorable in incorporating into real-life systems for it requires a smaller computational training budget. When generalization is not the utmost preference and test domain is similar to the training domain, the traditional ML algorithms can still be considered as more economic alternatives to deep language representations.
这项研究评估了ELMo和DistilBERT这两种最先进的深度学习上下文语言表示的稳健性,它们在监督学习二进制评论新闻分类和商品评论情感分析方面的表现。使用与训练数据不同的测试集,实现了一个“跨上下文”设置。具体而言,在新闻分类任务中,模型基于印度本地新闻和中国本地新闻开发,在情感分析任务中,模型基于电影评论训练,并在顾客评论中测试。这个比较旨在探索当今自然语言处理系统的代表作力的极限,以使其能够适用于实际场景。模型经过了优化,并输入到Feed-Forward神经网络和双向长期短期记忆网络中。多nomial Naive Bayes和线性支持向量机作为传统的基线。结果显示,在二进制文本分类中,DistilBERT在跨上下文 setting上的表现比ELMo更好。ELMo观察到其对跨上下文测试数据的稳定性比两个基线都强。另一方面,当训练和测试数据都是同一个语料库的子集(没有跨上下文)时,ELMo的表现与DistilBERT相当。DistilBERT也被发现比ELMo小30%,运行速度更快83%。结果显示,DistilBERT可以更好地将通用语义知识向其他领域转移,比ELMo更有效。DistilBERT也被认为更适合融入实际系统,因为它需要的计算训练预算较小。当泛化不是最优先考虑时,测试领域与训练领域相似,传统的机器学习算法仍然可以被视为深度学习表示的更经济的选择。
https://arxiv.org/abs/2303.12936
Electronic medical records (EMRs) are stored in relational databases. It can be challenging to access the required information if the user is unfamiliar with the database schema or general database fundamentals. Hence, researchers have explored text-to-SQL generation methods that provide healthcare professionals direct access to EMR data without needing a database expert. However, currently available datasets have been essentially "solved" with state-of-the-art models achieving accuracy greater than or near 90%. In this paper, we show that there is still a long way to go before solving text-to-SQL generation in the medical domain. To show this, we create new splits of the existing medical text-to-SQL dataset MIMICSQL that better measure the generalizability of the resulting models. We evaluate state-of-the-art language models on our new split showing substantial drops in performance with accuracy dropping from up to 92% to 28%, thus showing substantial room for improvement. Moreover, we introduce a novel data augmentation approach to improve the generalizability of the language models. Overall, this paper is the first step towards developing more robust text-to-SQL models in the medical domain.\footnote{The dataset and code will be released upon acceptance.
电子医疗记录(EMRs)存储在关系型数据库中。如果用户不熟悉数据库表 schema或一般数据库基础结构,那么访问所需的信息可能会非常困难。因此,研究人员已经探索了文本到SQL生成方法,以便提供医疗保健专业人员直接访问EMR数据,而不需要数据库专家。然而,目前可用的数据集基本上已经“解决”,最先进的模型准确率超过或接近于90%。在本文中,我们表明,在医疗领域中解决文本到SQL生成问题还有很长的路要走。为了展示这一点,我们创造了新的医疗文本到SQL数据集MIMICSQL的分集,更好地衡量结果模型的通用性。我们评估了最先进的语言模型在我们的新分集中的表现,显示性能大幅度下降,准确率从高达92%降至28%,因此表明有很大的改进空间。此外,我们引入了一种新的数据增强方法,以提高语言模型的通用性。总的来说,本文是开发医疗领域中更稳定的文本到SQL模型的第一步。
https://arxiv.org/abs/2303.12898
In recent years, Transformer-based models such as the Switch Transformer have achieved remarkable results in natural language processing tasks. However, these models are often too complex and require extensive pre-training, which limits their effectiveness for small clinical text classification tasks with limited data. In this study, we propose a simplified Switch Transformer framework and train it from scratch on a small French clinical text classification dataset at CHU Sainte-Justine hospital. Our results demonstrate that the simplified small-scale Transformer models outperform pre-trained BERT-based models, including DistillBERT, CamemBERT, FlauBERT, and FrALBERT. Additionally, using a mixture of expert mechanisms from the Switch Transformer helps capture diverse patterns; hence, the proposed approach achieves better results than a conventional Transformer with the self-attention mechanism. Finally, our proposed framework achieves an accuracy of 87\%, precision at 87\%, and recall at 85\%, compared to the third-best pre-trained BERT-based model, FlauBERT, which achieved an accuracy of 84\%, precision at 84\%, and recall at 84\%. However, Switch Transformers have limitations, including a generalization gap and sharp minima. We compare it with a multi-layer perceptron neural network for small French clinical narratives classification and show that the latter outperforms all other models.
近年来,基于Transformer的模型,如Switch Transformer,在自然语言处理任务中取得了显著的成果。然而,这些模型往往过于复杂,需要广泛的预训练,这限制了它们在数据有限的情况下小型临床文本分类任务的有效性。在本研究中,我们提出了一个简单的Switch Transformer框架,并在CHU圣Justine医院的小 French临床文本分类数据集上从头训练它。我们的结果显示,简化的小尺度Transformer模型比预训练的BERT基于模型,如DistillBERT、 CamemBERT、FlauBERT和FrALBERT,表现更好。此外,使用Switch Transformer中的专家机制混合物可以帮助捕捉不同模式,因此,我们提出的方法比传统的Transformer self-attention机制获得更好的结果。最后,我们提出的框架 achieve an accuracy of 87\%, precision at 87\%, and recall at 85%,相比之下,FlauBERT(84\%),具有84\%的accuracy、84\%的precision和84\%的recall。然而,Switch Transformer有限制,包括泛化差距和 sharp minimum。我们比较了Switch Transformer和小 French临床叙事分类数据的多层感知机神经网络,并表明后者在所有其他模型中表现更好。
https://arxiv.org/abs/2303.12892
Pretrained transformer-based models have shown high performance in natural language generation task. However, a new wave of interest has surged: automatic programming language generation. This task consists of translating natural language instructions to a programming code. Despite the fact that well-known pretrained models on language generation have achieved good performance in learning programming languages, effort is still needed in automatic code generation. In this paper, we introduce JaCoText, a model based on Transformers neural network. It aims to generate java source code from natural language text. JaCoText leverages advantages of both natural language and code generation models. More specifically, we study some findings from the state of the art and use them to (1) initialize our model from powerful pretrained models, (2) explore additional pretraining on our java dataset, (3) carry out experiments combining the unimodal and bimodal data in the training, and (4) scale the input and output length during the fine-tuning of the model. Conducted experiments on CONCODE dataset show that JaCoText achieves new state-of-the-art results.
预训练Transformer-based模型在自然语言生成任务中表现优异。然而,一股新的兴趣已经崛起:自动编程语言生成。这项任务是将自然语言指令转换为编程代码。尽管著名的语言生成预训练模型在学习编程语言方面取得了良好的表现,但在自动代码生成方面仍然需要努力。在本文中,我们介绍了JaCoText,一个基于Transformer神经网络模型的对象。它旨在从自然语言文本中提取Java源代码。JaCoText利用自然语言和代码生成模型的优势。具体来说,我们研究了最先进的研究 findings 并使用它们(1)从强大的预训练模型初始化我们的模型,(2)探索我们的Java数据集额外的预训练,(3)在训练期间结合单眼和双眼数据进行实验,(4)在模型微调期间调整输入和输出长度。在CONCODE数据集上开展的实验表明,JaCoText取得了新的最先进的结果。
https://arxiv.org/abs/2303.12869
While the state-of-the-art for frame semantic parsing has progressed dramatically in recent years, it is still difficult for end-users to apply state-of-the-art models in practice. To address this, we present Frame Semantic Transformer, an open-source Python library which achieves near state-of-the-art performance on FrameNet 1.7, while focusing on ease-of-use. We use a T5 model fine-tuned on Propbank and FrameNet exemplars as a base, and improve performance by using FrameNet lexical units to provide hints to T5 at inference time. We enhance robustness to real-world data by using textual data augmentations during training.
过去几年中,框架语义解析的技术进展非常大,但对用户而言,在实践中应用最先进的模型仍然非常困难。为了解决这个问题,我们提出了Frame Semantic Transformer,一个开源的Python库,可以在FrameNet 1.7上实现最先进的性能,同时注重易用性。我们使用Propbank和FrameNet示例模型中的T5模型作为基础,并通过使用FrameNet词汇单元在推理时向T5提供提示来改进性能。在训练过程中,我们还使用文本数据增强来增强对真实数据的可靠性。
https://arxiv.org/abs/2303.12788
A positive phrase or a sentence with an underlying negative motive is usually defined as sarcasm that is widely used in today's social media platforms such as Facebook, Twitter, Reddit, etc. In recent times active users in social media platforms are increasing dramatically which raises the need for an automated NLP-based system that can be utilized in various tasks such as determining market demand, sentiment analysis, threat detection, etc. However, since sarcasm usually implies the opposite meaning and its detection is frequently a challenging issue, data meaning extraction through an NLP-based model becomes more complicated. As a result, there has been a lot of study on sarcasm detection in English over the past several years, and there's been a noticeable improvement and yet sarcasm detection in the Bangla language's state remains the same. In this article, we present a BERT-based system that can achieve 99.60\% while the utilized traditional machine learning algorithms are only capable of achieving 89.93\%. Additionally, we have employed Local Interpretable Model-Agnostic Explanations that introduce explainability to our system. Moreover, we have utilized a newly collected bangla sarcasm dataset, BanglaSarc that was constructed specifically for the evaluation of this study. This dataset consists of fresh records of sarcastic and non-sarcastic comments, the majority of which are acquired from Facebook and YouTube comment sections.
一个正面短语或句子背后存在消极动机通常被定义为 sarcastic,在今天的社交媒体平台上如Facebook、Twitter、Reddit等广泛使用。近年来,社交媒体平台上的活跃用户数量急剧增加,这导致了需要一种自动化的 NLP 系统,可以在各种任务中使用,例如确定市场需求、情绪分析、威胁检测等。然而,由于 sarcastic 往往意味着相反的含义,其检测经常是一个具有挑战性的问题,因此通过 NLP 模型提取数据含义变得更加复杂。因此,过去几年中,在英语中研究了 sarcastic 检测,取得了明显进展,然而孟加拉语中 sarcastic 检测仍然相同。在本文中,我们介绍了一种 BERT 系统,可以实现 99.60%,而使用的传统机器学习算法只能实现 89.93%。此外,我们采用了 local 解释模型无关的本地解释,引入我们的系统的可解释性。此外,我们使用了 newly collected孟加拉语 sarcastic 数据集 BanglaSarc,该数据集专门为评估本研究而构建。该数据集包括新鲜记录的 sarcastic 和非 sarcastic 评论,其中大多数评论是从 Facebook 和 YouTube 评论 sections 收集的。
https://arxiv.org/abs/2303.12772
ChatGPT, the first large language model (LLM) with mass adoption, has demonstrated remarkable performance in numerous natural language tasks. Despite its evident usefulness, evaluating ChatGPT's performance in diverse problem domains remains challenging due to the closed nature of the model and its continuous updates via Reinforcement Learning from Human Feedback (RLHF). We highlight the issue of data contamination in ChatGPT evaluations, with a case study of the task of stance detection. We discuss the challenge of preventing data contamination and ensuring fair model evaluation in the age of closed and continuously trained models.
ChatGPT,即第一个大规模语言模型(LLM),已经通过 mass adoption 实现了广泛采用,它在多个自然语言任务中表现出卓越的性能。尽管其显然有用,但在多个问题域中评估 ChatGPT 的性能仍然非常困难,因为其封闭性质以及通过从人类反馈中进行强化学习(RLHF)的不断更新。我们强调了 ChatGPT 评估中的数据污染问题,并以 stance detection 任务为例进行了深入研究。我们讨论了在封闭且不断训练模型的时代中,防止数据污染并确保公平模型评估所面临的挑战。
https://arxiv.org/abs/2303.12767
Considering a conversation thread, stance classification aims to identify the opinion (e.g. agree or disagree) of replies towards a given target. The target of the stance is expected to be an essential component in this task, being one of the main factors that make it different from sentiment analysis. However, a recent study shows that a target-oblivious model outperforms target-aware models, suggesting that targets are not useful when predicting stance. This paper re-examines this phenomenon for rumour stance classification (RSC) on social media, where a target is a rumour story implied by the source tweet in the conversation. We propose adversarial attacks in the test data, aiming to assess the models robustness and evaluate the role of the data in the models performance. Results show that state-of-the-art models, including approaches that use the entire conversation thread, overly relying on superficial signals. Our hypothesis is that the naturally high occurrence of target-independent direct replies in RSC (e.g. "this is fake" or just "fake") results in the impressive performance of target-oblivious models, highlighting the risk of target instances being treated as noise during training.
考虑到对话线程,立场分类旨在识别回复对给定目标的个人观点(例如同意或不同意)。立场的目标应该在这个任务中被视为一个重要的组件,是使其与情感分析不同的主要因素之一。然而,一项最近的研究显示,目标忽略模型比目标意识到模型表现更好,这表明预测立场时目标并不非常有用。本文重新审视了社交媒体上的谣言立场分类(RSC)现象,其中目标是指对话源在对话中的谣言故事。我们提出了对抗攻击在测试数据中实施,旨在评估模型的鲁棒性和评估数据在模型性能中的作用。结果显示,最先进的模型,包括使用整个对话线程的方法,过度依赖表面信号。我们的假设是,在RSC中自然高的不受目标影响的直接回复(例如“这是假”或只是“假”)导致目标忽略模型令人印象深刻的表现,突出了目标实例在训练期间被当做噪声的风险。
https://arxiv.org/abs/2303.12665
This paper describes our submission to ICASSP 2023 MUG Challenge Track 4, Keyphrase Extraction, which aims to extract keyphrases most relevant to the conference theme from conference materials. We model the challenge as a single-class Named Entity Recognition task and developed techniques for better performance on the challenge: For the data preprocessing, we encode the split keyphrases after word segmentation. In addition, we increase the amount of input information that the model can accept at one time by fusing multiple preprocessed sentences into one segment. We replace the loss function with the multi-class focal loss to address the sparseness of keyphrases. Besides, we score each appearance of keyphrases and add an extra output layer to fit the score to rank keyphrases. Exhaustive evaluations are performed to find the best combination of the word segmentation tool, the pre-trained embedding model, and the corresponding hyperparameters. With these proposals, we scored 45.04 on the final test set.
本论文描述了我们对ICASSP 2023 MUG Challenge track 4,“关键词提取”,提交的申请。该任务旨在从会议材料中提取与会议主题最相关的关键词。我们将挑战建模为单个类别命名实体识别任务,并开发了在挑战中更好的表现的技术:对于数据预处理,我们在词分割后编码分割的关键词。此外,我们增加模型可以同时接受的时间步长的输入信息数量,通过将多个预处理句子融合成一个段来增加。我们替换损失函数为多类聚焦损失,以解决关键词稀疏性问题。此外,我们对每个关键词的出现进行评分,并添加额外的输出层以匹配评分来排名关键词。进行了充分的评估,以找到最佳的组合词分割工具、预训练嵌入模型和相应的超参数。基于这些建议,我们在最终测试集上得分45.04。
https://arxiv.org/abs/2303.13463
The advancement of speech technologies has been remarkable, yet its integration with African languages remains limited due to the scarcity of African speech corpora. To address this issue, we present AfroDigits, a minimalist, community-driven dataset of spoken digits for African languages, currently covering 38 African languages. As a demonstration of the practical applications of AfroDigits, we conduct audio digit classification experiments on six African languages [Igbo (ibo), Yoruba (yor), Rundi (run), Oshiwambo (kua), Shona (sna), and Oromo (gax)] using the Wav2Vec2.0-Large and XLS-R models. Our experiments reveal a useful insight on the effect of mixing African speech corpora during finetuning. AfroDigits is the first published audio digit dataset for African languages and we believe it will, among other things, pave the way for Afro-centric speech applications such as the recognition of telephone numbers, and street numbers. We release the dataset and platform publicly at this https URL and this https URL respectively.
语音技术的发展引人注目,但与非洲语言的融合仍然受到限制,因为非洲语音 corpora 的稀缺性。为了解决这个问题,我们提出了Afr计分系统,这是一个专注于非洲语言的语音数字数据集,目前涵盖了38种非洲语言。为了展示Afr计分系统的实用应用,我们进行了六种非洲语言的语音数字分类实验,使用WAV2Vec2.0-Large和Xls-R模型。我们的实验揭示了混合非洲语音 corpora 的效果在微调期间的重要性。Afr计分系统是非洲语言的首次发表语音数字数据集,我们相信它将有助于开辟非洲语言为中心的语音应用,如电话号码和街道号码的识别。我们将公开发布数据集和平台,分别位于这两个httpsURL。
https://arxiv.org/abs/2303.12582