This paper delves into the problem of safe reinforcement learning (RL) in a partially observable environment with the aim of achieving safe-reachability objectives. In traditional partially observable Markov decision processes (POMDP), ensuring safety typically involves estimating the belief in latent states. However, accurately estimating an optimal Bayesian filter in POMDP to infer latent states from observations in a continuous state space poses a significant challenge, largely due to the intractable likelihood. To tackle this issue, we propose a stochastic model-based approach that guarantees RL safety almost surely in the face of unknown system dynamics and partial observation environments. We leveraged the Predictive State Representation (PSR) and Reproducing Kernel Hilbert Space (RKHS) to represent future multi-step observations analytically, and the results in this context are provable. Furthermore, we derived essential operators from the kernel Bayes' rule, enabling the recursive estimation of future observations using various operators. Under the assumption of \textit{undercompleness}, a polynomial sample complexity is established for the RL algorithm for the infinite size of observation and action spaces, ensuring an $\epsilon-$suboptimal safe policy guarantee.
本文深入探讨了在部分观察环境中进行安全强化学习(RL)的问题,旨在实现安全到达目标。在传统的部分观察马尔可夫决策过程(POMDP)中,确保安全通常涉及估计隐含状态的概率。然而,在POMDP中准确估计最优贝叶斯滤波器以推断连续状态空间中的观察值具有重大挑战,这主要是因为难以计算隐含状态的概率分布。为解决这个问题,我们提出了一个基于随机模型的方法,该方法在未知系统动态和部分观察环境中保证RL安全性几乎 surely。我们利用预测状态表示(PSR)和还原哈希空间(RKHS)来表示未来多步观察的解析形式,且此方面的结果具有可证明性。此外,我们还从 kernel Bayes' rule 中导出关键运算,使得使用各种运算对未来的观察进行递归估计。在假设不满足的情况下,对于观察无限大、动作空间无限大的情况,RL算法的样本复杂度为多项式级数,从而确保了$\epsilon$-最优的安全策略保证。
https://arxiv.org/abs/2312.00727
Guided by grammatical structure, words compose to form sentences, and guided by discourse structure, sentences compose to form dialogues and documents. The compositional aspect of sentence and discourse units is often overlooked by machine learning algorithms. A recent initiative called Quantum Natural Language Processing (QNLP) learns word meanings as points in a Hilbert space and acts on them via a translation of grammatical structure into Parametrised Quantum Circuits (PQCs). Previous work extended the QNLP translation to discourse structure using points in a closure of Hilbert spaces. In this paper, we evaluate this translation on a Winograd-style pronoun resolution task. We train a Variational Quantum Classifier (VQC) for binary classification and implement an end-to-end pronoun resolution system. The simulations executed on IBMQ software converged with an F1 score of 87.20%. The model outperformed two out of three classical coreference resolution systems and neared state-of-the-art SpanBERT. A mixed quantum-classical model yet improved these results with an F1 score increase of around 6%.
受语法结构的指导,单词组合形成句子,受语篇结构的指导,句子组合形成对话和文档。句子和语篇单位的组合方面常常被机器学习算法忽视。最近的一个名为量子自然语言处理(QNLP)的倡议将词义学习为希尔伯特空间中的点,并通过将语篇结构从语法结构翻译为参数化量子电路(PQCs)来对这些词义进行操作。之前的 work 将 QNLP 翻译扩展到语篇结构,利用希尔伯特空间中的点。在本文中,我们通过 winograd 风格的同义词消解任务评估了这种翻译。我们训练了一个二分类变分量子分类器(VQC),并实现了端到端同义词消解系统。在 IBMQ 软件上执行的模拟获得了 87.20%的 F1 分数。该模型超过了两个三分之二的经典同义词消解系统,并接近于最先进的 SpanBERT。一种混合量子-经典模型还提高了这些结果,F1 分数增加了约 6%。
https://arxiv.org/abs/2312.00688
This paper presents our work for the Violence Inciting Text Detection shared task in the First Workshop on Bangla Language Processing. Social media has accelerated the propagation of hate and violence-inciting speech in society. It is essential to develop efficient mechanisms to detect and curb the propagation of such texts. The problem of detecting violence-inciting texts is further exacerbated in low-resource settings due to sparse research and less data. The data provided in the shared task consists of texts in the Bangla language, where each example is classified into one of the three categories defined based on the types of violence-inciting texts. We try and evaluate several BERT-based models, and then use an ensemble of the models as our final submission. Our submission is ranked 10th in the final leaderboard of the shared task with a macro F1 score of 0.737.
本文代表我们在第一届孟加拉语言处理研讨会上的工作,研究了如何检测社交媒体上可能引起暴力和仇恨言论的文稿。社交媒体加速了社会中暴力和仇恨言论的传播。在资源有限的环境中,检测和遏制这种文本的传播变得尤为重要。在共享任务中提供的数据中,每篇文章都基于孟加拉语,并根据可能引起暴力和仇恨言论的文稿类型将其归类为三种不同的类别。我们试图评估几种基于BERT的模型,然后将模型的集合作为我们最终的提交。我们提交的论文在共享任务的最终排行榜上排名第10,并具有微宏F1分数为0.737。
https://arxiv.org/abs/2311.18778
Semantic representations of text, i.e. representations of natural language which capture meaning by geometry, are essential for areas such as information retrieval and document grouping. High-dimensional trained dense vectors have received much attention in recent years as such representations. We investigate the structure of semantic spaces that arise from embeddings made with Sentence-BERT and find that the representations suffer from a well-known problem in high dimensions called hubness. Hubness results in asymmetric neighborhood relations, such that some texts (the hubs) are neighbours of many other texts while most texts (so-called anti-hubs), are neighbours of few or no other texts. We quantify the semantic quality of the embeddings using hubness scores and error rate of a neighbourhood based classifier. We find that when hubness is high, we can reduce error rate and hubness using hubness reduction methods. We identify a combination of two methods as resulting in the best reduction. For example, on one of the tested pretrained models, this combined method can reduce hubness by about 75% and error rate by about 9%. Thus, we argue that mitigating hubness in the embedding space provides better semantic representations of text.
语义文本的代表,即通过几何方式捕捉自然语言含义的表示对于信息检索和文档分类等领域的应用至关重要。近年来,已经受到了很多关注的高维训练密集向量,作为这样的表示,也称为词向量嵌入。我们研究了使用Sentence-BERT生成的嵌入所形成的语义空间结构,并发现这些表示在较高维度上存在一个已知的问题,称为聚类性。聚类性导致不对称的邻居关系,使得一些文本(聚类)是许多其他文本的邻居,而大多数文本(称为反聚类)只有很少或没有其他文本的邻居。我们使用聚类性分数和基于分类器的邻居误差率来量化语义嵌入的语义质量。我们发现,当聚类性很高时,我们可以通过聚类减少误差率,并减少聚类性。我们找出了两种方法的组合,使得效果最好。例如,在测试预训练模型中的一个,这种组合方法可以降低聚类性大约75%,并降低大约9%的误差率。因此,我们认为在嵌入空间中减轻聚类性能够提供更好的文本语义表示。
https://arxiv.org/abs/2311.18364
Turkish is one of the most popular languages in the world. Wide us of this language on social media platforms such as Twitter, Instagram, or Tiktok and strategic position of the country in the world politics makes it appealing for the social network researchers and industry. To address this need, we introduce TurkishBERTweet, the first large scale pre-trained language model for Turkish social media built using almost 900 million tweets. The model shares the same architecture as base BERT model with smaller input length, making TurkishBERTweet lighter than BERTurk and can have significantly lower inference time. We trained our model using the same approach for RoBERTa model and evaluated on two text classification tasks: Sentiment Classification and Hate Speech Detection. We demonstrate that TurkishBERTweet outperforms the other available alternatives on generalizability and its lower inference time gives significant advantage to process large-scale datasets. We also compared our models with the commercial OpenAI solutions in terms of cost and performance to demonstrate TurkishBERTweet is scalable and cost-effective solution. As part of our research, we released TurkishBERTweet and fine-tuned LoRA adapters for the mentioned tasks under the MIT License to facilitate future research and applications on Turkish social media. Our TurkishBERTweet model is available at: this https URL
土耳其是世界上使用最广泛的语之一。在Twitter、Instagram或Tiktok等社交媒体平台上广泛使用这种语言,以及土耳其在世界政治中的战略地位,使其对社交媒体研究员和产业具有吸引力。为满足这种需求,我们介绍了土耳其BERTweet,第一个基于几乎9亿条推文的土耳其社交媒体的大型预训练语言模型。该模型与基本BERT模型具有较小的输入长度,使得土耳其BERTweet比BERTurk更轻,可以在推理过程中显著降低。我们使用相同的方法对RoBERTa模型进行训练,并在两个文本分类任务上进行评估:情感分类和仇恨言论检测。我们证明,土耳其BERTweet在一般可解释性和较低的推理时间方面优于其他可用选项。我们还与商业OpenAI解决方案在成本和性能方面进行了比较,以证明土耳其BERTweet是一个可扩展和成本效益高的解决方案。作为我们的研究的一部分,我们在MIT许可证下发布了土耳其BERTweet,并对指定任务进行了微调,以促进未来对土耳其社交媒体的研究和应用。我们的土耳其BERTweet模型可以从以下链接获取:https://this URL
https://arxiv.org/abs/2311.18063
Cross-lingual transfer learning is an important property of multilingual large language models (LLMs). But how do LLMs represent relationships between languages? Every language model has an input layer that maps tokens to vectors. This ubiquitous layer of language models is often overlooked. We find that similarities between these input embeddings are highly interpretable and that the geometry of these embeddings differs between model families. In one case (XLM-RoBERTa), embeddings encode language: tokens in different writing systems can be linearly separated with an average of 99.2% accuracy. Another family (mT5) represents cross-lingual semantic similarity: the 50 nearest neighbors for any token represent an average of 7.61 writing systems, and are frequently translations. This result is surprising given that there is no explicit parallel cross-lingual training corpora and no explicit incentive for translations in pre-training objectives. Our research opens the door for investigations in 1) The effect of pre-training and model architectures on representations of languages and 2) The applications of cross-lingual representations embedded in language models.
跨语言迁移学习是多语种大型语言模型(LLMs)的重要特征。但是,LLMs 如何表示语言之间的关系呢?每个语言模型都有一个输入层,该层将词映射到向量。这种普遍的层通常被忽视。我们发现,这些输入向量之间的相似性非常高,而且这些向量的几何形状在模型家族之间有所不同。在一个例子(XLM-RoBERTa)中,嵌入式编码了语言:不同书写系统中的词可以以平均99.2%的准确率进行线性分离。另一个家族(mT5)表示跨语言语义相似性:对于任何词,最接近的50个邻居的平均写作系统是平均7.61,经常是翻译。这一结果令人惊讶,因为在预训练目标中没有明确的语言间跨语言训练数据,而且没有明确的语言间翻译激励。我们的研究为探究在预训练和模型架构对语言表示的影响以及跨语言表示在语言模型中的应用打开了大门。
https://arxiv.org/abs/2311.18034
Advancements in monaural speech enhancement (SE) techniques have greatly improved the perceptual quality of speech. However, integrating these techniques into automatic speech recognition (ASR) systems has not yielded the expected performance gains, primarily due to the introduction of distortions during the SE process. In this paper, we propose a novel approach called FAT-HuBERT, which leverages distortion-invariant self-supervised learning (SSL) to enhance the robustness of ASR. To address the distortions introduced by the SE frontends, we introduce layer-wise fusion modules that incorporate features extracted from both observed noisy signals and enhanced signals. During training, the SE frontend is randomly selected from a pool of models. We evaluate the performance of FAT-HuBERT on simulated noisy speech generated from LibriSpeech as well as real-world noisy speech from the CHiME-4 1-channel dataset. The experimental results demonstrate a significant relative reduction in word error rate (WER).
随着单声道语音增强(SE)技术的进步,语音的感知质量得到了显著提高。然而,将这些技术集成到自动语音识别(ASR)系统中并没有产生预期的性能提升,主要原因是SE过程中引入了失真。在本文中,我们提出了一种名为FAT-HuBERT的新方法,利用失真无关自监督学习(SSL)来增强ASR的鲁棒性。为了解决SE前端引入的失真,我们引入了级联模块,将来自观测到的嘈杂信号和增强信号的特征融合在一起。在训练过程中,SE前端从模型池中随机选择。我们评估FAT-HuBERT在从LibriSpeech产生的模拟嘈杂语音以及CHiME-4 1-channel数据集中的真实世界嘈杂语音上的性能。实验结果表明,与原始语音相比,WER降低了相当大的比例。
https://arxiv.org/abs/2311.17790
Learning effective sentence representations is crucial for many Natural Language Processing (NLP) tasks, including semantic search, semantic textual similarity (STS), and clustering. While multiple transformer models have been developed for sentence embedding learning, these models may not perform optimally when dealing with specialized domains like aviation, which has unique characteristics such as technical jargon, abbreviations, and unconventional grammar. Furthermore, the absence of labeled datasets makes it difficult to train models specifically for the aviation domain. To address these challenges, we propose a novel approach for adapting sentence transformers for the aviation domain. Our method is a two-stage process consisting of pre-training followed by fine-tuning. During pre-training, we use Transformers and Sequential Denoising AutoEncoder (TSDAE) with aviation text data as input to improve the initial model performance. Subsequently, we fine-tune our models using a Natural Language Inference (NLI) dataset in the Sentence Bidirectional Encoder Representations from Transformers (SBERT) architecture to mitigate overfitting issues. Experimental results on several downstream tasks show that our adapted sentence transformers significantly outperform general-purpose transformers, demonstrating the effectiveness of our approach in capturing the nuances of the aviation domain. Overall, our work highlights the importance of domain-specific adaptation in developing high-quality NLP solutions for specialized industries like aviation.
学习有效的句子表示对于许多自然语言处理(NLP)任务至关重要,包括语义搜索、语义文本相似性(STS)和聚类。尽管已经为句子嵌入学习开发了多种Transformer模型,但这些模型在处理具有独特特点,如技术术语、缩写和不规则语法的专业领域时,可能无法发挥最佳性能。此外,缺乏标记数据使得为航空领域训练特定模型变得困难。为了应对这些挑战,我们提出了一种针对航空领域的句子Transformer自适应方法。我们的方法是一个两阶段过程,包括预训练和微调。在预训练阶段,我们使用Transformer和序列去噪自动编码器(TSDAE)以航空文本数据作为输入来提高初始模型性能。随后,我们通过Sentence Bidirectional Encoder Representations from Transformers (SBERT)架构,使用自然语言推理(NLI)数据对模型进行微调,以减轻过拟合问题。在几个下游任务上的实验结果表明,我们自适应的句子Transformer显著优于通用Transformer,证明了我们在捕捉航空领域的细微差别方面取得了成功。总之,我们的工作突出了在为专业行业如航空开发高质量NLP解决方案时,进行领域特定调整的重要性。
https://arxiv.org/abs/2305.09556
Community Question Answering (CQA) becomes increasingly prevalent in recent years. However, there are a large number of answers, which is difficult for users to select the relevant answers. Therefore, answer selection is a very significant subtask of CQA. In this paper, we first propose the Question-Answer cross attention networks (QAN) with pre-trained models for answer selection and utilize large language model (LLM) to perform answer selection with knowledge augmentation. Specifically, we apply the BERT model as the encoder layer to do pre-training for question subjects, question bodies and answers, respectively, then the cross attention mechanism selects the most relevant answer for different questions. Experiments show that the QAN model achieves state-of-the-art performance on two datasets, SemEval2015 and SemEval2017. Moreover, we use the LLM to generate external knowledge from questions and correct answers to achieve knowledge augmentation for the answer selection task by LLM, while optimizing the prompt of LLM in different aspects. The results show that the introduction of external knowledge can improve the correct answer selection rate of LLM on datasets SemEval2015 and SemEval2017. Meanwhile, LLM can also select the correct answer on more questions by optimized prompt.
社区问题解答(CQA)在近年来变得越来越普遍。然而,存在大量答案,这使得用户很难选择相关的答案。因此,答案选择是CQA的一个非常重要的子任务。在本文中,我们首先提出了带有预训练模型的问答案叉注意力网络(QAN),并利用大型语言模型(LLM)进行知识增强的答案选择。具体来说,我们将BERT模型作为编码器层进行问题主体、问题体和答案的预训练,然后跨注意力机制选择不同问题下的最相关答案。实验证明,QAN模型在两个数据集上的表现达到了当前最先进的水平:SemEval2015和SemEval2017。此外,我们使用LLM从问题中生成外部知识,并使用LLM优化LLM的提示以实现知识增强。为了优化LLM的提示,我们对其在不同方面进行优化。结果表明,引入外部知识可以提高LLM在SemEval2015和SemEval2017数据集上的正确答案选择率。同时,LLM还可以通过优化提示选择更多的答案。
https://arxiv.org/abs/2311.17502
Prompt-based learning has been widely applied in many low-resource NLP tasks such as few-shot scenarios. However, this paradigm has been shown to be vulnerable to backdoor attacks. Most of the existing attack methods focus on inserting manually predefined templates as triggers in the pre-training phase to train the victim model and utilize the same triggers in the downstream task to perform inference, which tends to ignore the transferability and stealthiness of the templates. In this work, we propose a novel approach of TARGET (Template-trAnsfeRable backdoor attack aGainst prompt-basEd NLP models via GPT4), which is a data-independent attack method. Specifically, we first utilize GPT4 to reformulate manual templates to generate tone-strong and normal templates, and the former are injected into the model as a backdoor trigger in the pre-training phase. Then, we not only directly employ the above templates in the downstream task, but also use GPT4 to generate templates with similar tone to the above templates to carry out transferable attacks. Finally we have conducted extensive experiments on five NLP datasets and three BERT series models, with experimental results justifying that our TARGET method has better attack performance and stealthiness compared to the two-external baseline methods on direct attacks, and in addition achieves satisfactory attack capability in the unseen tone-similar templates.
基于提示的学习在许多低资源的自然语言处理任务中得到了广泛应用,如少样本场景。然而,这种范式已经被证明容易受到后门攻击。现有的攻击方法主要在预训练阶段将人工预定义模板作为触发器,以训练受害者模型,并在下游任务中利用相同触发器进行推理。这种方法往往忽略了模板的可转移性和隐秘性。在这项工作中,我们提出了一个名为TARGET的新攻击方法(通过GPT4针对提示驱动的自然语言处理模型进行后门攻击),这是一种数据无关的攻击方法。具体来说,我们首先利用GPT4将手动模板重新表述为语调强烈和正常模板,并将前者在预训练阶段注入模型作为后门触发器。然后,我们不仅直接在下游任务中使用上述模板,而且利用GPT4生成具有相似语调的模板进行可转移攻击。最后,我们在五个自然语言处理数据集和三个BERT系列模型上进行了广泛的实验,实验结果证实了我们的TARGET方法在直接攻击方面的攻击性能和隐秘性优于两种外部基线方法,并且还能在未见过的语调相似模板上实现满意的攻击能力。
https://arxiv.org/abs/2311.17429
Transformer-based models, such as BERT and GPT, have been widely adopted in natural language processing (NLP) due to their exceptional performance. However, recent studies show their vulnerability to textual adversarial attacks where the model's output can be misled by intentionally manipulating the text inputs. Despite various methods that have been proposed to enhance the model's robustness and mitigate this vulnerability, many require heavy consumption resources (e.g., adversarial training) or only provide limited protection (e.g., defensive dropout). In this paper, we propose a novel method called dynamic attention, tailored for the transformer architecture, to enhance the inherent robustness of the model itself against various adversarial attacks. Our method requires no downstream task knowledge and does not incur additional costs. The proposed dynamic attention consists of two modules: (I) attention rectification, which masks or weakens the attention value of the chosen tokens, and (ii) dynamic modeling, which dynamically builds the set of candidate tokens. Extensive experiments demonstrate that dynamic attention significantly mitigates the impact of adversarial attacks, improving up to 33\% better performance than previous methods against widely-used adversarial attacks. The model-level design of dynamic attention enables it to be easily combined with other defense methods (e.g., adversarial training) to further enhance the model's robustness. Furthermore, we demonstrate that dynamic attention preserves the state-of-the-art robustness space of the original model compared to other dynamic modeling methods.
基于Transformer的模型(如BERT和GPT)因其在自然语言处理(NLP)方面的优异表现而得到了广泛采用。然而,最近的研究表明,这些模型对文本对抗攻击的鲁棒性较弱,容易受到有意操作文本输入的影响。尽管已经提出了许多方法来增强模型的稳健性并减轻这种漏洞,但许多方法都需要大量的资源(例如对抗训练)或只有有限的保护(例如防御下凸)。在本文中,我们提出了一种名为动态关注的方法,专门针对Transformer架构,以提高模型本身对各种对抗攻击的固有鲁棒性。我们的方法无需下游任务知识,也不需要额外的成本。动态关注包括两个模块:(I)注意力校正,它掩盖或削弱了所选标记的注意力值,和(ii)动态建模,它动态地构建了候选标记的集合。大量的实验证明,动态关注显著减轻了对抗攻击的影响,将之前的方法与广泛使用的对抗攻击相比,性能提高33\%。动态关注的模型级设计使它能够轻松与其他防御方法(例如对抗训练)结合,进一步增强模型的鲁棒性。此外,我们还证明了动态关注保留了原始模型的最先进鲁棒性空间与其他动态建模方法相比。
https://arxiv.org/abs/2311.17400
In the past decade, using Street View images and machine learning to measure human perception has become a mainstream research approach in urban science. However, this approach using only image-shallow information makes it difficult to comprehensively understand the deep semantic features of human perception of a scene. In this study, we proposed a new framework based on a pre-train natural language model to understand the relationship between human perception and the sense of a scene. Firstly, Place Pulse 2.0 was used as our base dataset, which contains a variety of human-perceived labels, namely, beautiful, safe, wealthy, depressing, boring, and lively. An image captioning network was used to extract the description information of each street view image. Secondly, a pre-trained BERT model was finetuning and added a regression function for six human perceptual dimensions. Furthermore, we compared the performance of five traditional regression methods with our approach and conducted a migration experiment in Hong Kong. Our results show that human perception scoring by deep semantic features performed better than previous studies by machine learning methods with shallow features. The use of deep scene semantic features provides new ideas for subsequent human perception research, as well as better explanatory power in the face of spatial heterogeneity.
在过去的十年里,利用Street View图像和机器学习来测量人类感知已经成为城市科学领域的主流研究方法。然而,仅使用图像浅层信息来使用这种方法很难全面理解场景中人类感知的高级语义特征。在这项研究中,我们提出了一个基于预训练自然语言模型的全新框架,以理解人类感知与场景感知的关联。首先,我们使用了Place Pulse 2.0作为我们的基础数据集,它包含了各种人类感知标签,如美丽、安全、富有、令人沮丧、无聊和生动。然后,我们使用图像标题网络提取每个Street View图像的描述信息。其次,我们使用预训练的BERT模型进行微调,并添加了六个人类感知维度的回归功能。此外,我们还比较了五种传统回归方法与我们方法的性能,并在香港进行了迁移实验。我们的结果表明,通过深度语义特征进行的人类感知评分要优于仅使用浅层特征的机器学习方法。利用深度场景语义特征为后续的人类感知研究提供了新的思路,同时也在面对空间异质性时具有更好的解释力。
https://arxiv.org/abs/2311.17354
Traditional Chinese medicine (TCM) prescription is the most critical form of TCM treatment, and uncovering the complex nonlinear relationship between symptoms and TCM is of great significance for clinical practice and assisting physicians in diagnosis and treatment. Although there have been some studies on TCM prescription generation, these studies consider a single factor and directly model the symptom-prescription generation problem mainly based on symptom descriptions, lacking guidance from TCM knowledge. To this end, we propose a RoBERTa and Knowledge Enhancement model for Prescription Generation of Traditional Chinese Medicine (RoKEPG). RoKEPG is firstly pre-trained by our constructed TCM corpus, followed by fine-tuning the pre-trained model, and the model is guided to generate TCM prescriptions by introducing four classes of knowledge of TCM through the attention mask matrix. Experimental results on the publicly available TCM prescription dataset show that RoKEPG improves the F1 metric by about 2% over the baseline model with the best results.
传统中医药(TCM)处方是TCM治疗的最关键形式,揭示症状与TCM之间的复杂非线性关系对临床实践和帮助医生进行诊断和治疗具有很大的意义。尽管已经有一些关于TCM处方生成的研究,但这些研究仅考虑了一个因素,并主要基于症状描述来建模症状处方生成问题,缺乏TCM知识指导。因此,我们提出了一个RoBERTa和知识增强模型用于传统中医药处方生成(RoKEPG)。RoKEPG首先通过构建TCM语料库进行预训练,然后对预训练模型进行微调,最后通过引入四种TCM知识的类别通过注意力掩码矩阵来指导模型生成TCM处方。在公开可用的TCM处方数据集上进行的实验结果表明,RoKEPG在基线模型具有最佳结果的情况下,将F1得分提高了约2%。
https://arxiv.org/abs/2311.17307
Data augmentation via back-translation is common when pretraining Vision-and-Language Navigation (VLN) models, even though the generated instructions are noisy. But: does that noise matter? We find that nonsensical or irrelevant language instructions during pretraining can have little effect on downstream performance for both HAMT and VLN-BERT on R2R, and is still better than only using clean, human data. To underscore these results, we concoct an efficient augmentation method, Unigram + Object, which generates nonsensical instructions that nonetheless improve downstream performance. Our findings suggest that what matters for VLN R2R pretraining is the quantity of visual trajectories, not the quality of instructions.
通过逆向翻译进行数据增强在预训练Vision-and-Language Navigation(VLN)模型时很常见,即使生成的指令很嘈杂。但是:那里的噪音重要吗?我们发现,在预训练过程中出现不切实际或无关的语言指令对HAMT和VLN-BERT在R2R上的下游性能影响很小,而且使用干净、人类数据仍然更好。为了强调这些结果,我们设计了一个有效的增强方法:Unigram + Object,它生成不切实际但仍然提高下游性能的指令。我们的研究结果表明,对于VLN的R2R预训练,重要的是视觉轨迹的数量,而不是指令的质量。
https://arxiv.org/abs/2311.17280
Artificial intelligence and machine learning have significantly bolstered the technological world. This paper explores the potential of transfer learning in natural language processing focusing mainly on sentiment analysis. The models trained on the big data can also be used where data are scarce. The claim is that, compared to training models from scratch, transfer learning, using pre-trained BERT models, can increase sentiment classification accuracy. The study adopts a sophisticated experimental design that uses the IMDb dataset of sentimentally labelled movie reviews. Pre-processing includes tokenization and encoding of text data, making it suitable for NLP models. The dataset is used on a BERT based model, measuring its performance using accuracy. The result comes out to be 100 per cent accurate. Although the complete accuracy could appear impressive, it might be the result of overfitting or a lack of generalization. Further analysis is required to ensure the model's ability to handle diverse and unseen data. The findings underscore the effectiveness of transfer learning in NLP, showcasing its potential to excel in sentiment analysis tasks. However, the research calls for a cautious interpretation of perfect accuracy and emphasizes the need for additional measures to validate the model's generalization.
人工智能和机器学习在很大程度上推动了科技发展。本文主要探讨自然语言处理中迁移学习的潜力,重点关注情感分析。使用大数据训练的模型也可以在没有数据的情况下使用。论文认为,与从头训练模型相比,使用预训练的BERT模型进行迁移学习可以提高情感分类准确性。研究采用了一种复杂的实验设计,使用了情感标注的电影评论的IMDb数据集。预处理包括对文本数据的分词和编码,使其适合自然语言处理模型。数据集应用于基于BERT的模型,通过准确性来衡量其性能。结果表明,准确率为100%。尽管完整的准确性可能会令人印象深刻,但它可能是过拟合或泛化不足的结果。需要进一步分析以确保模型能够处理多样化和未见过的数据。研究结果强调了迁移学习在自然语言处理中的有效性,展示了它在情感分析任务中取得优异表现的潜力。然而,研究呼吁对完美准确度的谨慎解释,并强调需要额外的措施来验证模型的泛化能力。
https://arxiv.org/abs/2311.16965
We propose a new automated evaluation metric for machine-generated radiology reports using the successful COMET architecture adapted for the radiology domain. We train and publish four medically-oriented model checkpoints, including one trained on RadGraph, a radiology knowledge graph. Our results show that our metric correlates moderately to high with established metrics such as BERTscore, BLEU, and CheXbert scores. Furthermore, we demonstrate that one of our checkpoints exhibits a high correlation with human judgment, as assessed using the publicly available annotations of six board-certified radiologists, using a set of 200 reports. We also performed our own analysis gathering annotations with two radiologists on a collection of 100 reports. The results indicate the potential effectiveness of our method as a radiology-specific evaluation metric. The code, data, and model checkpoints to reproduce our findings will be publicly available.
我们提出了一个用于机器生成影像报告的新自动评估指标,基于成功的COMET架构,专门为放射学领域定制。我们训练和发布四个医学导向的模型检查点,包括一个基于RadGraph的模型。我们的结果表明,我们的指标与已知的指标(如BERT分数、BLEU和CheXbert分数)的相关性相当。此外,我们还证明了其中一个检查点与人类判断具有很高的相关性,这是通过使用放射科医生公开可用的注释评估的。我们还使用两个放射科医生对一组100个报告进行注释,分析了我们的方法的有效性。结果表明,我们的方法作为放射学特定评估指标具有潜在的有效性。要重现我们的研究结果,代码、数据和模型检查点将公开可用。
https://arxiv.org/abs/2311.16764
While performance of many text classification tasks has been recently improved due to Pre-trained Language Models (PLMs), in this paper we show that they still suffer from a performance gap when the underlying distribution of topics changes. For example, a genre classifier trained on \textit{political} topics often fails when tested on documents about \textit{sport} or \textit{medicine}. In this work, we quantify this phenomenon empirically with a large corpus and a large set of topics. Consequently, we verify that domain transfer remains challenging both for classic PLMs, such as BERT, and for modern large models, such as GPT-3. We also suggest and successfully test a possible remedy: after augmenting the training dataset with topically-controlled synthetic texts, the F1 score improves by up to 50\% for some topics, nearing on-topic training results, while others show little to no improvement. While our empirical results focus on genre classification, our methodology is applicable to other classification tasks such as gender, authorship, or sentiment classification. The code and data to replicate the experiments are available at this https URL
虽然由于预训练语言模型(PLMs)的性能最近得到了提高,但当主题分布发生变化时,它们仍然会面临性能差距。例如,在政治主题上进行训练的 genres 分类器在测试体育或医疗方面的文档时常常表现不佳。在我们的研究中,我们用大量的语料库和主题集来定量这一现象。结果表明,对于经典 PLMs(如 BERT)和现代大型模型(如 GPT-3),领域迁移仍然具有挑战性。我们还提出了一个可能的解决方法,并在一些主题上进行了实验,结果表明,通过控制主题的 synthetic 文本,F1 分数可以达到提高至50\%的效果,接近主题训练的结果,而其他主题则没有或几乎没有改善。虽然我们的实证结果集中关注于 genres,但我们的方法可以应用于其他分类任务,如性别、作者或情感分类。实验代码和数据可在此处复制:https:// URL。
https://arxiv.org/abs/2311.16083
Function is increasingly recognized as an important indicator of whole-person health, although it receives little attention in clinical natural language processing research. We introduce the first public annotated dataset specifically on the Mobility domain of the International Classification of Functioning, Disability and Health (ICF), aiming to facilitate automatic extraction and analysis of functioning information from free-text clinical notes. We utilize the National NLP Clinical Challenges (n2c2) research dataset to construct a pool of candidate sentences using keyword expansion. Our active learning approach, using query-by-committee sampling weighted by density representativeness, selects informative sentences for human annotation. We train BERT and CRF models, and use predictions from these models to guide the selection of new sentences for subsequent annotation iterations. Our final dataset consists of 4,265 sentences with a total of 11,784 entities, including 5,511 Action entities, 5,328 Mobility entities, 306 Assistance entities, and 639 Quantification entities. The inter-annotator agreement (IAA), averaged over all entity types, is 0.72 for exact matching and 0.91 for partial matching. We also train and evaluate common BERT models and state-of-the-art Nested NER models. The best F1 scores are 0.84 for Action, 0.7 for Mobility, 0.62 for Assistance, and 0.71 for Quantification. Empirical results demonstrate promising potential of NER models to accurately extract mobility functioning information from clinical text. The public availability of our annotated dataset will facilitate further research to comprehensively capture functioning information in electronic health records (EHRs).
功能正越来越被认为是一个重要的人体健康指标,尽管在临床自然语言处理研究中,它受到的关注较少。我们引入了第一个针对国际功能、残疾和健康分类(ICF)运动领域的公开注释数据集,旨在促进从免费文本临床笔记中自动提取和分析功能信息。我们利用国家自然语言处理临床挑战(n2c2)研究数据集,通过关键词扩展构建候选句子池。我们采用基于密度的查询委员会采样方法,具有信息丰富性的句子进行人机标注。我们训练了BERT和CRF模型,并使用这些模型的预测来指导后续标注迭代中的新句子选择。最终数据集包括4,265个句子,共包括11,784个实体,包括5,511个动作实体,5,328个运动实体,306个协助实体和639个计量实体。互校者一致性(IAA)平均值,对所有实体类型为0.72,对于完全匹配为0.91,对于部分匹配为0.89。我们还训练和评估了常见的BERT模型以及最新的嵌套NER模型。最佳F1分数分别为0.84(动作)、0.7(运动)、0.62(协助)和0.71(计量)。实证结果证明了自然语言处理模型的准确提取临床文本中运动功能信息的有希望的潜力。我们公开注释数据集的公共可用性将促进进一步研究全面捕捉电子健康记录(EHR)中的功能信息。
https://arxiv.org/abs/2311.15946
The use of pre-training is an emerging technique to enhance a neural model's performance, which has been shown to be effective for many neural language models such as BERT. This technique has also been used to enhance the performance of recommender systems. In such recommender systems, pre-training models are used to learn a better initialisation for both users and items. However, recent existing pre-trained recommender systems tend to only incorporate the user interaction data at the pre-training stage, making it difficult to deliver good recommendations, especially when the interaction data is sparse. To alleviate this common data sparsity issue, we propose to pre-train the recommendation model not only with the interaction data but also with other available information such as the social relations among users, thereby providing the recommender system with a better initialisation compared with solely relying on the user interaction data. We propose a novel recommendation model, the Social-aware Gaussian Pre-trained model (SGP), which encodes the user social relations and interaction data at the pre-training stage in a Graph Neural Network (GNN). Afterwards, in the subsequent fine-tuning stage, our SGP model adopts a Gaussian Mixture Model (GMM) to factorise these pre-trained embeddings for further training, thereby benefiting the cold-start users from these pre-built social relations. Our extensive experiments on three public datasets show that, in comparison to 16 competitive baselines, our SGP model significantly outperforms the best baseline by upto 7.7% in terms of NDCG@10. In addition, we show that SGP permits to effectively alleviate the cold-start problem, especially when users newly register to the system through their friends' suggestions.
预训练是一种新兴的技术,旨在提高神经模型的性能,已经证明对于许多神经语言模型(如BERT)来说非常有效。这种技术还被用于增强推荐系统的性能。在这种推荐系统中,预训练模型用于学习更好的用户和物品的初始化。然而,最近存在的预训练推荐系统往往仅在预训练阶段使用用户交互数据,这使得推荐效果变得困难,尤其是在用户交互数据稀疏的情况下。为了缓解这种普遍的数据稀疏问题,我们提出了一个名为社交感知高斯预训练模型(SGP)的新推荐模型,它将用户社交关系和交互数据在预训练阶段编码为图神经网络(GNN)。接着,在后续的微调阶段,我们的SGP模型采用高斯混合模型(GMM)对预训练嵌入进行进一步训练,从而有益于从未预先构建的社交关系中的冷启动用户。我们对三个公共数据集的广泛实验证明,与16个竞争基线相比,我们的SGP模型在NDCG@10方面显著提高了7.7%。此外,我们还证明了SGP能够有效缓解冷启动问题,尤其是在用户通过朋友建议新注册系统时。
https://arxiv.org/abs/2311.15790
Educational crosswords offer numerous benefits for students, including increased engagement, improved understanding, critical thinking, and memory retention. Creating high-quality educational crosswords can be challenging, but recent advances in natural language processing and machine learning have made it possible to use language models to generate nice wordplays. The exploitation of cutting-edge language models like GPT3-DaVinci, GPT3-Curie, GPT3-Babbage, GPT3-Ada, and BERT-uncased has led to the development of a comprehensive system for generating and verifying crossword clues. A large dataset of clue-answer pairs was compiled to fine-tune the models in a supervised manner to generate original and challenging clues from a given keyword. On the other hand, for generating crossword clues from a given text, Zero/Few-shot learning techniques were used to extract clues from the input text, adding variety and creativity to the puzzles. We employed the fine-tuned model to generate data and labeled the acceptability of clue-answer parts with human supervision. To ensure quality, we developed a classifier by fine-tuning existing language models on the labeled dataset. Conversely, to assess the quality of clues generated from the given text using zero/few-shot learning, we employed a zero-shot learning approach to check the quality of generated clues. The results of the evaluation have been very promising, demonstrating the effectiveness of the approach in creating high-standard educational crosswords that offer students engaging and rewarding learning experiences.
教育 crossword 提供了许多好处,包括增加参与度、提高理解、培养批判性思维和记忆保留。创建高质量的教育 crossword 可能具有挑战性,但近年来自然语言处理和机器学习方面的先进技术使使用语言模型生成有趣的字谜成为可能。利用先进的语言模型如 GPT3-DaVinci、GPT3-Curie、GPT3-Babbage、GPT3-Ada 和 BERT-uncased,已经开发了一个全面系统来生成和验证 crossword 提示。 为了以监督方式微调模型并生成原创且具有挑战性的提示,我们收集了大量提示-答案对的数据来训练模型。另一方面,为了从给定文本中生成 crossword 提示,我们使用了零/ few-shot 学习技术从输入文本中提取提示,为谜题增加多样性和创意。我们使用微调后的模型生成数据,并通过人类监督来标记提示的接受性。为了确保质量,我们在已标注的数据集上对现有语言模型进行微调。相反,为了评估从给定文本中生成提示的质量,我们使用零-shot学习方法检查生成的提示质量。评估结果表明,该方法非常具有前景,证明了在创建具有高水准教育 crossword 的过程中,该方法可以为学生提供丰富且有趣的学习体验。
https://arxiv.org/abs/2311.15723