Assurance cases can be used to argue for the safety of products in safety engineering. In safety-critical areas, the construction of assurance cases is indispensable. Trustworthiness Derivation Trees (TDTs) enhance assurance cases by incorporating formal methods, rendering it possible for automatic reasoning about assurance cases. We present Trustworthiness Derivation Tree Analyzer (Trusta), a desktop application designed to automatically construct and verify TDTs. The tool has a built-in Prolog interpreter in its backend, and is supported by the constraint solvers Z3 and MONA. Therefore, it can solve constraints about logical formulas involving arithmetic, sets, Horn clauses etc. Trusta also utilizes large language models to make the creation and evaluation of assurance cases more convenient. It allows for interactive human examination and modification. We evaluated top language models like ChatGPT-3.5, ChatGPT-4, and PaLM 2 for generating assurance cases. Our tests showed a 50%-80% similarity between machine-generated and human-created cases. In addition, Trusta can extract formal constraints from text in natural languages, facilitating an easier interpretation and validation process. This extraction is subject to human review and correction, blending the best of automated efficiency with human insight. To our knowledge, this marks the first integration of large language models in automatic creating and reasoning about assurance cases, bringing a novel approach to a traditional challenge. Through several industrial case studies, Trusta has proven to quickly find some subtle issues that are typically missed in manual inspection, demonstrating its practical value in enhancing the assurance case development process.
在安全性关键领域,建设质量保证案例是不可或缺的。 trustworthinessDerivation Trees(TDT)通过引入形式方法,提高了质量保证案例的质量,使得对质量保证案例进行自动推理变得可能。我们开发了 trustworthinessDerivation Tree Analyzer( Trusta),这是一个桌面应用程序,旨在自动构建和验证 TDT。该工具在后台拥有一个内置 Prolog 解释器,并支持约束求解器 Z3 和 MONA。因此,它可以解决涉及算术、集合、 Horn 条件等逻辑公式的约束。 trusta 还利用大型语言模型,使创建和评估质量保证案例更加便利。它允许人机交互的人类检查和修改。我们评估了如 ChatGPT-3.5、ChatGPT-4 和 PaLM2 等顶尖语言模型,以生成质量保证案例。我们的测试结果显示,机器生成的案例与人类生成的案例有 50%-80% 的相似性。此外, trusta 从自然语言文本中提取形式约束,促进了更容易的解释和验证过程。这种提取需要人类审查和修正,将自动化效率和人类洞察力相结合。据我们所知,这是第一次将大型语言模型集成到自动创建和推理质量保证案例方面,带来了传统挑战的一种新颖方法。通过几个工业案例研究, trusta 证明可以快速发现一些通常在手动检查中忽略的微妙问题,展示了它在增强质量保证案例开发过程中的实际价值。
https://arxiv.org/abs/2309.12941
Keyword spotting (KWS) refers to the task of identifying a set of predefined words in audio streams. With the advances seen recently with deep neural networks, it has become a popular technology to activate and control small devices, such as voice assistants. Relying on such models for edge devices, however, can be challenging due to hardware constraints. Moreover, as adversarial attacks have increased against voice-based technologies, developing solutions robust to such attacks has become crucial. In this work, we propose VIC-KD, a robust distillation recipe for model compression and adversarial robustness. Using self-supervised speech representations, we show that imposing geometric priors to the latent representations of both Teacher and Student models leads to more robust target models. Experiments on the Google Speech Commands datasets show that the proposed methodology improves upon current state-of-the-art robust distillation methods, such as ARD and RSLAD, by 12% and 8% in robust accuracy, respectively.
关键字检测(KWS)是指识别音频流中的预先定义词汇的任务。随着深度学习网络的最新进展,它已经成为激活和控制小型设备的流行技术,例如语音助手。然而,依靠此类模型来处理边缘设备可能会由于硬件限制而面临挑战。此外,随着对基于语音技术的对抗攻击的增加,开发对此类攻击具有鲁棒性的解决方案变得越来越重要。在这个研究中,我们提出了VIC-KD,一个模型压缩和对抗鲁棒性的鲁棒分岔方法。通过使用自监督语音表示,我们证明了在教师和学生模型的潜在表示中添加几何先验可以生成更加鲁棒的目标模型。在Google Speech 命令数据集上的实验表明,该方法在鲁棒精度方面相对于当前先进的鲁棒分岔方法如ard和RSLAD分别提高了12%和8%。
https://arxiv.org/abs/2309.12914
Nowadays, increasingly more data are available as knowledge graphs (KGs). While this data model supports advanced reasoning and querying, they remain difficult to mine due to their size and complexity. Graph mining approaches can be used to extract patterns from KGs. However this presents two main issues. First, graph mining approaches tend to extract too many patterns for a human analyst to interpret (pattern explosion). Second, real-life KGs tend to differ from the graphs usually treated in graph mining: they are multigraphs, their vertex degrees tend to follow a power-law, and the way in which they model knowledge can produce spurious patterns. Recently, a graph mining approach named GraphMDL+ has been proposed to tackle the problem of pattern explosion, using the Minimum Description Length (MDL) principle. However, GraphMDL+, like other graph mining approaches, is not suited for KGs without adaptations. In this paper we propose KG-MDL, a graph pattern mining approach based on the MDL principle that, given a KG, generates a human-sized and descriptive set of graph patterns, and so in a parameter-less and anytime way. We report on experiments on medium-sized KGs showing that our approach generates sets of patterns that are both small enough to be interpreted by humans and descriptive of the KG. We show that the extracted patterns highlight relevant characteristics of the data: both of the schema used to create the data, and of the concrete facts it contains. We also discuss the issues related to mining graph patterns on knowledge graphs, as opposed to other types of graph data.
Nowadays, increasingly more data are available as knowledge graphs (KGs). While this data model supports advanced reasoning and querying, they remain difficult to mine due to their size and complexity. Graph mining approaches can be used to extract patterns from KGs. However this presents two main issues. First, graph mining approaches tend to extract too many patterns for a human analyst to interpret (pattern explosion). Second, real-life KGs tend to differ from the graphs usually treated in graph mining: they are multigraphs, their vertex degrees tend to follow a power-law, and the way in which they model knowledge can produce spurious patterns. Recently, a graph mining approach named GraphMDL+ has been proposed to tackle the problem of pattern explosion, using the minimum Description Length (MDL) principle. However, GraphMDL+, like other graph mining approaches, is not suited for KGs without adaptations. In this paper we propose KG-MDL, a graph pattern mining approach based on the MDL principle that, given a KG, generates a human-sized and descriptive set of graph patterns, and so in a parameter-less and anytime way. We report on experiments on medium-sized KGs showing that our approach generates sets of patterns that are both small enough to be interpreted by humans and descriptive of the KG. We show that the extracted patterns highlight relevant characteristics of the data: both of the schema used to create the data, and of the concrete facts it contains. We also discuss the issues related to mining graph patterns on knowledge graphs, as opposed to other types of graph data.
https://arxiv.org/abs/2309.12908
Neural machine translation (NMT) has shown impressive performance when trained on large-scale corpora. However, generic NMT systems have demonstrated poor performance on out-of-domain translation. To mitigate this issue, several domain adaptation methods have recently been proposed which often lead to better translation quality than genetic NMT systems. While there has been some continuous progress in NMT for English and other European languages, domain adaption in Arabic has received little attention in the literature. The current study, therefore, aims to explore the effectiveness of domain-specific adaptation for Arabic MT (AMT), in yet unexplored domain, financial news articles. To this end, we developed carefully a parallel corpus for Arabic-English (AR- EN) translation in the financial domain for benchmarking different domain adaptation methods. We then fine-tuned several pre-trained NMT and Large Language models including ChatGPT-3.5 Turbo on our dataset. The results showed that the fine-tuning is successful using just a few well-aligned in-domain AR-EN segments. The quality of ChatGPT translation was superior than other models based on automatic and human evaluations. To the best of our knowledge, this is the first work on fine-tuning ChatGPT towards financial domain transfer learning. To contribute to research in domain translation, we made our datasets and fine-tuned models available at this https URL.
神经网络机器翻译(NMT)在大规模语料库上训练时表现令人印象深刻。然而,通用NMT系统在跨域翻译方面表现出较差的性能。为了解决这个问题,近年来提出了许多域适应方法,这些方法通常比遗传的NMT系统提供更好的翻译质量。虽然英语和其他欧洲语言在NMT方面取得了一些进展,但在阿拉伯语的域适应方面在文献中却较少关注。因此,本研究的目的是探索阿拉伯MT(AMT)在尚未探索过的域——金融新闻 articles 中的域特定适应效果。为此,我们仔细开发了金融域中的阿拉伯-英语(AR- EN)翻译平行语料库,以基准不同的域适应方法。随后,我们对几个预先训练的NMT和大型语言模型,包括 ChatGPT-3.5 Turbo 进行了微调,在我们的数据集上成功进行了微调。结果显示,仅仅使用一些与域相关的 AR-EN Segments 就可以成功进行微调。ChatGPT 翻译的质量基于自动和人工评估被认为比其他模型更好。据我们所知,这是第一个针对金融域迁移学习的研究。为了做出贡献到域翻译研究,我们将该数据集和微调模型放在了这个 https URL 上。
https://arxiv.org/abs/2309.12863
Open-set object recognition aims to identify if an object is from a class that has been encountered during training or not. To perform open-set object recognition accurately, a key challenge is how to reduce the reliance on spurious-discriminative features. In this paper, motivated by that different large models pre-trained through different paradigms can possess very rich while distinct implicit knowledge, we propose a novel framework named Large Model Collaboration (LMC) to tackle the above challenge via collaborating different off-the-shelf large models in a training-free manner. Moreover, we also incorporate the proposed framework with several novel designs to effectively extract implicit knowledge from large models. Extensive experiments demonstrate the efficacy of our proposed framework. Code is available \href{this https URL}{here}.
开放集对象识别的目标是确定一个对象是否属于训练过程中遇到的一类。要准确进行开放集对象识别,一个关键挑战是如何减少对伪分类特征的依赖。为此,我们提出了一种名为“大型模型协作”(LMC)的新框架,以通过无训练方式协作不同批量的大型模型,以解决上述挑战。此外,我们还与多个新设计结合,有效地从大型模型中提取隐含知识。广泛的实验证明了我们提出的框架的有效性。代码可用以下链接获取:这里。
https://arxiv.org/abs/2309.12780
Large Language Models (LLMs), acting as a powerful reasoner and generator, exhibit extraordinary performance across various natural language tasks, such as question answering (QA). Among these tasks, Multi-Hop Question Answering (MHQA) stands as a widely discussed category, necessitating seamless integration between LLMs and the retrieval of external knowledge. Existing methods employ LLM to generate reasoning paths and plans, and utilize IR to iteratively retrieve related knowledge, but these approaches have inherent flaws. On one hand, Information Retriever (IR) is hindered by the low quality of generated queries by LLM. On the other hand, LLM is easily misguided by the irrelevant knowledge by IR. These inaccuracies, accumulated by the iterative interaction between IR and LLM, lead to a disaster in effectiveness at the end. To overcome above barriers, in this paper, we propose a novel pipeline for MHQA called Furthest-Reasoning-with-Plan-Assessment (FuRePA), including an improved framework (Furthest Reasoning) and an attached module (Plan Assessor). 1) Furthest reasoning operates by masking previous reasoning path and generated queries for LLM, encouraging LLM generating chain of thought from scratch in each iteration. This approach enables LLM to break the shackle built by previous misleading thoughts and queries (if any). 2) The Plan Assessor is a trained evaluator that selects an appropriate plan from a group of candidate plans proposed by LLM. Our methods are evaluated on three highly recognized public multi-hop question answering datasets and outperform state-of-the-art on most metrics (achieving a 10%-12% in answer accuracy).
大型语言模型(LLM)作为一种强大的推理和生成工具,在各种自然语言任务中表现出非凡的性能,例如问答(QA)。在这些任务中,MHQA是一个被广泛讨论的类别,需要进行LLM和外部知识检索的无缝集成。现有的方法使用LLM生成推理路径和计划,并使用IR迭代地检索相关知识,但这些方法具有固有的缺陷。一方面,信息检索(IR)受到LLM生成低质量查询的限制。另一方面,LLM很容易受到IR生成的无关知识的影响。这些不准确的误差通过IR和LLM的迭代交互不断增加,最终导致 effectiveness 的灾难。为了克服上述障碍,在本文中,我们提出了MHQA的新型管道,称为“最短推理与计划评估(FuRePA)”,包括改进的框架(最短推理)和一个附加模块(计划评估器)。1) 最短推理通过掩盖前推理路径和生成LLM的查询,鼓励LLM在每次迭代中从头生成思考链。这种方法使LLM能够打破由以前误导性思考和查询(如果有)构建的束缚。2) 计划评估器是一个经过训练的评估者,从LLM提出的一组备选计划中选择适当的计划。我们的方法在三个备受认可的公共多级问答数据集上进行了评估,并在大多数指标上优于最先进的方法(实现回答准确性10%-12%)。
https://arxiv.org/abs/2309.12767
Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme recognition versus supervised models. Training SSRL models requires a large amount of pre-training data and this poses a challenge for low resource languages. A common approach is transferring knowledge from other languages. Instead, we propose to use audio augmentation to pre-train SSRL models in a low resource condition and evaluate phoneme recognition as downstream task. We performed a systematic comparison of augmentation techniques, namely: pitch variation, noise addition, accented target-language speech and other language speech. We found combined augmentations (noise/pitch) was the best augmentation strategy outperforming accent and language knowledge transfer. We compared the performance with various quantities and types of pre-training data. We examined the scaling factor of augmented data to achieve equivalent performance to models pre-trained with target domain speech. Our findings suggest that for resource constrained languages, in-domain synthetic augmentation can outperform knowledge transfer from accented or other language speech.
自监督表示学习(SSRL)已经提高了后续音节识别相对于监督模型的性能。训练SSRL模型需要大量的预训练数据,这对资源有限的语言来说是一个挑战。一种常见方法是从其他语言中转移知识。相反,我们建议利用音频增强来在资源有限的情况下预训练SSRL模型,并将音节识别作为后续任务进行评估。我们对增强技术进行了系统性的比较,包括音调变化、噪音添加、目标语言语音带有口音以及其他语言语音。我们发现合并增强(噪音/音调)是最佳的增强策略,比口音和语言知识转移表现更好。我们与各种数量和类型的预训练数据进行了比较,并研究了增强数据的缩放因子,以获得与目标语言语音预训练模型相当的性能。我们的发现表明,对于资源有限的语言来说,跨域合成增强可以优于带有口音或其他语言语音的知识转移。
https://arxiv.org/abs/2309.12763
Human knowledge is subject to uncertainties, imprecision, incompleteness and inconsistencies. Moreover, the meaning of many everyday terms is dependent on the context. That poses a huge challenge for the Semantic Web. This paper introduces work on an intuitive notation and model for defeasible reasoning with imperfect knowledge, and relates it to previous work on argumentation theory. PKN is to N3 as defeasible reasoning is to deductive logic. Further work is needed on an intuitive syntax for describing reasoning strategies and tactics in declarative terms, drawing upon the AIF ontology for inspiration. The paper closes with observations on symbolic approaches in the era of large language models.
人类知识面临着不确定性、不准确、不完整和不一致的挑战。此外,许多日常术语的意义取决于上下文。这对语义网构成了巨大的挑战。本文介绍了一种基于不完美知识的可预测推理的直观符号表示和模型,并将其与推理理论方面的先前工作联系起来。PKN类似于可预测推理相对于演绎逻辑的地位。还需要进一步研究直观语法,以在declarative术语中描述推理策略和战术,并借鉴AIF本体论作为灵感来源。本文最后总结了大型语言模型时代的符号方法。
https://arxiv.org/abs/2309.12731
Large language models (LLMs) have had a huge impact on society due to their impressive capabilities and vast knowledge of the world. Various applications and tools have been created that allow users to interact with these models in a black-box scenario. However, one limitation of this scenario is that users cannot modify the internal knowledge of the model, and the only way to add or modify internal knowledge is by explicitly mentioning it to the model during the current interaction. This learning process is called in-context training, and it refers to training that is confined to the user's current session or context. In-context learning has significant applications, but also has limitations that are seldom studied. In this paper, we present a study that shows how the model can suffer from interference between information that continually flows in the context, causing it to forget previously learned knowledge, which can reduce the model's performance. Along with showing the problem, we propose an evaluation benchmark based on the bAbI dataset.
大型语言模型(LLM)对社会发展产生了巨大的影响,因为其出色的能力和对世界的广泛知识。各种应用程序和工具被创建,使用户可以在一个黑盒场景中与这些模型交互。然而,这个场景的一个限制是用户不能修改模型的内部知识,并且添加或修改内部知识的唯一方法是在当前的交互中明确地向模型提到它。这种学习过程被称为上下文训练,它指的是局限于用户当前会话或上下文的培训。上下文训练具有重要的应用,但也具有很少被研究的限制。在本文中,我们提出一项研究,以展示模型如何受到不断流动Context中信息的干扰,导致它忘记先前学习的知识,从而降低模型的性能。同时,我们提出了基于AbBI数据集的评价基准。
https://arxiv.org/abs/2309.12727
Neural language models have exhibited outstanding performance in a range of downstream tasks. However, there is limited understanding regarding the extent to which these models internalize syntactic knowledge, so that various datasets have recently been constructed to facilitate syntactic evaluation of language models across languages. In this paper, we introduce JCoLA (Japanese Corpus of Linguistic Acceptability), which consists of 10,020 sentences annotated with binary acceptability judgments. Specifically, those sentences are manually extracted from linguistics textbooks, handbooks and journal articles, and split into in-domain data (86 %; relatively simple acceptability judgments extracted from textbooks and handbooks) and out-of-domain data (14 %; theoretically significant acceptability judgments extracted from journal articles), the latter of which is categorized by 12 linguistic phenomena. We then evaluate the syntactic knowledge of 9 different types of Japanese language models on JCoLA. The results demonstrated that several models could surpass human performance for the in-domain data, while no models were able to exceed human performance for the out-of-domain data. Error analyses by linguistic phenomena further revealed that although neural language models are adept at handling local syntactic dependencies like argument structure, their performance wanes when confronted with long-distance syntactic dependencies like verbal agreement and NPI licensing.
神经网络语言模型在一系列后续任务中表现出了卓越的表现。然而,对于这些模型内部化语法知识的程度仍存在有限的理解,因此各种数据集近年来被构建起来,以促进跨语言的语法评估模型。在本文中,我们介绍了JCoLA(日语语法接受库),它由10,020个带有二进制接受判定的语句组成。具体来说,这些语句从语言学教材、 Handbook 和期刊中手动提取,并将其分为内部语言数据(86 %;从教材和 Handbook 提取的相对简单的接受判定)和外部语言数据(14 %;从期刊提取的具有理论意义的接受判定),后者按照 12 种语言学现象进行分类。随后,我们评估了 9 种日语语言模型的不同语法知识的 JCoLA。结果表明,有几个模型可以在内部语言数据上超过人类表现,但在外部语言数据上却没有能力超过人类表现。语言学现象的错误分析进一步揭示了尽管神经网络语言模型擅长处理类似于论点结构的局部语法依赖,但它们在与类似于语音同意和NPI授权等长距离语法依赖面前的表现会减弱。
https://arxiv.org/abs/2309.12676
This paper details our speaker diarization system designed for multi-domain, multi-microphone casual conversations. The proposed diarization pipeline uses weighted prediction error (WPE)-based dereverberation as a front end, then applies end-to-end neural diarization with vector clustering (EEND-VC) to each channel separately. It integrates the diarization result obtained from each channel using diarization output voting error reduction plus overlap (DOVER-LAP). To harness the knowledge from the target domain and results integrated across all channels, we apply self-supervised adaptation for each session by retraining the EEND-VC with pseudo-labels derived from DOVER-LAP. The proposed system was incorporated into NTT's submission for the distant automatic speech recognition task in the CHiME-7 challenge. Our system achieved 65 % and 62 % relative improvements on development and eval sets compared to the organizer-provided VC-based baseline diarization system, securing third place in diarization performance.
这篇文章介绍了我们设计的适用于多领域、多麦克风 casual conversations 的 speaker diarization 系统。该 diarization 系统采用基于加权预测误差(WPE)的声学去混响作为前端,然后将全端神经网络声学归一化与向量聚类(EEND-VC)分别应用于每个通道。它通过减少 diarization 输出的投票错误以及融合(DOVER-LAP)来实现每个通道的声学归一化结果,并将它们整合在一起。为了从目标领域中提取知识和将整合在所有通道中的结果进行训练,我们在每个会话中使用自监督适应技术,通过从 DOVER-LAP 中推导出伪标签来重新训练 EEND-VC。该提议系统被 NTT 纳入了 CHiME-7 挑战中远程自动语音识别任务提交的候选列表中。我们的系统相对于组织者提供的基于 VC 基线的声学基线 diarization 系统在开发和应用集上实现了 65 % 和 62 % 的相对改进,确保了声学归一化性能的第三名。
https://arxiv.org/abs/2309.12656
Contract review is an essential step in construction projects to prevent potential losses. However, the current methods for reviewing construction contracts lack effectiveness and reliability, leading to time-consuming and error-prone processes. While large language models (LLMs) have shown promise in revolutionizing natural language processing (NLP) tasks, they struggle with domain-specific knowledge and addressing specialized issues. This paper presents a novel approach that leverages LLMs with construction contract knowledge to emulate the process of contract review by human experts. Our tuning-free approach incorporates construction contract domain knowledge to enhance language models for identifying construction contract risks. The use of a natural language when building the domain knowledge base facilitates practical implementation. We evaluated our method on real construction contracts and achieved solid performance. Additionally, we investigated how large language models employ logical thinking during the task and provide insights and recommendations for future research.
合同审查是建造项目防止潜在损失的必不可少的步骤。然而,当前用于审查建筑合同的方法缺乏效率和可靠性,导致浪费时间和容易出错的过程。虽然大型语言模型(LLM)在改变自然语言处理任务方面表现出了潜力,但它们与特定领域的知识和解决专业问题的能力而奋斗。本文提出了一种创新的方法,利用LLM中的建筑合同知识,模拟人类专家的合同审查过程。我们的无调整方法将建筑合同领域知识纳入其中,以提高语言模型识别建筑合同风险的能力。在建立领域知识库时使用自然语言 facilitate practical implementation。我们对我们的方法在真实建筑合同上的评估并取得稳定的性能。此外,我们研究大型语言模型在任务中如何使用逻辑思考,并为未来的研究提供见解和建议。
https://arxiv.org/abs/2309.12626
To alleviate the expensive human labeling, semi-supervised semantic segmentation employs a few labeled images and an abundant of unlabeled images to predict the pixel-level label map with the same size. Previous methods often adopt co-training using two convolutional networks with the same architecture but different initialization, which fails to capture the sufficiently diverse features. This motivates us to use tri-training and develop the triple-view encoder to utilize the encoders with different architectures to derive diverse features, and exploit the knowledge distillation skill to learn the complementary semantics among these encoders. Moreover, existing methods simply concatenate the features from both encoder and decoder, resulting in redundant features that require large memory cost. This inspires us to devise a dual-frequency decoder that selects those important features by projecting the features from the spatial domain to the frequency domain, where the dual-frequency channel attention mechanism is introduced to model the feature importance. Therefore, we propose a Triple-view Knowledge Distillation framework, termed TriKD, for semi-supervised semantic segmentation, including the triple-view encoder and the dual-frequency decoder. Extensive experiments were conducted on two benchmarks, \ie, Pascal VOC 2012 and Cityscapes, whose results verify the superiority of the proposed method with a good tradeoff between precision and inference speed.
为了减轻昂贵的人工标注成本,半监督语义分割使用少量标记图像和大量未标记图像来预测具有相同大小的像素级标签地图。以往的方法通常采用共训练,使用具有相同架构但不同初始化的卷积神经网络,但无法捕捉到足够多样化的特征。这激励我们使用三角训练和发展三重视角编码器,利用编码器具有不同架构来提取多样化特征,并利用知识蒸馏技能学习这些编码器之间的互补语义。此外,现有的方法只是将特征从编码器和解码器中拼接在一起,导致冗余特征,需要巨大的内存成本。这启发我们设计一种双频解码器,通过从空间域到频率域 projected特征来选择这些重要特征,并在频率域中引入双频通道注意力机制来建模特征重要性。因此,我们提出了一个名为 TriKD 的三重视角知识蒸馏框架,包括三重视角编码器和双频解码器。在两个基准上进行了广泛的实验,分别是Pascal VOC 2012和城市景观,其结果证实了该方法的优越性,具有精度和推理速度的良好权衡。
https://arxiv.org/abs/2309.12557
Domain generalization studies the problem of training a model with samples from several domains (or distributions) and then testing the model with samples from a new, unseen domain. In this paper, we propose a novel approach for domain generalization that leverages recent advances in large vision-language models, specifically a CLIP teacher model, to train a smaller model that generalizes to unseen domains. The key technical contribution is a new type of regularization that requires the student's learned image representations to be close to the teacher's learned text representations obtained from encoding the corresponding text descriptions of images. We introduce two designs of the loss function, absolute and relative distance, which provide specific guidance on how the training process of the student model should be regularized. We evaluate our proposed method, dubbed RISE (Regularized Invariance with Semantic Embeddings), on various benchmark datasets and show that it outperforms several state-of-the-art domain generalization methods. To our knowledge, our work is the first to leverage knowledge distillation using a large vision-language model for domain generalization. By incorporating text-based information, RISE improves the generalization capability of machine learning models.
域泛化研究的问题是训练从一个多个域(或分布)中收集样本的模型,然后使用从一个未知的新域中收集样本的模型进行测试。在本文中,我们提出了一种域泛化的新方法,利用大型视觉语言模型的最新进展,特别是Clip teacher模型,训练一种小型模型,使其能够泛化到未知的域。关键技术贡献是一种新类型的正则化,它要求学生 learned 的图像表示接近从图像对应的文本描述编码中得到的 teacher 的文本表示。我们介绍了两种 loss 函数的设计,即绝对距离和相对距离,提供了具体指导,如何对学生模型的训练过程正则化。我们评估了我们提出的新方法,称为rise(正则化语义嵌入),在各种基准数据集上进行评估,并表明它比一些最先进的域泛化方法表现更好。据我们所知,我们的工作是使用大型视觉语言模型进行域泛化的第一个利用知识蒸馏的方法。通过引入文本信息,rise改善了机器学习模型的泛化能力。
https://arxiv.org/abs/2309.12530
Many mathematical models have been leveraged to design embeddings for representing Knowledge Graph (KG) entities and relations for link prediction and many downstream tasks. These mathematically-inspired models are not only highly scalable for inference in large KGs, but also have many explainable advantages in modeling different relation patterns that can be validated through both formal proofs and empirical results. In this paper, we make a comprehensive overview of the current state of research in KG completion. In particular, we focus on two main branches of KG embedding (KGE) design: 1) distance-based methods and 2) semantic matching-based methods. We discover the connections between recently proposed models and present an underlying trend that might help researchers invent novel and more effective models. Next, we delve into CompoundE and CompoundE3D, which draw inspiration from 2D and 3D affine operations, respectively. They encompass a broad spectrum of techniques including distance-based and semantic-based methods. We will also discuss an emerging approach for KG completion which leverages pre-trained language models (PLMs) and textual descriptions of entities and relations and offer insights into the integration of KGE embedding methods with PLMs for KG completion.
许多数学模型被用来利用设计表示知识图(KG)实体和关系嵌入,以进行链接预测和其他许多后续任务。这些数学模型不仅具有在大型KG中进行推理的高度可扩展性,而且还具有许多可解释的优势,在建模不同的关系模式时,可以通过形式证明和实证结果进行验证。在本文中,我们进行了全面综述KG完成的研究现状。特别是,我们重点关注KG嵌入(KGE)设计的两个主要分支:距离方法和语义匹配方法。我们发现了最近提出模型之间的联系,并提出了可能有助于研究人员发明新且更有效模型的潜在趋势。接下来,我们将探讨结合化合物E和化合物E3D,分别从2D和3D阿夫洛夫操作中汲取灵感。它们涵盖了包括距离方法和语义方法在内的广泛这些方法。此外,我们还将讨论KG完成新兴的方法,利用预训练的语言模型(PLMs)和实体和关系文本描述,并提供关于将KGE嵌入方法与PLMs用于KG完成之间的集成的洞察。
https://arxiv.org/abs/2309.12501
With more complex AI systems used by non-AI experts to complete daily tasks, there is an increasing effort to develop methods that produce explanations of AI decision making understandable by non-AI experts. Towards this effort, leveraging higher-level concepts and producing concept-based explanations have become a popular method. Most concept-based explanations have been developed for classification techniques, and we posit that the few existing methods for sequential decision making are limited in scope. In this work, we first contribute a desiderata for defining "concepts" in sequential decision making settings. Additionally, inspired by the Protege Effect which states explaining knowledge often reinforces one's self-learning, we explore the utility of concept-based explanations providing a dual benefit to the RL agent by improving agent learning rate, and to the end-user by improving end-user understanding of agent decision making. To this end, we contribute a unified framework, State2Explanation (S2E), that involves learning a joint embedding model between state-action pairs and concept-based explanations, and leveraging such learned model to both (1) inform reward shaping during an agent's training, and (2) provide explanations to end-users at deployment for improved task performance. Our experimental validations, in Connect 4 and Lunar Lander, demonstrate the success of S2E in providing a dual-benefit, successfully informing reward shaping and improving agent learning rate, as well as significantly improving end user task performance at deployment time.
随着非人工智能专家使用更为复杂的人工智能系统来完成日常任务,开发方法来产生非人工智能专家可以理解的AI决策解释变得越来越重要。为此,利用高级概念并产生基于概念的解释已经成为一种流行的方法。大多数基于概念的解释都是为了分类技术而开发的,我们假设只有少数现有的方法适用于Sequential决策。在本文中,我们首先提出了在Sequential决策设置下定义“概念”的需求。此外,受到解释知识往往强化自我学习的Protego效应的启发,我们探索了基于概念的解释的实用性,通过提高代理学习速率来同时提高代理学习率,并提高用户的代理决策理解度。为此,我们提出了一个统一的框架State2Explanation(S2E),它涉及学习状态行动一对的联合嵌入模型,并利用这种学习模型(1)在代理训练期间通知奖励 shaping,(2)在部署时向用户提供解释以改善任务表现。我们的实验验证在 Connect 4 和月球探测器中证明了S2E提供了双好处的成功,成功地通知奖励 shaping并提高代理学习速率,以及在部署时显著改善用户的任务表现。
https://arxiv.org/abs/2309.12482
Applying link prediction (LP) methods over knowledge graphs (KG) for tasks such as causal event prediction presents an exciting opportunity. However, typical LP models are ill-suited for this task as they are incapable of performing inductive link prediction for new, unseen event entities and they require retraining as knowledge is added or changed in the underlying KG. We introduce a case-based reasoning model, EvCBR, to predict properties about new consequent events based on similar cause-effect events present in the KG. EvCBR uses statistical measures to identify similar events and performs path-based predictions, requiring no training step. To generalize our methods beyond the domain of event prediction, we frame our task as a 2-hop LP task, where the first hop is a causal relation connecting a cause event to a new effect event and the second hop is a property about the new event which we wish to predict. The effectiveness of our method is demonstrated using a novel dataset of newsworthy events with causal relations curated from Wikidata, where EvCBR outperforms baselines including translational-distance-based, GNN-based, and rule-based LP models.
将链接预测方法(LP)应用于知识图谱(KG)中的任务,如因果关系预测,带来了令人兴奋的机会。然而,典型的LP模型不适合这项工作,因为它们无法对新 unseen 事件实体进行归纳链接预测,并且需要随着在底层KG中新知识的添加或更改而重新训练。我们引入了一种基于案例推理模型的例子推理模型( EvCBR),以预测新后继事件的属性,基于在KG中存在的类似因果关系的事件。 EvCBR 使用统计方法识别类似事件,并使用路径预测方法进行预测,不需要训练步骤。为了将我们的方法扩展到事件预测领域的之外,我们将任务定义成两个hop的LP任务,其中第一个hop是一个因果关系连接一个 cause 事件和一个 new 效应事件,第二个hop是我们希望预测的新事件的属性。我们的方法的效果通过使用一个从 Wikidata 整理的有价值的新闻事件数据集来展示,在该数据集中, EvCBR 比包括 Translation-distance-based、GNN-based 和规则为基础的 LP 模型的基准表现更好。
https://arxiv.org/abs/2309.12423
Large language models (LLMs) have pushed the limits of natural language understanding and exhibited excellent problem-solving ability. Despite the great success, most existing open-source LLMs (\eg, LLaMA-2) are still far away from satisfactory for solving mathematical problem due to the complex reasoning procedures. To bridge this gap, we propose \emph{MetaMath}, a fine-tuned language model that specializes in mathematical reasoning. Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives without extra knowledge, which results in a new dataset called {MetaMathQA}. Then we fine-tune the LLaMA-2 models on MetaMathQA. Experimental results on two popular benchmarks (\ie, GSM8K and MATH) for mathematical reasoning demonstrate that MetaMath outperforms a suite of open-source LLMs by a significant margin. Our MetaMath-7B model achieves $66.4\%$ on GSM8K and $19.4\%$ on MATH, exceeding the state-of-the-art models of the same size by $11.5\%$ and $8.7\%$. Particularly, {MetaMath-70B} achieves an accuracy of $82.3\%$ on {GSM8K}, slightly better than {GPT-3.5-Turbo}. We release the {MetaMathQA} dataset, the {MetaMath} models with different model sizes and the training code for public use.
大型语言模型(LLM)已经突破了自然语言理解的极限,并展现出了出色的解决问题的能力。尽管取得了巨大的成功,但大部分现有的开源LLM(例如LLaMA-2)在解决数学问题方面仍然无法令人满意,因为这涉及到复杂的推理过程。为了填补这一差距,我们提出了“MetaMath”,这是一个专门用于数学推理的优化语言模型。具体来说,我们从头开始通过在没有额外知识的情况下从多个角度改写问题来Bootstrap数学问题,从而产生了一个新的数据集{MetaMathQA}。然后,我们在{MetaMathQA}上优化了LLaMA-2模型。在两个流行的基准(例如GSM8K和Math)上进行数学推理的实验结果表明,MetaMath在同类模型中表现优异,比开源LLM套件还要好。我们的MetaMath-7B模型在GSM8K上达到了66.4%,在Math上达到了19.4%,超过了相同大小的最先进的模型的11.5%和8.7%。特别是,{MetaMath-70B}在{GSM8K}上实现了82.3%的准确率,比{GPT-3.5-Turbo}略微更好。我们发布了{MetaMathQA}数据集、不同模型大小的{MetaMath}模型和训练代码,以供公众使用。
https://arxiv.org/abs/2309.12284
Large language models (LLMs) have demonstrated dominating performance in many NLP tasks, especially on generative tasks. However, they often fall short in some information extraction tasks, particularly those requiring domain-specific knowledge, such as Biomedical Named Entity Recognition (NER). In this paper, inspired by Chain-of-thought, we leverage the LLM to solve the Biomedical NER step-by-step: break down the NER task into entity span extraction and entity type determination. Additionally, for entity type determination, we inject entity knowledge to address the problem that LLM's lack of domain knowledge when predicting entity category. Experimental results show a significant improvement in our two-step BioNER approach compared to previous few-shot LLM baseline. Additionally, the incorporation of external knowledge significantly enhances entity category determination performance.
大型语言模型(LLMs)在许多自然语言处理任务中表现出主导性表现,特别是在生成任务方面。然而,它们在一些信息提取任务中往往表现不佳,尤其是需要特定领域知识的领域特定知识任务,例如生物医学领域的命名实体识别(NER)。在本文中,受到思维流程的启发,我们利用LLM解决生物医学领域的NER任务:将NER任务分解成实体范围提取和实体类型确定:此外,对于实体类型确定,我们注入实体知识来解决LLM在预测实体类别时缺乏领域知识的问题。实验结果显示,我们两步生物医学NER方法相比先前的几个LLM基准任务表现出了显著改进。此外,将外部知识融入实体类别确定任务中显著增强了性能。
https://arxiv.org/abs/2309.12278
Detecting fake news requires both a delicate sense of diverse clues and a profound understanding of the real-world background, which remains challenging for detectors based on small language models (SLMs) due to their knowledge and capability limitations. Recent advances in large language models (LLMs) have shown remarkable performance in various tasks, but whether and how LLMs could help with fake news detection remains underexplored. In this paper, we investigate the potential of LLMs in fake news detection. First, we conduct an empirical study and find that a sophisticated LLM such as GPT 3.5 could generally expose fake news and provide desirable multi-perspective rationales but still underperforms the basic SLM, fine-tuned BERT. Our subsequent analysis attributes such a gap to the LLM's inability to select and integrate rationales properly to conclude. Based on these findings, we propose that current LLMs may not substitute fine-tuned SLMs in fake news detection but can be a good advisor for SLMs by providing multi-perspective instructive rationales. To instantiate this proposal, we design an adaptive rationale guidance network for fake news detection (ARG), in which SLMs selectively acquire insights on news analysis from the LLMs' rationales. We further derive a rationale-free version of ARG by distillation, namely ARG-D, which services cost-sensitive scenarios without inquiring LLMs. Experiments on two real-world datasets demonstrate that ARG and ARG-D outperform three types of baseline methods, including SLM-based, LLM-based, and combinations of small and large language models.
检测假新闻不仅需要敏锐的多线索感知,还需要对现实世界的背景进行深入的理解。由于小语言模型(SLMs)的知识和能力限制,基于SLMs的探测器仍然面临挑战。最近,大型语言模型(LLMs)取得了显著表现,在各种任务中表现良好,但如何帮助检测假新闻仍然有待探索。在本文中,我们探讨了LLMs在假新闻检测中的潜力。我们首先进行了一项实证研究,并发现,像GPT 3.5这样的 sophisticated LLM 通常能够揭示假新闻,并提供理想的多视角 rationales,但 still underperforms the basic SLM, fine-tunes BERT。我们的后续分析将这种差异归因于 LLM 无法正确选择和整合 rationales,以得出结论。基于这些发现,我们提出当前LLMs可能无法替代微调的SLMs在假新闻检测中替代SLMs,但可以作为一个良好的指导方针,为SLMs提供多视角有用的 rationales。为了实施这一建议,我们设计了自适应 rationale guidance network 用于假新闻检测(ARG),其中SLMs 选择性地从LLMs的 rationales 中获取新闻分析 insights。我们还可以通过蒸馏(distillation)方法生成无 rationale 的 ARG-D,该方法为成本敏感的场景提供服务,而不需要询问LLMs。在两个实际数据集上的实验表明,ARG 和 ARG-D 比三种基准方法(基于 SLM、基于 LLM 和大型语言模型和大型语言模型和大型语言模型的组合)更有效。
https://arxiv.org/abs/2309.12247