Self-supervised training methods for transformers have demonstrated remarkable performance across various domains. Previous transformer-based models, such as masked autoencoders (MAE), typically utilize a single normalization layer for both the [CLS] symbol and the tokens. We propose in this paper a simple modification that employs separate normalization layers for the tokens and the [CLS] symbol to better capture their distinct characteristics and enhance downstream task performance. Our method aims to alleviate the potential negative effects of using the same normalization statistics for both token types, which may not be optimally aligned with their individual roles. We empirically show that by utilizing a separate normalization layer, the [CLS] embeddings can better encode the global contextual information and are distributed more uniformly in its anisotropic space. When replacing the conventional normalization layer with the two separate layers, we observe an average 2.7% performance improvement over the image, natural language, and graph domains.
Transformer的自监督训练方法已经在各种领域表现出卓越的性能。以前的Transformer模型,如掩码自动编码器(MAE),通常只使用[CLS]符号和代币的单个标准化层。在本文中,我们提出了一种简单的修改,使用不同的标准化层来处理代币和[CLS]符号,更好地捕捉它们的独特特征并提高后续任务表现。我们的方法旨在减轻使用相同标准化统计对不同代币类型的潜在负面影响,这些可能不太与它们各自的角色最佳匹配。我们的经验表明,通过使用独立的标准化层,[CLS]嵌入可以更好地编码全球上下文信息,并在其非均匀空间中更均匀地分布。当将传统的标准化层替换为两个独立的层时,我们观察到与图像、自然语言和图形 domains相比,平均性能提高了2.7%。
https://arxiv.org/abs/2309.12931
High-quality text embedding is pivotal in improving semantic textual similarity (STS) tasks, which are crucial components in Large Language Model (LLM) applications. However, a common challenge existing text embedding models face is the problem of vanishing gradients, primarily due to their reliance on the cosine function in the optimization objective, which has saturation zones. To address this issue, this paper proposes a novel angle-optimized text embedding model called AnglE. The core idea of AnglE is to introduce angle optimization in a complex space. This novel approach effectively mitigates the adverse effects of the saturation zone in the cosine function, which can impede gradient and hinder optimization processes. To set up a comprehensive STS evaluation, we experimented on existing short-text STS datasets and a newly collected long-text STS dataset from GitHub Issues. Furthermore, we examine domain-specific STS scenarios with limited labeled data and explore how AnglE works with LLM-annotated data. Extensive experiments were conducted on various tasks including short-text STS, long-text STS, and domain-specific STS tasks. The results show that AnglE outperforms the state-of-the-art (SOTA) STS models that ignore the cosine saturation zone. These findings demonstrate the ability of AnglE to generate high-quality text embeddings and the usefulness of angle optimization in STS.
高质量的文本嵌入是改善语义文本相似性任务的关键,它们是大型语言模型应用的关键组件。然而,现有文本嵌入模型面临一个共同的挑战,就是梯度消失问题,这主要是因为它们在优化目标中依赖余弦函数,而余弦函数有一个饱和区域。为了解决这一问题,本文提出了一种名为AnglE的新角度优化文本嵌入模型。AnglE的核心思想是引入复杂的空间角度优化。这种新的方法有效地缓解了余弦函数饱和区域产生的不利效应,这些效应可能会阻碍梯度和妨碍优化过程。为了建立全面的语义文本相似性评估,我们实验了现有的短文本语义文本相似性任务数据集和新从GitHub问题集收集的长篇文本语义文本相似性任务数据集。我们还检查了特定领域的有限标记数据下的特定语义文本相似性场景,并探索了AnglE与LLM标记数据的结合方式。广泛的实验涵盖了各种任务,包括短文本语义文本相似性任务、长篇文本语义文本相似性任务和特定领域的语义文本相似性任务。结果表明,AnglE比忽略余弦函数饱和区域的最先进的语义文本相似性模型表现更好。这些发现表明AnglE能够生成高质量的文本嵌入,以及在语义文本相似性任务中的角度优化的有用性。
https://arxiv.org/abs/2309.12871
The representations of neural networks are often compared to those of biological systems by performing regression between the neural network responses and those measured from biological systems. Many different state-of-the-art deep neural networks yield similar neural predictions, but it remains unclear how to differentiate among models that perform equally well at predicting neural responses. To gain insight into this, we use a recent theoretical framework that relates the generalization error from regression to the spectral bias of the model activations and the alignment of the neural responses onto the learnable subspace of the model. We extend this theory to the case of regression between model activations and neural responses, and define geometrical properties describing the error embedding geometry. We test a large number of deep neural networks that predict visual cortical activity and show that there are multiple types of geometries that result in low neural prediction error as measured via regression. The work demonstrates that carefully decomposing representational metrics can provide interpretability of how models are capturing neural activity and points the way towards improved models of neural activity.
神经网络的表示经常被用来与生物系统的表示进行比较,通过比较神经网络响应与从生物系统测量的响应之间的回归结果。许多最先进的深度学习网络都产生了类似的神经网络预测结果,但如何区分在预测神经网络响应方面表现同样出色的模型仍然是一个问题。为了解决这个问题,我们使用了一个最近的理论框架,该框架将回归 generalization 误差与模型激活函数的谱偏差以及将神经网络响应对齐到模型可学习子空间的几何性质联系起来。我们将这个理论扩展到模型激活值和神经网络响应之间的回归情况,并定义了描述错误嵌入几何性质的几何性质。我们测试了大量预测视觉皮层活动的深度学习网络,并表明有多种类型的几何形状会导致通过回归测量的神经网络预测误差较低。这项工作表明,仔细分解表示度量可以提供模型如何捕捉神经网络活动的可解释性,并指向改进神经网络活动模型的方向。
https://arxiv.org/abs/2309.12821
Few-shot learning has made impressive strides in addressing the crucial challenges of recognizing unknown samples from novel classes in target query sets and managing visual shifts between domains. However, existing techniques fall short when it comes to identifying target outliers under domain shifts by learning to reject pseudo-outliers from the source domain, resulting in an incomplete solution to both problems. To address these challenges comprehensively, we propose a novel approach called Domain Adaptive Few-Shot Open Set Recognition (DA-FSOS) and introduce a meta-learning-based architecture named DAFOSNET. During training, our model learns a shared and discriminative embedding space while creating a pseudo open-space decision boundary, given a fully-supervised source domain and a label-disjoint few-shot target domain. To enhance data density, we use a pair of conditional adversarial networks with tunable noise variances to augment both domains closed and pseudo-open spaces. Furthermore, we propose a domain-specific batch-normalized class prototypes alignment strategy to align both domains globally while ensuring class-discriminativeness through novel metric objectives. Our training approach ensures that DAFOS-NET can generalize well to new scenarios in the target domain. We present three benchmarks for DA-FSOS based on the Office-Home, mini-ImageNet/CUB, and DomainNet datasets and demonstrate the efficacy of DAFOS-NET through extensive experimentation
有限次学习在解决目标查询集合中 novel 类的新样本以及在不同域之间的视觉转换方面取得了令人印象深刻的进展。然而,现有技术在域转换下识别目标异常样本方面存在缺陷,通过学习从源域中拒绝源域中的伪异常样本,导致对两个问题的不完整解决方案。为了全面解决这些挑战,我们提出了一种名为“域自适应有限次开放集识别”(DA-FSOS)的新方法,并介绍了名为 DAFOSNET 的元学习架构。在训练期间,我们的模型学习一个共享且具有区别性的嵌入空间,同时创建一个伪开放空间的决策边界,给定一个完全监督的源域和一个标签独立的有限次目标域。为了增强数据密度,我们使用具有可调节噪声均值的两个条件对抗网络,增加两个域的关闭和伪开放空间。此外,我们提出了一个域特定的批量归一化类原型对齐策略,以全球对齐两个域,同时通过新度量目标保证类分类性。我们的训练方法确保了 DAFOS-NET 可以在目标域中的新场景下泛化良好。基于 Office-Home、迷你 ImageNet/CUB 和 DomainNet 数据集,我们提出了三个基准指标,用于 DA-FSOS,并通过广泛的实验证明了 DAFOS-NET 的效力。
https://arxiv.org/abs/2309.12814
This work aims to provide an overview on the open-source multilanguage tool called StyloMetrix. It offers stylometric text representations that cover various aspects of grammar, syntax and lexicon. StyloMetrix covers four languages: Polish as the primary language, English, Ukrainian and Russian. The normalized output of each feature can become a fruitful course for machine learning models and a valuable addition to the embeddings layer for any deep learning algorithm. We strive to provide a concise, but exhaustive overview on the application of the StyloMetrix vectors as well as explain the sets of the developed linguistic features. The experiments have shown promising results in supervised content classification with simple algorithms as Random Forest Classifier, Voting Classifier, Logistic Regression and others. The deep learning assessments have unveiled the usefulness of the StyloMetrix vectors at enhancing an embedding layer extracted from Transformer architectures. The StyloMetrix has proven itself to be a formidable source for the machine learning and deep learning algorithms to execute different classification tasks.
这项工作旨在提供对开源的多语言工具StyloMetrix的概述。StyloMetrix提供形态学文本表示,涵盖了语法、句法和词汇表的各种方面。StyloMetrix涵盖了四种语言:波兰语作为主要语言,英语、乌克兰语和俄语。每个特征的标准化输出都可以成为机器学习模型的有效课程,并成为任何深度学习算法的嵌入层的重要补充。我们致力于提供简洁但全面的概述,包括StyloMetrix向量的应用范围,并解释所开发的语言学特征的集合。实验在监督的内容分类中取得了良好的结果,使用简单的算法如随机森林分类器、投票分类器、线性回归和其他算法。深度学习评估揭示了StyloMetrix向量在增强Transformer架构提取的嵌入层上的有用性。StyloMetrix已经证明自己是机器学习和深度学习算法执行不同分类任务的强大来源。
https://arxiv.org/abs/2309.12810
This research introduces an enhanced version of the multi-objective speech assessment model, called MOSA-Net+, by leveraging the acoustic features from large pre-trained weakly supervised models, namely Whisper, to create embedding features. The first part of this study investigates the correlation between the embedding features of Whisper and two self-supervised learning (SSL) models with subjective quality and intelligibility scores. The second part evaluates the effectiveness of Whisper in deploying a more robust speech assessment model. Third, the possibility of combining representations from Whisper and SSL models while deploying MOSA-Net+ is analyzed. The experimental results reveal that Whisper's embedding features correlate more strongly with subjective quality and intelligibility than other SSL's embedding features, contributing to more accurate prediction performance achieved by MOSA-Net+. Moreover, combining the embedding features from Whisper and SSL models only leads to marginal improvement. As compared to MOSA-Net and other SSL-based speech assessment models, MOSA-Net+ yields notable improvements in estimating subjective quality and intelligibility scores across all evaluation metrics. We further tested MOSA-Net+ on Track 3 of the VoiceMOS Challenge 2023 and obtained the top-ranked performance.
这项研究介绍了一种增强版本的多目标语音评估模型,称为MOSA-Net+,通过利用大型弱监督预训练模型Whisper的声学特征来创建嵌入特征。本研究第一部分研究了Whisper的嵌入特征与两个基于自我监督学习(SSL)模型的主观质量和语音识别得分之间的相关性。本研究第二部分评估了Whisper在部署更稳健的语音评估模型方面的 effectiveness。第三部分分析了在部署MOSA-Net+的同时,将Whisper和SSL模型的表示相结合的可能性。实验结果显示,Whisper的嵌入特征与主观质量和语音识别得分之间的相关性比SSL模型的其他嵌入特征更强,为MOSA-Net+实现的更准确的预测性能做出了贡献。此外,将Whisper和SSL模型的嵌入特征相结合仅会导致微小改进。与MOSA-Net和其他基于SSL的语音评估模型相比,MOSA-Net+在估计主观质量和语音识别得分方面实现了显著的改进。我们在2023年声音MOS挑战 track 3 上测试了MOSA-Net+,并取得了排名最高的性能。
https://arxiv.org/abs/2309.12766
Semantic similarity between natural language texts is typically measured either by looking at the overlap between subsequences (e.g., BLEU) or by using embeddings (e.g., BERTScore, S-BERT). Within this paper, we argue that when we are only interested in measuring the semantic similarity, it is better to directly predict the similarity using a fine-tuned model for such a task. Using a fine-tuned model for the STS-B from the GLUE benchmark, we define the STSScore approach and show that the resulting similarity is better aligned with our expectations on a robust semantic similarity measure than other approaches.
自然语言文本之间的语义相似性通常可以通过比较序列之间的重叠(例如,BLEU)或使用嵌入(例如,BERTScore,S-BERT)来测量。在本文中,我们认为,如果只关心测量语义相似性,则最好直接使用专门为该任务微调的模型来预测相似性。使用从GLUE基准测试中的微调模型STS-B来定义STSScore方法,并表明,结果的相似性与我们对于可靠的语义相似性测量期望的对齐更好。
https://arxiv.org/abs/2309.12697
Recent advancements in Natural Language Processing (NLP) have highlighted the potential of sentence embeddings in measuring semantic similarity. Yet, its application in analyzing real-world dyadic interactions and predicting the affect of conversational participants remains largely uncharted. To bridge this gap, the present study utilizes verbal conversations within 50 married couples talking about conflicts and pleasant activities. Transformer-based model all-MiniLM-L6-v2 was employed to obtain the embeddings of the utterances from each speaker. The overall similarity of the conversation was then quantified by the average cosine similarity between the embeddings of adjacent utterances. Results showed that semantic similarity had a positive association with wives' affect during conflict (but not pleasant) conversations. Moreover, this association was not observed with husbands' affect regardless of conversation types. Two validation checks further provided support for the validity of the similarity measure and showed that the observed patterns were not mere artifacts of data. The present study underscores the potency of sentence embeddings in understanding the association between interpersonal dynamics and individual affect, paving the way for innovative applications in affective and relationship sciences.
最近的自然语言处理(NLP)进展已经强调了句子嵌入在测量语义相似性方面的潜力。然而,在分析真实世界中两男一女的互动以及预测对话参与者的影响方面,应用句子嵌入仍然在很大程度上未知。为了填补这一差距,本研究利用50对已婚夫妇在讨论冲突和愉悦活动时的口头对话。采用基于Transformer的模型all-MiniLM-L6-v2从每个参与者的说话中提取了嵌入。然后,整个对话的相似性通过相邻说话者嵌入平均余弦相似度量化。结果表明,在冲突(但非愉悦)对话中,语义相似性与妻子的情绪产生了积极关系。此外,无论对话类型如何,这种关系都没有观察到与丈夫的情绪。两个验证检查进一步支持了相似性度量的精度,并表明所观察到的模式不是数据本身的副产品。本研究强调了句子嵌入在理解人际关系动态和个人情绪之间的相互作用方面的潜力,为情感和关系科学中的创新应用铺平了道路。
https://arxiv.org/abs/2309.12646
Domain generalization studies the problem of training a model with samples from several domains (or distributions) and then testing the model with samples from a new, unseen domain. In this paper, we propose a novel approach for domain generalization that leverages recent advances in large vision-language models, specifically a CLIP teacher model, to train a smaller model that generalizes to unseen domains. The key technical contribution is a new type of regularization that requires the student's learned image representations to be close to the teacher's learned text representations obtained from encoding the corresponding text descriptions of images. We introduce two designs of the loss function, absolute and relative distance, which provide specific guidance on how the training process of the student model should be regularized. We evaluate our proposed method, dubbed RISE (Regularized Invariance with Semantic Embeddings), on various benchmark datasets and show that it outperforms several state-of-the-art domain generalization methods. To our knowledge, our work is the first to leverage knowledge distillation using a large vision-language model for domain generalization. By incorporating text-based information, RISE improves the generalization capability of machine learning models.
域泛化研究的问题是训练从一个多个域(或分布)中收集样本的模型,然后使用从一个未知的新域中收集样本的模型进行测试。在本文中,我们提出了一种域泛化的新方法,利用大型视觉语言模型的最新进展,特别是Clip teacher模型,训练一种小型模型,使其能够泛化到未知的域。关键技术贡献是一种新类型的正则化,它要求学生 learned 的图像表示接近从图像对应的文本描述编码中得到的 teacher 的文本表示。我们介绍了两种 loss 函数的设计,即绝对距离和相对距离,提供了具体指导,如何对学生模型的训练过程正则化。我们评估了我们提出的新方法,称为rise(正则化语义嵌入),在各种基准数据集上进行评估,并表明它比一些最先进的域泛化方法表现更好。据我们所知,我们的工作是使用大型视觉语言模型进行域泛化的第一个利用知识蒸馏的方法。通过引入文本信息,rise改善了机器学习模型的泛化能力。
https://arxiv.org/abs/2309.12530
Many mathematical models have been leveraged to design embeddings for representing Knowledge Graph (KG) entities and relations for link prediction and many downstream tasks. These mathematically-inspired models are not only highly scalable for inference in large KGs, but also have many explainable advantages in modeling different relation patterns that can be validated through both formal proofs and empirical results. In this paper, we make a comprehensive overview of the current state of research in KG completion. In particular, we focus on two main branches of KG embedding (KGE) design: 1) distance-based methods and 2) semantic matching-based methods. We discover the connections between recently proposed models and present an underlying trend that might help researchers invent novel and more effective models. Next, we delve into CompoundE and CompoundE3D, which draw inspiration from 2D and 3D affine operations, respectively. They encompass a broad spectrum of techniques including distance-based and semantic-based methods. We will also discuss an emerging approach for KG completion which leverages pre-trained language models (PLMs) and textual descriptions of entities and relations and offer insights into the integration of KGE embedding methods with PLMs for KG completion.
许多数学模型被用来利用设计表示知识图(KG)实体和关系嵌入,以进行链接预测和其他许多后续任务。这些数学模型不仅具有在大型KG中进行推理的高度可扩展性,而且还具有许多可解释的优势,在建模不同的关系模式时,可以通过形式证明和实证结果进行验证。在本文中,我们进行了全面综述KG完成的研究现状。特别是,我们重点关注KG嵌入(KGE)设计的两个主要分支:距离方法和语义匹配方法。我们发现了最近提出模型之间的联系,并提出了可能有助于研究人员发明新且更有效模型的潜在趋势。接下来,我们将探讨结合化合物E和化合物E3D,分别从2D和3D阿夫洛夫操作中汲取灵感。它们涵盖了包括距离方法和语义方法在内的广泛这些方法。此外,我们还将讨论KG完成新兴的方法,利用预训练的语言模型(PLMs)和实体和关系文本描述,并提供关于将KGE嵌入方法与PLMs用于KG完成之间的集成的洞察。
https://arxiv.org/abs/2309.12501
We study the effect of tokenization on gender bias in machine translation, an aspect that has been largely overlooked in previous works. Specifically, we focus on the interactions between the frequency of gendered profession names in training data, their representation in the subword tokenizer's vocabulary, and gender bias. We observe that female and non-stereotypical gender inflections of profession names (e.g., Spanish "doctora" for "female doctor") tend to be split into multiple subword tokens. Our results indicate that the imbalance of gender forms in the model's training corpus is a major factor contributing to gender bias and has a greater impact than subword splitting. We show that analyzing subword splits provides good estimates of gender-form imbalance in the training data and can be used even when the corpus is not publicly available. We also demonstrate that fine-tuning just the token embedding layer can decrease the gap in gender prediction accuracy between female and male forms without impairing the translation quality.
我们研究 tokenization 对机器翻译中的性别偏见的影响,这是过去工作 largely 忽略了的一个方面。具体来说,我们关注训练数据中性别职业名称的频率、它们在子词分词器的词汇表中的表示以及性别偏见之间的关系。我们观察到,职业名称中的女性和非典型性别变体(如西班牙语“ doctora”用于“女性医生”)往往被分解成多个子词 token。我们的结果显示,模型训练语料中的性别形式不平衡是一个主要因素,对女性偏见造成了更大的影响,而子词分割的影响比这更大。我们表明,分析子词分割可以提供训练数据中性别形式不平衡的良好估计,即使语料库不公开可用也可以使用。我们还证明,仅微调 token 嵌入层可以减小女性和男性形式之间的性别预测精度差距,而不会损害翻译质量。
https://arxiv.org/abs/2309.12491
With more complex AI systems used by non-AI experts to complete daily tasks, there is an increasing effort to develop methods that produce explanations of AI decision making understandable by non-AI experts. Towards this effort, leveraging higher-level concepts and producing concept-based explanations have become a popular method. Most concept-based explanations have been developed for classification techniques, and we posit that the few existing methods for sequential decision making are limited in scope. In this work, we first contribute a desiderata for defining "concepts" in sequential decision making settings. Additionally, inspired by the Protege Effect which states explaining knowledge often reinforces one's self-learning, we explore the utility of concept-based explanations providing a dual benefit to the RL agent by improving agent learning rate, and to the end-user by improving end-user understanding of agent decision making. To this end, we contribute a unified framework, State2Explanation (S2E), that involves learning a joint embedding model between state-action pairs and concept-based explanations, and leveraging such learned model to both (1) inform reward shaping during an agent's training, and (2) provide explanations to end-users at deployment for improved task performance. Our experimental validations, in Connect 4 and Lunar Lander, demonstrate the success of S2E in providing a dual-benefit, successfully informing reward shaping and improving agent learning rate, as well as significantly improving end user task performance at deployment time.
随着非人工智能专家使用更为复杂的人工智能系统来完成日常任务,开发方法来产生非人工智能专家可以理解的AI决策解释变得越来越重要。为此,利用高级概念并产生基于概念的解释已经成为一种流行的方法。大多数基于概念的解释都是为了分类技术而开发的,我们假设只有少数现有的方法适用于Sequential决策。在本文中,我们首先提出了在Sequential决策设置下定义“概念”的需求。此外,受到解释知识往往强化自我学习的Protego效应的启发,我们探索了基于概念的解释的实用性,通过提高代理学习速率来同时提高代理学习率,并提高用户的代理决策理解度。为此,我们提出了一个统一的框架State2Explanation(S2E),它涉及学习状态行动一对的联合嵌入模型,并利用这种学习模型(1)在代理训练期间通知奖励 shaping,(2)在部署时向用户提供解释以改善任务表现。我们的实验验证在 Connect 4 和月球探测器中证明了S2E提供了双好处的成功,成功地通知奖励 shaping并提高代理学习速率,以及在部署时显著改善用户的任务表现。
https://arxiv.org/abs/2309.12482
We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shift short attention effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA demonstrates strong empirical results on various tasks on LLaMA2 models from 7B/13B to 70B. LongLoRA adopts LLaMA2 7B from 4k context to 100k, or LLaMA2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like FlashAttention-2. In addition, to make LongLoRA practical, we collect a dataset, LongQA, for supervised fine-tuning. It contains more than 3k long context question-answer pairs.
我们提出了 LongLoRA,一种高效的微调方法,可以扩展训练前已训练的大型语言模型(LLM)的上下文大小,同时以有限的计算成本实现。通常,训练上下文大小较长的LLM具有计算成本高的特点,需要大量的训练时间和GPU资源。例如,训练8192上下文长度的LLM需要比2048上下文长度的计算成本更高的 self-attention 层。在本文中,我们加速了LLM的上下文扩展。一方面,虽然推理时需要密集的注意力,但稀疏的注意力可以在模型微调方面有效地和高效地进行。 proposed 的简短注意力有效地促进了上下文扩展,导致与仅使用常规注意力的微调相比,显著的计算节省,类似于仅使用常规注意力的微调。特别地,在训练时只需要两行代码,而在推理时则是可选的。另一方面,我们重新考虑了上下文扩展参数高效的微调模式。值得注意的是,我们发现 LoRA 对上下文扩展的工作在可训练嵌入和归一化的前提下良好运作。LongLoRA 在多个任务上展示了强 empirical 结果,包括从 7B/13B 到 70B 的 LLaMA2 模型。LongLoRA 将 LLaMA2 7B 从 4k 上下文扩展到 100k,或在一个 8x A100 机器上将 LLaMA2 70B 扩展到 32k。LongLoRA 在保持模型原架构的情况下扩展了模型的上下文,并与大多数现有技术,如 FlashAttention-2 兼容。此外,为了使 LongLoRA 实现,我们收集了一个数据集 LongQA,用于监督微调。该数据集包含超过 3k 长的上下文问答对。
https://arxiv.org/abs/2309.12307
We tackle the problem of robust novelty detection, where we aim to detect novelties in terms of semantic content while being invariant to changes in other, irrelevant factors. Specifically, we operate in a setup with multiple environments, where we determine the set of features that are associated more with the environments, rather than to the content relevant for the task. Thus, we propose a method that starts with a pretrained embedding and a multi-env setup and manages to rank the features based on their environment-focus. First, we compute a per-feature score based on the feature distribution variance between envs. Next, we show that by dropping the highly scored ones, we manage to remove spurious correlations and improve the overall performance by up to 6%, both in covariance and sub-population shift cases, both for a real and a synthetic benchmark, that we introduce for this task.
我们解决了 robust novelty detection 问题,该问题旨在通过语义内容来检测新奇性,同时不受到其他无关因素的变化影响。具体来说,我们在一个有多个环境的 setup 中操作,确定与环境更相关的特征,而不是与任务相关的内容。因此,我们提出了一种方法,它从预训练嵌入和多环境 setup 开始,并通过环境焦点来排名特征。首先,我们计算每个特征的分数,基于不同环境特征分布的差异。接下来,我们表明,通过删除高分特征,我们能够删除伪相关关系,并提高整体表现,无论是covance 和子群体移动 cases,我们对为该任务引入的真实和合成基准进行了测试。
https://arxiv.org/abs/2309.12301
In recent years, datasets of paired audio and captions have enabled remarkable success in automatically generating descriptions for audio clips, namely Automated Audio Captioning (AAC). However, it is labor-intensive and time-consuming to collect a sufficient number of paired audio and captions. Motivated by the recent advances in Contrastive Language-Audio Pretraining (CLAP), we propose a weakly-supervised approach to train an AAC model assuming only text data and a pre-trained CLAP model, alleviating the need for paired target data. Our approach leverages the similarity between audio and text embeddings in CLAP. During training, we learn to reconstruct the text from the CLAP text embedding, and during inference, we decode using the audio embeddings. To mitigate the modality gap between the audio and text embeddings we employ strategies to bridge the gap during training and inference stages. We evaluate our proposed method on Clotho and AudioCaps datasets demonstrating its ability to achieve a relative performance of up to ~$83\%$ compared to fully supervised approaches trained with paired target data.
近年来,配对音频和字幕数据集在自动为音频片段生成描述方面取得了显著的成功,也就是自动化音频标题生成(AAC)。然而,收集足够的配对音频和字幕数据集是费时费力的。基于最近在Contrastive Language-Audio Pretraining(CLAP)方面的进展,我们提出了一种弱监督的方法来训练一个AAC模型,假设只有文本数据和一个预先训练的CLAP模型,从而消除了需要配对目标数据的需求。我们的方法利用CLAP中音频和文本嵌入之间的相似性。在训练期间,我们学习从CLAP文本嵌入中恢复文本,而在推理期间,我们使用音频嵌入进行解码。为了缓解音频和文本嵌入之间的模式差异,我们采用了在训练和推理阶段中 bridge the gap 的策略。我们评估了我们提出的方法在Clotho和AudioCaps数据集上的表现,表明它能够与完全监督的方法训练使用配对目标数据相比实现高达 ~$83\%$ 的相对性能。
https://arxiv.org/abs/2309.12242
A range of applications of multi-modal music information retrieval is centred around the problem of connecting large collections of sheet music (images) to corresponding audio recordings, that is, identifying pairs of audio and score excerpts that refer to the same musical content. One of the typical and most recent approaches to this task employs cross-modal deep learning architectures to learn joint embedding spaces that link the two distinct modalities - audio and sheet music images. While there has been steady improvement on this front over the past years, a number of open problems still prevent large-scale employment of this methodology. In this article we attempt to provide an insightful examination of the current developments on audio-sheet music retrieval via deep learning methods. We first identify a set of main challenges on the road towards robust and large-scale cross-modal music retrieval in real scenarios. We then highlight the steps we have taken so far to address some of these challenges, documenting step-by-step improvement along several dimensions. We conclude by analysing the remaining challenges and present ideas for solving these, in order to pave the way to a unified and robust methodology for cross-modal music retrieval.
多种多感官音乐信息检索应用的核心是连接大型音乐谱子(图像)与相应的音频录制的问题,即确定一对音频和曲段节选,它们都涉及相同的音乐内容。这种任务的常用最新方法之一是通过跨感官深度学习架构学习连接两种不同感官——音频和音乐谱子图像的联合嵌入空间。尽管过去几年中在这方面取得了稳定的改进,但还有一些开放性问题仍然阻止这种方法的大规模使用。在本文中,我们尝试通过深度学习方法深入探讨音乐谱子音频检索的最新发展。我们首先确定一系列主要挑战,这是在真实场景中实现稳健和大规模的跨感官音乐检索所面临的关键障碍。然后我们重点介绍了我们迄今采取的步骤,记录了一系列维度上的逐步改进。我们最后分析剩余的挑战并提出解决这些问题的方案,为跨感官音乐检索提供统一的稳健方法铺平道路。
https://arxiv.org/abs/2309.12158
Linking sheet music images to audio recordings remains a key problem for the development of efficient cross-modal music retrieval systems. One of the fundamental approaches toward this task is to learn a cross-modal embedding space via deep neural networks that is able to connect short snippets of audio and sheet music. However, the scarcity of annotated data from real musical content affects the capability of such methods to generalize to real retrieval scenarios. In this work, we investigate whether we can mitigate this limitation with self-supervised contrastive learning, by exposing a network to a large amount of real music data as a pre-training step, by contrasting randomly augmented views of snippets of both modalities, namely audio and sheet images. Through a number of experiments on synthetic and real piano data, we show that pre-trained models are able to retrieve snippets with better precision in all scenarios and pre-training configurations. Encouraged by these results, we employ the snippet embeddings in the higher-level task of cross-modal piece identification and conduct more experiments on several retrieval configurations. In this task, we observe that the retrieval quality improves from 30% up to 100% when real music data is present. We then conclude by arguing for the potential of self-supervised contrastive learning for alleviating the annotated data scarcity in multi-modal music retrieval models.
将音乐谱与音频录制链接仍然是开发高效跨modal音乐检索系统的关键问题。对于这个任务,一种基本的方法是通过深度神经网络学习一个跨modal嵌入空间,该空间能够连接音频和音乐片段的短 snippet。然而,从真实音乐内容的标注数据稀缺性的角度来看,这些方法是否能够适用于真实的检索场景具有影响。在这项工作中,我们研究是否能够通过自监督比较学习来减轻这种限制,方法是将网络暴露在大量的真实音乐数据上作为预训练步骤,通过随机增强的两种模式片段的视图进行比较。通过模拟和真实的钢琴数据进行了一系列实验,我们表明预训练模型能够在所有场景和预训练配置下更准确地检索 snippet。因为这些结果的鼓励,我们使用 snippet嵌入在跨modal片段识别的高级任务中,并进行了更多的实验,针对多个检索配置。在这个任务中,我们观察到当存在真实音乐数据时,检索质量从30%增加到100%。因此我们最终得出结论,自监督比较学习的潜力有助于减轻跨modal音乐检索模型中标注数据稀缺性的问题。
https://arxiv.org/abs/2309.12134
Neural network approaches to single-channel speech enhancement have received much recent attention. In particular, mask-based architectures have achieved significant performance improvements over conventional methods. This paper proposes a multiscale autoencoder (MSAE) for mask-based end-to-end neural network speech enhancement. The MSAE performs spectral decomposition of an input waveform within separate band-limited branches, each operating with a different rate and scale, to extract a sequence of multiscale embeddings. The proposed framework features intuitive parameterization of the autoencoder, including a flexible spectral band design based on the Constant-Q transform. Additionally, the MSAE is constructed entirely of differentiable operators, allowing it to be implemented within an end-to-end neural network, and be discriminatively trained. The MSAE draws motivation both from recent multiscale network topologies and from traditional multiresolution transforms in speech processing. Experimental results show the MSAE to provide clear performance benefits relative to conventional single-branch autoencoders. Additionally, the proposed framework is shown to outperform a variety of state-of-the-art enhancement systems, both in terms of objective speech quality metrics and automatic speech recognition accuracy.
神经网络对单通道语音增强的研究最近受到了广泛关注。特别是,基于Mask的架构在与传统方法相比实现了显著的性能提升。本文提出了一种基于Mask的多维度自编码器(MSAE),用于实现基于Mask的端到端神经网络语音增强。MSAE在 separate band-limited分支内执行谱分解操作,每个分支以不同的速率和尺度运行,以提取多尺度嵌入序列。 proposed 框架采用直觉的自编码器参数化,包括基于康普顿-Q变换的灵活谱带设计。此外,MSAE完全由不同的操作员构建,使其能够在端到端神经网络内部实现,并进行有选择性的训练。MSAE从最近的多尺度网络拓扑和传统语音处理中的多分辨率变换中吸取了动力。实验结果表明,与传统的单分支自编码器相比,MSAE可以提供明显的性能优势。此外, proposed 框架在 objective speech quality metrics 和自动语音识别精度方面击败了多种最先进的增强系统。
https://arxiv.org/abs/2309.12121
Many applications of cross-modal music retrieval are related to connecting sheet music images to audio recordings. A typical and recent approach to this is to learn, via deep neural networks, a joint embedding space that correlates short fixed-size snippets of audio and sheet music by means of an appropriate similarity structure. However, two challenges that arise out of this strategy are the requirement of strongly aligned data to train the networks, and the inherent discrepancies of musical content between audio and sheet music snippets caused by local and global tempo differences. In this paper, we address these two shortcomings by designing a cross-modal recurrent network that learns joint embeddings that can summarize longer passages of corresponding audio and sheet music. The benefits of our method are that it only requires weakly aligned audio-sheet music pairs, as well as that the recurrent network handles the non-linearities caused by tempo variations between audio and sheet music. We conduct a number of experiments on synthetic and real piano data and scores, showing that our proposed recurrent method leads to more accurate retrieval in all possible configurations.
许多跨modal音乐检索应用都与音乐谱图像与音频录制之间的联系有关。一种典型的最近的方法是如何通过学习使用深度神经网络,建立一种适当的相似结构,将较短的固定大小音频片段和音乐片段进行相关匹配,通过适当的相似性结构来实现。然而,这个策略引出了两个挑战:训练网络时必须 strongly aligned 的数据,以及由当地和全球速度差异引起的音乐片段之间的固有差异,这些差异是由音频和音乐片段片段长度的不同引起的。在本文中,我们为了解决这两个缺点,设计了一种新的跨modal重复网络,该网络学习联合嵌入,可以总结相应的音频和音乐片段的更长段落。我们的方法的优点在于它只需要Weakly aligned 的音频和音乐片段对,而且重复网络可以处理音频和音乐片段之间速度变化引起的非线性ities。我们研究了合成和真实的钢琴数据和评分,证明了我们提出的重复方法可以在所有可能的配置下提高检索的准确性。
https://arxiv.org/abs/2309.12111
Anticancer peptides (ACPs) are a group of peptides that exhibite antineoplastic properties. The utilization of ACPs in cancer prevention can present a viable substitute for conventional cancer therapeutics, as they possess a higher degree of selectivity and safety. Recent scientific advancements generate an interest in peptide-based therapies which offer the advantage of efficiently treating intended cells without negatively impacting normal cells. However, as the number of peptide sequences continues to increase rapidly, developing a reliable and precise prediction model becomes a challenging task. In this work, our motivation is to advance an efficient model for categorizing anticancer peptides employing the consolidation of word embedding and deep learning models. First, Word2Vec and FastText are evaluated as word embedding techniques for the purpose of extracting peptide sequences. Then, the output of word embedding models are fed into deep learning approaches CNN, LSTM, BiLSTM. To demonstrate the contribution of proposed framework, extensive experiments are carried on widely-used datasets in the literature, ACPs250 and Independent. Experiment results show the usage of proposed model enhances classification accuracy when compared to the state-of-the-art studies. The proposed combination, FastText+BiLSTM, exhibits 92.50% of accuracy for ACPs250 dataset, and 96.15% of accuracy for Independent dataset, thence determining new state-of-the-art.
抗肿瘤肽(ACP)是一类具有抗肿瘤性质的肽。在癌症预防中,使用ACP可以提供一个可行的替代传统癌症治疗,因为它们具有更高的选择性和安全性。最近的科学研究引起了对肽Based治疗的兴趣,这种治疗方法具有高效地治疗预期细胞而不会对正常细胞产生负面影响的优势。然而,随着肽序列数量的迅速增加,开发可靠且精确的预测模型变得越来越困难。在这个工作中,我们的动力是开发一种高效的分类模型,使用词嵌入和深度学习模型的整合。首先,Word2Vec和FastText被评估为词嵌入技术,以提取肽序列。然后,词嵌入模型的输出被输入到深度学习方法CNN、LSTM和BiLSTM中。为了证明所提出的框架的贡献,对文献中广泛使用的数据集ACPs250和独立数据集进行了广泛的实验。实验结果表明,与最先进的研究相比,使用所提出的模型可以提高分类准确性。所提出的组合FastText+BiLSTM对ACPs250数据集的准确率达到了92.50%,对独立数据集的准确率达到了96.15%。因此,确定了新的前沿技术。
https://arxiv.org/abs/2309.12058