Fine-tuning large-scale Transformers has led to the explosion of many AI applications across Natural Language Processing and Computer Vision tasks. However, fine-tuning all pre-trained model parameters becomes impractical as the model size and number of tasks increase. Parameter-efficient transfer learning (PETL) methods aim to address these challenges. While effective in reducing the number of trainable parameters, PETL methods still require significant energy and computational resources to fine-tune. In this paper, we introduce \textbf{RE}current \textbf{AD}aption (READ) -- a lightweight and memory-efficient fine-tuning method -- to overcome the limitations of the current PETL approaches. Specifically, READ inserts a small RNN network alongside the backbone model so that the model does not have to back-propagate through the large backbone network. Through comprehensive empirical evaluation of the GLUE benchmark, we demonstrate READ can achieve a $56\%$ reduction in the training memory consumption and an $84\%$ reduction in the GPU energy usage while retraining high model quality compared to full-tuning. Additionally, the model size of READ does not grow with the backbone model size, making it a highly scalable solution for fine-tuning large Transformers.
微调大规模Transformer模型导致了许多在自然语言处理和计算机视觉任务中的AI应用的爆发。然而,随着模型大小和任务数量的增加,微调所有已训练模型参数变得非常不切实际。参数高效的迁移学习方法(PETL)旨在解决这些问题。虽然PETL方法可以有效地减少可训练参数的数量,但仍需要大量的能源和计算资源进行微调。在本文中,我们介绍了循环吸收(READ) - 一种轻量级且内存高效的微调方法 - 以克服当前PETL方法的局限性。具体来说,READ将小循环神经网络与基线模型并行,从而使模型不需要通过大型基线网络进行反向传播。通过全面 empirical 评估GLUE基准测试,我们证明READ可以在与全调相比,提高模型质量的同时,减少训练内存消耗和GPU能源使用 $56\%$。此外,READ的模型大小与基线模型大小无关,因此它是微调大型Transformer的高效可扩展解决方案。
https://arxiv.org/abs/2305.15348
Many recent developments in large language models focus on prompting them to perform specific tasks. One effective prompting method is in-context learning, where the model performs a (possibly new) generation/prediction task given one (or more) examples. Past work has shown that the choice of examples can make a large impact on task performance. However, finding good examples is not straightforward since the definition of a representative group of examples can vary greatly depending on the task. While there are many existing methods for selecting in-context examples, they generally score examples independently, ignoring the dependency between them and the order in which they are provided to the large language model. In this work, we propose Retrieval for In-Context Learning (RetICL), a learnable method for modeling and optimally selecting examples sequentially for in-context learning. We frame the problem of sequential example selection as a Markov decision process, design an example retriever model using an LSTM, and train it using proximal policy optimization (PPO). We validate RetICL on math problem solving datasets and show that it outperforms both heuristic and learnable baselines, and achieves state-of-the-art accuracy on the TabMWP dataset. We also use case studies to show that RetICL implicitly learns representations of math problem solving strategies.
许多大型语言模型的最新发展都集中在促使它们执行特定的任务。一种有效的促使方法就是上下文学习,其中模型执行一个(可能全新的)生成/预测任务,给定一个(或多个)示例。过去的研究成果表明,选择示例可以对任务表现产生重大影响。然而,找到好的示例并不是件容易的事,因为对于一个代表性的例子群的定义可以随着任务而有很大的差异。虽然有许多现有方法用于选择上下文示例,但它们通常独立评分示例,忽视了它们与示例向量之间的依赖关系以及将它们提供给大型语言模型的顺序。在本研究中,我们提出了上下文学习检索(RetICL),一种可学习的方法,用于建模和最优地选择上下文学习中Sequentially selected Examples。我们将Sequentially selected Examples的问题视为马尔可夫决策过程,使用LSTM设计了一个示例检索模型,并使用近邻策略优化(PPO)训练它。我们对数学问题求解数据集进行了验证,并表明RetICL比启发式和可学习基准表现更好,在TabMWP数据集上实现了最先进的精度。我们还使用案例研究来证明RetICL隐含地学习数学问题求解策略的表示。
https://arxiv.org/abs/2305.14502
Speech Emotion Recognition (SER) is a critical enabler of emotion-aware communication in human-computer interactions. Deep Learning (DL) has improved the performance of SER models by improving model complexity. However, designing DL architectures requires prior experience and experimental evaluations. Encouragingly, Neural Architecture Search (NAS) allows automatic search for an optimum DL model. In particular, Differentiable Architecture Search (DARTS) is an efficient method of using NAS to search for optimised models. In this paper, we propose DARTS for a joint CNN and LSTM architecture for improving SER performance. Our choice of the CNN LSTM coupling is inspired by results showing that similar models offer improved performance. While SER researchers have considered CNNs and RNNs separately, the viability of using DARTs jointly for CNN and LSTM still needs exploration. Experimenting with the IEMOCAP dataset, we demonstrate that our approach outperforms best-reported results using DARTS for SER.
语音情感识别( SER )是实现人类计算机交互中情感有意识的通信的关键工具。深度学习(DL )通过提高模型复杂性,已经改善了 SER 模型的性能。然而,设计 DL 架构需要先前经验和实验评估。令人鼓舞的是,神经网络架构搜索(NAS )可以自动搜索最佳的 DL 模型。特别是,可变性架构搜索(DARTS )是一种高效的利用NAS 搜索优化模型的方法。在本文中,我们提出了用 joint CNN 和 LSTM 架构来提高 SER 性能的 DARTS 方案。我们选择 CNN LSTM 耦合的方式是受结果显示类似模型可以提高性能的启发的。虽然 SER 研究人员已经考虑了 CNN 和 RNN 分别考虑,但用 DARTs 联合 CNN 和 LSTM 的可行性仍需要探索。通过使用 IEMOCAP 数据集进行实验,我们证明了我们的方法和使用 DARTS 的 SER 性能最佳结果相比表现更好。
https://arxiv.org/abs/2305.14402
We develop an automated video colorization framework that minimizes the flickering of colors across frames. If we apply image colorization techniques to successive frames of a video, they treat each frame as a separate colorization task. Thus, they do not necessarily maintain the colors of a scene consistently across subsequent frames. The proposed solution includes a novel deep recurrent encoder-decoder architecture which is capable of maintaining temporal and contextual coherence between consecutive frames of a video. We use a high-level semantic feature extractor to automatically identify the context of a scenario including objects, with a custom fusion layer that combines the spatial and temporal features of a frame sequence. We demonstrate experimental results, qualitatively showing that recurrent neural networks can be successfully used to improve color consistency in video colorization.
我们开发了一个自动化的视频色彩校正框架,以减少帧之间的色彩闪烁。如果我们将图像色彩校正技术应用于视频的相邻帧,它们将每个帧视为独立的色彩校正任务。因此,它们不一定保持场景的颜色在后续的帧中保持一致。 proposed solution 包括一种新颖的深度循环编码-解码架构,它能够在视频的连续帧之间保持时间和实践一致性。我们使用高级别的语义特征提取器自动识别包括物体的场景上下文,并使用自定义融合层将帧序列的空间和时间特征进行组合。我们展示了实验结果,定性地证明循环神经网络可以成功地用于改善视频色彩校正中的颜色一致性。
https://arxiv.org/abs/2305.13704
Human activity recognition (HAR) is one of the core research themes in ubiquitous and wearable computing. With the shift to deep learning (DL) based analysis approaches, it has become possible to extract high-level features and perform classification in an end-to-end manner. Despite their promising overall capabilities, DL-based HAR may suffer from overfitting due to the notoriously small, often inadequate, amounts of labeled sample data that are available for typical HAR applications. In response to such challenges, we propose ConvBoost -- a novel, three-layer, structured model architecture and boosting framework for convolutional network based HAR. Our framework generates additional training data from three different perspectives for improved HAR, aiming to alleviate the shortness of labeled training data in the field. Specifically, with the introduction of three conceptual layers--Sampling Layer, Data Augmentation Layer, and Resilient Layer -- we develop three "boosters" -- R-Frame, Mix-up, and C-Drop -- to enrich the per-epoch training data by dense-sampling, synthesizing, and simulating, respectively. These new conceptual layers and boosters, that are universally applicable for any kind of convolutional network, have been designed based on the characteristics of the sensor data and the concept of frame-wise HAR. In our experimental evaluation on three standard benchmarks (Opportunity, PAMAP2, GOTOV) we demonstrate the effectiveness of our ConvBoost framework for HAR applications based on variants of convolutional networks: vanilla CNN, ConvLSTM, and Attention Models. We achieved substantial performance gains for all of them, which suggests that the proposed approach is generic and can serve as a practical solution for boosting the performance of existing ConvNet-based HAR models. This is an open-source project, and the code can be found at this https URL
人类活动识别(HAR)是普遍和可穿戴计算的核心研究主题之一。随着深度学习(DL)为基础的分析方法的普及,现在可以实现从终端进行特征提取和分类的全方位行为识别。尽管HAR的整体能力前景看好,但由于标注样本数据的著名局限性和不足,通常适用于Har应用程序的数量非常有限。为了应对这些问题,我们提出了卷积Boost - 一种独特的三层级结构模型架构和Boost框架,以卷积神经网络为基础的HAR。我们的框架从不同的角度生成额外的训练数据,以提高HAR,旨在减轻标记训练数据在实地的短缺。具体来说,我们引入了三个概念层 - 采样层、数据增强层和恢复层 - 我们开发了三个“Boosters” - R-Frame、混合物和C-Drop,以通过密集采样、合成和模拟等方式从每个epoch中增加训练数据。这些新的概念层和Boosters,适用于任何类型的卷积网络,是根据传感器数据的特征和帧级的HAR概念设计的。在我们针对三个标准基准点(机会、PAMAP2和GOTOV)的实验评估中,我们展示了我们的卷积Boost框架基于卷积神经网络变体的Har应用程序的有效性。我们为所有参与者都实现了显著的性能增益,这表明我们提出的 approach 是通用性的,可以作为一种实际解决方案,提高现有的卷积神经网络为基础的Har模型的性能。这是一个开源项目,代码可在该https URL上找到。
https://arxiv.org/abs/2305.13541
The fixed-size context of Transformer makes GPT models incapable of generating arbitrarily long text. In this paper, we introduce RecurrentGPT, a language-based simulacrum of the recurrence mechanism in RNNs. RecurrentGPT is built upon a large language model (LLM) such as ChatGPT and uses natural language to simulate the Long Short-Term Memory mechanism in an LSTM. At each timestep, RecurrentGPT generates a paragraph of text and updates its language-based long-short term memory stored on the hard drive and the prompt, respectively. This recurrence mechanism enables RecurrentGPT to generate texts of arbitrary length without forgetting. Since human users can easily observe and edit the natural language memories, RecurrentGPT is interpretable and enables interactive generation of long text. RecurrentGPT is an initial step towards next-generation computer-assisted writing systems beyond local editing suggestions. In addition to producing AI-generated content (AIGC), we also demonstrate the possibility of using RecurrentGPT as an interactive fiction that directly interacts with consumers. We call this usage of generative models by ``AI As Contents'' (AIAC), which we believe is the next form of conventional AIGC. We further demonstrate the possibility of using RecurrentGPT to create personalized interactive fiction that directly interacts with readers instead of interacting with writers. More broadly, RecurrentGPT demonstrates the utility of borrowing ideas from popular model designs in cognitive science and deep learning for prompting LLMs. Our code is available at this https URL and an online demo is available at this https URL.
Transformer的固定上下文使得GPT模型无法生成任意长文本。在本文中,我们介绍了RecurrentGPT,一个基于语言模型的RNN的再循环机制的模拟对象。RecurrentGPT基于像ChatGPT这样的大型语言模型构建,并使用自然语言模拟来模拟LSTM中的长短期记忆机制。在每个时间步骤中,RecurrentGPT生成一篇段落,并更新其语言模型上的长期-短期记忆存储在硬盘和提示中。这种再循环机制使得RecurrentGPT能够生成任意长度的文本而不会忘记。由于人类用户可以轻松观察和编辑自然语言 memories,RecurrentGPT可以被解释,并实现交互式的生成长文本。RecurrentGPT是超越本地编辑建议的下一代计算机辅助写作系统的初步步骤。除了生成AI生成的内容(AIGC),我们还展示了使用RecurrentGPT作为直接与消费者互动的交互 fiction的可能性。我们称之为“AI作为内容”(AIAC),我们认为它是传统AIGC的下一个形式。我们还展示了使用RecurrentGPT创建个性化交互 fiction的可能性,该交互 fiction直接与读者互动而不是与作者互动。更广泛地说,RecurrentGPT展示了从认知科学和深度学习中受欢迎的模型设计中提取灵感的实用性。我们的代码可在该httpsURL上获取,一个在线演示可在该httpsURL上获取。
https://arxiv.org/abs/2305.13304
Motivation: Named Entity Recognition (NER) is a key task to support biomedical research. In Biomedical Named Entity Recognition (BioNER), obtaining high-quality expert annotated data is laborious and expensive, leading to the development of automatic approaches such as distant supervision. However, manually and automatically generated data often suffer from the unlabeled entity problem, whereby many entity annotations are missing, degrading the performance of full annotation NER models. Results: To address this problem, we systematically study the effectiveness of partial annotation learning methods for biomedical entity recognition over different simulated scenarios of missing entity annotations. Furthermore, we propose a TS-PubMedBERT-Partial-CRF partial annotation learning model. We harmonize 15 biomedical NER corpora encompassing five entity types to serve as a gold standard and compare against two commonly used partial annotation learning models, BiLSTM-Partial-CRF and EER-PubMedBERT, and the state-of-the-art full annotation learning BioNER model PubMedBERT tagger. Results show that partial annotation learning-based methods can effectively learn from biomedical corpora with missing entity annotations. Our proposed model outperforms alternatives and, specifically, the PubMedBERT tagger by 38% in F1-score under high missing entity rates. The recall of entity mentions in our model is also competitive with the upper bound on the fully annotated dataset.
motivation: 命名实体识别(NER)是支持生物医学研究的关键任务。在生物医学命名实体识别(BioNER),获得高质量的专家标注数据是一项勤劳且昂贵的工作,导致出现了远程监督等自动方法的发展。然而,手动和自动生成的数据往往遭受未标注实体的问题,即许多实体标注缺失,从而降低了完整的命名实体识别模型的性能。结果:为了解决这一问题,我们系统地研究了缺少实体标注情况下生物医学实体识别 partial 标注学习方法的有效性。此外,我们提出了TS-PubMedBERT-partial-CRF partial 标注学习模型。我们将15个涵盖五种实体类型的生物医学 NER 数据集协调起来作为标准,并与其他常用的 partial 标注学习模型——BiLSTM-partial-CRF和EER-PubMedBERT——以及先进的全标注学习生物医学命名实体识别模型 PubMedBERT 标签器进行比较。结果表明,基于 partial 标注学习的方法可以从包含缺失实体标注的生物医学数据集中有效地学习。我们提出的模型比竞争对手表现出色,特别是在高缺失实体率下,F1 得分高出38%。我们的模型还实现了与完整标注数据集的上界的竞争。
https://arxiv.org/abs/2305.13120
To ensure safe autonomous driving in urban environments with complex vehicle-pedestrian interactions, it is critical for Autonomous Vehicles (AVs) to have the ability to predict pedestrians' short-term and immediate actions in real-time. In recent years, various methods have been developed to study estimating pedestrian behaviors for autonomous driving scenarios, but there is a lack of clear definitions for pedestrian behaviors. In this work, the literature gaps are investigated and a taxonomy is presented for pedestrian behavior characterization. Further, a novel multi-task sequence to sequence Transformer encoders-decoders (TF-ed) architecture is proposed for pedestrian action and trajectory prediction using only ego vehicle camera observations as inputs. The proposed approach is compared against an existing LSTM encoders decoders (LSTM-ed) architecture for action and trajectory prediction. The performance of both models is evaluated on the publicly available Joint Attention Autonomous Driving (JAAD) dataset, CARLA simulation data as well as real-time self-driving shuttle data collected on university campus. Evaluation results illustrate that the proposed method reaches an accuracy of 81% on action prediction task on JAAD testing data and outperforms the LSTM-ed by 7.4%, while LSTM counterpart performs much better on trajectory prediction task for a prediction sequence length of 25 frames.
为了确保在城市环境中进行复杂车辆-行人相互作用的自主驾驶的安全,对于无人驾驶车辆(AVs)来说,具备实时预测行人短期和即时行动的能力是至关重要的。近年来,已经开发了许多方法研究如何估计自主驾驶场景中的行人行为,但行人行为的定义仍然存在欠缺。在这项工作中,对文献空白进行了调查,并提出了行人行为特征的分类。此外,提出了一种全新的多任务序列到序列Transformer编码器-解码器(TF-ed)架构,用于行人行动和路径预测,仅使用自己车辆相机观察作为输入。将这种方法与现有的LSTM编码器-解码器(LSTM-ed)架构进行比较,以进行行人和路径预测。在公开可用的联合注意力自主驾驶(JAAD)数据集、CARLA模拟数据和在大学校园内收集的实时自主通勤数据集上进行了评估。评估结果显示,在JAAD测试数据集中,该方法的动作预测精度达到81%,比LSTM-ed高出7.4%,而LSTM的对应方法在预测序列长度为25帧的情况下表现更好。
https://arxiv.org/abs/2305.13051
Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of Transformers with the efficient inference of RNNs. Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, which parallelizes computations during training and maintains constant computational and memory complexity during inference, leading to the first non-transformer architecture to be scaled to tens of billions of parameters. Our experiments reveal that RWKV performs on par with similarly sized Transformers, suggesting that future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling the trade-offs between computational efficiency and model performance in sequence processing tasks.
Transformer已经几乎革命了所有自然语言处理(NLP)任务,但是它们面临着与序列长度呈指数级增长的内存和计算复杂性的问题。相比之下,循环神经网络(RNN)的内存和计算要求呈现出线性增长,但它们难以与Transformer相媲美,因为它们在并行化和规模上的局限性。我们提出了一种新的模型架构,称为“负载加权关键值(RWKV)”,它结合了Transformer高效的并行训练和RNN高效的推理。我们的研究方法利用了线性注意力机制,使我们能够将模型既可以作为Transformer也可以作为RNN来定义,并在训练期间将计算并行化,并在推理期间保持计算和内存复杂性的恒定,导致第一个能够处理数十亿参数的非线性架构。我们的实验表明,RWKV的性能与类似大小的Transformer相当,这表明未来工作可以利用这个架构来创造更高效的模型。这项工作标志着在序列处理任务中计算效率和模型性能之间的权衡的一个巨大的转折点。
https://arxiv.org/abs/2305.13048
Applications of deep learning in financial market prediction has attracted huge attention from investors and researchers. In particular, intra-day prediction at the minute scale, the dramatically fluctuating volume and stock prices within short time periods have posed a great challenge for the convergence of networks result. Informer is a more novel network, improved on Transformer with smaller computational complexity, longer prediction length and global time stamp features. We have designed three experiments to compare Informer with the commonly used networks LSTM, Transformer and BERT on 1-minute and 5-minute frequencies for four different stocks/ market indices. The prediction results are measured by three evaluation criteria: MAE, RMSE and MAPE. Informer has obtained best performance among all the networks on every dataset. Network without the global time stamp mechanism has significantly lower prediction effect compared to the complete Informer; it is evident that this mechanism grants the time series to the characteristics and substantially improves the prediction accuracy of the networks. Finally, transfer learning capability experiment is conducted, Informer also achieves a good performance. Informer has good robustness and improved performance in market prediction, which can be exactly adapted to real trading.
深度学习在金融市场预测中的应用已经引起了投资者和研究人员的极大关注。特别是,以分钟为单位的日内预测、在短时间内剧烈波动的交易量和股票价格,对网络结果的一致性提出了巨大的挑战。Informer是一种更加独特的网络,在Transformer的计算复杂度较小、预测长度较长和全球时间戳特征方面进行了改进。我们设计了三组实验,以比较Informer与四种不同股票/市场指数的常见网络LSTM、Transformer和BERT在1分钟和5分钟频率上的预测能力。预测结果以三个评估标准:MAE、RMSE和MAPE来衡量。Informer在所有数据集上在所有情况下取得了最佳表现。与没有全球时间戳机制的网络相比,Informer的预测效果显著较低;显然,这种机制赋予了时间序列的特征,并显著提高了网络的预测准确性。最后,进行了转移学习能力实验,Informer也取得了良好的表现。Informer在金融市场预测中的稳健性和改善性能非常出色,可以完美地适应真实的交易。
https://arxiv.org/abs/2305.14382
Event Causality Identification (ECI) aims to identify causal relations between events in unstructured texts. This is a very challenging task, because causal relations are usually expressed by implicit associations between events. Existing methods usually capture such associations by directly modeling the texts with pre-trained language models, which underestimate two kinds of semantic structures vital to the ECI task, namely, event-centric structure and event-associated structure. The former includes important semantic elements related to the events to describe them more precisely, while the latter contains semantic paths between two events to provide possible supports for ECI. In this paper, we study the implicit associations between events by modeling the above explicit semantic structures, and propose a Semantic Structure Integration model (SemSIn). It utilizes a GNN-based event aggregator to integrate the event-centric structure information, and employs an LSTM-based path aggregator to capture the event-associated structure information between two events. Experimental results on three widely used datasets show that SemSIn achieves significant improvements over baseline methods.
事件因果关系识别(ECI)旨在在无结构文本中识别事件之间的因果关系。这是一个非常具有挑战性的任务,因为因果关系通常通过事件之间的隐含关联来表达。现有的方法通常通过直接建模预训练语言模型来捕捉这些关联,这低估了ECI任务中至关重要的两种语义结构,即事件中心结构和事件相关结构。前者包括与事件相关的重要语义元素,以更准确地描述事件,而后者则包含两个事件之间的语义路径,为ECI提供可能的支持。在本文中,我们通过建模上述明确的语义结构来研究事件之间的隐含关联,并提出了语义结构整合模型(SemSIn)。它使用基于GNN的事件聚合器来整合事件中心结构信息,并使用基于LSTM的路径聚合器来捕获两个事件之间的事件相关结构信息。对三个广泛使用的数据集的实验结果显示,SemSIn比基线方法取得了显著改善。
https://arxiv.org/abs/2305.12792
This paper presents the Coswara dataset, a dataset containing diverse set of respiratory sounds and rich meta-data, recorded between April-2020 and February-2022 from 2635 individuals (1819 SARS-CoV-2 negative, 674 positive, and 142 recovered subjects). The respiratory sounds contained nine sound categories associated with variants of breathing, cough and speech. The rich metadata contained demographic information associated with age, gender and geographic location, as well as the health information relating to the symptoms, pre-existing respiratory ailments, comorbidity and SARS-CoV-2 test status. Our study is the first of its kind to manually annotate the audio quality of the entire dataset (amounting to 65~hours) through manual listening. The paper summarizes the data collection procedure, demographic, symptoms and audio data information. A COVID-19 classifier based on bi-directional long short-term (BLSTM) architecture, is trained and evaluated on the different population sub-groups contained in the dataset to understand the bias/fairness of the model. This enabled the analysis of the impact of gender, geographic location, date of recording, and language proficiency on the COVID-19 detection performance.
本 paper 介绍了Coswara 数据集,这是一个包含多种呼吸声和丰富元数据的集,记录于2020年4月至2022年2月期间,从2635个个体中收集(其中包括1819个无COVID-19症状、674个有症状和142个恢复者)。呼吸声包含了与呼吸、咳嗽和言语变异相关的九个声音分类。丰富元数据包含与年龄、性别和地理位置相关的人口统计数据,以及与症状、现有呼吸疾病、并发和COVID-19测试状态相关的健康信息。我们的研究是该领域的第一个,通过手动听力方式对整个数据集的音频质量进行手动标注(共计65小时)。文章总结了数据收集程序、人口统计数据、症状和音频数据信息。基于双向长短期记忆(BLSTM)架构的COVID-19分类器在数据集中的不同人口子群体中进行训练和评估,以理解模型偏见/公平性。这使能够分析性别、地理位置、记录日期和语言 proficiency对COVID-19检测性能的影响。
https://arxiv.org/abs/2305.12741
Energy-based language models (ELMs) parameterize an unnormalized distribution for natural sentences and are radically different from popular autoregressive language models (ALMs). As an important application, ELMs have been successfully used as a means for calculating sentence scores in speech recognition, but they all use less-modern CNN or LSTM networks. The recent progress in Transformer networks and large pretrained models such as BERT and GPT2 opens new possibility to further advancing ELMs. In this paper, we explore different architectures of energy functions and different training methods to investigate the capabilities of ELMs in rescoring for speech recognition, all using large pretrained models as backbones.
能量基于语言模型(ELMs)对自然句子的未归一化分布进行参数化,与流行的自回归语言模型(ALMs)有很大不同。作为一种重要的应用,ELMs已被成功用作语音识别中句子得分的计算方法,但它们 all 使用较现代化的卷积神经网络或LSTM网络。Transformer网络和大型预训练模型如BERT和GPT2的最近进展为ELMs提供了进一步改进的新可能性。在本文中,我们探索了能量函数的不同架构和不同的训练方法,以研究ELMs在语音识别中的重新评分能力,所有模型都使用大型预训练模型作为骨架。
https://arxiv.org/abs/2305.12676
This paper proposes a sequence-to-sequence learning approach for Arabic pronoun resolution, which explores the effectiveness of using advanced natural language processing (NLP) techniques, specifically Bi-LSTM and the BERT pre-trained Language Model, in solving the pronoun resolution problem in Arabic. The proposed approach is evaluated on the AnATAr dataset, and its performance is compared to several baseline models, including traditional machine learning models and handcrafted feature-based models. Our results demonstrate that the proposed model outperforms the baseline models, which include KNN, logistic regression, and SVM, across all metrics. In addition, we explore the effectiveness of various modifications to the model, including concatenating the anaphor text beside the paragraph text as input, adding a mask to focus on candidate scores, and filtering candidates based on gender and number agreement with the anaphor. Our results show that these modifications significantly improve the model's performance, achieving up to 81% on MRR and 71% for F1 score while also demonstrating higher precision, recall, and accuracy. These findings suggest that the proposed model is an effective approach to Arabic pronoun resolution and highlights the potential benefits of leveraging advanced NLP neural models.
本论文提出了一种序列到序列学习方法来解决阿拉伯人称代词识别问题,该方法探索了使用高级自然语言处理技术(NLP技术),特别是 Bi-LSTM 和 BERT 预训练语言模型,来解决阿拉伯人称代词识别问题的有效性。该方法在 AnATAr 数据集上进行评估,并与其他基准模型(包括传统机器学习模型和手算特征模型)进行了比较。我们的结果显示, proposed 模型在所有指标上都胜过基准模型,包括 KNN、逻辑回归和 SVM。此外,我们探索了对模型进行各种修改的有效性,包括将人称代词文本放置在段落文本旁边作为输入、添加掩码以专注于候选人得分、并根据性别和数字与人称代词的一致性过滤候选人。我们的结果显示,这些修改显著提高了模型的表现,在准确率上达到 81%,在 MRR 上达到 71%,同时表现出更高的精度、召回率和准确性。这些结果表明, proposed 模型是解决阿拉伯人称代词识别问题的有效方法,并突出了利用高级 NLP 神经网络模型的潜在好处。
https://arxiv.org/abs/2305.11529
Vision-and-language navigation (VLN) is a crucial but challenging cross-modal navigation task. One powerful technique to enhance the generalization performance in VLN is the use of an independent speaker model to provide pseudo instructions for data augmentation. However, current speaker models based on Long-Short Term Memory (LSTM) lack the ability to attend to features relevant at different locations and time steps. To address this, we propose a novel progress-aware spatio-temporal transformer speaker (PASTS) model that uses the transformer as the core of the network. PASTS uses a spatio-temporal encoder to fuse panoramic representations and encode intermediate connections through steps. Besides, to avoid the misalignment problem that could result in incorrect supervision, a speaker progress monitor (SPM) is proposed to enable the model to estimate the progress of instruction generation and facilitate more fine-grained caption results. Additionally, a multifeature dropout (MFD) strategy is introduced to alleviate overfitting. The proposed PASTS is flexible to be combined with existing VLN models. The experimental results demonstrate that PASTS outperforms all existing speaker models and successfully improves the performance of previous VLN models, achieving state-of-the-art performance on the standard Room-to-Room (R2R) dataset.
视觉和语言导航(VLN)是一个关键但具有挑战性的跨modal导航任务。一种增强VLN泛化性能的强大方法是使用独立的说话者模型提供数据增强的伪指令。然而,当前基于长短期记忆(LSTM)的说话者模型缺乏在不同位置和时间步骤中关注相关特征的能力。为了解决这一问题,我们提出了一种新的进度感知空间transformer说话者(PAsts)模型,该模型使用Transformer作为网络的核心。PAsts使用空间编码器来融合全景表示,并通过步骤编码中间连接。此外,为了避免可能导致错误监督的不匹配问题,我们提出了一个说话者进度监控器(SPM),使模型能够估计指令生成的进展,并促进更精细的标题结果。此外,我们引入了一种多特征 dropout(MFD)策略,以减轻过拟合。提出的PAsts非常灵活,可以与现有的VLN模型相结合。实验结果显示,PAsts比所有现有的说话者模型表现更好,并成功地改善了先前的VLN模型的性能,在标准房间到房间(R2R)数据集上取得了最先进的表现。
https://arxiv.org/abs/2305.11918
In motor neuroscience, artificial recurrent neural networks models often complement animal studies. However, most modeling efforts are limited to data-fitting, and the few that examine virtual embodied agents in a reinforcement learning context, do not draw direct comparisons to their biological counterparts. Our study addressing this gap, by uncovering structured neural activity of a virtual robot performing legged locomotion that directly support experimental findings of primate walking and cycling. We find that embodied agents trained to walk exhibit smooth dynamics that avoid tangling -- or opposing neural trajectories in neighboring neural space -- a core principle in computational neuroscience. Specifically, across a wide suite of gaits, the agent displays neural trajectories in the recurrent layers are less tangled than those in the input-driven actuation layers. To better interpret the neural separation of these elliptical-shaped trajectories, we identify speed axes that maximizes variance of mean activity across different forward, lateral, and rotational speed conditions.
在运动神经学中,人工循环神经网络模型常常可以补充动物研究。然而,大多数建模努力都局限于数据匹配,而仅少数研究在强化学习上下文中研究了虚拟具有身体感知能力的实体,这些实体不能直接与生物体进行直接比较。我们的研究解决这个问题,通过揭示执行腿动能力的虚拟机器人的结构化神经网络活动,直接支持多足动物步行和骑车的实验研究结果。我们发现,训练步行的实体表现出平滑的动态特性,避免纠缠或相邻神经网络空间中的反向神经网络轨迹,这是计算神经学的基本原理之一。具体来说,在所有的步伐类型中,实体的循环层神经网络轨迹都比输入驱动控制层的轨迹更容易纠缠。为了更好地解释这些椭圆形轨迹的神经网络分离,我们确定了速度轴,这些轴最大限度地提高了不同向前、向后、旋转速度条件下的均值活动多样性。
https://arxiv.org/abs/2305.11107
Language is by its very nature incremental in how it is produced and processed. This property can be exploited by NLP systems to produce fast responses, which has been shown to be beneficial for real-time interactive applications. Recent neural network-based approaches for incremental processing mainly use RNNs or Transformers. RNNs are fast but monotonic (cannot correct earlier output, which can be necessary in incremental processing). Transformers, on the other hand, consume whole sequences, and hence are by nature non-incremental. A restart-incremental interface that repeatedly passes longer input prefixes can be used to obtain partial outputs, while providing the ability to revise. However, this method becomes costly as the sentence grows longer. In this work, we propose the Two-pass model for AdaPtIve Revision (TAPIR) and introduce a method to obtain an incremental supervision signal for learning an adaptive revision policy. Experimental results on sequence labelling show that our model has better incremental performance and faster inference speed compared to restart-incremental Transformers, while showing little degradation on full sequences.
语言在生产和处理方式上本身就是增量的,这一点可以通过NLP系统利用来实现,以产生快速响应,这在实时交互应用中是有益的。最近,基于神经网络的方法增量处理主要使用RNNs或Transformers。RNNs虽然快但非单调(无法纠正之前的输出,这在增量处理中是必要的)。Transformers整个序列都要消耗,因此本质上是增量的。一个反复通过更长输入前缀的重新开始增量接口可以用来获取部分输出,同时提供修改能力。然而,句子越长,这种方法就越来越昂贵。在本研究中,我们提出了TAPIR模型,并介绍了一种方法来获取增量监督信号,以学习自适应修改策略。序列标签实验结果表明,我们的模型相比重新开始增量的Transformers,具有更好的增量性能和更快的推理速度,而在完整序列中表现出 little degradation。
https://arxiv.org/abs/2305.10845
This paper proposes a novel Attention-based Encoder-Decoder network for End-to-End Neural speaker Diarization (AED-EEND). In AED-EEND system, we incorporate the target speaker enrollment information used in target speaker voice activity detection (TS-VAD) to calculate the attractor, which can mitigate the speaker permutation problem and facilitate easier model convergence. In the training process, we propose a teacher-forcing strategy to obtain the enrollment information using the ground-truth label. Furthermore, we propose three heuristic decoding methods to identify the enrollment area for each speaker during the evaluation process. Additionally, we enhance the attractor calculation network LSTM used in the end-to-end encoder-decoder based attractor calculation (EEND-EDA) system by incorporating an attention-based model. By utilizing such an attention-based attractor decoder, our proposed AED-EEND system outperforms both the EEND-EDA and TS-VAD systems with only 0.5s of enrollment data.
本论文提出了一种基于注意力的Encoder-Decoder网络,用于端到端神经网络语音分割(AED-EEND)系统。在AED-EEND系统中,我们利用目标语音参与者的参与信息,在目标语音参与者语音活动检测(TS-VAD)中计算吸引子,从而减轻语音置换问题,并方便模型更容易收敛。在训练过程中,我们提出了一种老师强制策略,使用真实的标签获取参与信息。此外,我们提出了三种启发解码方法,在评估过程中识别每个语音参与者的参与区域。此外,我们通过引入基于注意力模型的注意力吸引子计算网络LSTM,增强了EEND-EDA系统中使用的吸引子计算网络。利用这种基于注意力的吸引子解码器,我们的AED-EEND系统在仅使用0.5秒的参与数据的情况下,比EEND-EDA系统和TS-VAD系统表现更好。
https://arxiv.org/abs/2305.10704
Works in quantum machine learning (QML) over the past few years indicate that QML algorithms can function just as well as their classical counterparts, and even outperform them in some cases. Among the corpus of recent work, many current QML models take advantage of variational quantum algorithm (VQA) circuits, given that their scale is typically small enough to be compatible with NISQ devices and the method of automatic differentiation for optimizing circuit parameters is familiar to machine learning (ML). While the results bear interesting promise for an era when quantum machines are more readily accessible, if one can achieve similar results through non-quantum methods then there may be a more near-term advantage available to practitioners. To this end, the nature of this work is to investigate the utilization of stochastic methods inspired by a variational quantum version of the long short-term memory (LSTM) model in an attempt to approach the reported successes in performance and rapid convergence. By analyzing the performance of classical, stochastic, and quantum methods, this work aims to elucidate if it is possible to achieve some of QML's major reported benefits on classical machines by incorporating aspects of its stochasticity.
过去几年中的量子机器学习(QML)工作表明,QML算法可以与经典算法一样良好地工作,甚至在一些情况下可以表现更好。在最近的工作成果中,许多当前的QML模型利用Variational Quantum Algorithm(VQA)电路,因为它们的尺度通常足够大,与NisysQ设备兼容,而自动微分以优化电路参数是机器学习(ML)常见的方法。虽然这些结果在量子机器更为容易获取的时代具有有趣的潜力,如果通过非量子方法能够实现类似结果,则可能为实践者带来更近期的优势。因此,本工作的性质是研究基于Variational Quantum LSTM模型的随机方法的利用,试图接近报告的性能和快速收敛的成功。通过分析经典、随机和量子方法的性能,该工作旨在阐明是否可能通过引入随机性方面实现QML在经典机器上的主要报告利益。
https://arxiv.org/abs/2305.10212
Stagnant weather condition is one of the major contributors to air pollution as it is favorable for the formation and accumulation of pollutants. To measure the atmosphere's ability to dilute air pollutants, Air Stagnation Index (ASI) has been introduced as an important meteorological index. Therefore, making long-lead ASI forecasts is vital to make plans in advance for air quality management. In this study, we found that autumn Niño indices derived from sea surface temperature (SST) anomalies show a negative correlation with wintertime ASI in southern China, offering prospects for a prewinter forecast. We developed an LSTM-based model to predict the future wintertime ASI. Results demonstrated that multivariate inputs (past ASI and Niño indices) achieve better forecast performance than univariate input (only past ASI). The model achieves a correlation coefficient of 0.778 between the actual and predicted ASI, exhibiting a high degree of consistency.
停滞气象条件是空气污染的主要贡献者之一,因为它有利于污染物的形成和积累。为了衡量大气层稀释污染物的能力,Air Stagnation Index (ASI)已引入作为一个重要的气象指数。因此,进行长期ASI预测是制定空气质量管理计划的重要步骤。在本研究中,我们发现,从海洋表面温度异常中提取的 autumn Niño indices 在 southern China 的冬季 ASI 方面呈负相关,为冬季预测提供了前景。我们开发了一个基于LSTM的模型来预测未来的冬季 ASI。结果表明,多因素输入(过去的 ASI 和 Niño indices)比单因素输入(仅仅过去的 ASI)实现更好的预测表现。模型实现了实际和预测 ASI之间的 correlation coefficient 为0.778,表现出高度的一致性。
https://arxiv.org/abs/2305.11901