This paper presents an overview of rule-based system for automatic accentuation and phonemic transcription of Russian texts for speech connected tasks, such as Automatic Speech Recognition (ASR). Two parts of the developed system, accentuation and transcription, use different approaches to achieve correct phonemic representations of input phrases. Accentuation is based on "Grammatical dictionary of the Russian language" of A.A. Zaliznyak and wiktionary corpus. To distinguish homographs, the accentuation system also utilises morphological information of the sentences based on Recurrent Neural Networks (RNN). Transcription algorithms apply the rules presented in the monograph of B.M. Lobanov and L.I. Tsirulnik "Computer Synthesis and Voice Cloning". The rules described in the present paper are implemented in an open-source module, which can be of use to any scientific study connected to ASR or Speech To Text (STT) tasks. Automatically marked up text annotations of the Russian Voxforge database were used as training data for an acoustic model in CMU Sphinx. The resulting acoustic model was evaluated on cross-validation, mean Word Accuracy being 71.2%. The developed toolkit is written in the Python language and is accessible on GitHub for any researcher interested.
本文概述了一个基于规则的自动重音和音素转录的俄罗斯文本系统,用于语音连接任务,如自动语音识别(ASR)。系统的两个部分,重音和转录,采用不同的方法来实现输入短语的正确音素表示。重音基于A.A. Zaliznyak和维基词典俄语的“语法词典”。为了区分同形词,重音系统还利用基于循环神经网络(RNN)的句子的语素信息。转录算法应用了B.M. Lobanov和L.I. Tsirulnik在其著作中提出的规则。本文中的规则描述在文中得到了实现,这是一个开源模块,对于与ASR或语音到文本(STT)任务相关的任何科学研究都很有用。在CMU Sphinx上使用俄语Voxforge数据库的自动标注的文本注释作为音频模型的训练数据。通过交叉验证评估该音频模型,平均单词准确率达到了71.2%。开发工具包是用Python编写的,并可公开获取在GitHub上。
https://arxiv.org/abs/2410.02538
Establishing and maintaining 5G mmWave vehicular connectivity poses a significant challenge due to high user mobility that necessitates frequent triggering of beam switching procedures. Departing from reactive beam switching based on the user device channel state feedback, proactive beam switching prepares in advance for upcoming beam switching decisions by exploiting accurate channel state information (CSI) prediction. In this paper, we develop a framework for autonomous self-trained CSI prediction for mmWave vehicular users where a base station (gNB) collects and labels a dataset that it uses for training recurrent neural network (RNN)-based CSI prediction model. The proposed framework exploits the CSI feedback from vehicular users combined with overhearing the C-V2X cooperative awareness messages (CAMs) they broadcast. We implement and evaluate the proposed framework using deepMIMO dataset generation environment and demonstrate its capability to provide accurate CSI prediction for 5G mmWave vehicular users. CSI prediction model is trained and its capability to provide accurate CSI predictions from various input features are investigated.
建立和维护5G mmWave车辆连接 poses a significant challenge due to high user mobility that necessitates frequent triggering of beam switching procedures. 离开基于用户设备信道状态反馈的反应式波切换,主动波切换在事先利用准确的信道状态信息(CSI)预测进行波切换决策方面做好准备。在本文中,我们为mmWave车辆用户开发了一个自适应的CSI预测框架,该框架基于基站(gNB)收集和标记用于训练基于循环神经网络(RNN)的CSI预测模型的数据集。所提出的框架利用车辆用户产生的CSI反馈以及监听他们广播的C-V2X合作意识消息(CAMs)。我们使用 deepMIMO 数据生成环境实现并评估所提出的框架,并证明了它为5G mmWave车辆用户提供准确CSI预测的能力。我们研究了CSI预测模型的训练及其从各种输入特征提供准确CSI预测的能力。
https://arxiv.org/abs/2410.02326
Objects we encounter often change appearance as we interact with them. Changes in illumination (shadows), object pose, or movement of nonrigid objects can drastically alter available image features. How do biological visual systems track objects as they change? It may involve specific attentional mechanisms for reasoning about the locations of objects independently of their appearances -- a capability that prominent neuroscientific theories have associated with computing through neural synchrony. We computationally test the hypothesis that the implementation of visual attention through neural synchrony underlies the ability of biological visual systems to track objects that change in appearance over time. We first introduce a novel deep learning circuit that can learn to precisely control attention to features separately from their location in the world through neural synchrony: the complex-valued recurrent neural network (CV-RNN). Next, we compare object tracking in humans, the CV-RNN, and other deep neural networks (DNNs), using FeatureTracker: a large-scale challenge that asks observers to track objects as their locations and appearances change in precisely controlled ways. While humans effortlessly solved FeatureTracker, state-of-the-art DNNs did not. In contrast, our CV-RNN behaved similarly to humans on the challenge, providing a computational proof-of-concept for the role of phase synchronization as a neural substrate for tracking appearance-morphing objects as they move about.
我们经常遇到的对象在互动过程中会改变外观。光照变化、物体姿态或运动非刚性对象的改变会导致可用图像特征发生极大改变。生物视觉系统如何跟踪随其变化的对象呢?这可能涉及特定注意力机制来独立于物体外观计算物体位置的推理能力——这一能力与通过神经同步计算的神经科学理论密切相关。我们通过计算视觉注意力通过神经同步实现来测试假设,即视觉注意力通过神经同步实现了生物视觉系统在时间上跟踪随其外观变化的对象的能力。 首先,我们介绍了一个新型的深度学习电路,可以通过神经同步准确地控制对特征的关注度,而无需考虑它们在空间中的位置:复杂值循环神经网络(CV-RNN)。接下来,我们使用FeatureTracker这个大型的挑战来比较人类、CV-RNN和其他深度神经网络(DNNs)的物体跟踪能力。尽管人类轻松地解决了FeatureTracker,但最先进的DNNs没有做到。相反,我们的CV-RNN在挑战中表现出了与人类相似的行为,提供了计算同步作为神经基因为追踪随其运动变化的外貌变形的物体的证明。
https://arxiv.org/abs/2410.02094
In the fields of computational mathematics and artificial intelligence, the need for precise data modeling is crucial, especially for predictive machine learning tasks. This paper explores further XNet, a novel algorithm that employs the complex-valued Cauchy integral formula, offering a superior network architecture that surpasses traditional Multi-Layer Perceptrons (MLPs) and Kolmogorov-Arnold Networks (KANs). XNet significant improves speed and accuracy across various tasks in both low and high-dimensional spaces, redefining the scope of data-driven model development and providing substantial improvements over established time series models like LSTMs.
在计算数学和人工智能领域,精确的数据建模至关重要,尤其是在预测机器学习任务中。本文进一步探讨了XNet,一种采用复值Cauchy积分公式的新型算法,为神经网络提供了卓越的网络结构,超越了传统的多层感知器(MLPs)和Kolmogorov-Arnold网络(KANs)。XNet在低维和高维空间各种任务上显著提高了速度和精度,重新定义了数据驱动模型开发的范围,并在如LSTMs等现有时间序列模型的基础上提供了显著的改进。
https://arxiv.org/abs/2410.02033
In recent years, deep learning has revolutionized the field of protein science, enabling advancements in predicting protein properties, structural folding and interactions. This paper presents DeepProtein, a comprehensive and user-friendly deep learning library specifically designed for protein-related tasks. DeepProtein integrates a couple of state-of-the-art neural network architectures, which include convolutional neural network (CNN), recurrent neural network (RNN), transformer, graph neural network (GNN), and graph transformer (GT). It provides user-friendly interfaces, facilitating domain researchers in applying deep learning techniques to protein data. Also, we curate a benchmark that evaluates these neural architectures on a variety of protein tasks, including protein function prediction, protein localization prediction, and protein-protein interaction prediction, showcasing its superior performance and scalability. Additionally, we provide detailed documentation and tutorials to promote accessibility and encourage reproducible research. This library is extended from a well-known drug discovery library, DeepPurpose and publicly available at this https URL.
近年来,深度学习已经彻底颠覆了蛋白质科学领域,使预测蛋白质属性、结构折叠和相互作用取得了进展。本文介绍了DeepProtein,一个专门为蛋白质相关任务设计的全面且易用的深度学习库。DeepProtein整合了几个最先进的神经网络架构,包括卷积神经网络(CNN)、循环神经网络(RNN)、Transformer、图神经网络(GNN)和图Transformer(GT)。它提供了易用的界面,使领域研究人员能够将深度学习技术应用于蛋白质数据。我们还策划了一个基准,评估这些神经网络架构在包括蛋白质功能预测、蛋白质定位预测和蛋白质-蛋白质相互作用预测在内的各种蛋白质任务上的性能,展示了其在优越性能和可扩展性方面的优势。此外,我们还提供了详细的文档和教程,以促进其易用性和鼓励可重复的研究。这个库从著名的药物发现库DeepPurpose延伸,并公开发布在https://这个网址。
https://arxiv.org/abs/2410.02023
The scalability limitations of Transformers regarding sequence length have renewed interest in recurrent sequence models that are parallelizable during training. As a result, many novel recurrent architectures, such as S4, Mamba, and Aaren, have been proposed that achieve comparable performance. In this work, we revisit traditional recurrent neural networks (RNNs) from over a decade ago: LSTMs (1997) and GRUs (2014). While these models were slow due to requiring to backpropagate through time (BPTT), we show that by removing their hidden state dependencies from their input, forget, and update gates, LSTMs and GRUs no longer need to BPTT and can be efficiently trained in parallel. Building on this, we introduce minimal versions (minLSTMs and minGRUs) that (1) use significantly fewer parameters than their traditional counterparts and (2) are fully parallelizable during training (175x faster for a sequence of length 512). Lastly, we show that these stripped-down versions of decade-old RNNs match the empirical performance of recent sequence models.
transformers关于序列长度的可扩展性限制重新引起了人们对并行序列模型的关注。结果,提出了许多新序列架构,如S4、Mamba和Aaren,这些架构在训练过程中具有与Transformer相当的性能。在这篇工作中,我们重新审视了十年前的经典循环神经网络(RNNs):LSTM(1997)和GRU(2014)。虽然这些模型由于需要通过时间反向传播而速度较慢(BPTT),但我们证明,通过从输入、 forget和更新门中移除隐状态依赖,LSTMs和GRUs不再需要BPTT,并且可以在并行方式下高效训练。在此基础上,我们引入了最小版本(minLSTMs和minGRUs),其特点如下:(1)使用比传统 counterparts大大少的参数,(2)在训练过程中完全并行化(175倍于序列长度为512)。最后,我们证明了这些剥离了十年老RNN的简化版本与最近序列模型的实证性能相匹敌。
https://arxiv.org/abs/2410.01201
Uncertainty of environments has long been a difficult characteristic to handle, when performing real-world robot tasks. This is because the uncertainty produces unexpected observations that cannot be covered by manual scripting. Learning based robot controlling methods are a promising approach for generating flexible motions against unknown situations, but still tend to suffer under uncertainty due to its deterministic nature. In order to adaptively perform the target task under such conditions, the robot control model must be able to accurately understand the possible uncertainty, and to exploratively derive the optimal action that minimizes such uncertainty. This paper extended an existing predictive learning based robot control method, which employ foresight prediction using dynamic internal simulation. The foresight module refines the model's hidden states by sampling multiple possible futures and replace with the one that led to the lower future uncertainty. The adaptiveness of the model was evaluated on a door opening task. The door can be opened either by pushing, pulling, or sliding, but robot cannot visually distinguish which way, and is required to adapt on the fly. The results showed that the proposed model adaptively diverged its motion through interaction with the door, whereas conventional methods failed to stably diverge. The models were analyzed on Lyapunov exponents of RNN hidden states which reflect the possible divergence at each time step during task execution. The result indicated that the foresight module biased the model to consider future consequences, which lead to embedding uncertainties at the policy of the robot controller, rather than the resultant observation. This is beneficial for implementing adaptive behaviors, which indices derivation of diverse motion during exploration.
不确定环境的特点一直是一个难以处理的问题,在进行现实世界的机器人任务时。这是因为不确定性会产生无法通过手动脚本预测的意外观察结果。基于机器学习控制方法是一种有前途的方法,可以生成对抗未知情况的灵活运动,但是由于其确定性 nature,仍然容易在不确定性条件下遭受挫折。为了在类似情况下适应执行目标任务,机器人控制模型必须能够准确理解可能的不确定性,并通过探索性推理得出最小不确定性的最优动作。本文在基于预测学习的现有机器人控制方法上进行了扩展,该方法使用动态内部仿真使用前瞻性预测。前瞻性模块通过采样多个可能的未来并替换为导致较低未来不确定性的未来来优化模型的隐藏状态。对模型的适应性进行了评估,该模型通过与门相互作用而适应性地改变其运动方式,而传统方法则无法在每次任务执行过程中稳定地改变。对模型的分析是在Lyapunov指数上进行的,这些指数反映了在任务执行过程中每个时间点的可能变化。结果表明,基于前瞻性的方法使模型更倾向于考虑未来的后果,从而将不确定性在机器人控制器的行为策略上进行编码,而不是在观察结果上。这对于实现自适应行为非常有利,这些行为指数可以衡量在探索过程中不同动作的衍生运动。
https://arxiv.org/abs/2410.00774
Extending the RNN Transducer (RNNT) to recognize multi-talker speech is essential for wider automatic speech recognition (ASR) applications. Multi-talker RNNT (MT-RNNT) aims to achieve recognition without relying on costly front-end source separation. MT-RNNT is conventionally implemented using architectures with multiple encoders or decoders, or by serializing all speakers' transcriptions into a single output stream. The first approach is computationally expensive, particularly due to the need for multiple encoder processing. In contrast, the second approach involves a complex label generation process, requiring accurate timestamps of all words spoken by all speakers in the mixture, obtained from an external ASR system. In this paper, we propose a novel alignment-free training scheme for the MT-RNNT (MT-RNNT-AFT) that adopts the standard RNNT architecture. The target labels are created by appending a prompt token corresponding to each speaker at the beginning of the transcription, reflecting the order of each speaker's appearance in the mixtures. Thus, MT-RNNT-AFT can be trained without relying on accurate alignments, and it can recognize all speakers' speech with just one round of encoder processing. Experiments show that MT-RNNT-AFT achieves performance comparable to that of the state-of-the-art alternatives, while greatly simplifying the training process.
将循环神经网络(RNN)Transducer(RNNT)扩展以识别多说话者语音对于更广泛的自动语音识别(ASR)应用至关重要。多说话者循环神经网络(MT-RNNT)旨在实现不依赖于昂贵的前端声源分离的识别,MT-RNNT通常采用具有多个编码器或解码器的架构,或通过将所有发言者的转录序列串行化得到单个输出流来实现。第一种方法在计算上非常昂贵,特别是因为需要进行多个编码器处理。相比之下,第二种方法涉及复杂的标签生成过程,需要从外部ASR系统获得所有发言者所说话的所有单词的准确时间戳。在本文中,我们提出了一个名为MT-RNNT-AFT的新无同步训练方案,它采用标准的RNNT架构。目标标签是通过在转录的开头附加上每个发言者的提示词来创建的,反映了每个发言者在混合中的出现顺序。因此,MT-RNNT-AFT可以在不需要准确对齐的情况下进行训练,并且可以在只需一次编码器处理 round 就能识别所有发言者的语音。实验证明,MT-RNNT-AFT在实现与最先进的替代方案相当的同时,大大简化了训练过程。
https://arxiv.org/abs/2409.20301
The advancement of the Natural Language Processing field has enabled the development of language models with a great capacity for generating text. In recent years, Neuroscience has been using these models to better understand cognitive processes. In previous studies, we found that models like Ngrams and LSTM networks can partially model Predictability when used as a co-variable to explain readers' eye movements. In the present work, we further this line of research by using GPT-2 based models. The results show that this architecture achieves better outcomes than its predecessors.
自然语言处理领域的进步使得具有生成文本巨大能力的语言模型得以开发。近年来,神经科学已经使用这些模型更好地理解认知过程。在以前的研究中,我们发现,像Ngrams和LSTM网络这样的模型可以部分地解释读者眼动时的预测性。在本文工作中,我们进一步深化了这一研究,使用了基于GPT-2的模型。结果显示,这种架构取得了比前人更好的结果。
https://arxiv.org/abs/2409.20174
In this paper we propose a novel approach based on knowledge graphs to provide timely access to structured information, to enable actionable technology intelligence, and improve cyber-physical systems planning. Our framework encompasses a text mining process, which includes information retrieval, keyphrase extraction, semantic network creation, and topic map visualization. Following this data exploration process, we employ a selective knowledge graph construction (KGC) approach supported by an electronics and innovation ontology-backed pipeline for multi-objective decision-making with a focus on cyber-physical systems. We apply our methodology to the domain of automotive electrical systems to demonstrate the approach, which is scalable. Our results demonstrate that our construction process outperforms GraphGPT as well as our bi-LSTM and transformer REBEL with a pre-defined dataset by several times in terms of class recognition, relationship construction and correct "sublass of" categorization. Additionally, we outline reasoning applications and provide a comparison with Wikidata to show the differences and advantages of the approach.
在本文中,我们提出了一种基于知识图谱的新方法,以提供及时访问结构化信息,实现可行动的技术智能,并提高网络物理系统规划。我们的框架包括文本挖掘过程,包括信息检索、关键词提取、语义网络创建和主题图可视化。在数据探索过程之后,我们采用了一个由电子和创新本体论支持的多目标决策(MOT)方法来构建选择性的知识图谱。我们将我们的方法应用于汽车电气系统领域,以证明该方法是可扩展的。我们的结果表明,通过预定义的数据集,我们的构建过程在类识别、关系构建和正确“子类”分类方面优于GraphGPT,以及我们的 Bi-LSTM 和 Transformer REBEL。此外,我们还概述了推理应用,并与Wikidata进行了比较,以展示该方法的区别和优势。
https://arxiv.org/abs/2409.20010
Automatic music transcription (AMT), aiming to convert musical signals into musical notation, is one of the important tasks in music information retrieval. Recently, previous works have applied high-resolution labels, i.e., the continuous onset and offset times of piano notes, as training targets, achieving substantial improvements in transcription performance. However, there still remain some issues to be addressed, e.g., the harmonics of notes are sometimes recognized as false positive notes, and the size of AMT model tends to be larger to improve the transcription performance. To address these issues, we propose an improved high-resolution piano transcription model to well capture specific acoustic characteristics of music signals. First, we employ the Constant-Q Transform as the input representation to better adapt to musical signals. Moreover, we have designed two architectures: the first is based on a convolutional recurrent neural network (CRNN) with dilated convolution, and the second is an encoder-decoder architecture that combines CRNN with a non-autoregressive Transformer decoder. We conduct systematic experiments for our models. Compared to the high-resolution AMT system used as a baseline, our models effectively achieve 1) consistent improvement in note-level metrics, and 2) the significant smaller model size, which shed lights on future work.
自动音乐转录(AMT)是音乐信息检索中的一个重要任务,旨在将音乐信号转换为音乐记谱。近年来,以前的工作已经将高分辨率标签,即钢琴音符的连续出现和偏移时间,作为训练目标,在转录性能上取得了显著的提高。然而,仍然有一些问题需要解决,例如,有时音符的谐波被错误地识别为正弦波,而且AMT模型的尺寸往往越大,以提高转录性能。为了解决这些问题,我们提出了一个改进的高分辨率钢琴转录模型,以更好地捕捉音乐的特定声学特征。首先,我们使用常数项变换作为输入表示,更好地适应音乐信号。此外,我们还设计了一个架构:第一个是基于卷积循环神经网络(CRNN)的架构,第二个是将CRNN与非自回归Transformer解码器相结合的架构。我们对我们的模型进行了系统实验。与用作基线的 high-resolution AMT 系统相比,我们的模型在音符级别指标上实现了1)的一致改进,2)显著的模型大小,这为未来的工作提供了启示。
https://arxiv.org/abs/2409.19614
Depth perception is essential for a robot's spatial and geometric understanding of its environment, with many tasks traditionally relying on hardware-based depth sensors like RGB-D or stereo cameras. However, these sensors face practical limitations, including issues with transparent and reflective objects, high costs, calibration complexity, spatial and energy constraints, and increased failure rates in compound systems. While monocular depth estimation methods offer a cost-effective and simpler alternative, their adoption in robotics is limited due to their output of relative rather than metric depth, which is crucial for robotics applications. In this paper, we propose a method that utilizes a single calibrated camera, enabling the robot to act as a ``measuring stick" to convert relative depth estimates into metric depth in real-time as tasks are performed. Our approach employs an LSTM-based metric depth regressor, trained online and refined through probabilistic filtering, to accurately restore the metric depth across the monocular depth map, particularly in areas proximal to the robot's motion. Experiments with real robots demonstrate that our method significantly outperforms current state-of-the-art monocular metric depth estimation techniques, achieving a 22.1% reduction in depth error and a 52% increase in success rate for a downstream task.
深度感知对于机器人对环境的空间和几何理解至关重要,许多任务传统上依赖硬件式深度传感器,如RGB-D或立体相机。然而,这些传感器面临实际限制,包括对透明和反光物体的兼容性问题,高成本,校准复杂度,以及复杂系统的失败率增加。虽然单目深度估计方法提供了一种成本效益高且简单的替代方案,但由于它们的输出是相对深度而不是标量深度,这对于机器人应用至关重要,因此其在机器人领域的采用受到了限制。在本文中,我们提出了一种利用单校准摄像机的方法,使得机器人能够在执行任务时将相对深度估计转换为标量深度。我们的方法采用了一种基于LSTM的标量深度回归器,通过在线训练和概率滤波来优化,以准确地修复单目深度图中的标量深度,特别是在机器人运动附近区域。使用实际机器人进行实验证明,我们的方法显著优于当前的单目标量深度估计技术,实现了22.1%的深度误差降低和52%的成功率增加,对于下游任务取得了显著的改善。
https://arxiv.org/abs/2409.19490
Activity recognition is a challenging task due to the large scale of trajectory data and the need for prompt and efficient processing. Existing methods have attempted to mitigate this problem by employing traditional LSTM architectures, but these approaches often suffer from inefficiencies in processing large datasets. In response to this challenge, we propose VecLSTM, a novel framework that enhances the performance and efficiency of LSTM-based neural networks. Unlike conventional approaches, VecLSTM incorporates vectorization layers, leveraging optimized mathematical operations to process input sequences more efficiently. We have implemented VecLSTM and incorporated it into the MySQL database. To evaluate the effectiveness of VecLSTM, we compare its performance against a conventional LSTM model using a dataset comprising 1,467,652 samples with seven unique labels. Experimental results demonstrate superior accuracy and efficiency compared to the state-of-the-art, with VecLSTM achieving a validation accuracy of 85.57\%, a test accuracy of 85.47\%, and a weighted F1-score of 0.86. Furthermore, VecLSTM significantly reduces training time, offering a 26.2\% reduction compared to traditional LSTM models.
活动识别是一个具有挑战性的任务,因为轨迹数据的规模庞大,需要进行 prompt 和高效的处理。现有的方法试图通过采用传统的 LSTM 架构来缓解这个问题,但往往这些方法在处理大量数据时存在低效问题。为了应对这个挑战,我们提出了 VecLSTM,一种新颖的框架,可以增强基于 LSTM 的神经网络的性能和效率。与传统方法不同,VecLSTM 包含了向量化层,利用优化后的数学运算更有效地处理输入序列。我们已经实现了 VecLSTM,并将其集成到 MySQL 数据库中。为了评估 VecLSTM 的有效性,我们使用由包含1,467,652个样本,具有七个独特标签的数据集与传统 LSTM 模型进行比较。实验结果表明,与最先进的 LSTM 模型相比,VecLSTM 的准确性和效率具有优越性,VecLSTM 的验证精度为85.57%,测试精度为85.47%,加权 F1 分数为0.86。此外,VecLSTM 显著减少了训练时间,与传统 LSTM 模型相比,减少了26.2%。
https://arxiv.org/abs/2409.19258
Meta learning has been widely used to exploit rich-resource source tasks to improve the performance of low-resource target tasks. Unfortunately, most existing meta learning approaches treat different source tasks equally, ignoring the relatedness of source tasks to the target task in knowledge transfer. To mitigate this issue, we propose a reinforcement-based multi-source meta-transfer learning framework (Meta-RTL) for low-resource commonsense reasoning. In this framework, we present a reinforcement-based approach to dynamically estimating source task weights that measure the contribution of the corresponding tasks to the target task in the meta-transfer learning. The differences between the general loss of the meta model and task-specific losses of source-specific temporal meta models on sampled target data are fed into the policy network of the reinforcement learning module as rewards. The policy network is built upon LSTMs that capture long-term dependencies on source task weight estimation across meta learning iterations. We evaluate the proposed Meta-RTL using both BERT and ALBERT as the backbone of the meta model on three commonsense reasoning benchmark datasets. Experimental results demonstrate that Meta-RTL substantially outperforms strong baselines and previous task selection strategies and achieves larger improvements on extremely low-resource settings.
元学习已被广泛应用于从丰富资源源任务中利用到低资源目标任务的性能提升。然而,大多数现有的元学习方法将不同的源任务平等对待,忽略了源任务与目标任务之间的相关性。为了减轻这个问题,我们提出了一个基于强化学习的多源元转移学习框架(Meta-RTL)用于低资源常识推理。在这个框架中,我们提出了一个基于强化的方法来动态估计源任务权重,该权重衡量了相应任务对目标任务的贡献。元模型的一般损失和特定源元模型的任务特定损失在随机目标数据上的差异被输入到元学习模型的策略网络作为奖励。策略网络基于LSTM,捕捉了元学习迭代过程中源任务权重估计的长远依赖关系。我们使用BERT和ALBERT作为元模型的元核,在三个常识推理基准数据集上评估所提出的Meta-RTL。实验结果表明,Meta-RTL在很大程度上超过了强大的基线和以前的任务选择策略,并在极度低资源设置上实现了更大的改进。
https://arxiv.org/abs/2409.19075
Stability in recurrent neural models poses a significant challenge, particularly in developing biologically plausible neurodynamical models that can be seamlessly trained. Traditional cortical circuit models are notoriously difficult to train due to expansive nonlinearities in the dynamical system, leading to an optimization problem with nonlinear stability constraints that are difficult to impose. Conversely, recurrent neural networks (RNNs) excel in tasks involving sequential data but lack biological plausibility and interpretability. In this work, we address these challenges by linking dynamic divisive normalization (DN) to the stability of ORGaNICs, a biologically plausible recurrent cortical circuit model that dynamically achieves DN and has been shown to simulate a wide range of neurophysiological phenomena. By using the indirect method of Lyapunov, we prove the remarkable property of unconditional local stability for an arbitrary-dimensional ORGaNICs circuit when the recurrent weight matrix is the identity. We thus connect ORGaNICs to a system of coupled damped harmonic oscillators, which enables us to derive the circuit's energy function, providing a normative principle of what the circuit, and individual neurons, aim to accomplish. Further, for a generic recurrent weight matrix, we prove the stability of the 2D model and demonstrate empirically that stability holds in higher dimensions. Finally, we show that ORGaNICs can be trained by backpropagation through time without gradient clipping/scaling, thanks to its intrinsic stability property and adaptive time constants, which address the problems of exploding, vanishing, and oscillating gradients. By evaluating the model's performance on RNN benchmarks, we find that ORGaNICs outperform alternative neurodynamical models on static image classification tasks and perform comparably to LSTMs on sequential tasks.
循环神经网络(RNNs)的稳定性 poses 是一个重大的挑战,尤其是在开发生物 plausible 的神经动力学模型时,可以无缝训练。传统的皮质电路模型由于动态系统中的非线性拓扑结构,导致具有非线性稳定性约束的优化问题很难求解。相反,循环神经网络(RNNs)在涉及序列数据的任务上表现出色,但缺乏生物合理性和可解释性。在这项工作中,我们通过将动态分枝 normalization(DN)与 ORGaNICs 的稳定性联系起来,解决了这些挑战。ORGaNICs 是一种生物 plausible 的循环神经网络模型,具有动态地实现 DN 并已展示出模拟广泛的神经生理现象的能力。通过使用 Lyapunov 间接法,我们证明了任意维度的 ORGaNICs 电路在循环权重矩阵为 identity 时的条件局部稳定性惊人的特性。因此,我们将 ORGaNICs 连接到耦合阻尼谐波振荡器的系统中,这使得我们能够求出电路的能量函数,为电路和单个神经元提供了规范原理,即它们致力于实现的目标。此外,对于任意维度的通用循环权重矩阵,我们证明了 2D 模型的稳定性,并实验证明了在更高维度中稳定性依然存在。最后,我们证明了 ORGaNICs 可以通过反向传播通过时间进行训练,得益于其固有稳定性特性和自适应时间常数,这解决了爆炸、消失和振荡梯度的問題。通过评估模型在 RNN 基准测试上的性能,我们发现 ORGaNICs 在静态图像分类任务上优于其他神经动力学模型,同时在序列任务上与 LSTMs 相当。
https://arxiv.org/abs/2409.18946
One of the most important challenges in the financial and cryptocurrency field is accurately predicting cryptocurrency price trends. Leveraging artificial intelligence (AI) is beneficial in addressing this challenge. Cryptocurrency markets, marked by substantial growth and volatility, attract investors and scholars keen on deciphering and forecasting cryptocurrency price movements. The vast and diverse array of data available for such predictions increases the complexity of the task. In our study, we introduce a novel approach termed hard and soft information fusion (HSIF) to enhance the accuracy of cryptocurrency price movement forecasts. The hard information component of our approach encompasses historical price records alongside technical indicators. Complementing this, the soft data component extracts from X (formerly Twitter), encompassing news headlines and tweets about the cryptocurrency. To use this data, we use the Bidirectional Encoder Representations from Transformers (BERT)-based sentiment analysis method, financial BERT (FinBERT), which performs best. Finally, our model feeds on the information set including processed hard and soft data. We employ the bidirectional long short-term memory (BiLSTM) model because processing information in both forward and backward directions can capture long-term dependencies in sequential information. Our empirical findings emphasize the superiority of the HSIF approach over models dependent on single-source data by testing on Bitcoin-related data. By fusing hard and soft information on Bitcoin dataset, our model has about 96.8\% accuracy in predicting price movement. Incorporating information enables our model to grasp the influence of social sentiment on price fluctuations, thereby supplementing the technical analysis-based predictions derived from hard information.
在金融和加密货币领域,准确预测加密货币价格趋势是最具挑战性的。利用人工智能(AI)解决这一挑战是有益的。加密货币市场以大幅增长和波动为特征,吸引了渴望研究和预测加密货币价格运动的投资者和学者。如此预测提供了大量数据,增加了这项任务的复杂性。在我们研究中,我们引入了一种名为硬和软信息融合(HSIF)的新方法,以提高加密货币价格运动预测的准确性。 HSIF方法的一个关键组成部分是硬信息部分,它包括历史价格记录和技术指标。此外,软数据部分从X(前Twitter)中提取,涵盖有关加密货币的新闻标题和推文。要使用这些数据,我们使用基于BERT的双向编码表示的情绪分析方法、金融BERT(FinBERT),这是最佳选择。最后,我们的模型依赖于包括处理过的硬和软数据的信息集。我们使用双向长短时记忆(BiLSTM)模型,因为处理信息在正向和反向方向上可以捕捉到序列信息中的长期依赖关系。 我们的实证研究结果表明,HSIF方法相对于单源数据模型的优越性通过在比特币相关数据上进行测试得到证实。通过在比特币数据集上融合硬信息和软信息,我们的模型在预测价格运动方面的准确率约为96.8%。纳入信息使我们的模型能够抓住社会情绪对价格波动的影响,从而补充基于硬信息的技术分析预测。
https://arxiv.org/abs/2409.18895
Integrating machine learning (ML) into customer service chatbots enhances their ability to understand and respond to user queries, ultimately improving service performance. However, they may appear artificial to some users and affecting customer experience. Hence, meticulous evaluation of ML models for each pipeline component is crucial for optimizing performance, though differences in functionalities can lead to unfair comparisons. In this paper, we present a tailored experimental evaluation approach for goal-oriented customer service chatbots with pipeline architecture, focusing on three key components: Natural Language Understanding (NLU), dialogue management (DM), and Natural Language Generation (NLG). Our methodology emphasizes individual assessment to determine optimal ML models. Specifically, we focus on optimizing hyperparameters and evaluating candidate models for NLU (utilizing BERT and LSTM), DM (employing DQN and DDQN), and NLG (leveraging GPT-2 and DialoGPT). The results show that for the NLU component, BERT excelled in intent detection whereas LSTM was superior for slot filling. For the DM component, the DDQN model outperformed DQN by achieving fewer turns, higher rewards, as well as greater success rates. For NLG, the large language model GPT-2 surpassed DialoGPT in BLEU, METEOR, and ROUGE metrics. These findings aim to provide a benchmark for future research in developing and optimizing customer service chatbots, offering valuable insights into model performance and optimal hyperparameters.
将机器学习(ML)集成到客户服务聊天机器人中可以增强其理解用户查询并做出回应的能力,从而提高服务质量。然而,对于某些用户来说,它们可能看起来过于人工化,并影响客户体验。因此,对每个管道组件的每个ML模型进行精细评估是优化性能的关键。在本文中,我们提出了一个针对目标导向的客户服务聊天机器人的定制实验评估方法,重点关注三个关键组件:自然语言理解(NLU)、对话管理(DM)和自然语言生成(NLG)。我们的方法强调个体评估以确定最优的ML模型。具体来说,我们关注优化超参数并评估NLU(利用BERT和LSTM)、DM(采用DQN和DDQN)和NLG(利用GPT-2和DialoGPT)的候选模型。结果显示,在NLU组件中,BERT在意图检测方面表现优异,而LSTM在补录条目方面优势更大。在DM组件中,DDQN模型优于DQN,实现更少的回合数、更高的奖励以及更高的成功率。在NLG组件中,大语言模型GPT-2在BLEU、METEOR和ROUGE指标上超过了DialoGPT。这些发现为未来研究开发和优化客户服务聊天机器人提供了基准,并为模型性能和最优超参数提供了宝贵的见解。
https://arxiv.org/abs/2409.18568
Low latency models are critical for real-time speech enhancement applications, such as hearing aids and hearables. However, the sub-millisecond latency space for resource-constrained hearables remains underexplored. We demonstrate speech enhancement using a computationally efficient minimum-phase FIR filter, enabling sample-by-sample processing to achieve mean algorithmic latency of 0.32 ms to 1.25 ms. With a single microphone, we observe a mean SI-SDRi of 4.1 dB. The approach shows generalization with a DNSMOS increase of 0.2 on unseen audio recordings. We use a lightweight LSTM-based model of 644k parameters to generate FIR taps. We benchmark that our system can run on low-power DSP with 388 MIPS and mean end-to-end latency of 3.35 ms. We provide a comparison with baseline low-latency spectral masking techniques. We hope this work will enable a better understanding of latency and can be used to improve the comfort and usability of hearables.
低延迟模型对于实时语音增强应用,如助听器和可听设备至关重要。然而,对于资源受限的可听设备,如低功耗 hearables,亚毫秒延迟空间仍未经探索。我们通过使用计算高效的最小相位 FIR 滤波器来展示语音增强,实现每样本 0.32ms 至 1.25ms 的均算法延迟。通过一个麦克风,我们观察到平均信号-功率误码率(SI-SDRi)为 4.1 dB。这种方法在未见过的音频录音上具有泛化能力,DNSMOS 增加 0.2。我们使用具有 644k 参数的轻量级 LSTM 模型生成 FIR 采样。我们对比了我们的系统在低功耗 DSP 上运行的情况,其时钟脉冲数为 388 MIPS,均延迟为 3.35ms。我们还提供了与基线低延迟语音掩码技术的比较。我们希望这项工作能够更好地理解延迟,并可用于提高可听设备的舒适度和可用性。
https://arxiv.org/abs/2409.18239
The increasing number of unmanned aerial vehicles (UAVs) in urban environments requires a strategy to minimize their environmental impact, both in terms of energy efficiency and noise reduction. In order to reduce these concerns, novel strategies for developing prediction models and optimization of flight planning, for instance through deep reinforcement learning (DRL), are needed. Our goal is to develop DRL algorithms capable of enabling the autonomous navigation of UAVs in urban environments, taking into account the presence of buildings and other UAVs, optimizing the trajectories in order to reduce both energetic consumption and noise. This is achieved using fluid-flow simulations which represent the environment in which UAVs navigate and training the UAV as an agent interacting with an urban environment. In this work, we consider a domain domain represented by a two-dimensional flow field with obstacles, ideally representing buildings, extracted from a three-dimensional high-fidelity numerical simulation. The presented methodology, using PPO+LSTM cells, was validated by reproducing a simple but fundamental problem in navigation, namely the Zermelo's problem, which deals with a vessel navigating in a turbulent flow, travelling from a starting point to a target location, optimizing the trajectory. The current method shows a significant improvement with respect to both a simple PPO and a TD3 algorithm, with a success rate (SR) of the PPO+LSTM trained policy of 98.7%, and a crash rate (CR) of 0.1%, outperforming both PPO (SR = 75.6%, CR=18.6%) and TD3 (SR=77.4% and CR=14.5%). This is the first step towards DRL strategies which will guide UAVs in a three-dimensional flow field using real-time signals, making the navigation efficient in terms of flight time and avoiding damages to the vehicle.
越来越多城市环境中无人飞行器(UAVs)的数量需要一种策略来最小化其对环境的影响,无论是能源效率还是噪声减少。为了减少这些担忧,需要开发新的预测模型和飞行计划优化策略,例如通过深度强化学习(DRL)实现自主飞行。我们的目标是开发出能够使UAV在 urban环境中自主导航的 DRL 算法,考虑建筑物和其他UAV的存在,优化轨迹以降低能量消耗和噪声。这是通过流体动力学仿真来实现的,这些仿真代表了 UAV 在环境中导航的环境。在这篇工作中,我们考虑了一个由障碍物组成的二维流场域,理想地表示了建筑物,从三维高保真度数值仿真中提取出来。所提出的方法,使用 PPO+LSTM 单元,通过复制一个简单的但基本的问题来验证,即 Zermelo 问题,涉及航行器在湍流中的航行,从起点到目标地点,优化轨迹。目前的方法与简单的 PPO 和TD3算法相比,具有显著的改进,成功率(SR)为 98.7%,碰撞率(CR)为0.1%,优于 PPO(SR = 75.6%, CR = 18.6%)和TD3(SR = 77.4% 和 CR = 14.5%)。这是 DRL 策略引导 UAV 在三维流场中导航的第一步,使导航在飞行时间和避免对车辆造成损害方面更加高效。
https://arxiv.org/abs/2409.17922
Contextual-LAS (CLAS) has been shown effective in improving Automatic Speech Recognition (ASR) of rare words. It relies on phrase-level contextual modeling and attention-based relevance scoring without explicit contextual constraint which lead to insufficient use of contextual information. In this work, we propose deep CLAS to use contextual information better. We introduce bias loss forcing model to focus on contextual information. The query of bias attention is also enriched to improve the accuracy of the bias attention score. To get fine-grained contextual information, we replace phrase-level encoding with character-level encoding and encode contextual information with conformer rather than LSTM. Moreover, we directly use the bias attention score to correct the output probability distribution of the model. Experiments using the public AISHELL-1 and AISHELL-NER. On AISHELL-1, compared to CLAS baselines, deep CLAS obtains a 65.78% relative recall and a 53.49% relative F1-score increase in the named entity recognition scene.
上下文理解语言模型(CLAS)已被证明对提高罕见词语音识别(ASR)有效。它依赖于短语级别上下文建模和基于注意的关联评分,没有显式上下文约束,这导致上下文信息利用不足。在本文中,我们提出了一种深层CLAS,以更好地利用上下文信息。我们引入了偏差损失强制模型,以关注上下文信息。偏差的注意力查询也进行了改进,以提高偏差注意分数的准确性。为了获得细粒度的上下文信息,我们用字符级别编码代替短语级别编码,并将上下文信息编码为conformer,而不是LSTM。此外,我们直接使用偏差注意分数来纠正模型的输出概率分布。使用公共的AISHELL-1和AISHELL-NER。在AISHELL-1上,与CLAS基线相比,深CLAS在命名实体识别场景中的相对召回率为65.78%,相对F1分数为53.49%。
https://arxiv.org/abs/2409.17603