Abstract
Speech Emotion Recognition (SER) is crucial for enabling computers to understand the emotions conveyed in human communication. With recent advancements in Deep Learning (DL), the performance of SER models has significantly improved. However, designing an optimal DL architecture requires specialised knowledge and experimental assessments. Fortunately, Neural Architecture Search (NAS) provides a potential solution for automatically determining the best DL model. The Differentiable Architecture Search (DARTS) is a particularly efficient method for discovering optimal models. This study presents emoDARTS, a DARTS-optimised joint CNN and Sequential Neural Network (SeqNN: LSTM, RNN) architecture that enhances SER performance. The literature supports the selection of CNN and LSTM coupling to improve performance. While DARTS has previously been used to choose CNN and LSTM operations independently, our technique adds a novel mechanism for selecting CNN and SeqNN operations in conjunction using DARTS. Unlike earlier work, we do not impose limits on the layer order of the CNN. Instead, we let DARTS choose the best layer order inside the DARTS cell. We demonstrate that emoDARTS outperforms conventionally designed CNN-LSTM models and surpasses the best-reported SER results achieved through DARTS on CNN-LSTM by evaluating our approach on the IEMOCAP, MSP-IMPROV, and MSP-Podcast datasets.
Abstract (translated)
语音情感识别(SER)对让计算机理解人类交流中传达的情感至关重要。随着深度学习(DL)的最近进步,SER模型的性能显著提高。然而,设计最优的DL架构需要专业知识和实验评估。幸运的是,神经架构搜索(NAS)提供了一种自动确定最佳DL模型的潜在解决方案。不同可导架构搜索(DARTS)是一种特别有效的发现最优模型的方法。 本研究提出了emoDARTS,一种经过DARTS优化的联合CNN和序列神经网络(SeqNN: LSTM,RNN)架构,提高了SER性能。文献支持选择CNN和LSTM耦合以提高性能。尽管DARTS以前已经被用于选择独立的CNN和LSTM操作,但我们的技术通过使用DARTS选择最佳层序来结合CNN和SeqNN操作。与以前的工作不同,我们没有对CNN的层序施加限制。相反,让DARTS在DARTS单元内选择最佳层序。我们证明了emoDARTS超越了通过DARTS在CNN-LSTM上实现的最佳报告SER结果,通过在IEMOCAP、MSP-IMPROV和MSP-Podcast数据集上评估我们的方法。
URL
https://arxiv.org/abs/2403.14083