Abstract
Voice based applications are ruling over the era of automation because speech has a lot of factors that determine a speakers information as well as speech. Modern Automatic Speech Recognition (ASR) is a blessing in the field of Human-Computer Interaction (HCI) for efficient communication among humans and devices using Artificial Intelligence technology. Speech is one of the easiest mediums of communication because it has a lot of identical features for different speakers. Nowadays it is possible to determine speakers and their identity using their speech in terms of speaker recognition. In this paper, we presented a method that will provide a speakers geographical identity in a certain region using continuous Bengali speech. We consider eight different divisions of Bangladesh as the geographical region. We applied the Mel Frequency Cepstral Coefficient (MFCC) and Delta features on an Artificial Neural Network to classify speakers division. We performed some preprocessing tasks like noise reduction and 8-10 second segmentation of raw audio before feature extraction. We used our dataset of more than 45 hours of audio data from 633 individual male and female speakers. We recorded the highest accuracy of 85.44%.
Abstract (translated)
基于语音的应用程序正在统治自动化时代,因为语音有很多因素来决定发言者的信息和语音。现代人工智能语音识别(ASR)在人工智能技术领域为人类和设备之间的有效通信带来了福音。语音是交流中最为容易的媒介,因为它有很多不同发言者的相似特征。如今,通过他们的语音,可以确定发言者和他们的身份,实现发言者识别。在本文中,我们提出了一种方法,将使用连续孟加拉语来确定发言者在某个地区的地理身份。我们将孟加拉国分为八个不同的地区。我们在人工神经网络中应用了Mel频率倒谱系数(MFCC)和差分特征来对发言者进行分类。在特征提取之前,我们进行了一些预处理任务,如降噪和8-10秒音频的分割。我们使用我们拥有超过633个单独男性和女性发言者的数据集。我们记录了最高的准确率为85.44%。
URL
https://arxiv.org/abs/2404.15168