Abstract
Automatic speech recognition (ASR) systems play a key role in applications involving human-machine interactions. Despite their importance, ASR models for the Portuguese language proposed in the last decade have limitations in relation to the correct identification of punctuation marks in automatic transcriptions, which hinder the use of transcriptions by other systems, models, and even by humans. However, recently Whisper ASR was proposed by OpenAI, a general-purpose speech recognition model that has generated great expectations in dealing with such limitations. This chapter presents the first study on the performance of Whisper for punctuation prediction in the Portuguese language. We present an experimental evaluation considering both theoretical aspects involving pausing points (comma) and complete ideas (exclamation, question, and fullstop), as well as practical aspects involving transcript-based topic modeling - an application dependent on punctuation marks for promising performance. We analyzed experimental results from videos of Museum of the Person, a virtual museum that aims to tell and preserve people's life histories, thus discussing the pros and cons of Whisper in a real-world scenario. Although our experiments indicate that Whisper achieves state-of-the-art results, we conclude that some punctuation marks require improvements, such as exclamation, semicolon and colon.
Abstract (translated)
自动语音识别(ASR)系统在涉及人类-机器互动的应用中扮演着关键角色。尽管它们非常重要,但在过去十年中,针对葡萄牙语的ASR模型在自动转录的正确识别punctuation marks方面存在一些限制,这限制了其他系统、模型甚至人类的使用。然而,最近OpenAI提出了Whisper ASR,这是一个通用的语音识别模型,在处理这些限制方面引起了极大的期望。本章介绍了第一个研究,是关于Whisper在葡萄牙语中的punctuation预测性能的研究。我们考虑了理论和实践两个方面,包括涉及暂停点(逗号)和完整想法(感叹号、问题、句号)的方面,以及涉及基于转录的主题建模的方面——这是一个依赖于punctuation marks才能取得良好性能的应用。我们从Person博物馆的视频中提取了实验结果,该博物馆旨在保护和讲述人们的生命历史,因此讨论了Whisper在现实世界场景中的优点和缺点。虽然我们的实验表明Whisper取得了最先进的结果,但我们得出结论,一些punctuation marks需要改进,例如感叹号、分号和斜杠。
URL
https://arxiv.org/abs/2305.14580