Abstract
Speechreading broadly involves looking, perceiving, and interpreting spoken symbols. It has a wide range of multimedia applications such as in surveillance, Internet telephony, and as an aid to a person with hearing impairments. However, most of the work in speechreading has been limited to text generation from silent videos. Recently, research has ventured into generating (audio) speech from silent video sequences but there have been no developments in using multiple cameras for speech generation. To this end, this paper presents the world's first ever multi-view speech reading and reconstruction system. This work encompasses the boundaries of multimedia research by putting forth a model which leverages silent video feeds from multiple cameras recording the same subject to generate intelligent speech for a speaker. Initial results confirm the usefulness of exploiting multiple views in building an efficient speech reading and reconstruction system. It further shows the optimal placement of cameras which would lead to the maximum intelligibility of speech. Next, it lays out various innovative applications for the proposed system focusing on its potential prodigious impact in not just security arena but in many other multimedia analytics problems.
Abstract (translated)
语音阅读广泛涉及查看,感知和解释口头符号。它具有广泛的多媒体应用,如监控,网络电话,以及对有听力障碍者的帮助。然而,大部分的语音阅读工作仅限于从静音视频中生成文本。最近,研究已经尝试从静音视频序列生成(音频)语音,但是在使用多个摄像机进行语音生成方面没有任何进展。为此,本文介绍了世界上第一个多视图语音阅读和重建系统。这项工作涵盖了多媒体研究的界限,提出了一个模型,该模型利用来自多个摄像机的静音视频输入来录制同一主题,为演讲者生成智能语音。初步结果证实了利用多种观点建立有效的语音阅读和重建系统的有用性。它进一步显示了摄像机的最佳位置,这将导致语音的最大可懂度。接下来,它为所提出的系统提供了各种创新应用,重点关注其在安全领域以及许多其他多媒体分析问题中的潜在巨大影响。
URL
https://arxiv.org/abs/1807.00619