Paper Reading AI Learner

Speech Reconstitution using Multi-view Silent Videos

2018-07-02 12:16:55
Yaman Kumar, Mayank Aggarwal, Pratham Nawal, Shin'ichi Satoh, Rajiv Ratn Shah, Roger Zimmerman

Abstract

Speechreading broadly involves looking, perceiving, and interpreting spoken symbols. It has a wide range of multimedia applications such as in surveillance, Internet telephony, and as an aid to a person with hearing impairments. However, most of the work in speechreading has been limited to text generation from silent videos. Recently, research has ventured into generating (audio) speech from silent video sequences but there have been no developments in using multiple cameras for speech generation. To this end, this paper presents the world's first ever multi-view speech reading and reconstruction system. This work encompasses the boundaries of multimedia research by putting forth a model which leverages silent video feeds from multiple cameras recording the same subject to generate intelligent speech for a speaker. Initial results confirm the usefulness of exploiting multiple views in building an efficient speech reading and reconstruction system. It further shows the optimal placement of cameras which would lead to the maximum intelligibility of speech. Next, it lays out various innovative applications for the proposed system focusing on its potential prodigious impact in not just security arena but in many other multimedia analytics problems.

Abstract (translated)

语音阅读广泛涉及查看,感知和解释口头符号。它具有广泛的多媒体应用,如监控,网络电话,以及对有听力障碍者的帮助。然而,大部分的语音阅读工作仅限于从静音视频中生成文本。最近,研究已经尝试从静音视频序列生成(音频)语音,但是在使用多个摄像机进行语音生成方面没有任何进展。为此,本文介绍了世界上第一个多视图语音阅读和重建系统。这项工作涵盖了多媒体研究的界限,提出了一个模型,该模型利用来自多个摄像机的静音视频输入来录制同一主题,为演讲者生成智能语音。初步结果证实了利用多种观点建立有效的语音阅读和重建系统的有用性。它进一步显示了摄像机的最佳位置,这将导致语音的最大可懂度。接下来,它为所提出的系统提供了各种创新应用,重点关注其在安全领域以及许多其他多媒体分析问题中的潜在巨大影响。

URL

https://arxiv.org/abs/1807.00619

PDF

https://arxiv.org/pdf/1807.00619.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot