Abstract
The most common way for humans to communicate is by speech. But perhaps a language system cannot know what it is communicating without a connection to the real world by image perception. In fact, humans perceive these multiple sources of information together to build a general concept. However, constructing a machine that can alleviate these modalities together in a supervised learning fashion is difficult, because a parallel dataset is required among speech, image, and text modalities altogether that is often unavailable. A machine speech chain based on sequence-to-sequence deep learning was previously proposed to achieve semi-supervised learning that enabled automatic speech recognition (ASR) and text-to-speech synthesis (TTS) to teach each other when they receive unpaired data. In this research, we take a further step by expanding the speech chain into a multimodal chain and design a closely knit chain architecture that connects ASR, TTS, image captioning (IC), and image retrieval (IR) models into a single framework. ASR, TTS, IC, and IR components can be trained in a semi-supervised fashion by assisting each other given incomplete datasets and leveraging cross-modal data augmentation within the chain.
Abstract (translated)
人类最常见的交流方式是通过语言。但是,如果没有图像感知与现实世界的联系,语言系统可能无法知道它在传达什么。事实上,人类将这些多个信息源结合在一起,以构建一个总体概念。然而,构建一台能够以有监督的学习方式一起缓解这些模式的机器是困难的,因为语音、图像和文本模式之间需要一个并行数据集,而这通常是不可用的。提出了一种基于序列到序列深度学习的机器语音链,实现了半监督学习,使自动语音识别(ASR)和文本到语音合成(TTS)在接收不成对数据时能够互相学习。在本研究中,我们进一步将语音链扩展为多模态链,并设计一个紧密结合的链结构,将ASR、TTS、图像字幕(IC)和图像检索(IR)模型连接到一个单一的框架中。ASR、TTS、IC和IR组件可以通过在给定的不完整数据集中相互帮助以及利用链内的跨模式数据扩充以半监督的方式进行培训。
URL
https://arxiv.org/abs/1906.00579