Paper Reading AI Learner

Strong and Simple Baselines for Multimodal Utterance Embeddings

2019-05-14 13:44:37
Paul Pu Liang, Yao Chong Lim, Yao-Hung Hubert Tsai, Ruslan Salakhutdinov, Louis-Philippe Morency

Abstract

Human language is a rich multimodal signal consisting of spoken words, facial expressions, body gestures, and vocal intonations. Learning representations for these spoken utterances is a complex research problem due to the presence of multiple heterogeneous sources of information. Recent advances in multimodal learning have followed the general trend of building more complex models that utilize various attention, memory and recurrent components. In this paper, we propose two simple but strong baselines to learn embeddings of multimodal utterances. The first baseline assumes a conditional factorization of the utterance into unimodal factors. Each unimodal factor is modeled using the simple form of a likelihood function obtained via a linear transformation of the embedding. We show that the optimal embedding can be derived in closed form by taking a weighted average of the unimodal features. In order to capture richer representations, our second baseline extends the first by factorizing into unimodal, bimodal, and trimodal factors, while retaining simplicity and efficiency during learning and inference. From a set of experiments across two tasks, we show strong performance on both supervised and semi-supervised multimodal prediction, as well as significant (10 times) speedups over neural models during inference. Overall, we believe that our strong baseline models offer new benchmarking options for future research in multimodal learning.

Abstract (translated)

人类语言是一种由口语、面部表情、身体姿势和声调组成的丰富的多模态信号。由于存在多种不同的信息源,对这些口语表达的学习表征是一个复杂的研究问题。多模学习的最新进展遵循了利用各种注意、记忆和重复成分建立更复杂模型的一般趋势。在本文中,我们提出了两个简单但强有力的基线来学习多模态话语的嵌入。第一个基线假设将话语分解为单峰因子。每一个单峰因子都是通过嵌入的线性变换得到的一个简单的似然函数形式来建模的。结果表明,采用单峰特征的加权平均,可以得到最优嵌入的封闭形式。为了获得更丰富的表示,我们的第二个基线将第一个基线分解为单峰、双峰和三峰因子,同时在学习和推理过程中保持简单和效率。通过对两个任务的一组实验,我们在有监督和半监督的多模态预测上都表现出了很强的性能,并且在推理过程中,神经模型的速度明显加快了(10倍)。总的来说,我们相信我们强大的基线模型为未来多模学习研究提供了新的基准选择。

URL

https://arxiv.org/abs/1906.02125

PDF

https://arxiv.org/pdf/1906.02125.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot