The emoticons are symbolic representations that generally accompany the textual content to visually enhance or summarize the true intention of a written message. Although widely utilized in the realm of social media, the core semantics of these emoticons have not been extensively explored based on multiple modalities. Incorporating textual and visual information within a single message develops an advanced way of conveying information. Hence, this research aims to analyze the relationship among sentences, visuals, and emoticons. For an orderly exposition, this paper initially provides a detailed examination of the various techniques for extracting multimodal features, emphasizing the pros and cons of each method. Through conducting a comprehensive examination of several multimodal algorithms, with specific emphasis on the fusion approaches, we have proposed a novel contrastive learning based multimodal architecture. The proposed model employs the joint training of dual-branch encoder along with the contrastive learning to accurately map text and images into a common latent space. Our key finding is that by integrating the principle of contrastive learning with that of the other two branches yields superior results. The experimental results demonstrate that our suggested methodology surpasses existing multimodal approaches in terms of accuracy and robustness. The proposed model attained an accuracy of 91% and an MCC-score of 90% while assessing emoticons using the Multimodal-Twitter Emoticon dataset acquired from Twitter. We provide evidence that deep features acquired by contrastive learning are more efficient, suggesting that the proposed fusion technique also possesses strong generalisation capabilities for recognising emoticons across several modes.
表情符号是一种象征性的表示形式,通常与文本内容一起使用,以视觉上增强或总结书面信息的确切意图。尽管在社交媒体领域得到了广泛应用,但基于多种模式对这些表情符号的核心语义进行了深入探讨还是不足为继。将文本和视觉信息集成到一个消息中,发展了一种高级传达信息的方法。因此,这项研究旨在分析句子、视觉信息和表情符号之间的关系。为进行有序的阐述,本文首先对各种提取多模态特征的技术进行了详细调查,强调每种方法的优缺点。通过全面评估多个多模态算法,特别是融合方法,我们提出了一个新颖的基于多模态学习的架构。所提出的模型采用双分支编码器与对比学习相结合来准确地将文本和图像映射到共同的潜在空间。我们的关键发现是,将对比学习原理与其他两个分支相结合产生了更好的结果。实验结果表明,我们所提出的方法在准确性和鲁棒性方面超过了现有的多模态方法。在使用Twitter Multimodal Emoticon数据集评估表情符号时,所提出的模型获得了91%的准确性和90%的MCC分数。我们提供了证据,表明通过对比学习获得的深度特征更加有效,表明所提出的融合技术也对识别多种模式下的表情符号具有很强的泛化能力。
https://arxiv.org/abs/2408.02571
This paper introduces a groundbreaking multi-modal neural network model designed for resolution enhancement, which innovatively leverages inter-diagnostic correlations within a system. Traditional approaches have primarily focused on uni-modal enhancement strategies, such as pixel-based image enhancement or heuristic signal interpolation. In contrast, our model employs a novel methodology by harnessing the diagnostic relationships within the physics of fusion plasma. Initially, we establish the correlation among diagnostics within the tokamak. Subsequently, we utilize these correlations to substantially enhance the temporal resolution of the Thomson Scattering diagnostic, which assesses plasma density and temperature. By increasing its resolution from conventional 200Hz to 500kHz, we facilitate a new level of insight into plasma behavior, previously attainable only through computationally intensive simulations. This enhancement goes beyond simple interpolation, offering novel perspectives on the underlying physical phenomena governing plasma dynamics.
本文提出了一种在分辨率增强方面具有突破性的多模态神经网络模型,该模型创新地利用了系统内诊断关系。传统方法主要集中在单模态增强策略,例如基于像素的图像增强或启发式信号插值。相比之下,我们的模型通过利用融合 plasma 物理学中诊断关系的方法来创新性地实现了一种新的方法。首先,我们在 tokamak 中建立了诊断之间的关系。接着,我们利用这些关系大大增强了汤姆逊散射诊断的时域分辨率,该诊断评估了 plasma 密度和温度。通过将分辨率从传统的 200Hz 提高到 500kHz,我们促进了对 plasma 行为的深入洞察,这一般仅通过计算密集型模拟才能实现。这种增强超越了简单的插值,提供了一种新颖的视角,揭示了控制 plasma 动力学背后的物理现象。
https://arxiv.org/abs/2405.05908
This report provide a detailed description of the method that we explored and proposed in the WECIA Emotion Prediction Competition (EPC), which predicts a person's emotion through an artistic work with a comment. The dataset of this competition is ArtELingo, designed to encourage work on diversity across languages and cultures. The dataset has two main challenges, namely modal imbalance problem and language-cultural differences problem. In order to address this issue, we propose a simple yet effective approach called single-multi modal with Emotion-Cultural specific prompt(ECSP), which focuses on using the single modal message to enhance the performance of multimodal models and a well-designed prompt to reduce cultural differences problem. To clarify, our approach contains two main blocks: (1)XLM-R\cite{conneau2019unsupervised} based unimodal model and X$^2$-VLM\cite{zeng2022x} based multimodal model (2) Emotion-Cultural specific prompt. Our approach ranked first in the final test with a score of 0.627.
本报告详细描述了我们参加WECIA情感预测竞赛(EPC)时所探索和提出的方法,该竞赛通过一件艺术作品来预测一个人的情感。比赛的數據集是ArtELingo,旨在鼓励跨語言和文化的作品。比赛數據集有两个主要挑戰,即模态不平衡問題和語言-文化差異問題。为了应对这个问题,我们提出了一个简单而有效的方案,称为情感文化特定提示(ECSP)单一模态与多模态模型。该方案重点使用单一模态信息来提高多模态模型的性能,并设计了一个精心设计的提示来减少文化差异问题。为了明确,我们的方法包含两个主要部分:(1)基于unimodal的XLM-R模型和基于multimodal的X$^2$-VLM模型(2)情感-文化特定提示。我们的方法在决赛测试中排名第一,得分为0.627。
https://arxiv.org/abs/2403.17683
In the rapidly evolving field of machine learning (ML), data augmentation (DA) has emerged as a pivotal technique for enhancing model performance by diversifying training examples without the need for additional data collection. This survey explores the transformative impact of Large Language Models (LLMs) on DA, particularly addressing the unique challenges and opportunities they present in the context of natural language processing (NLP) and beyond. From a data perspective and a learning perspective, we examine various strategies that utilize Large Language Models for data augmentation, including a novel exploration of learning paradigms where LLM-generated data is used for further training. Additionally, this paper delineates the primary challenges faced in this domain, ranging from controllable data augmentation to multi modal data augmentation. This survey highlights the paradigm shift introduced by LLMs in DA, aims to serve as a foundational guide for researchers and practitioners in this field.
在迅速发展的机器学习(ML)领域,数据增强(DA)已成为通过扩展训练示例来提高模型性能的关键技术,而无需进行额外的数据收集。本调查探讨了大型语言模型(LLMs)对DA的变革性影响,特别是在自然语言处理(NLP)及其它领域的独特挑战和机遇。从数据视角和学习视角出发,我们检查了各种利用LLM进行数据增强的策略,包括一种新的探索学习范式,其中LLM生成的数据用于进一步训练。此外,本文概述了该领域面临的主要挑战,从可控制的数据增强到多模态数据增强。本调查突出了LLM在DA领域引入的范式转变,旨在为该领域的研究人员和实践者提供基础指导。
https://arxiv.org/abs/2403.02990
This paper presents a novel multi modal deep learning framework for enhanced agricultural pest detection, combining tiny-BERT's natural language processing with R-CNN and ResNet-18's image processing. Addressing limitations of traditional CNN-based visual methods, this approach integrates textual context for more accurate pest identification. The R-CNN and ResNet-18 integration tackles deep CNN issues like vanishing gradients, while tiny-BERT ensures computational efficiency. Employing ensemble learning with linear regression and random forest models, the framework demonstrates superior discriminate ability, as shown in ROC and AUC analyses. This multi modal approach, blending text and image data, significantly boosts pest detection in agriculture. The study highlights the potential of multi modal deep learning in complex real-world scenarios, suggesting future expansions in diversity of datasets, advanced data augmentation, and cross-modal attention mechanisms to enhance model performance.
本文提出了一种新颖的多模态深度学习框架,用于增强农业害虫检测,将 tiny-BERT 的自然语言处理与 R-CNN 和 ResNet-18 的图像处理相结合。该方法解决了传统 CNN 视觉方法的局限性,并引入了文本上下文以实现更精确的害虫识别。R-CNN 和 ResNet-18 的集成解决了深度 CNN 问题,如消失的梯度,而 tiny-BERT 保证了计算效率。通过线性回归和随机森林模型的集成,该框架展示了卓越的判别能力,如图论和 AUC 分析结果所示。这种多模态方法结合了文本和图像数据,显著提高了农业中的害虫检测。本研究突出了多模态深度学习在复杂现实场景中的潜力,建议在未来增加数据集的多样性、高级数据增强和跨模态关注机制,以提高模型性能。
https://arxiv.org/abs/2312.10948
Most current audio-visual emotion recognition models lack the flexibility needed for deployment in practical applications. We envision a multimodal system that works even when only one modality is available and can be implemented interchangeably for either predicting emotional attributes or recognizing categorical emotions. Achieving such flexibility in a multimodal emotion recognition system is difficult due to the inherent challenges in accurately interpreting and integrating varied data sources. It is also a challenge to robustly handle missing or partial information while allowing direct switch between regression and classification tasks. This study proposes a \emph{versatile audio-visual learning} (VAVL) framework for handling unimodal and multimodal systems for emotion regression and emotion classification tasks. We implement an audio-visual framework that can be trained even when audio and visual paired data is not available for part of the training set (i.e., audio only or only video is present). We achieve this effective representation learning with audio-visual shared layers, residual connections over shared layers, and a unimodal reconstruction task. Our experimental results reveal that our architecture significantly outperforms strong baselines on both the CREMA-D and MSP-IMPROV corpora. Notably, VAVL attains a new state-of-the-art performance in the emotional attribute prediction task on the MSP-IMPROV corpus. Code available at: this https URL
当前大多数音频和视觉情感识别模型缺乏实际应用所需的灵活性。我们构想了一种多模态系统,即使在只有一个模态可用的情况下也能工作,并且可以相互替代地用于预测情感属性或识别分类情感。在实现多模态情感识别系统的灵活性方面,由于准确解释和整合多种数据源的内在挑战,非常困难。此外,在处理缺失或部分信息的同时,允许直接进行回归和分类任务的挑战也非常困难。本文提出了一个 \emph{多功能音频-视觉学习} (VAVL)框架,用于处理情感回归和情感分类任务中的单模态和多模态系统。我们实现了一个音频和视觉共享层的框架,在共享层上保留连接,并实现了单模态重建任务。我们的实验结果显示,我们的架构在CREMA-D和MSP-IMPROV corpora上的 strong baselines 显著超越了它们。值得注意的是,在MSP-IMPROV corpora上的情感属性预测任务中,VAVL取得了新的先进技术表现。代码可在 this https URL 上获取。
https://arxiv.org/abs/2305.07216
In the past ten years, with the help of deep learning, especially the rapid development of deep neural networks, medical image analysis has made remarkable progress. However, how to effectively use the relational information between various tissues or organs in medical images is still a very challenging problem, and it has not been fully studied. In this thesis, we propose two novel solutions to this problem based on deep relational learning. First, we propose a context-aware fully convolutional network that effectively models implicit relation information between features to perform medical image segmentation. The network achieves the state-of-the-art segmentation results on the Multi Modal Brain Tumor Segmentation 2017 (BraTS2017) and Multi Modal Brain Tumor Segmentation 2018 (BraTS2018) data sets. Subsequently, we propose a new hierarchical homography estimation network to achieve accurate medical image mosaicing by learning the explicit spatial relationship between adjacent frames. We use the UCL Fetoscopy Placenta dataset to conduct experiments and our hierarchical homography estimation network outperforms the other state-of-the-art mosaicing methods while generating robust and meaningful mosaicing result on unseen frames.
在过去的十年中,借助深度学习,特别是深度神经网络的迅速发展,医学图像分析取得了显著进展。然而,如何有效地利用医学图像中各种组织和器官之间的隐含关系仍然是一个极具挑战性的问题,并尚未得到充分研究。在本文中,我们提出了基于深度学习关系深度学习的两个创新解决方案。首先,我们提出了一种具有上下文意识的全卷积神经网络,有效地模型了特征之间的隐含关系信息,以进行医学图像分割。该网络在Multimodal Brain Tumor Segmentation 2017( BraTS2017)和Multimodal Brain Tumor Segmentation 2018( BraTS2018)数据集上取得了最先进的分割结果。随后,我们提出了一种新的层级基元估计网络,以通过学习相邻帧之间的明确空间关系实现准确的医学图像拼贴,我们使用UCL鲸鱼超声波 dataset进行了实验,我们的层级基元估计网络在未观测帧上的拼贴结果表现优异,同时生成稳健且有意义的拼贴结果。
https://arxiv.org/abs/2303.16099
Falls have become more frequent in recent years, which has been harmful for senior citizens.Therefore detecting falls have become important and several data sets and machine learning model have been introduced related to fall detection. In this project report, a human fall detection method is proposed using a multi modality approach. We used the UP-FALL detection data set which is collected by dozens of volunteers using different sensors and two cameras. We use wrist sensor with acclerometer data keeping labels to binary classification, namely fall and no fall from the data set.We used fusion of camera and sensor data to increase performance. The experimental results shows that using only wrist data as compared to multi sensor for binary classification did not impact the model prediction performance for fall detection.
Falls近年来变得越来越普遍,这对老年人来说是一种危害。因此,检测跌倒变得非常重要,并引入了多个数据集和机器学习模型与跌倒检测相关。在本报告中,提出了一种使用多模态方法的人跌倒检测方法。我们使用了由数十个志愿者使用多种传感器和两只摄像机收集的UP-Fall检测数据集。我们使用带计步器的手腕传感器将计步器数据作为标签进行二元分类,即跌倒和未跌倒从数据集中选取。我们使用了相机和传感器数据的集成来提高性能。实验结果显示,与使用多个传感器进行二元分类相比,仅使用手腕数据对于跌倒检测模型预测性能没有影响。
https://arxiv.org/abs/2302.00224
Visual entailment (VE) is to recognize whether the semantics of a hypothesis text can be inferred from the given premise image, which is one special task among recent emerged vision and language understanding tasks. Currently, most of the existing VE approaches are derived from the methods of visual question answering. They recognize visual entailment by quantifying the similarity between the hypothesis and premise in the content semantic features from multi modalities. Such approaches, however, ignore the VE's unique nature of relation inference between the premise and hypothesis. Therefore, in this paper, a new architecture called AlignVE is proposed to solve the visual entailment problem with a relation interaction method. It models the relation between the premise and hypothesis as an alignment matrix. Then it introduces a pooling operation to get feature vectors with a fixed size. Finally, it goes through the fully-connected layer and normalization layer to complete the classification. Experiments show that our alignment-based architecture reaches 72.45\% accuracy on SNLI-VE dataset, outperforming previous content-based models under the same settings.
https://arxiv.org/abs/2211.08736
Food is not only a basic human necessity but also a key factor driving a society's health and economic well-being. As a result, the cooking domain is a popular use-case to demonstrate decision-support (AI) capabilities in service of benefits like precision health with tools ranging from information retrieval interfaces to task-oriented chatbots. An AI here should understand concepts in the food domain (e.g., recipes, ingredients), be tolerant to failures encountered while cooking (e.g., browning of butter), handle allergy-based substitutions, and work with multiple data modalities (e.g. text and images). However, the recipes today are handled as textual documents which makes it difficult for machines to read, reason and handle ambiguity. This demands a need for better representation of the recipes, overcoming the ambiguity and sparseness that exists in the current textual documents. In this paper, we discuss the construction of a machine-understandable rich recipe representation (R3), in the form of plans, from the recipes available in natural language. R3 is infused with additional knowledge such as information about allergens and images of ingredients, possible failures and tips for each atomic cooking step. To show the benefits of R3, we also present TREAT, a tool for recipe retrieval which uses R3 to perform multi-modal reasoning on the recipe's content (plan objects - ingredients and cooking tools), food preparation process (plan actions and time), and media type (image, text). R3 leads to improved retrieval efficiency and new capabilities that were hither-to not possible in textual representation.
https://arxiv.org/abs/2203.17109
Social navigation is the capability of an autonomous agent, such as a robot, to navigate in a 'socially compliant' manner in the presence of other intelligent agents such as humans. With the emergence of autonomously navigating mobile robots in human populated environments (e.g., domestic service robots in homes and restaurants and food delivery robots on public sidewalks), incorporating socially compliant navigation behaviors on these robots becomes critical to ensuring safe and comfortable human robot coexistence. To address this challenge, imitation learning is a promising framework, since it is easier for humans to demonstrate the task of social navigation rather than to formulate reward functions that accurately capture the complex multi objective setting of social navigation. The use of imitation learning and inverse reinforcement learning to social navigation for mobile robots, however, is currently hindered by a lack of large scale datasets that capture socially compliant robot navigation demonstrations in the wild. To fill this gap, we introduce Socially CompliAnt Navigation Dataset (SCAND) a large scale, first person view dataset of socially compliant navigation demonstrations. Our dataset contains 8.7 hours, 138 trajectories, 25 miles of socially compliant, human teleoperated driving demonstrations that comprises multi modal data streams including 3D lidar, joystick commands, odometry, visual and inertial information, collected on two morphologically different mobile robots a Boston Dynamics Spot and a Clearpath Jackal by four different human demonstrators in both indoor and outdoor environments. We additionally perform preliminary analysis and validation through real world robot experiments and show that navigation policies learned by imitation learning on SCAND generate socially compliant behaviors
https://arxiv.org/abs/2203.15041
In this paper, we propose to build a stylish image captioning model through a Multi-style Multi modality mechanism (2M). We demonstrate that with 2M, we can build an effective stylish captioner and that multi-references produced by the model can also support explaining the model through identifying erroneous input features on faulty examples. We show how this 2M mechanism can be used to build stylish captioning models and show how these models can be utilized to provide explanations of likely errors in the models.
https://arxiv.org/abs/2110.10704
With the increase in computation power and the development of new state-of-the-art deep learning algorithms, appearance-based gaze estimation is becoming more and more popular. It is believed to work well with curated laboratory data sets, however it faces several challenges when deployed in real world scenario. One such challenge is to estimate the gaze of a person about which the Deep Learning model trained for gaze estimation has no knowledge about. To analyse the performance in such scenarios we have tried to simulate a calibration mechanism. In this work we use the MPIIGaze data set. We trained a multi modal convolutional neural network and analysed its performance with and without calibration and this evaluation provides clear insights on how calibration improved the performance of the Deep Learning model in estimating gaze in the wild.
https://arxiv.org/abs/2109.12801
Sepsis is a life-threatening disease with high morbidity, mortality and healthcare costs. The early prediction and administration of antibiotics and intravenous fluids is considered crucial for the treatment of sepsis and can save potentially millions of lives and billions in health care costs. Professional clinical care practitioners have proposed clinical criterion which aid in early detection of sepsis; however, performance of these criterion is often limited. Clinical text provides essential information to estimate the severity of the sepsis in addition to structured clinical data. In this study, we explore how clinical text can complement structured data towards early sepsis prediction task. In this paper, we propose multi modal model which incorporates both structured data in the form of patient measurements as well as textual notes on the patient. We employ state-of-the-art NLP models such as BERT and a highly specialized NLP model in Amazon Comprehend Medical to represent the text. On the MIMIC-III dataset containing records of ICU admissions, we show that by using these notes, one achieves an improvement of 6.07 points in a standard utility score for Sepsis prediction and 2.89% in AUROC score. Our methods significantly outperforms a clinical criteria suggested by experts, qSOFA, as well as the winning model of the PhysioNet Computing in Cardiology Challenge for predicting Sepsis.
https://arxiv.org/abs/2107.11094
Emotion recognition is an important research field for Human-Computer Interaction(HCI). Audio-Video Emotion Recognition (AVER) is now attacked with Deep Neural Network (DNN) modeling tools. In published papers, as a rule, the authors show only cases of the superiority of multi modalities over audio-only or video-only modalities. However, there are cases superiority in single modality can be found. In our research, we hypothesize that for fuzzy categories of emotional events, the higher noise of one modality can amplify the lower noise of the second modality represented indirectly in the parameters of the modeling neural network. To avoid such cross-modal information interference we define a multi-modal Residual Perceptron Network (MRPN) which learns from multi-modal network branches creating deep feature representation with reduced noise. For the proposed MRPN model and the novel time augmentation for streamed digital movies, the state-of-art average recognition rate was improved to 91.4% for The Ryerson Audio-Visual Database of Emotional Speech and Song(RAVDESS) dataset and to 83.15% for Crowd-sourced Emotional multi-modal Actors Dataset(Crema-d). Moreover, the MRPN concept shows its potential for multi-modal classifiers dealing with signal sources not only of optical and acoustical type.
https://arxiv.org/abs/2107.10742
Focus based methods have shown promising results for the task of depth estimation. However, most existing focus based depth estimation approaches depend on maximal sharpness of the focal stack. Out of focus information in the focal stack poses challenges for this task. In this paper, we propose a dynamically multi modal learning strategy which incorporates RGB data and the focal stack in our framework. Our goal is to deeply excavate the spatial correlation in the focal stack by designing the spatial correlation perception module and dynamically fuse multi modal information between RGB data and the focal stack in a adaptive way by designing the multi modal dynamic fusion module. The success of our method is demonstrated by achieving the state of the art performance on two datasets. Furthermore, we test our network on a set of different focused images generated by a smart phone camera to prove that the proposed method not only broke the limitation of only using light field data, but also open a path toward practical applications of depth estimation on common consumer level cameras data.
https://arxiv.org/abs/2104.05969
In this paper, We present our approach for IEEEBigMM 2020, Grand Challenge (BMGC), Identifying senti-ments from tweets related to the MeToo movement. The modelis based on an ensemble of Convolutional Neural Network,Bidirectional LSTM and a DNN for final classification. Thispaper is aimed at providing a detailed analysis of the modeland the results obtained. We have ranked 5th out of 10 teamswith a score of 0.51491
https://arxiv.org/abs/2104.05331
We address the problem of estimating depth with multi modal audio visual data. Inspired by the ability of animals, such as bats and dolphins, to infer distance of objects with echolocation, some recent methods have utilized echoes for depth estimation. We propose an end-to-end deep learning based pipeline utilizing RGB images, binaural echoes and estimated material properties of various objects within a scene. We argue that the relation between image, echoes and depth, for different scene elements, is greatly influenced by the properties of those elements, and a method designed to leverage this information can lead to significantly improve depth estimation from audio visual inputs. We propose a novel multi modal fusion technique, which incorporates the material properties explicitly while combining audio (echoes) and visual modalities to predict the scene depth. We show empirically, with experiments on Replica dataset, that the proposed method obtains 28% improvement in RMSE compared to the state-of-the-art audio-visual depth prediction method. To demonstrate the effectiveness of our method on larger dataset, we report competitive performance on Matterport3D, proposing to use it as a multimodal depth prediction benchmark with echoes for the first time. We also analyse the proposed method with exhaustive ablation experiments and qualitative results. The code and models are available at this https URL
https://arxiv.org/abs/2103.08468
In this paper, we primarily explore the improvement of single stream audio systems using Angle of Arrival calculations in both simulation and real life gathered data. We wanted to learn how to discern the direction of an audio source from gathered signal data to ultimately incorporate into a multi modal security system. We focused on the MUSIC algorithm for the estimation of the angle of arrival but briefly experimented with other techniques such as Bartlett and Capo. We were able to implement our own MUSIC algorithm on stimulated data from Cornell. In addition, we demonstrated how we are able to calculate the angle of arrival over time in a real life scene. Finally, we are able to detect the direction of arrival for two separate and simultaneous audio sources in a real life scene. Eventually, we could incorporate this tracking into a multi modal system combined with video. Overall, we are able to produce compelling results for angle of arrival calculations that could be the stepping stones for a better system to detect events in a scene.
https://arxiv.org/abs/2101.09904
Speech-driven facial video generation has been a complex problem due to its multi-modal aspects namely audio and video domain. The audio comprises lots of underlying features such as expression, pitch, loudness, prosody(speaking style) and facial video has lots of variability in terms of head movement, eye blinks, lip synchronization and movements of various facial action units along with temporal smoothness. Synthesizing highly expressive facial videos from the audio input and static image is still a challenging task for generative adversarial networks. In this paper, we propose a multi-modal adaptive normalization(MAN) based architecture to synthesize a talking person video of arbitrary length using as input: an audio signal and a single image of a person. The architecture uses the multi-modal adaptive normalization, keypoint heatmap predictor, optical flow predictor and class activation map[58] based layers to learn movements of expressive facial components and hence generates a highly expressive talking-head video of the given person. The multi-modal adaptive normalization uses the various features of audio and video such as Mel spectrogram, pitch, energy from audio signals and predicted keypoint heatmap/optical flow and a single image to learn the respective affine parameters to generate highly expressive video. Experimental evaluation demonstrates superior performance of the proposed method as compared to Realistic Speech-Driven Facial Animation with GANs(RSDGAN) [53], Speech2Vid [10], and other approaches, on multiple quantitative metrics including: SSIM (structural similarity index), PSNR (peak signal to noise ratio), CPBD (image sharpness), WER(word error rate), blinks/sec and LMD(landmark distance). Further, qualitative evaluation and Online Turing tests demonstrate the efficacy of our approach.
https://arxiv.org/abs/2012.07304