Multi_Modal

Diag2Diag: Multi modal super resolution for physics discovery with application to fusion

2024-05-09 17:06:51

Azarakhsh Jalalvand, Max Curie, SangKyeun Kim, Peter Steiner, Jaemin Seo, Qiming Hu, Andrew Oakleigh Nelson, Egemen Kolemen

arXiv_AI

arXiv_AI Image_Enhancement Super_Resolution Relation Multi_Modal Enhancement
Abstract

This paper introduces a groundbreaking multi-modal neural network model designed for resolution enhancement, which innovatively leverages inter-diagnostic correlations within a system. Traditional approaches have primarily focused on uni-modal enhancement strategies, such as pixel-based image enhancement or heuristic signal interpolation. In contrast, our model employs a novel methodology by harnessing the diagnostic relationships within the physics of fusion plasma. Initially, we establish the correlation among diagnostics within the tokamak. Subsequently, we utilize these correlations to substantially enhance the temporal resolution of the Thomson Scattering diagnostic, which assesses plasma density and temperature. By increasing its resolution from conventional 200Hz to 500kHz, we facilitate a new level of insight into plasma behavior, previously attainable only through computationally intensive simulations. This enhancement goes beyond simple interpolation, offering novel perspectives on the underlying physical phenomena governing plasma dynamics.

Abstract (translated)

本文提出了一种在分辨率增强方面具有突破性的多模态神经网络模型,该模型创新地利用了系统内诊断关系。传统方法主要集中在单模态增强策略,例如基于像素的图像增强或启发式信号插值。相比之下,我们的模型通过利用融合 plasma 物理学中诊断关系的方法来创新性地实现了一种新的方法。首先,我们在 tokamak 中建立了诊断之间的关系。接着,我们利用这些关系大大增强了汤姆逊散射诊断的时域分辨率,该诊断评估了 plasma 密度和温度。通过将分辨率从传统的 200Hz 提高到 500kHz,我们促进了对 plasma 行为的深入洞察,这一般仅通过计算密集型模拟才能实现。这种增强超越了简单的插值,提供了一种新颖的视角,揭示了控制 plasma 动力学背后的物理现象。

URL

https://arxiv.org/abs/2405.05908

PDF

https://arxiv.org/pdf/2405.05908.pdf
Read All
Solution for Emotion Prediction Competition of Workshop on Emotionally and Culturally Intelligent AI

2024-03-26 13:14:18

Shengdong Xu, Zhouyang Chi, Yang Yang

arXiv_AI

arXiv_AI Prediction Multi_Modal Unsupervised Pose Emotion
Abstract

This report provide a detailed description of the method that we explored and proposed in the WECIA Emotion Prediction Competition (EPC), which predicts a person's emotion through an artistic work with a comment. The dataset of this competition is ArtELingo, designed to encourage work on diversity across languages and cultures. The dataset has two main challenges, namely modal imbalance problem and language-cultural differences problem. In order to address this issue, we propose a simple yet effective approach called single-multi modal with Emotion-Cultural specific prompt(ECSP), which focuses on using the single modal message to enhance the performance of multimodal models and a well-designed prompt to reduce cultural differences problem. To clarify, our approach contains two main blocks: (1)XLM-R\cite{conneau2019unsupervised} based unimodal model and X$^2$-VLM\cite{zeng2022x} based multimodal model (2) Emotion-Cultural specific prompt. Our approach ranked first in the final test with a score of 0.627.

Abstract (translated)

本报告详细描述了我们参加WECIA情感预测竞赛（EPC）时所探索和提出的方法，该竞赛通过一件艺术作品来预测一个人的情感。比赛的數據集是ArtELingo，旨在鼓励跨語言和文化的作品。比赛數據集有两个主要挑戰，即模态不平衡問題和語言-文化差異問題。为了应对这个问题，我们提出了一个简单而有效的方案，称为情感文化特定提示（ECSP）单一模态与多模态模型。该方案重点使用单一模态信息来提高多模态模型的性能，并设计了一个精心设计的提示来减少文化差异问题。为了明确，我们的方法包含两个主要部分：（1）基于unimodal的XLM-R模型和基于multimodal的X$^2$-VLM模型（2）情感-文化特定提示。我们的方法在决赛测试中排名第一，得分为0.627。

URL

https://arxiv.org/abs/2403.17683

PDF

https://arxiv.org/pdf/2403.17683.pdf
Read All
Data Augmentation using LLMs: Data Perspectives, Learning Paradigms and Challenges

2024-03-05 14:11:54

Bosheng Ding, Chengwei Qin, Ruochen Zhao, Tianze Luo, Xinze Li, Guizhen Chen, Wenhan Xia, Junjie Hu, Anh Tuan Luu, Shafiq Joty

arXiv_CL

arXiv_CL Survey Face Language_Model Multi_Modal LLM
Abstract

In the rapidly evolving field of machine learning (ML), data augmentation (DA) has emerged as a pivotal technique for enhancing model performance by diversifying training examples without the need for additional data collection. This survey explores the transformative impact of Large Language Models (LLMs) on DA, particularly addressing the unique challenges and opportunities they present in the context of natural language processing (NLP) and beyond. From a data perspective and a learning perspective, we examine various strategies that utilize Large Language Models for data augmentation, including a novel exploration of learning paradigms where LLM-generated data is used for further training. Additionally, this paper delineates the primary challenges faced in this domain, ranging from controllable data augmentation to multi modal data augmentation. This survey highlights the paradigm shift introduced by LLMs in DA, aims to serve as a foundational guide for researchers and practitioners in this field.

Abstract (translated)

在迅速发展的机器学习（ML）领域，数据增强（DA）已成为通过扩展训练示例来提高模型性能的关键技术，而无需进行额外的数据收集。本调查探讨了大型语言模型（LLMs）对DA的变革性影响，特别是在自然语言处理（NLP）及其它领域的独特挑战和机遇。从数据视角和学习视角出发，我们检查了各种利用LLM进行数据增强的策略，包括一种新的探索学习范式，其中LLM生成的数据用于进一步训练。此外，本文概述了该领域面临的主要挑战，从可控制的数据增强到多模态数据增强。本调查突出了LLM在DA领域引入的范式转变，旨在为该领域的研究人员和实践者提供基础指导。

URL

https://arxiv.org/abs/2403.02990

PDF

https://arxiv.org/pdf/2403.02990.pdf
Read All
A Multimodal Approach for Advanced Pest Detection and Classification

2023-12-18 05:54:20

Jinli Duan, Haoyu Ding, Sung Kim

arXiv_AI

arXiv_AI Detection Deep_Learning Classification Attention Bert Multi_Modal
Abstract

This paper presents a novel multi modal deep learning framework for enhanced agricultural pest detection, combining tiny-BERT's natural language processing with R-CNN and ResNet-18's image processing. Addressing limitations of traditional CNN-based visual methods, this approach integrates textual context for more accurate pest identification. The R-CNN and ResNet-18 integration tackles deep CNN issues like vanishing gradients, while tiny-BERT ensures computational efficiency. Employing ensemble learning with linear regression and random forest models, the framework demonstrates superior discriminate ability, as shown in ROC and AUC analyses. This multi modal approach, blending text and image data, significantly boosts pest detection in agriculture. The study highlights the potential of multi modal deep learning in complex real-world scenarios, suggesting future expansions in diversity of datasets, advanced data augmentation, and cross-modal attention mechanisms to enhance model performance.

Abstract (translated)

本文提出了一种新颖的多模态深度学习框架,用于增强农业害虫检测,将 tiny-BERT 的自然语言处理与 R-CNN 和 ResNet-18 的图像处理相结合。该方法解决了传统 CNN 视觉方法的局限性,并引入了文本上下文以实现更精确的害虫识别。R-CNN 和 ResNet-18 的集成解决了深度 CNN 问题,如消失的梯度,而 tiny-BERT 保证了计算效率。通过线性回归和随机森林模型的集成,该框架展示了卓越的判别能力,如图论和 AUC 分析结果所示。这种多模态方法结合了文本和图像数据,显著提高了农业中的害虫检测。本研究突出了多模态深度学习在复杂现实场景中的潜力,建议在未来增加数据集的多样性、高级数据增强和跨模态关注机制,以提高模型性能。

URL

https://arxiv.org/abs/2312.10948

PDF

https://arxiv.org/pdf/2312.10948.pdf
Read All
Versatile Audio-Visual Learning for Handling Single and Multi Modalities in Emotion Regression and Classification Tasks

2023-05-12 03:13:37

Lucas Goncalves, Seong-Gyun Leem, Wei-Cheng Lin, Berrak Sisman, Carlos Busso

arXiv_SD

arXiv_SD Recognition Classification Represenation_Learning Prediction Multi_Modal Pose Emotion Reconstruction
Abstract

Most current audio-visual emotion recognition models lack the flexibility needed for deployment in practical applications. We envision a multimodal system that works even when only one modality is available and can be implemented interchangeably for either predicting emotional attributes or recognizing categorical emotions. Achieving such flexibility in a multimodal emotion recognition system is difficult due to the inherent challenges in accurately interpreting and integrating varied data sources. It is also a challenge to robustly handle missing or partial information while allowing direct switch between regression and classification tasks. This study proposes a \emph{versatile audio-visual learning} (VAVL) framework for handling unimodal and multimodal systems for emotion regression and emotion classification tasks. We implement an audio-visual framework that can be trained even when audio and visual paired data is not available for part of the training set (i.e., audio only or only video is present). We achieve this effective representation learning with audio-visual shared layers, residual connections over shared layers, and a unimodal reconstruction task. Our experimental results reveal that our architecture significantly outperforms strong baselines on both the CREMA-D and MSP-IMPROV corpora. Notably, VAVL attains a new state-of-the-art performance in the emotional attribute prediction task on the MSP-IMPROV corpus. Code available at: this https URL

Abstract (translated)

当前大多数音频和视觉情感识别模型缺乏实际应用所需的灵活性。我们构想了一种多模态系统，即使在只有一个模态可用的情况下也能工作，并且可以相互替代地用于预测情感属性或识别分类情感。在实现多模态情感识别系统的灵活性方面，由于准确解释和整合多种数据源的内在挑战，非常困难。此外，在处理缺失或部分信息的同时，允许直接进行回归和分类任务的挑战也非常困难。本文提出了一个 \emph{多功能音频-视觉学习} (VAVL)框架，用于处理情感回归和情感分类任务中的单模态和多模态系统。我们实现了一个音频和视觉共享层的框架，在共享层上保留连接，并实现了单模态重建任务。我们的实验结果显示，我们的架构在CREMA-D和MSP-IMPROV corpora上的 strong baselines 显著超越了它们。值得注意的是，在MSP-IMPROV corpora上的情感属性预测任务中，VAVL取得了新的先进技术表现。代码可在 this https URL 上获取。

URL

https://arxiv.org/abs/2305.07216

PDF

https://arxiv.org/pdf/2305.07216.pdf
Read All
Medical Image Analysis using Deep Relational Learning

2023-03-28 16:10:12

Zhihua Liu

arXiv_CV

arXiv_CV Segmentation GAN CNN Deep_Learning Relation Multi_Modal Pose Medical
Abstract

In the past ten years, with the help of deep learning, especially the rapid development of deep neural networks, medical image analysis has made remarkable progress. However, how to effectively use the relational information between various tissues or organs in medical images is still a very challenging problem, and it has not been fully studied. In this thesis, we propose two novel solutions to this problem based on deep relational learning. First, we propose a context-aware fully convolutional network that effectively models implicit relation information between features to perform medical image segmentation. The network achieves the state-of-the-art segmentation results on the Multi Modal Brain Tumor Segmentation 2017 (BraTS2017) and Multi Modal Brain Tumor Segmentation 2018 (BraTS2018) data sets. Subsequently, we propose a new hierarchical homography estimation network to achieve accurate medical image mosaicing by learning the explicit spatial relationship between adjacent frames. We use the UCL Fetoscopy Placenta dataset to conduct experiments and our hierarchical homography estimation network outperforms the other state-of-the-art mosaicing methods while generating robust and meaningful mosaicing result on unseen frames.

Abstract (translated)

在过去的十年中，借助深度学习，特别是深度神经网络的迅速发展，医学图像分析取得了显著进展。然而，如何有效地利用医学图像中各种组织和器官之间的隐含关系仍然是一个极具挑战性的问题，并尚未得到充分研究。在本文中，我们提出了基于深度学习关系深度学习的两个创新解决方案。首先，我们提出了一种具有上下文意识的全卷积神经网络，有效地模型了特征之间的隐含关系信息，以进行医学图像分割。该网络在Multimodal Brain Tumor Segmentation 2017( BraTS2017)和Multimodal Brain Tumor Segmentation 2018( BraTS2018)数据集上取得了最先进的分割结果。随后，我们提出了一种新的层级基元估计网络，以通过学习相邻帧之间的明确空间关系实现准确的医学图像拼贴，我们使用UCL鲸鱼超声波 dataset进行了实验，我们的层级基元估计网络在未观测帧上的拼贴结果表现优异，同时生成稳健且有意义的拼贴结果。

URL

https://arxiv.org/abs/2303.16099

PDF

https://arxiv.org/pdf/2303.16099.pdf
Read All
Human Fall Detection- Multimodality Approach

2023-02-01 04:05:14

Xi Wang, Ramya Penta, Bhavya Sehgal, Dale Chen-Song

arXiv_AI

arXiv_AI Detection Classification Prediction Multi_Modal Pose
Abstract

Falls have become more frequent in recent years, which has been harmful for senior citizens.Therefore detecting falls have become important and several data sets and machine learning model have been introduced related to fall detection. In this project report, a human fall detection method is proposed using a multi modality approach. We used the UP-FALL detection data set which is collected by dozens of volunteers using different sensors and two cameras. We use wrist sensor with acclerometer data keeping labels to binary classification, namely fall and no fall from the data set.We used fusion of camera and sensor data to increase performance. The experimental results shows that using only wrist data as compared to multi sensor for binary classification did not impact the model prediction performance for fall detection.

Abstract (translated)

Falls近年来变得越来越普遍,这对老年人来说是一种危害。因此,检测跌倒变得非常重要,并引入了多个数据集和机器学习模型与跌倒检测相关。在本报告中,提出了一种使用多模态方法的人跌倒检测方法。我们使用了由数十个志愿者使用多种传感器和两只摄像机收集的UP-Fall检测数据集。我们使用带计步器的手腕传感器将计步器数据作为标签进行二元分类,即跌倒和未跌倒从数据集中选取。我们使用了相机和传感器数据的集成来提高性能。实验结果显示,与使用多个传感器进行二元分类相比,仅使用手腕数据对于跌倒检测模型预测性能没有影响。

URL

https://arxiv.org/abs/2302.00224

PDF

https://arxiv.org/pdf/2302.00224.pdf
Read All
AlignVE: Visual Entailment Recognition Based on Alignment Relations

2022-11-16 07:52:24

Biwei Cao, Jiuxin Cao, Jie Gui, Jiayun Shen, Bo Liu, Lei He, Yuan Yan Tang, James Tin-Yau Kwok

arXiv_CV

arXiv_CV VQA Recognition Classification Relation Inference Multi_Modal Pose Action
Abstract

Visual entailment (VE) is to recognize whether the semantics of a hypothesis text can be inferred from the given premise image, which is one special task among recent emerged vision and language understanding tasks. Currently, most of the existing VE approaches are derived from the methods of visual question answering. They recognize visual entailment by quantifying the similarity between the hypothesis and premise in the content semantic features from multi modalities. Such approaches, however, ignore the VE's unique nature of relation inference between the premise and hypothesis. Therefore, in this paper, a new architecture called AlignVE is proposed to solve the visual entailment problem with a relation interaction method. It models the relation between the premise and hypothesis as an alignment matrix. Then it introduces a pooling operation to get feature vectors with a fixed size. Finally, it goes through the fully-connected layer and normalization layer to complete the classification. Experiments show that our alignment-based architecture reaches 72.45\% accuracy on SNLI-VE dataset, outperforming previous content-based models under the same settings.

Abstract (translated)

URL

https://arxiv.org/abs/2211.08736

PDF

https://arxiv.org/pdf/2211.08736.pdf
Read All
A Rich Recipe Representation as Plan to Support Expressive Multi Modal Queries on Recipe Content and Preparation Process

2022-03-31 15:29:38

Vishal Pallagani, Priyadharsini Ramamurthy, Vedant Khandelwal, Revathy Venkataramanan, Kausik Lakkaraju, Sathyanarayanan N. Aakur, Biplav Srivastava

arXiv_AI

arXiv_AI Face QA Sparse Knowledge Multi_Modal Action Chat
Abstract

Food is not only a basic human necessity but also a key factor driving a society's health and economic well-being. As a result, the cooking domain is a popular use-case to demonstrate decision-support (AI) capabilities in service of benefits like precision health with tools ranging from information retrieval interfaces to task-oriented chatbots. An AI here should understand concepts in the food domain (e.g., recipes, ingredients), be tolerant to failures encountered while cooking (e.g., browning of butter), handle allergy-based substitutions, and work with multiple data modalities (e.g. text and images). However, the recipes today are handled as textual documents which makes it difficult for machines to read, reason and handle ambiguity. This demands a need for better representation of the recipes, overcoming the ambiguity and sparseness that exists in the current textual documents. In this paper, we discuss the construction of a machine-understandable rich recipe representation (R3), in the form of plans, from the recipes available in natural language. R3 is infused with additional knowledge such as information about allergens and images of ingredients, possible failures and tips for each atomic cooking step. To show the benefits of R3, we also present TREAT, a tool for recipe retrieval which uses R3 to perform multi-modal reasoning on the recipe's content (plan objects - ingredients and cooking tools), food preparation process (plan actions and time), and media type (image, text). R3 leads to improved retrieval efficiency and new capabilities that were hither-to not possible in textual representation.

Abstract (translated)

URL

https://arxiv.org/abs/2203.17109

PDF

https://arxiv.org/pdf/2203.17109.pdf
Read All
Socially Compliant Navigation Dataset : A Large-Scale Dataset of Demonstrations for Social Navigation

2022-03-28 19:09:11

Haresh Karnan, Anirudh Nair, Xuesu Xiao, Garrett Warnell, Soeren Pirk, Alexander Toshev, Justin Hart, Joydeep Biswas, Peter Stone

arXiv_CV

arXiv_CV Reinforcement_Learning Multi_Modal Autonomous 3D Agent Robot
Abstract

Social navigation is the capability of an autonomous agent, such as a robot, to navigate in a 'socially compliant' manner in the presence of other intelligent agents such as humans. With the emergence of autonomously navigating mobile robots in human populated environments (e.g., domestic service robots in homes and restaurants and food delivery robots on public sidewalks), incorporating socially compliant navigation behaviors on these robots becomes critical to ensuring safe and comfortable human robot coexistence. To address this challenge, imitation learning is a promising framework, since it is easier for humans to demonstrate the task of social navigation rather than to formulate reward functions that accurately capture the complex multi objective setting of social navigation. The use of imitation learning and inverse reinforcement learning to social navigation for mobile robots, however, is currently hindered by a lack of large scale datasets that capture socially compliant robot navigation demonstrations in the wild. To fill this gap, we introduce Socially CompliAnt Navigation Dataset (SCAND) a large scale, first person view dataset of socially compliant navigation demonstrations. Our dataset contains 8.7 hours, 138 trajectories, 25 miles of socially compliant, human teleoperated driving demonstrations that comprises multi modal data streams including 3D lidar, joystick commands, odometry, visual and inertial information, collected on two morphologically different mobile robots a Boston Dynamics Spot and a Clearpath Jackal by four different human demonstrators in both indoor and outdoor environments. We additionally perform preliminary analysis and validation through real world robot experiments and show that navigation policies learned by imitation learning on SCAND generate socially compliant behaviors

Abstract (translated)

URL

https://arxiv.org/abs/2203.15041

PDF

https://arxiv.org/pdf/2203.15041.pdf
Read All
A Self-Explainable Stylish Image Captioning Framework via Multi-References

2021-10-20 18:00:40

Chengxi Li, Brent Harrison

arXiv_CL

arXiv_CL Image_Caption Caption Multi_Modal Pose
Abstract

In this paper, we propose to build a stylish image captioning model through a Multi-style Multi modality mechanism (2M). We demonstrate that with 2M, we can build an effective stylish captioner and that multi-references produced by the model can also support explaining the model through identifying erroneous input features on faulty examples. We show how this 2M mechanism can be used to build stylish captioning models and show how these models can be utilized to provide explanations of likely errors in the models.

Abstract (translated)

URL

https://arxiv.org/abs/2110.10704

PDF

https://arxiv.org/pdf/2110.10704.pdf
Read All
Effect Of Personalized Calibration On Gaze Estimation Using Deep-Learning

2021-09-27 05:14:12

Nairit Bandyopadhyay, Sébastien Riou, Didier Schwab

arXiv_CV

arXiv_CV CNN Deep_Learning Face Knowledge Multi_Modal Gaze_Estimation
Abstract

With the increase in computation power and the development of new state-of-the-art deep learning algorithms, appearance-based gaze estimation is becoming more and more popular. It is believed to work well with curated laboratory data sets, however it faces several challenges when deployed in real world scenario. One such challenge is to estimate the gaze of a person about which the Deep Learning model trained for gaze estimation has no knowledge about. To analyse the performance in such scenarios we have tried to simulate a calibration mechanism. In this work we use the MPIIGaze data set. We trained a multi modal convolutional neural network and analysed its performance with and without calibration and this evaluation provides clear insights on how calibration improved the performance of the Deep Learning model in estimating gaze in the wild.

Abstract (translated)

URL

https://arxiv.org/abs/2109.12801

PDF

https://arxiv.org/pdf/2109.12801.pdf
Read All
Improving Early Sepsis Prediction with Multi Modal Learning

2021-07-23 09:25:31

Fred Qin, Vivek Madan, Ujjwal Ratan, Zohar Karnin, Vishaal Kapoor, Parminder Bhatia, Taha Kass-Hout

arXiv_CL

arXiv_CL Detection Prediction Bert Multi_Modal Pose Medical
Abstract

Sepsis is a life-threatening disease with high morbidity, mortality and healthcare costs. The early prediction and administration of antibiotics and intravenous fluids is considered crucial for the treatment of sepsis and can save potentially millions of lives and billions in health care costs. Professional clinical care practitioners have proposed clinical criterion which aid in early detection of sepsis; however, performance of these criterion is often limited. Clinical text provides essential information to estimate the severity of the sepsis in addition to structured clinical data. In this study, we explore how clinical text can complement structured data towards early sepsis prediction task. In this paper, we propose multi modal model which incorporates both structured data in the form of patient measurements as well as textual notes on the patient. We employ state-of-the-art NLP models such as BERT and a highly specialized NLP model in Amazon Comprehend Medical to represent the text. On the MIMIC-III dataset containing records of ICU admissions, we show that by using these notes, one achieves an improvement of 6.07 points in a standard utility score for Sepsis prediction and 2.89% in AUROC score. Our methods significantly outperforms a clinical criteria suggested by experts, qSOFA, as well as the winning model of the PhysioNet Computing in Cardiology Challenge for predicting Sepsis.

Abstract (translated)

URL

https://arxiv.org/abs/2107.11094

PDF

https://arxiv.org/pdf/2107.11094.pdf
Read All
Multi-modal Residual Perceptron Network for Audio-Video Emotion Recognition

2021-07-21 13:11:37

Xin Chang, Władysław Skarbek

arXiv_AI

arXiv_AI Recognition Multi_Modal Pose Action Emotion Speech
Abstract

Emotion recognition is an important research field for Human-Computer Interaction(HCI). Audio-Video Emotion Recognition (AVER) is now attacked with Deep Neural Network (DNN) modeling tools. In published papers, as a rule, the authors show only cases of the superiority of multi modalities over audio-only or video-only modalities. However, there are cases superiority in single modality can be found. In our research, we hypothesize that for fuzzy categories of emotional events, the higher noise of one modality can amplify the lower noise of the second modality represented indirectly in the parameters of the modeling neural network. To avoid such cross-modal information interference we define a multi-modal Residual Perceptron Network (MRPN) which learns from multi-modal network branches creating deep feature representation with reduced noise. For the proposed MRPN model and the novel time augmentation for streamed digital movies, the state-of-art average recognition rate was improved to 91.4% for The Ryerson Audio-Visual Database of Emotional Speech and Song(RAVDESS) dataset and to 83.15% for Crowd-sourced Emotional multi-modal Actors Dataset(Crema-d). Moreover, the MRPN concept shows its potential for multi-modal classifiers dealing with signal sources not only of optical and acoustical type.

Abstract (translated)

URL

https://arxiv.org/abs/2107.10742

PDF

https://arxiv.org/pdf/2107.10742.pdf
Read All
Dynamic Fusion Network For Light Field Depth Estimation

2021-04-13 06:45:11

Yongri Piao, Yukun Zhang, Miao Zhang, Xinxin Ji

arXiv_CV

arXiv_CV Relation Multi_Modal Pose
Abstract

Focus based methods have shown promising results for the task of depth estimation. However, most existing focus based depth estimation approaches depend on maximal sharpness of the focal stack. Out of focus information in the focal stack poses challenges for this task. In this paper, we propose a dynamically multi modal learning strategy which incorporates RGB data and the focal stack in our framework. Our goal is to deeply excavate the spatial correlation in the focal stack by designing the spatial correlation perception module and dynamically fuse multi modal information between RGB data and the focal stack in a adaptive way by designing the multi modal dynamic fusion module. The success of our method is demonstrated by achieving the state of the art performance on two datasets. Furthermore, we test our network on a set of different focused images generated by a smart phone camera to prove that the proposed method not only broke the limitation of only using light field data, but also open a path toward practical applications of depth estimation on common consumer level cameras data.

Abstract (translated)

URL

https://arxiv.org/abs/2104.05969

PDF

https://arxiv.org/pdf/2104.05969.pdf
Read All
MeToo Tweets Sentiment Analysis Using Multi Modal frameworks

2021-04-12 10:18:33

Rushil Thareja

arXiv_AI

arXiv_AI RNN CNN Classification Sentiment Multi_Modal
Abstract

In this paper, We present our approach for IEEEBigMM 2020, Grand Challenge (BMGC), Identifying senti-ments from tweets related to the MeToo movement. The modelis based on an ensemble of Convolutional Neural Network,Bidirectional LSTM and a DNN for final classification. Thispaper is aimed at providing a detailed analysis of the modeland the results obtained. We have ranked 5th out of 10 teamswith a score of 0.51491

Abstract (translated)

URL

https://arxiv.org/abs/2104.05331

PDF

https://arxiv.org/pdf/2104.05331.pdf
Read All
Beyond Image to Depth: Improving Depth Prediction using Echoes

2021-03-15 15:45:24

Kranti Kumar Parida, Siddharth Srivastava, Gaurav Sharma

arXiv_CV

arXiv_CV Deep_Learning Relation Prediction Multi_Modal Pose 3D
Abstract

We address the problem of estimating depth with multi modal audio visual data. Inspired by the ability of animals, such as bats and dolphins, to infer distance of objects with echolocation, some recent methods have utilized echoes for depth estimation. We propose an end-to-end deep learning based pipeline utilizing RGB images, binaural echoes and estimated material properties of various objects within a scene. We argue that the relation between image, echoes and depth, for different scene elements, is greatly influenced by the properties of those elements, and a method designed to leverage this information can lead to significantly improve depth estimation from audio visual inputs. We propose a novel multi modal fusion technique, which incorporates the material properties explicitly while combining audio (echoes) and visual modalities to predict the scene depth. We show empirically, with experiments on Replica dataset, that the proposed method obtains 28% improvement in RMSE compared to the state-of-the-art audio-visual depth prediction method. To demonstrate the effectiveness of our method on larger dataset, we report competitive performance on Matterport3D, proposing to use it as a multimodal depth prediction benchmark with echoes for the first time. We also analyse the proposed method with exhaustive ablation experiments and qualitative results. The code and models are available at this https URL

Abstract (translated)

URL

https://arxiv.org/abs/2103.08468

PDF

https://arxiv.org/pdf/2103.08468.pdf
Read All
Using Angle of Arrival for Improving Indoor Localization

2021-01-25 05:52:19

Sai Koppula, Shivang Singh

arXiv_SD

arXiv_SD Tracking Multi_Modal
Abstract

In this paper, we primarily explore the improvement of single stream audio systems using Angle of Arrival calculations in both simulation and real life gathered data. We wanted to learn how to discern the direction of an audio source from gathered signal data to ultimately incorporate into a multi modal security system. We focused on the MUSIC algorithm for the estimation of the angle of arrival but briefly experimented with other techniques such as Bartlett and Capo. We were able to implement our own MUSIC algorithm on stimulated data from Cornell. In addition, we demonstrated how we are able to calculate the angle of arrival over time in a real life scene. Finally, we are able to detect the direction of arrival for two separate and simultaneous audio sources in a real life scene. Eventually, we could incorporate this tracking into a multi modal system combined with video. Overall, we are able to produce compelling results for angle of arrival calculations that could be the stepping stones for a better system to detect events in a scene.

Abstract (translated)

URL

https://arxiv.org/abs/2101.09904

PDF

https://arxiv.org/pdf/2101.09904.pdf
Read All
Multi Modal Adaptive Normalization for Audio to Video Generation

2020-12-14 07:39:45

Neeraj Kumar, Srishti Goel, Ankur Narang, Brejesh Lall

arXiv_CV

arXiv_CV GAN Adversarial Quantitative Multi_Modal Pose Action Optical_Flow Speech
Abstract

Speech-driven facial video generation has been a complex problem due to its multi-modal aspects namely audio and video domain. The audio comprises lots of underlying features such as expression, pitch, loudness, prosody(speaking style) and facial video has lots of variability in terms of head movement, eye blinks, lip synchronization and movements of various facial action units along with temporal smoothness. Synthesizing highly expressive facial videos from the audio input and static image is still a challenging task for generative adversarial networks. In this paper, we propose a multi-modal adaptive normalization(MAN) based architecture to synthesize a talking person video of arbitrary length using as input: an audio signal and a single image of a person. The architecture uses the multi-modal adaptive normalization, keypoint heatmap predictor, optical flow predictor and class activation map[58] based layers to learn movements of expressive facial components and hence generates a highly expressive talking-head video of the given person. The multi-modal adaptive normalization uses the various features of audio and video such as Mel spectrogram, pitch, energy from audio signals and predicted keypoint heatmap/optical flow and a single image to learn the respective affine parameters to generate highly expressive video. Experimental evaluation demonstrates superior performance of the proposed method as compared to Realistic Speech-Driven Facial Animation with GANs(RSDGAN) [53], Speech2Vid [10], and other approaches, on multiple quantitative metrics including: SSIM (structural similarity index), PSNR (peak signal to noise ratio), CPBD (image sharpness), WER(word error rate), blinks/sec and LMD(landmark distance). Further, qualitative evaluation and Online Turing tests demonstrate the efficacy of our approach.

Abstract (translated)

URL

https://arxiv.org/abs/2012.07304

PDF

https://arxiv.org/pdf/2012.07304.pdf
Read All
Brain tumor segmentation with missing modalities via latent multi-source correlation representation

2020-03-19 15:47:36

Tongxue Zhou, Stephane Canu, Pierre Vera, Su Ruan

arXiv_CV

arXiv_CV Segmentation Attention Relation Multi_Modal Pose
Abstract

Multimodal MR images can provide complementary information for accurate brain tumor segmentation. However, it's common to have missing imaging modalities in clinical practice. Since it exists a strong correlation between multi modalities, a novel correlation representation block is proposed to specially discover the latent multi-source correlation. Thanks to the obtained correlation representation, the segmentation becomes more robust in the case of missing modality. The model parameter estimation module first maps the individual representation produced by each encoder to obtain independent parameters, then, under these parameters, correlation expression module transforms all the individual representations to form a latent multi-source correlation representation. Finally, the correlation representations across modalities are fused via attention mechanism into a shared representation to emphasize the most important features for segmentation. We evaluate our model on BraTS 2018 datasets, it outperforms the current state-of-the-art method and produces robust results when one or more modalities are missing.

Abstract (translated)

URL

https://arxiv.org/abs/2003.08870

PDF

https://arxiv.org/pdf/2003.08870.pdf
Read All

Content

Multi_Modal (20)

Multi_Modal

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF Copy

Abstract

Abstract (translated)

URL

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF

PDF