Generating visual text in natural scene images is a challenging task with many unsolved problems. Different from generating text on artificially designed images (such as posters, covers, cartoons, etc.), the text in natural scene images needs to meet the following four key criteria: (1) Fidelity: the generated text should appear as realistic as a photograph and be completely accurate, with no errors in any of the strokes. (2) Reasonability: the text should be generated on reasonable carrier areas (such as boards, signs, walls, etc.), and the generated text content should also be relevant to the scene. (3) Utility: the generated text can facilitate to the training of natural scene OCR (Optical Character Recognition) tasks. (4) Controllability: The attribute of the text (such as font and color) should be controllable as this http URL this paper, we propose a two stage method, SceneVTG++, which simultaneously satisfies the four aspects mentioned above. SceneVTG++ consists of a Text Layout and Content Generator (TLCG) and a Controllable Local Text Diffusion (CLTD). The former utilizes the world knowledge of multi modal large language models to find reasonable text areas and recommend text content according to the nature scene background images, while the latter generates controllable multilingual text based on the diffusion model. Through extensive experiments, we respectively verified the effectiveness of TLCG and CLTD, and demonstrated the state-of-the-art text generation performance of SceneVTG++. In addition, the generated images have superior utility in OCR tasks like text detection and text recognition. Codes and datasets will be available.
在自然场景图像中生成视觉文本是一项充满挑战的任务,许多问题尚未解决。与在人工设计的图像(如海报、封面、卡通等)上生成文字不同,在自然场景图像中的文字需要满足以下四个关键标准:(1) 真实性:生成的文字应该看起来像一张照片一样逼真,并且完全准确,没有任何笔画错误。(2) 合理性:文本应当出现在合理的载体区域(如板子、标识牌、墙壁等),并且所生成的文本内容也应与场景相关。(3) 实用性:生成的文本能够促进自然场景OCR(光学字符识别)任务的训练。(4) 可控性:文字属性(如字体和颜色)应该可以控制。 在这篇论文中,我们提出了一种两阶段方法——SceneVTG++,它同时满足上述四个方面的需求。SceneVTG++由文本布局及内容生成器(TLCG) 和可控局部文本扩散(CLTD)组成。前者利用多模态大型语言模型的世界知识来寻找合理的文字区域并根据自然场景背景图像推荐文字内容,而后者则基于扩散模型生成可控制的多语言文本。通过广泛的实验,我们分别验证了TLCG和CLTD的有效性,并展示了SceneVTG++在文本生成性能方面的先进水平。此外,所生成的图像在OCR任务(如文本检测、文本识别)中具有极高的实用性。代码及数据集将会公开提供。
https://arxiv.org/abs/2501.02962
Autonomous off-road navigation is required for applications in agriculture, construction, search and rescue and defence. Traditional on-road autonomous methods struggle with dynamic terrains, leading to poor vehicle control on off-road. Recent deep-learning models have used perception sensors along with kinesthetic feedback for navigation on such terrains. However, this approach has out-of-domain uncertainty. Factors like change in weather and time of day impacts the performance of the model. We propose a multi modal fusion network FuseIsPath capable of using LWIR and RGB images to provide robustness against dynamic weather and light conditions. To aid further works in this domain, we also open-source a day-night dataset with LWIR and RGB images along with pseudo-labels for traversability. In order to co-register the two images we developed a novel method for targetless extrinsic calibration of LWIR, LiDAR and RGB cameras with translation accuracy of 1.7cm and rotation accuracy of 0.827degree.
https://arxiv.org/abs/2412.03173
The potential use of large language models (LLMs) in healthcare robotics can help address the significant demand put on healthcare systems around the world with respect to an aging demographic and a shortage of healthcare professionals. Even though LLMs have already been integrated into medicine to assist both clinicians and patients, the integration of LLMs within healthcare robots has not yet been explored for clinical settings. In this perspective paper, we investigate the groundbreaking developments in robotics and LLMs to uniquely identify the needed system requirements for designing health specific LLM based robots in terms of multi modal communication through human robot interactions (HRIs), semantic reasoning, and task planning. Furthermore, we discuss the ethical issues, open challenges, and potential future research directions for this emerging innovative field.
大型语言模型(LLMs)在医疗机器人中的潜在应用,可以帮助应对全球范围内因人口老龄化和医护人员短缺而对 healthcare 系统提出的重大需求。尽管 LLMs 已经被整合到医学中以帮助临床医生和患者,但将 LLMs 集成到医疗机器人中的方法尚未在临床环境中得到探索。在这篇视角论文中,我们探讨了机器人技术与 LLMs 的突破性发展,以独特地确定设计特定于健康领域的基于 LLM 机器人的系统需求,特别是在通过人机交互(HRIs)实现多模态通信、语义推理和任务规划方面的需求。此外,我们还讨论了这一新兴创新领域中的伦理问题、开放挑战以及潜在的未来研究方向。 注:原文中的 "healthcare systems" 已经被直接翻译为“医疗系统”,而 "address the significant demand put on healthcare systems" 被解释为应对医疗系统的重大需求。此外,考虑到中文表达习惯,“health specific” 直译可能不够自然,因此调整为“特定于健康领域”。
https://arxiv.org/abs/2411.03287
Understanding animal vocalizations through multi-source data fusion is crucial for assessing emotional states and enhancing animal welfare in precision livestock farming. This study aims to decode dairy cow contact calls by employing multi-modal data fusion techniques, integrating transcription, semantic analysis, contextual and emotional assessment, and acoustic feature extraction. We utilized the Natural Language Processing model to transcribe audio recordings of cow vocalizations into written form. By fusing multiple acoustic features frequency, duration, and intensity with transcribed textual data, we developed a comprehensive representation of cow vocalizations. Utilizing data fusion within a custom-developed ontology, we categorized vocalizations into high frequency calls associated with distress or arousal, and low frequency calls linked to contentment or calmness. Analyzing the fused multi dimensional data, we identified anxiety related features indicative of emotional distress, including specific frequency measurements and sound spectrum results. Assessing the sentiment and acoustic features of vocalizations from 20 individual cows allowed us to determine differences in calling patterns and emotional states. Employing advanced machine learning algorithms, Random Forest, Support Vector Machine, and Recurrent Neural Networks, we effectively processed and fused multi-source data to classify cow vocalizations. These models were optimized to handle computational demands and data quality challenges inherent in practical farm environments. Our findings demonstrate the effectiveness of multi-source data fusion and intelligent processing techniques in animal welfare monitoring. This study represents a significant advancement in animal welfare assessment, highlighting the role of innovative fusion technologies in understanding and improving the emotional wellbeing of dairy cows.
通过多源数据融合理解动物的发声对于在精准畜牧业中评估情绪状态和提高动物福利至关重要。本研究旨在通过运用多模态数据融合技术解码奶牛叫声,整合转录、语义分析、情境与情感评估以及声学特征提取。我们利用自然语言处理模型将奶牛叫声的音频记录转换为书面形式。通过结合多种声学特征(频率、持续时间和强度)与转录的文字数据,我们开发了对奶牛叫声的全面表示。在自定义开发的本体论中使用数据融合,我们将叫声分类为高频呼叫(与压力或唤醒状态相关)和低频呼叫(与满足感或平静情绪相关)。通过分析融合的多维数据,我们识别出表明情感困扰的具体频率测量和声音光谱结果等焦虑相关特征。评估20头单独奶牛叫声的情感和声学特征使我们能够确定叫唤模式和情感状态之间的差异。利用先进的机器学习算法——随机森林、支持向量机和循环神经网络,我们有效地处理并融合了多源数据以分类奶牛叫声。这些模型经过优化,可以应对实际农场环境中固有的计算需求和数据质量挑战。我们的研究结果表明,在动物福利监测中使用多源数据融合与智能处理技术的有效性。本研究表明在动物福利评估方面的一个重大进展,强调创新融合技术在理解和改善奶牛情感福祉中的作用。
https://arxiv.org/abs/2411.00477
This paper introduces M2M Gen, a multi modal framework for generating background music tailored to Japanese manga. The key challenges in this task are the lack of an available dataset or a baseline. To address these challenges, we propose an automated music generation pipeline that produces background music for an input manga book. Initially, we use the dialogues in a manga to detect scene boundaries and perform emotion classification using the characters faces within a scene. Then, we use GPT4o to translate this low level scene information into a high level music directive. Conditioned on the scene information and the music directive, another instance of GPT 4o generates page level music captions to guide a text to music model. This produces music that is aligned with the mangas evolving narrative. The effectiveness of M2M Gen is confirmed through extensive subjective evaluations, showcasing its capability to generate higher quality, more relevant and consistent music that complements specific scenes when compared to our baselines.
https://arxiv.org/abs/2410.09928
The emoticons are symbolic representations that generally accompany the textual content to visually enhance or summarize the true intention of a written message. Although widely utilized in the realm of social media, the core semantics of these emoticons have not been extensively explored based on multiple modalities. Incorporating textual and visual information within a single message develops an advanced way of conveying information. Hence, this research aims to analyze the relationship among sentences, visuals, and emoticons. For an orderly exposition, this paper initially provides a detailed examination of the various techniques for extracting multimodal features, emphasizing the pros and cons of each method. Through conducting a comprehensive examination of several multimodal algorithms, with specific emphasis on the fusion approaches, we have proposed a novel contrastive learning based multimodal architecture. The proposed model employs the joint training of dual-branch encoder along with the contrastive learning to accurately map text and images into a common latent space. Our key finding is that by integrating the principle of contrastive learning with that of the other two branches yields superior results. The experimental results demonstrate that our suggested methodology surpasses existing multimodal approaches in terms of accuracy and robustness. The proposed model attained an accuracy of 91% and an MCC-score of 90% while assessing emoticons using the Multimodal-Twitter Emoticon dataset acquired from Twitter. We provide evidence that deep features acquired by contrastive learning are more efficient, suggesting that the proposed fusion technique also possesses strong generalisation capabilities for recognising emoticons across several modes.
表情符号是一种象征性的表示形式,通常与文本内容一起使用,以视觉上增强或总结书面信息的确切意图。尽管在社交媒体领域得到了广泛应用,但基于多种模式对这些表情符号的核心语义进行了深入探讨还是不足为继。将文本和视觉信息集成到一个消息中,发展了一种高级传达信息的方法。因此,这项研究旨在分析句子、视觉信息和表情符号之间的关系。为进行有序的阐述,本文首先对各种提取多模态特征的技术进行了详细调查,强调每种方法的优缺点。通过全面评估多个多模态算法,特别是融合方法,我们提出了一个新颖的基于多模态学习的架构。所提出的模型采用双分支编码器与对比学习相结合来准确地将文本和图像映射到共同的潜在空间。我们的关键发现是,将对比学习原理与其他两个分支相结合产生了更好的结果。实验结果表明,我们所提出的方法在准确性和鲁棒性方面超过了现有的多模态方法。在使用Twitter Multimodal Emoticon数据集评估表情符号时,所提出的模型获得了91%的准确性和90%的MCC分数。我们提供了证据,表明通过对比学习获得的深度特征更加有效,表明所提出的融合技术也对识别多种模式下的表情符号具有很强的泛化能力。
https://arxiv.org/abs/2408.02571
This paper introduces a groundbreaking multi-modal neural network model designed for resolution enhancement, which innovatively leverages inter-diagnostic correlations within a system. Traditional approaches have primarily focused on uni-modal enhancement strategies, such as pixel-based image enhancement or heuristic signal interpolation. In contrast, our model employs a novel methodology by harnessing the diagnostic relationships within the physics of fusion plasma. Initially, we establish the correlation among diagnostics within the tokamak. Subsequently, we utilize these correlations to substantially enhance the temporal resolution of the Thomson Scattering diagnostic, which assesses plasma density and temperature. By increasing its resolution from conventional 200Hz to 500kHz, we facilitate a new level of insight into plasma behavior, previously attainable only through computationally intensive simulations. This enhancement goes beyond simple interpolation, offering novel perspectives on the underlying physical phenomena governing plasma dynamics.
本文提出了一种在分辨率增强方面具有突破性的多模态神经网络模型,该模型创新地利用了系统内诊断关系。传统方法主要集中在单模态增强策略,例如基于像素的图像增强或启发式信号插值。相比之下,我们的模型通过利用融合 plasma 物理学中诊断关系的方法来创新性地实现了一种新的方法。首先,我们在 tokamak 中建立了诊断之间的关系。接着,我们利用这些关系大大增强了汤姆逊散射诊断的时域分辨率,该诊断评估了 plasma 密度和温度。通过将分辨率从传统的 200Hz 提高到 500kHz,我们促进了对 plasma 行为的深入洞察,这一般仅通过计算密集型模拟才能实现。这种增强超越了简单的插值,提供了一种新颖的视角,揭示了控制 plasma 动力学背后的物理现象。
https://arxiv.org/abs/2405.05908
This report provide a detailed description of the method that we explored and proposed in the WECIA Emotion Prediction Competition (EPC), which predicts a person's emotion through an artistic work with a comment. The dataset of this competition is ArtELingo, designed to encourage work on diversity across languages and cultures. The dataset has two main challenges, namely modal imbalance problem and language-cultural differences problem. In order to address this issue, we propose a simple yet effective approach called single-multi modal with Emotion-Cultural specific prompt(ECSP), which focuses on using the single modal message to enhance the performance of multimodal models and a well-designed prompt to reduce cultural differences problem. To clarify, our approach contains two main blocks: (1)XLM-R\cite{conneau2019unsupervised} based unimodal model and X$^2$-VLM\cite{zeng2022x} based multimodal model (2) Emotion-Cultural specific prompt. Our approach ranked first in the final test with a score of 0.627.
本报告详细描述了我们参加WECIA情感预测竞赛(EPC)时所探索和提出的方法,该竞赛通过一件艺术作品来预测一个人的情感。比赛的數據集是ArtELingo,旨在鼓励跨語言和文化的作品。比赛數據集有两个主要挑戰,即模态不平衡問題和語言-文化差異問題。为了应对这个问题,我们提出了一个简单而有效的方案,称为情感文化特定提示(ECSP)单一模态与多模态模型。该方案重点使用单一模态信息来提高多模态模型的性能,并设计了一个精心设计的提示来减少文化差异问题。为了明确,我们的方法包含两个主要部分:(1)基于unimodal的XLM-R模型和基于multimodal的X$^2$-VLM模型(2)情感-文化特定提示。我们的方法在决赛测试中排名第一,得分为0.627。
https://arxiv.org/abs/2403.17683
In the rapidly evolving field of machine learning (ML), data augmentation (DA) has emerged as a pivotal technique for enhancing model performance by diversifying training examples without the need for additional data collection. This survey explores the transformative impact of Large Language Models (LLMs) on DA, particularly addressing the unique challenges and opportunities they present in the context of natural language processing (NLP) and beyond. From a data perspective and a learning perspective, we examine various strategies that utilize Large Language Models for data augmentation, including a novel exploration of learning paradigms where LLM-generated data is used for further training. Additionally, this paper delineates the primary challenges faced in this domain, ranging from controllable data augmentation to multi modal data augmentation. This survey highlights the paradigm shift introduced by LLMs in DA, aims to serve as a foundational guide for researchers and practitioners in this field.
在迅速发展的机器学习(ML)领域,数据增强(DA)已成为通过扩展训练示例来提高模型性能的关键技术,而无需进行额外的数据收集。本调查探讨了大型语言模型(LLMs)对DA的变革性影响,特别是在自然语言处理(NLP)及其它领域的独特挑战和机遇。从数据视角和学习视角出发,我们检查了各种利用LLM进行数据增强的策略,包括一种新的探索学习范式,其中LLM生成的数据用于进一步训练。此外,本文概述了该领域面临的主要挑战,从可控制的数据增强到多模态数据增强。本调查突出了LLM在DA领域引入的范式转变,旨在为该领域的研究人员和实践者提供基础指导。
https://arxiv.org/abs/2403.02990
This paper presents a novel multi modal deep learning framework for enhanced agricultural pest detection, combining tiny-BERT's natural language processing with R-CNN and ResNet-18's image processing. Addressing limitations of traditional CNN-based visual methods, this approach integrates textual context for more accurate pest identification. The R-CNN and ResNet-18 integration tackles deep CNN issues like vanishing gradients, while tiny-BERT ensures computational efficiency. Employing ensemble learning with linear regression and random forest models, the framework demonstrates superior discriminate ability, as shown in ROC and AUC analyses. This multi modal approach, blending text and image data, significantly boosts pest detection in agriculture. The study highlights the potential of multi modal deep learning in complex real-world scenarios, suggesting future expansions in diversity of datasets, advanced data augmentation, and cross-modal attention mechanisms to enhance model performance.
本文提出了一种新颖的多模态深度学习框架,用于增强农业害虫检测,将 tiny-BERT 的自然语言处理与 R-CNN 和 ResNet-18 的图像处理相结合。该方法解决了传统 CNN 视觉方法的局限性,并引入了文本上下文以实现更精确的害虫识别。R-CNN 和 ResNet-18 的集成解决了深度 CNN 问题,如消失的梯度,而 tiny-BERT 保证了计算效率。通过线性回归和随机森林模型的集成,该框架展示了卓越的判别能力,如图论和 AUC 分析结果所示。这种多模态方法结合了文本和图像数据,显著提高了农业中的害虫检测。本研究突出了多模态深度学习在复杂现实场景中的潜力,建议在未来增加数据集的多样性、高级数据增强和跨模态关注机制,以提高模型性能。
https://arxiv.org/abs/2312.10948
Most current audio-visual emotion recognition models lack the flexibility needed for deployment in practical applications. We envision a multimodal system that works even when only one modality is available and can be implemented interchangeably for either predicting emotional attributes or recognizing categorical emotions. Achieving such flexibility in a multimodal emotion recognition system is difficult due to the inherent challenges in accurately interpreting and integrating varied data sources. It is also a challenge to robustly handle missing or partial information while allowing direct switch between regression and classification tasks. This study proposes a \emph{versatile audio-visual learning} (VAVL) framework for handling unimodal and multimodal systems for emotion regression and emotion classification tasks. We implement an audio-visual framework that can be trained even when audio and visual paired data is not available for part of the training set (i.e., audio only or only video is present). We achieve this effective representation learning with audio-visual shared layers, residual connections over shared layers, and a unimodal reconstruction task. Our experimental results reveal that our architecture significantly outperforms strong baselines on both the CREMA-D and MSP-IMPROV corpora. Notably, VAVL attains a new state-of-the-art performance in the emotional attribute prediction task on the MSP-IMPROV corpus. Code available at: this https URL
当前大多数音频和视觉情感识别模型缺乏实际应用所需的灵活性。我们构想了一种多模态系统,即使在只有一个模态可用的情况下也能工作,并且可以相互替代地用于预测情感属性或识别分类情感。在实现多模态情感识别系统的灵活性方面,由于准确解释和整合多种数据源的内在挑战,非常困难。此外,在处理缺失或部分信息的同时,允许直接进行回归和分类任务的挑战也非常困难。本文提出了一个 \emph{多功能音频-视觉学习} (VAVL)框架,用于处理情感回归和情感分类任务中的单模态和多模态系统。我们实现了一个音频和视觉共享层的框架,在共享层上保留连接,并实现了单模态重建任务。我们的实验结果显示,我们的架构在CREMA-D和MSP-IMPROV corpora上的 strong baselines 显著超越了它们。值得注意的是,在MSP-IMPROV corpora上的情感属性预测任务中,VAVL取得了新的先进技术表现。代码可在 this https URL 上获取。
https://arxiv.org/abs/2305.07216
In the past ten years, with the help of deep learning, especially the rapid development of deep neural networks, medical image analysis has made remarkable progress. However, how to effectively use the relational information between various tissues or organs in medical images is still a very challenging problem, and it has not been fully studied. In this thesis, we propose two novel solutions to this problem based on deep relational learning. First, we propose a context-aware fully convolutional network that effectively models implicit relation information between features to perform medical image segmentation. The network achieves the state-of-the-art segmentation results on the Multi Modal Brain Tumor Segmentation 2017 (BraTS2017) and Multi Modal Brain Tumor Segmentation 2018 (BraTS2018) data sets. Subsequently, we propose a new hierarchical homography estimation network to achieve accurate medical image mosaicing by learning the explicit spatial relationship between adjacent frames. We use the UCL Fetoscopy Placenta dataset to conduct experiments and our hierarchical homography estimation network outperforms the other state-of-the-art mosaicing methods while generating robust and meaningful mosaicing result on unseen frames.
在过去的十年中,借助深度学习,特别是深度神经网络的迅速发展,医学图像分析取得了显著进展。然而,如何有效地利用医学图像中各种组织和器官之间的隐含关系仍然是一个极具挑战性的问题,并尚未得到充分研究。在本文中,我们提出了基于深度学习关系深度学习的两个创新解决方案。首先,我们提出了一种具有上下文意识的全卷积神经网络,有效地模型了特征之间的隐含关系信息,以进行医学图像分割。该网络在Multimodal Brain Tumor Segmentation 2017( BraTS2017)和Multimodal Brain Tumor Segmentation 2018( BraTS2018)数据集上取得了最先进的分割结果。随后,我们提出了一种新的层级基元估计网络,以通过学习相邻帧之间的明确空间关系实现准确的医学图像拼贴,我们使用UCL鲸鱼超声波 dataset进行了实验,我们的层级基元估计网络在未观测帧上的拼贴结果表现优异,同时生成稳健且有意义的拼贴结果。
https://arxiv.org/abs/2303.16099
Falls have become more frequent in recent years, which has been harmful for senior citizens.Therefore detecting falls have become important and several data sets and machine learning model have been introduced related to fall detection. In this project report, a human fall detection method is proposed using a multi modality approach. We used the UP-FALL detection data set which is collected by dozens of volunteers using different sensors and two cameras. We use wrist sensor with acclerometer data keeping labels to binary classification, namely fall and no fall from the data set.We used fusion of camera and sensor data to increase performance. The experimental results shows that using only wrist data as compared to multi sensor for binary classification did not impact the model prediction performance for fall detection.
Falls近年来变得越来越普遍,这对老年人来说是一种危害。因此,检测跌倒变得非常重要,并引入了多个数据集和机器学习模型与跌倒检测相关。在本报告中,提出了一种使用多模态方法的人跌倒检测方法。我们使用了由数十个志愿者使用多种传感器和两只摄像机收集的UP-Fall检测数据集。我们使用带计步器的手腕传感器将计步器数据作为标签进行二元分类,即跌倒和未跌倒从数据集中选取。我们使用了相机和传感器数据的集成来提高性能。实验结果显示,与使用多个传感器进行二元分类相比,仅使用手腕数据对于跌倒检测模型预测性能没有影响。
https://arxiv.org/abs/2302.00224
Visual entailment (VE) is to recognize whether the semantics of a hypothesis text can be inferred from the given premise image, which is one special task among recent emerged vision and language understanding tasks. Currently, most of the existing VE approaches are derived from the methods of visual question answering. They recognize visual entailment by quantifying the similarity between the hypothesis and premise in the content semantic features from multi modalities. Such approaches, however, ignore the VE's unique nature of relation inference between the premise and hypothesis. Therefore, in this paper, a new architecture called AlignVE is proposed to solve the visual entailment problem with a relation interaction method. It models the relation between the premise and hypothesis as an alignment matrix. Then it introduces a pooling operation to get feature vectors with a fixed size. Finally, it goes through the fully-connected layer and normalization layer to complete the classification. Experiments show that our alignment-based architecture reaches 72.45\% accuracy on SNLI-VE dataset, outperforming previous content-based models under the same settings.
https://arxiv.org/abs/2211.08736
Food is not only a basic human necessity but also a key factor driving a society's health and economic well-being. As a result, the cooking domain is a popular use-case to demonstrate decision-support (AI) capabilities in service of benefits like precision health with tools ranging from information retrieval interfaces to task-oriented chatbots. An AI here should understand concepts in the food domain (e.g., recipes, ingredients), be tolerant to failures encountered while cooking (e.g., browning of butter), handle allergy-based substitutions, and work with multiple data modalities (e.g. text and images). However, the recipes today are handled as textual documents which makes it difficult for machines to read, reason and handle ambiguity. This demands a need for better representation of the recipes, overcoming the ambiguity and sparseness that exists in the current textual documents. In this paper, we discuss the construction of a machine-understandable rich recipe representation (R3), in the form of plans, from the recipes available in natural language. R3 is infused with additional knowledge such as information about allergens and images of ingredients, possible failures and tips for each atomic cooking step. To show the benefits of R3, we also present TREAT, a tool for recipe retrieval which uses R3 to perform multi-modal reasoning on the recipe's content (plan objects - ingredients and cooking tools), food preparation process (plan actions and time), and media type (image, text). R3 leads to improved retrieval efficiency and new capabilities that were hither-to not possible in textual representation.
https://arxiv.org/abs/2203.17109
Social navigation is the capability of an autonomous agent, such as a robot, to navigate in a 'socially compliant' manner in the presence of other intelligent agents such as humans. With the emergence of autonomously navigating mobile robots in human populated environments (e.g., domestic service robots in homes and restaurants and food delivery robots on public sidewalks), incorporating socially compliant navigation behaviors on these robots becomes critical to ensuring safe and comfortable human robot coexistence. To address this challenge, imitation learning is a promising framework, since it is easier for humans to demonstrate the task of social navigation rather than to formulate reward functions that accurately capture the complex multi objective setting of social navigation. The use of imitation learning and inverse reinforcement learning to social navigation for mobile robots, however, is currently hindered by a lack of large scale datasets that capture socially compliant robot navigation demonstrations in the wild. To fill this gap, we introduce Socially CompliAnt Navigation Dataset (SCAND) a large scale, first person view dataset of socially compliant navigation demonstrations. Our dataset contains 8.7 hours, 138 trajectories, 25 miles of socially compliant, human teleoperated driving demonstrations that comprises multi modal data streams including 3D lidar, joystick commands, odometry, visual and inertial information, collected on two morphologically different mobile robots a Boston Dynamics Spot and a Clearpath Jackal by four different human demonstrators in both indoor and outdoor environments. We additionally perform preliminary analysis and validation through real world robot experiments and show that navigation policies learned by imitation learning on SCAND generate socially compliant behaviors
https://arxiv.org/abs/2203.15041
In this paper, we propose to build a stylish image captioning model through a Multi-style Multi modality mechanism (2M). We demonstrate that with 2M, we can build an effective stylish captioner and that multi-references produced by the model can also support explaining the model through identifying erroneous input features on faulty examples. We show how this 2M mechanism can be used to build stylish captioning models and show how these models can be utilized to provide explanations of likely errors in the models.
https://arxiv.org/abs/2110.10704
With the increase in computation power and the development of new state-of-the-art deep learning algorithms, appearance-based gaze estimation is becoming more and more popular. It is believed to work well with curated laboratory data sets, however it faces several challenges when deployed in real world scenario. One such challenge is to estimate the gaze of a person about which the Deep Learning model trained for gaze estimation has no knowledge about. To analyse the performance in such scenarios we have tried to simulate a calibration mechanism. In this work we use the MPIIGaze data set. We trained a multi modal convolutional neural network and analysed its performance with and without calibration and this evaluation provides clear insights on how calibration improved the performance of the Deep Learning model in estimating gaze in the wild.
https://arxiv.org/abs/2109.12801
Sepsis is a life-threatening disease with high morbidity, mortality and healthcare costs. The early prediction and administration of antibiotics and intravenous fluids is considered crucial for the treatment of sepsis and can save potentially millions of lives and billions in health care costs. Professional clinical care practitioners have proposed clinical criterion which aid in early detection of sepsis; however, performance of these criterion is often limited. Clinical text provides essential information to estimate the severity of the sepsis in addition to structured clinical data. In this study, we explore how clinical text can complement structured data towards early sepsis prediction task. In this paper, we propose multi modal model which incorporates both structured data in the form of patient measurements as well as textual notes on the patient. We employ state-of-the-art NLP models such as BERT and a highly specialized NLP model in Amazon Comprehend Medical to represent the text. On the MIMIC-III dataset containing records of ICU admissions, we show that by using these notes, one achieves an improvement of 6.07 points in a standard utility score for Sepsis prediction and 2.89% in AUROC score. Our methods significantly outperforms a clinical criteria suggested by experts, qSOFA, as well as the winning model of the PhysioNet Computing in Cardiology Challenge for predicting Sepsis.
https://arxiv.org/abs/2107.11094
Emotion recognition is an important research field for Human-Computer Interaction(HCI). Audio-Video Emotion Recognition (AVER) is now attacked with Deep Neural Network (DNN) modeling tools. In published papers, as a rule, the authors show only cases of the superiority of multi modalities over audio-only or video-only modalities. However, there are cases superiority in single modality can be found. In our research, we hypothesize that for fuzzy categories of emotional events, the higher noise of one modality can amplify the lower noise of the second modality represented indirectly in the parameters of the modeling neural network. To avoid such cross-modal information interference we define a multi-modal Residual Perceptron Network (MRPN) which learns from multi-modal network branches creating deep feature representation with reduced noise. For the proposed MRPN model and the novel time augmentation for streamed digital movies, the state-of-art average recognition rate was improved to 91.4% for The Ryerson Audio-Visual Database of Emotional Speech and Song(RAVDESS) dataset and to 83.15% for Crowd-sourced Emotional multi-modal Actors Dataset(Crema-d). Moreover, the MRPN concept shows its potential for multi-modal classifiers dealing with signal sources not only of optical and acoustical type.
https://arxiv.org/abs/2107.10742