Semantic understanding of scenes in three-dimensional space (3D) is a quintessential part of robotics oriented applications such as autonomous driving as it provides geometric cues such as size, orientation and true distance of separation to objects which are crucial for taking mission critical decisions. As a first step, in this work we investigate the possibility of semantically classifying different parts of a given scene in 3D by learning the underlying geometric context in addition to the texture cues BUT in the absence of labelled real-world datasets. To this end we generate a large number of synthetic scenes, their pixel-wise labels and corresponding 3D representations using CARLA software framework. We then build a deep neural network that learns underlying category specific 3D representation and texture cues from color information of the rendered synthetic scenes. Further on we apply the learned model on different real world datasets to evaluate its performance. Our preliminary investigation of results show that the neural network is able to learn the geometric context from synthetic scenes and effectively apply this knowledge to classify each point of a 3D representation of a scene in real-world.
https://arxiv.org/abs/1910.13676
The Tactical Driver Behavior modeling problem requires understanding of driver actions in complicated urban scenarios from a rich multi modal signals including video, LiDAR and CAN bus data streams. However, the majority of deep learning research is focused either on learning the vehicle/environment state (sensor fusion) or the driver policy (from temporal data), but not both. Learning both tasks end-to-end offers the richest distillation of knowledge, but presents challenges in formulation and successful training. In this work, we propose promising first steps in this direction. Inspired by the gating mechanisms in LSTM, we propose gated recurrent fusion units (GRFU) that learn fusion weighting and temporal weighting simultaneously. We demonstrate it's superior performance over multimodal and temporal baselines in supervised regression and classification tasks, all in the realm of autonomous navigation. We note a 10% improvement in the mAP score over state-of-the-art for tactical driver behavior classification in HDD dataset and a 20% drop in overall Mean squared error for steering action regression on TORCS dataset.
https://arxiv.org/abs/1910.00628
With the development of technology, the usage areas and importance of biometric systems have started to increase. Since the characteristics of each person are different from each other, a single model biometric system can yield successful results. However, because the characteristics of twin people are very close to each other, multiple biometric systems including multiple characteristics of individuals will be more appropriate and will increase the recognition rate. In this study, a multiple biometric recognition system consisting of a combination of multiple algorithms and multiple models was developed to distinguish people from other people and their twins. Ear and voice biometric data were used for the multimodal model and 38 pair of twin ear images and sound recordings were used in the data set. Sound and ear recognition rates were obtained using classical (hand-crafted) and deep learning algorithms. The results obtained were combined with the score level fusion method to achieve a success rate of 94.74% in rank-1 and 100% in rank -2.
随着技术的发展,生物特征识别系统的使用领域和重要性开始增加。由于每个人的特征不同,一个单一的模型生物识别系统可以产生成功的结果。然而,由于双胞胎的特征彼此非常接近,多个生物特征识别系统(包括个体的多个特征)将更为合适,并将提高识别率。本研究开发了一套由多个算法和多个模型组合而成的多重生物识别系统,用以区分人与他人及其双胞胎。多模态模型采用耳和语音生物特征数据,数据集采用38对双耳图像和录音。使用经典(手工制作)和深度学习算法获得声音和耳朵识别率。将所得结果与分数水平融合法相结合,一级成功率为94.74%,二级成功率为100%。
https://arxiv.org/abs/1903.07981
In this work, we propose a multi-modal Convolutional Neural Network (CNN) approach for brain tumor segmentation. We investigate how to combine different modalities efficiently in the CNN framework.We adapt various fusion methods, which are previously employed on video recognition problem, to the brain tumor segmentation problem,and we investigate their efficiency in terms of memory and performance.Our experiments, which are performed on BRATS dataset, lead us to the conclusion that learning separate representations for each modality and combining them for brain tumor segmentation could increase the performance of CNN systems.
在这项工作中,我们提出了一种用于脑肿瘤分割的多模态卷积神经网络(CNN)方法。我们研究如何在CNN框架中有效地组合不同的模态。我们将先前用于视频识别问题的各种融合方法适应于脑肿瘤分割问题,并且我们研究它们在记忆和性能方面的效率。我们的实验,这些是在BRATS数据集上进行的,我们得出结论:学习每种模态的单独表示并将它们组合用于脑肿瘤分割可以提高CNN系统的性能。
https://arxiv.org/abs/1809.06191
Cross modal image syntheses is gaining significant interests for its ability to estimate target images of a different modality from a given set of source images,like estimating MR to MR, MR to CT, CT to PET etc, without the need for an actual acquisition.Though they show potential for applications in radiation therapy planning,image super resolution, atlas construction, image segmentation etc.The synthesis results are not as accurate as the actual acquisition.In this paper,we address the problem of multi modal image synthesis by proposing a fully convolutional deep learning architecture called the SynNet.We extend the proposed architecture for various input output configurations. And finally, we propose a structure preserving custom loss function for cross-modal image synthesis.We validate the proposed SynNet and its extended framework on BRATS dataset with comparisons against three state-of-the art methods.And the results of the proposed custom loss function is validated against the traditional loss function used by the state-of-the-art methods for cross modal image synthesis.
跨模态图像合成因其能够从一组给定的源图像估计不同模态的目标图像,如估计MR到MR,MR到CT,CT到PET等,而无需实际采集,因此获得了巨大的兴趣。虽然它们在放射治疗计划,图像超分辨率,图谱构建,图像分割等方面显示出应用潜力,但综合结果并不像实际采集那么准确。本文中,我们提出了一种多模式图像合成的问题完全卷积深度学习体系结构称为SynNet。我们扩展了所提出的各种输入输出配置的体系结构。最后,我们提出了一种保留用于跨模态图像合成的定制损失函数的结构。我们通过与三种最先进的方法进行比较,验证了提出的SynNet及其在BRATS数据集上的扩展框架。并且提出了定制损失函数验证了最先进的交叉模态图像合成方法所使用的传统损失函数。
https://arxiv.org/abs/1806.11475