BowNet: Dilated Convolution Neural Network for Ultrasound Tongue Contour Extraction

Abstract
Abstract (translated)
URL
PDF

Abstract

Ultrasound imaging is safe, relatively affordable, and capable of real-time performance. One application of this technology is to visualize and to characterize human tongue shape and motion during a real-time speech to study healthy or impaired speech production. Due to the noisy nature of ultrasound images with low-contrast characteristic, it might require expertise for non-expert users to recognize organ shape such as tongue surface (dorsum). To alleviate this difficulty for quantitative analysis of tongue shape and motion, tongue surface can be extracted, tracked, and visualized instead of the whole tongue region. Delineating the tongue surface from each frame is a cumbersome, subjective, and error-prone task. Furthermore, the rapidity and complexity of tongue gestures have made it a challenging task, and manual segmentation is not a feasible solution for real-time applications. Employing the power of state-of-the-art deep neural network models and training techniques, it is feasible to implement new fully-automatic, accurate, and robust segmentation methods with the capability of real-time performance, applicable for tracking of the tongue contours during the speech. This paper presents two novel deep neural network models named BowNet and wBowNet benefits from the ability of global prediction of decoding-encoding models, with integrated multi-scale contextual information, and capability of full-resolution (local) extraction of dilated convolutions. Experimental results using several ultrasound tongue image datasets revealed that the combination of both localization and globalization searching could improve prediction result significantly. Assessment of BowNet models using both qualitatively and quantitatively studies showed them outstanding achievements in terms of accuracy and robustness in comparison with similar techniques.

Abstract (translated)

超声成像是安全的，相对便宜，并能够实时的性能。该技术的一个应用是在实时语音中可视化和描述人类舌头的形状和运动，以研究健康或受损的语音生成。由于低对比度超声图像的噪声特性，非专家用户可能需要专业知识来识别器官形状，如舌面（背）。为了减轻定量分析舌头形状和运动的困难，可以提取、跟踪和可视化舌头表面，而不是整个舌头区域。从每一帧描绘舌头表面是一项繁琐、主观和容易出错的任务。此外，语言手势的快速性和复杂性使其成为一项具有挑战性的任务，而手工分割并不是实时应用的可行解决方案。利用最先进的深部神经网络模型和训练技术，可以实现新的全自动、精确和鲁棒的分割方法，具有实时性能，适用于语音过程中的舌廓跟踪。本文提出了两种新的深度神经网络模型Bownet和Wbownet，它得益于解码编码模型的全局预测能力、集成的多尺度上下文信息以及扩展卷积的全分辨率（局部）提取能力。利用多个超声舌图像数据集的实验结果表明，局部化和全球化搜索相结合可以显著提高预测结果。通过定性和定量研究对弓网模型进行评估，结果表明，与同类技术相比，弓网模型在精度和鲁棒性方面取得了显著的成就。

URL

https://arxiv.org/abs/1906.04232

PDF

https://arxiv.org/pdf/1906.04232.pdf