Deep Learning Based Automatic Video Annotation Tool for Self-Driving Car

Abstract
Abstract (translated)
URL
PDF

Abstract

In a self-driving car, objection detection, object classification, lane detection and object tracking are considered to be the crucial modules. In recent times, using the real time video one wants to narrate the scene captured by the camera fitted in our vehicle. To effectively implement this task, deep learning techniques and automatic video annotation tools are widely used. In the present paper, we compare the various techniques that are available for each module and choose the best algorithm among them by using appropriate metrics. For object detection, YOLO and Retinanet-50 are considered and the best one is chosen based on mean Average Precision (mAP). For object classification, we consider VGG-19 and Resnet-50 and select the best algorithm based on low error rate and good accuracy. For lane detection, Udacity's 'Finding Lane Line' and deep learning based LaneNet algorithms are compared and the best one that can accurately identify the given lane is chosen for implementation. As far as object tracking is concerned, we compare Udacity's 'Object Detection and Tracking' algorithm and deep learning based Deep Sort algorithm. Based on the accuracy of tracking the same object in many frames and predicting the movement of objects, the best algorithm is chosen. Our automatic video annotation tool is found to be 83% accurate when compared with a human annotator. We considered a video with 530 frames each of resolution 1035 x 1800 pixels. At an average each frame had about 15 objects. Our annotation tool consumed 43 minutes in a CPU based system and 2.58 minutes in a mid-level GPU based system to process all four modules. But the same video took nearly 3060 minutes for one human annotator to narrate the scene in the given video. Thus we claim that our proposed automatic video annotation tool is reasonably fast (about 1200 times in a GPU system) and accurate.

Abstract (translated)

在自动驾驶汽车中，目标检测、目标分类、车道检测和目标跟踪是关键模块。最近，人们使用实时视频来讲述我们车上安装的摄像头捕捉到的场景。为了有效地实现这一任务，人们广泛使用了深度学习技术和自动视频注释工具。在本文中，我们比较了每个模块可用的各种技术，并通过使用适当的度量来选择其中的最佳算法。在目标检测中，考虑了Yolo和Retinanet-50，并根据平均精度（MAP）选择最佳值。对于目标分类，我们考虑了vgg-19和resnet-50，并在低错误率和高精度的基础上选择了最佳算法。在车道检测方面，比较了Udacity的“寻找车道线”算法和基于深度学习的Lanenet算法，选择了能够准确识别给定车道的最佳算法进行实现。在目标跟踪方面，比较了Udacity的“目标检测与跟踪”算法和基于深度学习的深度排序算法。基于多帧跟踪同一目标的精度和对目标运动的预测，选择了最佳算法。与人类注释器相比，我们的自动视频注释工具的精度为83%。我们考虑了一个530帧的视频，每个帧的分辨率为1035 x 1800像素。平均每帧有15个物体。我们的注释工具在基于CPU的系统中花费了43分钟，在基于GPU的中级系统中花费了2.58分钟来处理所有四个模块。但同一个视频用了近3060分钟，一个人类注释员在给定的视频中讲述了这个场景。因此，我们声称，我们提出的自动视频注释工具相当快速（在GPU系统中约1200次）和准确。

URL

https://arxiv.org/abs/1904.12618

PDF

https://arxiv.org/pdf/1904.12618.pdf