Paper Reading AI Learner

Lightweight Delivery Detection on Doorbell Cameras

2023-05-13 01:28:28
Pirazh Khorramshahi, Zhe Wu, Tianchen Wang, Luke Deluccia, Hongcheng Wang

Abstract

Despite recent advances in video-based action recognition and robust spatio-temporal modeling, most of the proposed approaches rely on the abundance of computational resources to afford running huge and computation-intensive convolutional or transformer-based neural networks to obtain satisfactory results. This limits the deployment of such models on edge devices with limited power and computing resources. In this work we investigate an important smart home application, video based delivery detection, and present a simple and lightweight pipeline for this task that can run on resource-constrained doorbell cameras. Our proposed pipeline relies on motion cues to generate a set of coarse activity proposals followed by their classification with a mobile-friendly 3DCNN network. For training we design a novel semi-supervised attention module that helps the network to learn robust spatio-temporal features and adopt an evidence-based optimization objective that allows for quantifying the uncertainty of predictions made by the network. Experimental results on our curated delivery dataset shows the significant effectiveness of our pipeline compared to alternatives and highlights the benefits of our training phase novelties to achieve free and considerable inference-time performance gains.

Abstract (translated)

尽管近年来在视频行动识别和稳健空间时间建模方面取得了进展,但大多数 proposed 的方法都依赖于计算资源的充足性,以支付运行大型、计算密集型卷积或Transformer神经网络以获得满意结果的需求。这限制了在资源受限的边缘设备上部署这些模型。在这项研究中,我们研究了一个重要的智能家庭应用——视频based delivery detection,并提出了一个简单的、轻量化的管道来完成这项任务,可以在资源受限的入门摄像头上运行。我们提出的管道依赖于运动线索生成一组粗动作建议,然后使用一个易于移动设备的3DCNN网络进行分类。为训练我们设计了一个新的半监督注意力模块,帮助网络学习稳健的空间时间特征,并采用基于证据的优化目标,允许量化网络的预测不确定性。我们对我们 curated delivery 数据集的实验结果表明,我们的管道相对于其他方法具有显著的有效性,并突出了我们在训练阶段新奇性的优势,以获得免费且可观的推理时间性能增益。

URL

https://arxiv.org/abs/2305.07812

PDF

https://arxiv.org/pdf/2305.07812.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot