Paper Reading AI Learner

Fine-grained activity recognition for assembly videos

2020-12-02 18:38:17
Jonathan D. Jones, Cathryn Cortesa, Amy Shelton, Barbara Landau, Sanjeev Khudanpur, Gregory D. Hager

Abstract

In this paper we address the task of recognizing assembly actions as a structure (e.g. a piece of furniture or a toy block tower) is built up from a set of primitive objects. Recognizing the full range of assembly actions requires perception at a level of spatial detail that has not been attempted in the action recognition literature to date. We extend the fine-grained activity recognition setting to address the task of assembly action recognition in its full generality by unifying assembly actions and kinematic structures within a single framework. We use this framework to develop a general method for recognizing assembly actions from observation sequences, along with observation features that take advantage of a spatial assembly's special structure. Finally, we evaluate our method empirically on two application-driven data sources: (1) An IKEA furniture-assembly dataset, and (2) A block-building dataset. On the first, our system recognizes assembly actions with an average framewise accuracy of 70% and an average normalized edit distance of 10%. On the second, which requires fine-grained geometric reasoning to distinguish between assemblies, our system attains an average normalized edit distance of 23% -- a relative improvement of 69% over prior work.

Abstract (translated)

URL

https://arxiv.org/abs/2012.01392

PDF

https://arxiv.org/pdf/2012.01392.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot