Abstract
Event camera is an asynchronous, high frequencyvision sensor with low power consumption, which is suitable forhuman action understanding task. It is vital to encode the spatial-temporal information of event data properly and use standardcomputer vision tool to learn from the data. In this work, wepropose a timestamp image encoding 2D network, which takes theencoded spatial-temporal images with polarity information of theevent data as input and output the action label. In addition, wepropose a future timestamp image generator to generate futureaction information to aid the model to anticipate the humanaction when the action is not completed. Experiment results showthat our method can achieve the same level of performance asthose RGB-based benchmarks on real world action recognition,and also achieve the state of the art (SOTA) result on gesturerecognition. Our future timestamp image generating model caneffectively improve the prediction accuracy when the action is notcompleted. We also provide insight discussion on the importanceof motion and appearance information in action recognition andanticipation.
Abstract (translated)
URL
https://arxiv.org/abs/2104.05145