DOI QR코드

DOI QR Code

Deep Learning-based Action Recognition using Skeleton Joints Mapping

스켈레톤 조인트 매핑을 이용한 딥 러닝 기반 행동 인식

  • Tasnim, Nusrat (School of Electronics and Information Engineering, Korea Aerospace University) ;
  • Baek, Joong-Hwan (School of Electronics and Information Engineering, Korea Aerospace University)
  • 타스님 (한국항공대학교 항공전자정보공학부) ;
  • 백중환 (한국항공대학교 항공전자정보공학부)
  • Received : 2020.03.24
  • Accepted : 2020.04.16
  • Published : 2020.04.30

Abstract

Recently, with the development of computer vision and deep learning technology, research on human action recognition has been actively conducted for video analysis, video surveillance, interactive multimedia, and human machine interaction applications. Diverse techniques have been introduced for human action understanding and classification by many researchers using RGB image, depth image, skeleton and inertial data. However, skeleton-based action discrimination is still a challenging research topic for human machine-interaction. In this paper, we propose an end-to-end skeleton joints mapping of action for generating spatio-temporal image so-called dynamic image. Then, an efficient deep convolution neural network is devised to perform the classification among the action classes. We use publicly accessible UTD-MHAD skeleton dataset for evaluating the performance of the proposed method. As a result of the experiment, the proposed system shows better performance than the existing methods with high accuracy of 97.45%.

최근 컴퓨터 비전과 딥러닝 기술의 발전으로 비디오 분석, 영상 감시, 인터렉티브 멀티미디어 및 인간 기계 상호작용 응용을 위해 인간 행동 인식에 관한 연구가 활발히 진행되고 있다. 많은 연구자에 의해 RGB 영상, 깊이 영상, 스켈레톤 및 관성 데이터를 사용하여 인간 행동 인식 및 분류를 위해 다양한 기술이 도입되었다. 그러나 스켈레톤 기반 행동 인식은 여전히 인간 기계 상호작용 분야에서 도전적인 연구 주제이다. 본 논문에서는 동적 이미지라 불리는 시공간 이미지를 생성하기 위해 동작의 종단간 스켈레톤 조인트 매핑 기법을 제안한다. 행동 클래스 간의 분류를 수행하기 위해 효율적인 심층 컨볼루션 신경망이 고안된다. 제안된 기법의 성능을 평가하기 위해 공개적으로 액세스 가능한 UTD-MHAD 스켈레톤 데이터 세트를 사용하였다. 실험 결과 제안된 시스템이 97.45 %의 높은 정확도로 기존 방법보다 성능이 우수함을 보였다.

Keywords

References

  1. Y. Du, W. Wang, and L. Wang, "Hierarchical recurrent neural network for skeleton based action recognition," in IEEE Conference on Computer Vision and Pattern Recognition, Boston: MC, pp. 1110-1118, 2015.
  2. X. Yang and Y. Tian, "Super normal vector for activity recognition using depth sequences," in IEEE Conference on Computer Vision and Pattern Recognition, Columbus: OH, pp. 804-811, 2014.
  3. V. S. Kulkarni, and S. D. Lokhande, "Appearance based recognition of american sign language using gesture segmentation," International Journal on Computer Science and Engineering, No. 3, pp. 560-565, 2010.
  4. P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, "Behavior recognition via sparse spatio-temporal features," in IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Breckenridge: CO, pp. 65-72, 2005.
  5. D. Wu, and L. Shao, “Silhouette analysis-based action recognition via exploiting human poses,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 23, No. 2, pp. 236-243, 2012. https://doi.org/10.1109/TCSVT.2012.2203731
  6. M. Ahmad, and S. W Lee, "HMM-based human action recognition using multiview image sequences," in 18th International Conference on Pattern Recognition (ICPR'06), Hong Kong, pp. 263-266, 2006.
  7. L. Xia, C. C. Chen, and J. K. Aggarwal, "View invariant human action recognition using histograms of 3d joints," in 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence: IR, pp. 20-27, 2012.
  8. J. Luo, W. Wang, and H. Qi, "Spatio-temporal feature extraction and representation for RGB-D human action recognition," Pattern Recognition Letters, Vol. 50, pp. 139-148, 2014. https://doi.org/10.1016/j.patrec.2014.03.024
  9. V. Megavannan, B. Agarwal, and R. V. Babu, "Human action recognition using depth maps," in 2012 International Conference on Signal Processing and Communications (SPCOM), Piscataway: NJ, pp. 1-5, 2012.
  10. J. Trelinski, and B. Kwolek, "Convolutional neural network-based action recognition on depth maps," in International Conference on Computer Vision and Graphics, Warsaw: Poland, pp. 209-221, 2018.
  11. P. Wang, W. Li, Z. Gao, J. Zhang, C. Tang, and P.O. Ogunbona, “Action recognition from depth maps using deep convolutional neural networks,” IEEE Transactions on Human-Machine Systems, Vol. 46, No. 4, pp. 498-509, 2015. https://doi.org/10.1109/THMS.2015.2504550
  12. K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," in Advances in Neural Information Processing Systems, Montreal: Canada, pp. 568-576, 2014.
  13. C. Li, Q. Zhong, D. Xie, and S. Pu, "Skeleton-based action recognition with convolutional neural networks," in 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, pp. 597-600, 2017.
  14. M. E. Hussein, M. Torki, M. A. Gowayyed, and M. El-Saban, "Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations," in Twenty-Third International Joint Conference on Artificial Intelligence, Beijing: China, pp. 2466-2472, 2013.
  15. Y. Du, Y. Fu, and L. Wang, "Skeleton based action recognition with convolutional neural network," IEEE 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur: Malaysia, pp. 579-583, 2015.
  16. P. Wang, Z. Li, Y. Hou, and W. Li, "Action recognition based on joint trajectory maps using convolutional neural networks," in Proceedings of the 24th ACM International Conference on ACM Multimedia, Amsterdam: Netherlands, pp. 102-106, 2016.
  17. Y. Hou, Z. Li, P. Wang, and W. Li, “Skeleton optical spectra-based action recognition using convolutional neural networks,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 28, No. 3, pp. 807-811, 2016. https://doi.org/10.1109/tcsvt.2016.2628339
  18. C. Li, Y. Hou, P. Wang, and W. Li, “Joint distance maps-based action recognition with convolutional neural networks,” IEEE Signal Processing Letters, Vol. 24, No. 5, pp. 624-628, 2017. https://doi.org/10.1109/LSP.2017.2678539
  19. J. Imran, and B. Raman, "Evaluating fusion of RGB-D and inertial sensors for multimodal human action recognition," Journal of Ambient Intelligence and Humanized Computing, pp. 1-20, 2019.
  20. UTD-MHAD skeleton dataset, University of Texas at Dalas, [Internet]. Available: https://personal.utdallas.edu/-kehtar/UTD-MHAD.html
  21. C. Shorten, and TM. Khoshgoftaar, “A survey on image data augmentation for deep learning,” Journal of Big Data, Vol. 6, No. 1, pp. 60, 2019. https://doi.org/10.1186/s40537-019-0197-0
  22. A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks," in Advances in Neural Information Processing Systems, Lake Tahoe: NV, pp. 1097-1105, 2012.