DOI QR코드

DOI QR Code

Preprocessing Technique for Improving Action Recognition Performance in ERP Video with Multiple Objects

다중 객체가 존재하는 ERP 영상에서 행동 인식 모델 성능 향상을 위한 전처리 기법

  • Park, Eun-Soo (Department of Computer Education, Sungkyunkwan University) ;
  • Kim, Seunghwan (Department of Computer Education, Sungkyunkwan University) ;
  • Ryu, Eun-Seok (Department of Computer Education, Sungkyunkwan University)
  • 박은수 (성균관대학교 컴퓨터교육과) ;
  • 김승환 (성균관대학교 컴퓨터교육과) ;
  • 류은석 (성균관대학교 컴퓨터교육과)
  • Received : 2019.12.26
  • Accepted : 2020.04.16
  • Published : 2020.05.30

Abstract

In this paper, we propose a preprocessing technique to solve the problems of action recognition with Equirectangular Projection (ERP) video. The preprocessing technique proposed in this paper assumes the person object as the subject of action, that is, the Object of Interest (OOI), and the surrounding area of the OOI as the ROI. The preprocessing technique consists of three modules. I) Recognize person object in the image with object recognition model. II) Create a saliency map from the input image. III) Select subject of action using recognized person object and saliency map. The subject boundary box of the selected action is input to the action recognition model in order to improve the action recognition performance. When comparing the performance of the proposed preprocessing method to the action recognition model and the performance of the original ERP image input method, the performance is improved up to 99.6%, and the action is obtained when only the OOI is detected. It can also see the effects of related video summaries.

본 논문에서 Equirectangular Projection(ERP) 영상으로 행동 인식을 할 때의 문제점들을 해결할 수 있는 전처리 기법을 제안한다. 본 논문에서 제안하는 전처리 기법은 사람 객체를 행동의 주체 즉, Object of Interest(OOI)로 가정하고, OOI의 주변 영역을 ROI로 가정한다. 전처리 기법은 3개의 모듈로 이루어져 있다. I) 객체 인식 모델로 영상 내 사람 객체를 인식한다. II) 입력 영상에서 saliency map을 생성한다. III) 인식된 사람 객체와 saliency map을 이용하여 행동의 주체를 선정한다. 이후 행동 인식 모델에 선정된 행동의 주체 boundary box를 입력하여 행동 인식 성능을 높인다. 제안하는 전처리기법을 사용한 데이터를 행동 인식 모델에 입력한 방법의 성능과 원본 ERP 영상을 입력한 방법의 성능을 비교하였을 때 최대 99.6%의 성능 향상을 보이며, OOI가 감지되는 프레임만을 추출하였을 때 행동 관련 영상 요약의 효과도 볼 수 있다.

Keywords

References

  1. J. Gutierrez, E. J. David, A. Coutrot, M. P. Da Silva, and P. L. Callet. 2018. Introducing UN Salient360! Benchmark: A platform for evaluating visual attention models for $360^{\circ}$ contents. In 2018 Tenth International Conference on Quality of Multimedia Experience (QoMEX). 1-3. https://doi.org/10.1109/QoMEX.2018.8463369
  2. Hou-Ning H., Yen-Chen L., Ming-Yu L., Hsien-Tzu C., Yung-Ju C., Min Sun. Deep 360 pilot: Learning a deep agent for piloting through 360 sports videos. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 1396-1405. 2017.
  3. Hyun-Joon R, SungWon H, Eun-Seok R. "Prediction complexitybased HEVC parallel processing for asymmetric multicores." Multimedia Tools and Applications 76, 23, pp.25271-25284. 2017. https://doi.org/10.1007/s11042-017-4413-7
  4. Hyun-Joon R, Bok-Gi L, Eun-Seok R. "Tile Partitioning and Allocation for HEVC Parallel Decoding on Asymmetric Multicores." The Journal of Korean Institute of Communications and Information Sciences (J-KICS), Vol.43, No.05, pp. 791-800. 2018. https://doi.org/10.7840/kics.2018.43.5.791
  5. Seehwan Y, Eun-Seok R. "Parallel HEVC decoding with asymmetric mobile multicores." Multimedia Tools and Applications 76, 16, pp.17337-17352. 2017. https://doi.org/10.1007/s11042-016-4269-2
  6. Robert S, Yago S, Karsten S, Thomas S, Eun-Seok R, Jangwoo S. "Temporal MCTS Coding Constraints Implementation." 122th MPEG meeting of ISO/IEC JTC1/SC29/ WG11, MPEG 122/m42423. 2018.
  7. Jang-Woo S, Dongmin J, Eun-Seok R. "Implementing Motion-Constrained Tile and Viewport Extraction for VR Streaming." ACM Network and Operating System Support for Digital Audio and Video 2018 (NOSSDAV2018). 2018.
  8. Jang-Woo S, Eun-Seok R. "Tile-Based 360-Degree Video Streaming for Mobile Virtual Reality in Cyber Physical System." Elsevier, Computers and Electrical Engineering. 2018.
  9. Jong-Beom J., Soonbin L., Dongmin J, Il-Woong R., Tuan T. L., Jaesung R., Eun-Seok R."Implementing Multi-view 360 Video Compression System for Immersive Media", The Korean Institute of Broadcast and Media Engineers (KIBME) Summer Conference, pp.139-142, Jun. pp.19-21, 2019.
  10. JongBeom J, Dongmin J, Jangwoo S, Eun-Seok R, "3DoF+ 360 Video Location based Asymmetric Down-sampling for View Synthesis to Immersive VR Video Streaming", MDPI, Sensors, 18(9):3148, Sep. 2018. https://doi.org/10.3390/s18093148
  11. Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis & Machine Intelligence, (11), 1254-1259.
  12. Itti, L., & Koch, C. (2000). A saliency-based search mechanism for overt and covert shifts of visual attention. Vision research, 40(10-12), 1489-1506. https://doi.org/10.1016/S0042-6989(99)00163-7
  13. Itti L. Koch C. Niebur E. (1998). A model for saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 1254-1259. https://doi.org/10.1109/34.730558
  14. Parkhurst D. Law K. Niebur E. (2002). Modeling the role of salience in the allocation of overt visual attention. Vision Research, 42, 107-123. https://doi.org/10.1016/S0042-6989(01)00250-4
  15. Hou, X., & Zhang, L. (2007, June). Saliency detection: A spectral residual approach. In 2007 IEEE Conference on Computer Vision and Pattern Recognition (pp. 1-8). Ieee.
  16. Hou, X., Harel, J., & Koch, C. (2011). Image signature: Highlighting sparse salient regions. IEEE transactions on pattern analysis and machine intelligence, 34(1), 194-201. https://doi.org/10.1109/TPAMI.2011.146
  17. Schauerte, B., & Stiefelhagen, R. (2012, October). Quaternion-based spectral saliency detection for eye fixation prediction. In European Conference on Computer Vision (pp. 116-129). Springer, Berlin, Heidelberg.
  18. Li, J., Levine, M. D., An, X., Xu, X., & He, H. (2012). Visual saliency based on scale-space analysis in the frequency domain. IEEE transactions on pattern analysis and machine intelligence, 35(4), 996-1010. https://doi.org/10.1109/TPAMI.2012.147
  19. Huang, X., Shen, C., Boix, X., & Zhao, Q. (2015). Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 262-270).
  20. Kruthiventi, S. S., Ayush, K., & Babu, R. V. (2017). Deepfix: A fully convolutional neural network for predicting human eye fixations. IEEE Transactions on Image Processing, 26(9), 4446-4456. https://doi.org/10.1109/TIP.2017.2710620
  21. Pan, J., Sayrol, E., Giro-i-Nieto, X., McGuinness, K., & O'Connor, N. E. (2016). Shallow and deep convolutional networks for saliency prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 598-606).
  22. Wang, L., Wang, L., Lu, H., Zhang, P., & Ruan, X. (2016, October). Saliency detection with recurrent fully convolutional networks. In European conference on computer vision (pp. 825-841). Springer, Cham.
  23. Cornia, M., Baraldi, L., Serra, G., & Cucchiara, R. (2018). Predicting human eye fixations via an lstm-based saliency attentive model. IEEE Transactions on Image Processing, 27(10), 5142-5154. https://doi.org/10.1109/tip.2018.2851672
  24. Wang, L., Qiao, Y., & Tang, X. (2015). Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4305-4314).
  25. Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems (pp. 568-576).
  26. Du, Y., Wang, W., & Wang, L. (2015). Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1110-1118).
  27. Li, C., Wang, P., Wang, S., Hou, Y., & Li, W. (2017, July). Skeleton-based action recognition using LSTM and CNN. In 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW) (pp. 585-590). IEEE.
  28. Soo-Yeun S., Joo-Heon C. Human Action Recognition System Using Multi-Mode Sensor and LSTM-based Deep Learning. Transactions of the Korean Society of Mechanical Engineers A, 42(2), pp.111-121. 2018. https://doi.org/10.3795/ksme-a.2018.42.2.111
  29. Janghak C., Jeongmin S., Sang-il C. "Analysis of Action Recognition Performance According to Depth of Deep Neural Network." Korean Institute of Information Scientists and Engineers (KIISE), pp.1827-1829. 2018.
  30. Sang-Jo K., Shao-Heng K., Eui-Young C. "Improved the action recognition performance of hierarchical RNNs through reinforcement learning." Korea Society of Computer Information. 26(2), pp. 360-363. 2018.
  31. Rouast, P. V., & Adam, M. T. (2019). Learning deep representations for video-based intake gesture detection. arXiv preprint arXiv: 1909.10695. https://doi.org/10.1109/jbhi.2019.2942845
  32. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., & Black, M. J. (2013). Towards understanding action recognition. In Proceedings of the IEEE international conference on computer vision (pp. 3192-3199).
  33. Bregonzio, M., Li, J., Gong, S., & Xiang, T. (2010, September). Discriminative Topics Modelling for Action Feature Selection and Recognition. In BMVC (pp. 1-11).
  34. Arseneau, S., & Cooperstock, J. R. (1999, August). Real-time image segmentation for action recognition. In 1999 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM 1999). Conference Proceedings (Cat. No. 99CH36368) (pp. 86-89). IEEE.
  35. Niu, F., & Abdel-Mottaleb, M. (2004, December). View-invariant human activity recognition based on shape and motion features. In IEEE Sixth International Symposium on Multimedia Software Engineering (pp. 546-556). IEEE.
  36. Sudhakaran, S., Escalera, S., & Lanz, O. (2019). Lsta: Long short-term attention for egocentric action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 9954-9963).
  37. Sharma, S., Kiros, R., & Salakhutdinov, R. (2015). Action recognition using visual attention. arXiv preprint arXiv:1511.04119.
  38. Berlin, S. J., & John, M. (2016, October). Human interaction recognition through deep learning network. In 2016 IEEE International Carnahan Conference on Security Technology (ICCST) (pp. 1-4). IEEE.
  39. Sydorov, V., Alahari, K., & Schmid, C. (2019, September). Focused Attention for Action Recognition.
  40. Su, Y. C., & Grauman, K. (2017). Learning spherical convolution for fast features from 360 imagery. In Advances in Neural Information Processing Systems (pp. 529-539).
  41. Su, Y. C., & Grauman, K. (2019). Kernel transformer networks for compact spherical convolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 9442-9451).
  42. Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767.
  43. Cornia, M., Baraldi, L., Serra, G., & Cucchiara, R. (2018). SAM: Pushing the Limits of Saliency Prediction Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 1890-1892).
  44. Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.