DOI QR코드

DOI QR Code

Teacher-Student Architecture Based CNN for Action Recognition

동작 인식을 위한 교사-학생 구조 기반 CNN

  • ;
  • 이효종 (전북대학교 컴퓨터공학부)
  • Received : 2021.12.23
  • Accepted : 2022.01.24
  • Published : 2022.03.31

Abstract

Convolutional neural network (CNN) generally uses two-stream architecture RGB and optical flow stream for its action recognition function. RGB frames stream display appearance and optical flow stream interprets its action. However, the standard method of using optical flow is costly in its computational time and latency associated with increased action recognition. The purpose of the study was to evaluate a novel way to create a two sub-networks in neural networks. The optical flow sub-network was assigned as a teacher and the RGB frames as a student. In the training stage, the optical flow sub-network extracts features through the teacher sub-network and transmits the information to student sub-network for baseline training. In the test stage, only student sub-network was operational with decreased in latency without computing optical flow. Experimental results shows that our network fed only by RGB stream gets a competitive accuracy of 54.5% on HMDB51, which is 1.5 times better than that on R3D-18.

대부분 첨단 동작 인식 컨볼루션 네트워크는 RGB 스트림과 광학 흐름 스트림, 양 스트림 아키텍처를 기반으로 하고 있다. RGB 프레임 스트림은 모양 특성을 나타내고 광학 흐름 스트림은 동작 특성을 해석한다. 그러나 광학 흐름은 계산 비용이 매우 높기 때문에 동작 인식 시간에 지연을 초래한다. 이에 양 스트림 네트워크와 교사-학생 아키텍처에서 영감을 받아 행동 인식을 위한 새로운 네트워크 디자인을 개발하였다. 제안 신경망은 두 개의 하위 네트워크로 구성되어있다. 즉, 교사 역할을 하는 광학 흐름 하위 네트워크와 학생 역할을 하는 RGB 프레임 하위 네트워크를 연결하였다. 훈련 단계에서 광학 흐름의 특징을 추출하고 교사 서브 네트워크를 훈련시킨 다음 그 특징을 학생 서브 네트워크를 훈련시키기 위한 기준선으로 지정하여 학생 서브 네트워크에 전송한다. 테스트 단계에서는 광학 흐름을 계산하지 않고 대기 시간이 줄어들도록 학생 네트워크만 사용한다. 제안 네트워크는 실험을 통하여 정확도 면에서 일반 이중 스트림 아키텍처에 비해 높은 정확도를 보여주는 것을 확인하였다.

Keywords

Acknowledgement

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education(GR 2019R1D1A3A03103736) and in part was supported by project for Cooperative R&D between Industry, Academy, and Research Institute funded Korea Ministry of SMEs and Startups in 20(Grant No S3114049).

References

  1. C. Szegedy, W. Liu, Y. Jia, P. Sermane, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with convolutions," in Proceedings of the Computer Vision and Pattern Recognition, pp.1-9, 2015.
  2. J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu, "CNN-RNN: A unified framework for multi-label image classification," in Proceedings of the Computer Vision and Pattern Recognition, pp.2285-2294, 2016.
  3. K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," in Proceedings of the Neural Information Processing, pp.568-576, 2014.
  4. C. Feichtenhofer, A. Pinz, and A. Zisserman, "Convolutional two-stream network fusion for video action recognition," in Proceedings of the Computer Vision and Pattern Recognition, pp.1933-1941, 2016.
  5. A. Diba, A. Pazandeh, and L. V. Gool, "Efficient two-stream motion and appearance 3D CNNs for video classification," arXiv:1608.08851, 2016.
  6. G. Hinton, O. Vinyal, and J. Dean, "Distilling the knowledge in a neural network," in Neural Information Processing Deep Learning Workshop, 2014.
  7. S. Kong, T. Guo, S. You, and C. Xu, "Learning student networks with few data," in Proceedings of the AAAI Conference on Artificial Intelligence, Vol.34, No.4, pp.4469-4476, 2020.
  8. J. P. Bashivan, M. Tensen, and J. J. DiCarlo, "Teacher guided architecture search," in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.5320- 5329, 2019.
  9. D. Shah, V. Trivedi, V. Sheth, A. Shah, and U. Chauhan, "ResTS: Residual deep interpretable architecture for plant disease detection," Information Processing in Agriculture, https://doi.org/10.1016/j.inpa.2021.06.001.
  10. C. Zach, T. Pock, and H. Bischof, "A duality based approach for realtime TV-L1 optical flow," in DAGM 2007: Pattern Recognition, Vol.4713, pp.214-223, 2007.
  11. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, "HMDB: A large video database for human motion recognition," in Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp.2556-2563, 2011.
  12. K. Soomro, A. R. Zamir, and M. Shah, "UCF101: A dataset of 101 human actions classes from videos in the wild," arXiv preprint arXiv:1212.0402, 2012.
  13. S. Sun, Z. Kuang, L. Sheng, W. Ouyang, and W. Zhang, "Optical flow guided feature: A fast and robust motion representation for video action recognition," in Proceedings of the Computer Vision and Pattern Recognition, pp.1-9, 2018.
  14. Y. Zhu, Z. Lan, S. Newsam, and A. G. Hauptmann, "Hidden two-stream convolutional networks for action recognition," arXiv preprint arXiv:1704.00389, 2017.
  15. J. Y.-H. Ng, J. Choi, J. Neumann, and L. S. Davis, "Actionflownet: Learning motion representation for action recognition," in IEEE Winter Conference on Applications of Computer Vision (WACV), pp.1616-1624, 2018.
  16. Y. Zhao and H. Lee,"FTSnet: A simple convolutional neural networks for action recognition," in Proceedings of the Annual Conference of KIPS(ACK) 2021, pp.878-879, 2021.
  17. K. He, X. Zhang, S. Ren, and J. Sun,"Deep residual learning for image recognition," in Proceedings of the Computer Vision and Pattern Recognition, pp.770-778, 2016.
  18. S. Xiao, J. Feng, J. Xing, H. Lai, S. Yan, and A. Kassim, "Robust facial landmark detection via recurrent attentive-refinement networks," in Proceedings of the European Conference on Computer Vision (ECCV), pp.57-72, 2016.
  19. Z. Wang, Q. She, and A. Smolic, "ACTION-Net: Multipath excitation for action recognition," in Proceedings of the Computer Vision and Pattern Recognition, pp.13214-13223, 2021.
  20. L. Wang, Z. Tong, B. Ji, and G. Wu, "TDN: Temporal difference networks for efficient action recognition," in Proceedings of the Computer Vision and Pattern Recognition, pp.1895-1904, 2021.
  21. T. Hui, X. Tang, and C. C. Loy, "A lightweight optical flow CNN-Revisiting data fidelity and regularization," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.43, No.8, pp.2555-2569, 2021. https://doi.org/10.1109/TPAMI.2020.2976928
  22. K. Luo, C. Wang, S. Liu, H. Fan, J. Wang, and J. Sun, "UPFlow: Upsampling pyramid for unsupervised optical flow learning," in Proceedings of the Computer Vision and Pattern Recognition, pp.1045-1054, 2021.