DOI QR코드

DOI QR Code

Fine-tuning Neural Network for Improving Video Classification Performance Using Vision Transformer

Vision Transformer를 활용한 비디오 분류 성능 향상을 위한 Fine-tuning 신경망

  • Received : 2023.09.18
  • Accepted : 2023.09.26
  • Published : 2023.09.30

Abstract

This paper proposes a neural network applying fine-tuning as a way to improve the performance of Video Classification based on Vision Transformer. Recently, the need for real-time video image analysis based on deep learning has emerged. Due to the characteristics of the existing CNN model used in Image Classification, it is difficult to analyze the association of consecutive frames. We want to find and solve the optimal model by comparing and analyzing the Vision Transformer and Non-local neural network models with the Attention mechanism. In addition, we propose an optimal fine-tuning neural network model by applying various methods of fine-tuning as a transfer learning method. The experiment trained the model with the UCF101 dataset and then verified the performance of the model by applying a transfer learning method to the UTA-RLDD dataset.

본 논문은 Vision Transformer를 기반으로 하는 Video Classification의 성능을 개선하는 방법으로 fine-tuning를 적용한 신경망을 제안한다. 최근 딥러닝 기반 실시간 비디오 영상 분석의 필요성이 대두되고 있다. Image Classification에 사용되는 기존 CNN 모델의 특징상 연속된 Frame에 대한 연관성을 분석하기 어렵다는 단점이 있다. 이와 같은 문제를 Attention 메커니즘이 적용된 Vistion Transformer와 Non-local 신경망 모델을 비교 분석하여 최적의 모델을 찾아 해결하고자 한다. 또한, 전이 학습 방법으로 fine-tuning의 다양한 방법을 적용하여 최적의 fine-tuning 신경망 모델을 제안한다. 실험은 UCF101 데이터셋으로 모델을 학습시킨 후, UTA-RLDD 데이터셋에 전이 학습 방법을 적용하여 모델의 성능을 검증하였다.

Keywords

Acknowledgement

This work was supported by Seokyeong University in 2022 and by Seokyeong University in 2023.

References

  1. Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi, "You Only Look Once: Unified, Real-Time Object Detection," IEEE 2016, pp. 779-788, 2016. DOI: 10.48550/arXiv.1506.02640
  2. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks," Communications of the ACM, Vol.60, No.6, pp.84-90, 2017. https://doi.org/10.1145/3065386
  3. Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, "A Comprehensive Survey on Transfer Learning," Proceedings of the IEEE, Vol.109, No.1, pp.43-76, 2021. DOI: 10.48550/arXiv.1911.02685
  4. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, "Attention Is All You Need," NIPS 2017, pp.6000-6010, 2017. DOI: 10.48550/arXiv.1706.03762
  5. Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucic, Cordelia Schmid, "ViViT: A Video Vision Transformer," ICCV 2021, pp.6836-6846, 2021. DOI: 10.48550/arXiv.2103.15691
  6. Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He, "Non-local Neural Networks," IEEE, pp.7794-7803, 2017.
  7. Khurram Soomro, Amir Roshan Zamir and Mubarak Shah, "UCF101: A Dataset of 101 Human Action Classes From Videos in The Wild," CRCVTR-12-01, 2012. DOI: 10.48550/arXiv.1212.0402
  8. Reza Ghoddoosian, Marnim Galib, Vassilis Athitsos, "A Realistic Dataset and Baseline Temporal Model for Early Drowsiness Detection," CVPRW 2019, pp.178-187, 2019. DOI: 10.48550/arXiv.1904.07312
  9. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, "An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale," https://arxiv.org/abs/2010.11929 DOI: 10.48550/arXiv.2010.11929
  10. Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton, "Layer Normalization," https://arxiv.org/abs/1607.06450
  11. A. Buades, B. Coll and J.-M. Morel, "A nonlocal algorithm for image denoising," 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), Vol.2, pp.60-65, 2005. DOI: 10.1109/CVPR.2005.38.
  12. https://newindow.tistory.com/254