DOI QR코드

DOI QR Code

Detection of video editing points using facial keypoints

얼굴 특징점을 활용한 영상 편집점 탐지

  • Joshep Na (Department of AI, Big Data & Management, Kookmin University) ;
  • Jinho Kim (Department of AI, Big Data & Management, Kookmin University) ;
  • Jonghyuk Park (College of Business Administration, Kookmin University)
  • 나요셉 (국민대학교 경영대학 AI빅데이터융합경영학과) ;
  • 김진호 (국민대학교 경영대학 AI빅데이터융합경영학과) ;
  • 박종혁 (국민대학교 경영대학 AI빅데이터융합경영학과)
  • Received : 2023.09.05
  • Accepted : 2023.11.16
  • Published : 2023.12.31

Abstract

Recently, various services using artificial intelligence(AI) are emerging in the media field as well However, most of the video editing, which involves finding an editing point and attaching the video, is carried out in a passive manner, requiring a lot of time and human resources. Therefore, this study proposes a methodology that can detect the edit points of video according to whether person in video are spoken by using Video Swin Transformer. First, facial keypoints are detected through face alignment. To this end, the proposed structure first detects facial keypoints through face alignment. Through this process, the temporal and spatial changes of the face are reflected from the input video data. And, through the Video Swin Transformer-based model proposed in this study, the behavior of the person in the video is classified. Specifically, after combining the feature map generated through Video Swin Transformer from video data and the facial keypoints detected through Face Alignment, utterance is classified through convolution layers. In conclusion, the performance of the image editing point detection model using facial keypoints proposed in this paper improved from 87.46% to 89.17% compared to the model without facial keypoints.

최근 미디어 분야에도 인공지능(AI)을 적용한 다양한 서비스가 등장하고 있는 추세이다. 하지만 편집점을 찾아 영상을 이어 붙이는 영상 편집은, 대부분 수동적 방식으로 진행되어 시간과 인적 자원의 소요가 많이 발생하고 있다. 이에 본 연구에서는 Video Swin Transformer를 활용하여, 발화 여부에 따른 영상의 편집점을 탐지할 수 있는 방법론을 제안한다. 이를 위해, 제안 구조는 먼저 Face Alignment를 통해 얼굴 특징점을 검출한다. 이와 같은 과정을 통해 입력 영상 데이터로부터 발화 여부에 따른 얼굴의 시 공간적인 변화를 모델에 반영한다. 그리고, 본 연구에서 제안하는 Video Swin Transformer 기반 모델을 통해 영상 속 사람의 행동을 분류한다. 구체적으로 비디오 데이터로부터 Video Swin Transformer를 통해 생성되는 Feature Map과 Face Alignment를 통해 검출된 얼굴 특징점을 합친 후 Convolution을 거쳐 발화 여부를 탐지하게 된다. 실험 결과, 본 논문에서 제안한 얼굴 특징점을 활용한 영상 편집점 탐지 모델을 사용했을 경우 분류 성능을 89.17% 기록하여, 얼굴 특징점을 사용하지 않았을 때의 성능 87.46% 대비 성능을 향상시키는 것을 확인할 수 있었다.

Keywords

References

  1. Adrian, B., and Georgios, T., "How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks)", Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), (2017), 1021-1030
  2. Baxter, J., "A model of inductive bias learning." Journal of artificial intelligence research 12, Vol. 12, (2000), 149-198 https://doi.org/10.1613/jair.731
  3. Bengio, Y., Rejean D., Pascal V. and Christian J., "A neural probabilistic language model", The Journal of Machine Learning Research, Vol3 (2003), 1137-1155
  4. Beniaguev D., Youtube Faces With Facial Keypoints, Kaggle, 2020. Available at https://www.kaggle.com/datasets/selfishgene/youtube-faces-with-facial-keypoints
  5. Dapogny, A., K. Bailly, and M. Cord, "Decafa: Deep convolutional cascade for face alignment in the wild", Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), (2019), 6893-6901
  6. Dosovitskiy A., Lucas B., Alexander K., Dirk W., Xiaohua Z., Thomas U., Mostafa D., Matthias M., Georg H., Sylvain G., Jakob U., and Neil H., "AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE", The International Conference on Learning Representations (ICLR), (2021)
  7. He, K., Xiangyu Z., Shaoqing R. and Jian S., "Deep Residual Learning for Image Recognition", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016), 770-778
  8. Hong, S., "AI-based automatic editing technology trends" The Korean Institute of Broadcast and Media Engineers, Vol.26, No.1(2021), 76-96
  9. Hong, S., Y., Chung. and J.-H., Lee., "Semi-supervised learning for sentiment analysis in mass social media", Journal of Korean Institute of Intelligent Systems, Vol. 24, No. 5, (2014), 482-488 https://doi.org/10.5391/JKIIS.2014.24.5.482
  10. Im, C-W. and D. H. Kwon, "A Study on the Understanding of AI from the Perspective of Users and Image Effects & Video Editing Programs based on It", Cartoon & Animation Studies, Vol., No60, (2020), 263-308 https://doi.org/10.7230/KOSCAS.2020.60.263
  11. Kim, D.-H., "Similar Contents Recommendation Model Based On Contents Meta Data Using Language Model," Journal of Intelligence and Information Systems, Vol. 29, No. 1(2023), 27-40
  12. Kim, E., Qinglong, L., Pilsik, C. and J., Kim, "A Study on the Media Recommendation System with Time Period Considering the Consumer Contextual Information Using Public Data", Journal of Korean Institute of Intelligent Systems, Vol 28, No.4, (2022), 95-117
  13. Kim, H.-S., "A Study on Artificial intelligence editing Highlight image recognition", Yonsei University Graduate School of Communication, 2021. Available at https://dcollection.yonsei.ac.kr/public_resource/pdf/000000530830_20231016180537.pdf
  14. Kim, J., H., Y. J. Shin, and H. C. Ahn, "Fake News Detection on YouTube Using Related Video Information," Journal of Intelligence and Information Systems, Vol. 29, No. 3(2023), 19-36.
  15. Kim, Y.-W., D. Y. Kim, and H. H. Seo, and Young-Min Kim, "Content-based Korean journal recommendation system using Sentence BERT," Journal of Intelligence and Information Systems, Vol. 29, No. 3(2023), 37-55.
  16. Lim, H., H., Moon, G., Park and Y., Lim, "Automatic Video Editing Application based on Climax Pattern Classified by Genre", JBE Vol. 25, No. 6, (2020), 861-869
  17. Lee, S., and J., Kim, "The Influence of Digital Content Reflected in Social Media" The Journal of Society for e-Business Studies Vol.23, No.4, November (2018),127-136
  18. Liu, Z., and Ning. J., Yue, C., Yixuan, W., Zheng, Z., Stephen, L., Han, H., "Video Swin Transformer" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, 3202-3211
  19. Liu, Z., Yutong L., Yue C., Han H., Yixuan W., Zheng Z., Stephen L. and Baining G., "Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows", Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), (2021), 10012-10022
  20. Loshchilov, I. and Frank H., "Decoupled Weight Decay Regularization" In International Conference on Learning Representations (ICLR), (2017)
  21. Newell, A., K., Yang, and J., Deng, "Stacked Hourglass Networks for Human Pose Estimation", European Conference on Computer Vision(ECCV), Vol. 9912, (2016), 483-499
  22. Park, S.-W. and B., Wang, "Web-based Text-To-Sign Language Translating System", Journal of Korean Institute of Intelligent Systems, (2014), 265-269
  23. Park, Y.-S., and H.-S. Kim, "Character Recognition and Search for Media Editing", JBE, Vol.27, No.4(2022), 519-526
  24. Shaw P., Jakob U., and Ashish V., "Self-attention with relative position representations", In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol 2, (2018), 464-468
  25. Song, H., "Lip-Reading dataset", AIHub, 2021. Available at https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=538
  26. Tsung-Yi L., Piotr D., Ross G., Kaiming H., Bharath H. and Serge B., "Feature Pyramid Networks for Object Detection", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017), 2117-2125
  27. Vaswani, A., Noam S., Niki P., Jakob U., Lion J., Aidan N. G., Lukasz K., and Illia P., "Attention is all you need", Part of Advances in Neural Information Processing Systems 30 (NIPS), (2017), 5998-6008
  28. Yang H. and K., Kim., "Real-time Lip Reading Interface System Based on Deep Learning Model Using Images", The Korean Intellectual Property Office, 2021. Available at https://patentimages.storage.googleapis.com/e3/66/d8/939e2a939bedab/KR20210054961A.pdf
  29. Yim, J., J., Joo. and G, Lee, "Smart Phone Picture Recognition Algorithm Using Electronic Maps of Architecture Configuration", Journal of Society for e-Business Studies, Vol 17, No. 3, (2012), 1-14 https://doi.org/10.7838/jsebs.2012.17.3.001