DOI QR코드

DOI QR Code

Video Analysis System for Action and Emotion Detection by Object with Hierarchical Clustering based Re-ID

계층적 군집화 기반 Re-ID를 활용한 객체별 행동 및 표정 검출용 영상 분석 시스템

  • Lee, Sang-Hyun (Department of Software and Computer Engineering, Ajou University) ;
  • Yang, Seong-Hun (Department of Convergence Software, Myongji University) ;
  • Oh, Seung-Jin (Department of Medical Information Technology Engineering, Soonchunhyang University) ;
  • Kang, Jinbeom (Xinapse)
  • 이상현 (아주대학교 정보통신대학 소프트웨어학과) ;
  • 양성훈 (명지대학교 ICT융합대학 융합소프트웨어학부) ;
  • 오승진 (순천향대학교 의료과학대학 의료IT공학과) ;
  • 강진범 (자이냅스)
  • Received : 2021.11.25
  • Accepted : 2022.01.14
  • Published : 2022.03.31

Abstract

Recently, the amount of video data collected from smartphones, CCTVs, black boxes, and high-definition cameras has increased rapidly. According to the increasing video data, the requirements for analysis and utilization are increasing. Due to the lack of skilled manpower to analyze videos in many industries, machine learning and artificial intelligence are actively used to assist manpower. In this situation, the demand for various computer vision technologies such as object detection and tracking, action detection, emotion detection, and Re-ID also increased rapidly. However, the object detection and tracking technology has many difficulties that degrade performance, such as re-appearance after the object's departure from the video recording location, and occlusion. Accordingly, action and emotion detection models based on object detection and tracking models also have difficulties in extracting data for each object. In addition, deep learning architectures consist of various models suffer from performance degradation due to bottlenects and lack of optimization. In this study, we propose an video analysis system consists of YOLOv5 based DeepSORT object tracking model, SlowFast based action recognition model, Torchreid based Re-ID model, and AWS Rekognition which is emotion recognition service. Proposed model uses single-linkage hierarchical clustering based Re-ID and some processing method which maximize hardware throughput. It has higher accuracy than the performance of the re-identification model using simple metrics, near real-time processing performance, and prevents tracking failure due to object departure and re-emergence, occlusion, etc. By continuously linking the action and facial emotion detection results of each object to the same object, it is possible to efficiently analyze videos. The re-identification model extracts a feature vector from the bounding box of object image detected by the object tracking model for each frame, and applies the single-linkage hierarchical clustering from the past frame using the extracted feature vectors to identify the same object that failed to track. Through the above process, it is possible to re-track the same object that has failed to tracking in the case of re-appearance or occlusion after leaving the video location. As a result, action and facial emotion detection results of the newly recognized object due to the tracking fails can be linked to those of the object that appeared in the past. On the other hand, as a way to improve processing performance, we introduce Bounding Box Queue by Object and Feature Queue method that can reduce RAM memory requirements while maximizing GPU memory throughput. Also we introduce the IoF(Intersection over Face) algorithm that allows facial emotion recognized through AWS Rekognition to be linked with object tracking information. The academic significance of this study is that the two-stage re-identification model can have real-time performance even in a high-cost environment that performs action and facial emotion detection according to processing techniques without reducing the accuracy by using simple metrics to achieve real-time performance. The practical implication of this study is that in various industrial fields that require action and facial emotion detection but have many difficulties due to the fails in object tracking can analyze videos effectively through proposed model. Proposed model which has high accuracy of retrace and processing performance can be used in various fields such as intelligent monitoring, observation services and behavioral or psychological analysis services where the integration of tracking information and extracted metadata creates greate industrial and business value. In the future, in order to measure the object tracking performance more precisely, there is a need to conduct an experiment using the MOT Challenge dataset, which is data used by many international conferences. We will investigate the problem that the IoF algorithm cannot solve to develop an additional complementary algorithm. In addition, we plan to conduct additional research to apply this model to various fields' dataset related to intelligent video analysis.

최근 영상 데이터의 급증으로 이를 효과적으로 처리하기 위해 객체 탐지 및 추적, 행동 인식, 표정 인식, 재식별(Re-ID)과 같은 다양한 컴퓨터비전 기술에 대한 수요도 급증했다. 그러나 객체 탐지 및 추적 기술은 객체의 영상 촬영 장소 이탈과 재등장, 오클루전(Occlusion) 등과 같이 성능을 저하시키는 많은 어려움을 안고 있다. 이에 따라 객체 탐지 및 추적 모델을 근간으로 하는 행동 및 표정 인식 모델 또한 객체별 데이터 추출에 난항을 겪는다. 또한 다양한 모델을 활용한 딥러닝 아키텍처는 병목과 최적화 부족으로 성능 저하를 겪는다. 본 연구에서는 YOLOv5기반 DeepSORT 객체추적 모델, SlowFast 기반 행동 인식 모델, Torchreid 기반 재식별 모델, 그리고 AWS Rekognition의 표정 인식 모델을 활용한 영상 분석 시스템에 단일 연결 계층적 군집화(Single-linkage Hierarchical Clustering)를 활용한 재식별(Re-ID) 기법과 GPU의 메모리 스루풋(Throughput)을 극대화하는 처리 기법을 적용한 행동 및 표정 검출용 영상 분석 시스템을 제안한다. 본 연구에서 제안한 시스템은 간단한 메트릭을 사용하는 재식별 모델의 성능보다 높은 정확도와 실시간에 가까운 처리 성능을 가지며, 객체의 영상 촬영 장소 이탈과 재등장, 오클루전 등에 의한 추적 실패를 방지하고 영상 내 객체별 행동 및 표정 인식 결과를 동일 객체에 지속적으로 연동하여 영상을 효율적으로 분석할 수 있다.

Keywords

References

  1. Abadi M, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al., "Tensorflow: Large-scale machine learning on heterogeneous distributed systems," arXiv:1603.04467, 2016.
  2. Amazon, "Amazon Rekognition," AWS, Available at https://aws.amazon.com/rekognition/ (Accessed Sep, 2021).
  3. An L, S. Yang, and B. Bhanu, "Person re-identification by robust canonical correlation analysis," IEEE Signal Processing Letters, Vol.22, No.8 (2015), 1103~1107. https://doi.org/10.1109/LSP.2015.2390222
  4. An L., M. Kafai, S. Yang, and B. Bhanu, "Person re-identification with reference descriptor," IEEE Transactions on Circuits and Systems for Video Technology, Vol.26, No.4 (2016), 776~787. https://doi.org/10.1109/TCSVT.2015.2416561
  5. An, L., X. Chen, S. Liu, Y. Lei, and S. Yang, "Integrating appearance features and soft biometrics for person re-identification," Multimedia Tools and Applications: An International Journal, Vol.76, No.9 (2017), 12117~12131. https://doi.org/10.1007/s11042-016-4070-2
  6. Azizan I., and F. Khalid, "Facial Emotion Recognition: A Brief Review", International Conference on Sustainable Engineering, Technology and Management(ICSETM), 2018.
  7. Bartlett M. S., G. Littlewort, E. Vural, K. Lee, M. Cetin, A. Ercil, and J. Movellan, "Data mining spontaneous facial behavior with automatic expression coding," Lecture Notes in Computer Science, Vol.5042, 2008, 1~20.
  8. Bashir M., E. A. Rundensteiner, and R. Ahsan, "A deep learning approach to trespassing detection using video surveillance data," IEEE International Conference on Big Data, 2019, 3535~3544.
  9. Bewley A., Z. Ge, L. Ott, F. Ramos, and B. Upcroft, "Simple online and realtime tracking," IEEE International Conference on Image Processing(ICIP), 2016, 3464~3468.
  10. Bochkovskiy A., C.-Y. Wang, and H.-Y. M. Liao, "YOLOv4: Optimal Speed and Accuracy of Object Detection," arXiv:2004.10934 [cs.CV], 2020.
  11. Brostrom M., "Real-time multi-object tracker using YOLOv5 and deep sort,", Github, 2021, Available at https://github.com/mikel-brostrom/Yolov5_DeepSort_Pytorch/ (Accessed Sep, 20 21).
  12. Chen T., M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, "Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems," arXiv:1512.01274, 2015.
  13. Ciaparrone G., F. L. Sanchez, S. Tabik, L. Troiano, R. Tagliaferri, and F. Herrera, "Deep Learning in Video Multi-Object Tracking: A Survey," arXiv:1907.12740, 2019.
  14. Feichtenhofer C., H. Fan, J. Malik, and K. He, "SlowFast Networks for Video Recognition", IEEE/CVF International Conference on Computer Vision(ICCV), 2019, 6201~6210.
  15. Glenn J. et al., (2021). "ultralytics/yolov5: v6.0 - YOLOv5n 'Nano' models, Roboflow integration, TensorFlow export, OpenCV DNN support (v6.0)," Zenodo, 2021, Available at https://doi.org/10.5281/zenodo.5563715/ (Accessed Sep, 2021).
  16. Gudelj D., A. F. Stama, J. Petroviae and P. Pale, "Visual Object Detection - an Overview of Algorithms and Results," 44th International Convention on Information, Communication and Electronic Technology (MIPRO), 2021, 1727~1732.
  17. Herath S., M. Harandi, and F. Porikli, "Going Deeper into action recognition: A survey", Image and Vision Computing, Vol.60, No.4 (2017), 4~21. https://doi.org/10.1016/j.imavis.2017.01.010
  18. Jaio L. et al., "A Survey of Deep Learning-Based Object Detection," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol.7 (2019), 128837~128868.
  19. JANG, S.-I., and C.-S. . PARK, "Object Tracking Based on Exactly Reweighted Online Total-Error-Rate Minimization", Journal of Intelligence and Information Systems, Vol. 25, No. 4 (2019), 53~65. https://doi.org/10.13088/JIIS.2019.25.4.053
  20. Jia Y., E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, "Caffe: Convolutional architecture for fast feature embedding," arXiv:1408.5093, 2014.
  21. Ko J. G., Y. S. Bae, J.Y. Park, and K. Park, "Technologies Trends in Image Big Data Analysis," Electronics and Telecommunications Research Institute(ETRI), Vol.29, No.4 (2014), 21~29.
  22. KT - 케이티, "8년 경력 어린이집 교사 엄마도 반한 우리 아이 현실 육아 비법은?! [가정교사] Ep.4," Youtube, Jan. 6, 2021, Availble at https://www.youtube.com/watch?v=jS2e8iKAqP4 (Accessed Sep, 2021)
  23. Kuo C. -H., S. Khamis, and V. Shet "Person re-identification using semantic color names and rankboost," IEEE Workshop on applications of computer vision(WACV), 2013, 281~287.
  24. Kviatkovsky I, A. Adam, and E. Rivlin, "Color invariants for person reidentification," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.35, No.7 (2013), 1622~1634. https://doi.org/10.1109/TPAMI.2012.246
  25. Lee H. G., M. K. Choi, D. H. Lee, and S. C. Lee, "Intelligent Diagnosis Assistant System of Capsule Endoscopy Video Through Analysis of Video Frames", Journal of Intelligence and Information Systems, Vol. 15, No. 2 (2009), 33~48.
  26. Liao S, Y. Hu, X. Zhu, and S. Z. Li, "Person re-identification by local maximal occurrence representation and metric learning," IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2015, 2197~2206.
  27. Moolchandani M., S. Dwivedi, S. Nigam and K. Gupta, "A survey on: Facial Emotion Recognition and Classification," 5th International Conference on Computing Methodologies and Communication(ICCMC), 2021, 1677~1686.
  28. Moon J. Y., H. I. Kim, and J. Y. Park, "Trends in Temporal Action Detection in Untrimmed Videos," Electronics and Telecommunications Research Institute(ETRI), Vol.35, No.3 (2020), 20~33.
  29. Paszke A., S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, "Automatic differentiation in pytorch," NIPS, 2017.
  30. Pedagadi S, J. Orwell, S. Velastin, and B. Boghossian, "Local fisher discriminant analysis for pedestrian re-identification," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, 3318~3325.
  31. samihormi, "Multi-Camera-Person-Tracking-and-Re-Identification," Github, 2020, Available at https://github.com/samihormi/Multi-Camera-Person-Tracking-and-Re-Identification (Accessed Jul, 2021).
  32. Shin, D.-W., T.-H. Kim, and J.-M. Choi, "Video Scene Detection using Shot Clustering based on Visual Features", Journal of Intelligence and Information Systems, Vol. 18, No. 2 (2012), 47~60. https://doi.org/10.13088/JIIS.2012.18.2.047
  33. Singh B., "DETECTING OBJECTS AND ACTIONS WITH DEEP LEARNING," (PhD Thesis), University of Maryland, College Park, 2018.
  34. Tang J., J. Xia, X. Mu, B. Pang, and C. Lu, "Asynchronous Interaction Aggregation for Action Detection," IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2020.
  35. Wang Y., T. Bao, C. Ding, and M. Zhu, "Face recognition in real-world surveillance videos with deep learning method," 2nd International Conference on Image, Vision and Computing (ICIVC), 2017, 239~243.
  36. Wojke N., A. Bewley, and D. Paulus, "Simple oneline and realtime tracking with a deep association metric," IEEE International Conference on Image Processing(ICIP), 2017, 3645~3649.
  37. Wright C. et al., "AI IN PRODUCTION: VIDEO ANALYSIS AND MACHINE LEARNING FOR EXPANDED LIVE EVENTS COVERAGE," SMPTE Motion Imaging Journal, vol.129, No.2 (2020), 36~45. https://doi.org/10.5594/JMI.2020.2967204
  38. Ye M., J. Shen, G. Lin, T. Xiang, L. Shao, and S. C. H. Hoi, "Deep Learning for Person Re-identification: A Survey and Outlook," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  39. Younis H., M. H. Bhatti, and M. Azeem, "Classification of Skin Cancer Dermoscopy Images using Transfer Learning," 15th International Conference on Emerging Technologies(ICET), 2019, 1~4.
  40. Zhang Y., C. Wang, X. Wang, W. Zeng, and W. Liu, "FairMOT: On the Fairness of Detection and Re-identification Object Tracking", arXiv:2004.01888, 2020.
  41. Zheng W., S. Gong, and T. Xiang, "Reidentification by relative distance comparison," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.35, No.3(2013), 653~668. https://doi.org/10.1109/TPAMI.2012.138
  42. Zhou K. and T. Xiang, "Torchreid: A Library for Deep Learning Person Re-Identification in Pytorch," arXiv:1910.10093 [cs.CV], 2019.
  43. 중앙육아종합지원센터, "2019 개정 누리과정 시범어린이집 놀이영상(전궁몬테소리어린이집)," Youtube, Available at https://www.youtube.com/watch?v=P8P0ZMP4nZo/ (Accessed 27 Mar, 2020).