Browse > Article
http://dx.doi.org/10.9717/kmms.2020.23.8.986

AnoVid: A Deep Neural Network-based Tool for Video Annotation  

Hwang, Jisu (Dept. of Computer Science, Kyonggi University)
Kim, Incheol (Dept. of Computer Science, Kyonggi University)
Publication Information
Abstract
In this paper, we propose AnoVid, an automated video annotation tool based on deep neural networks, that automatically generates various meta data for each scene or shot in a long drama video containing rich elements. To this end, a novel meta data schema for drama video is designed. Based on this schema, the AnoVid video annotation tool has a total of six deep neural network models for object detection, place recognition, time zone recognition, person recognition, activity detection, and description generation. Using these models, the AnoVid can generate rich video annotation data. In addition, AnoVid provides not only the ability to automatically generate a JSON-type video annotation data file, but also provides various visualization facilities to check the video content analysis results. Through experiments using a real drama video, "Misaeing", we show the practical effectiveness and performance of the proposed video annotation tool, AnoVid.
Keywords
Video Annotation; Video Metadata; Deep Neural Network;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 K. Zhang, Z. Zhang, Z. Li, S. Member, and Y. Qiao, “Joint Face Detection and Alignment Using Multitask Cascaded Convolution Networks,” Journal of IEEE Signal Processing Letters, Vol. 23, No. 10, pp. 1499-1503, 2016.   DOI
2 K.T. Kim and J.Y. Choi, “Development of Combined Architecture of Multiple Deep Convolutional Neural Networks for Improving Video Face Identification,” Journal of Korea Multimedia Society, Vol. 22, No. 6, pp. 655-664, 2019.   DOI
3 R. Grishick, J. Donahue, T. Darrell, and J. Malik, "Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation," Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, pp. 580-587, 2014.
4 R. Girshick, "Fast R-CNN," Proceeding of IEEE International Conference on Computer Vision, pp. 1440-1448, 2015.
5 L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, et al., "Describing Videos by Exploiting Temporal Structure," Proceeding of IEEE Conference International Conference on Computer Vision, 2015.
6 L. Gao, Z. Guo, H. Zhang, X. Xu, and H.T. Shen, “Video Captioning with Attention-Based LSTM and Semantic Consistency,” Proceeding of IEEE Transactions on Multimedia, Vol. 19, No. 9, pp. 2045-5055, 2017.   DOI
7 W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, and C.Y. Fu, et al., "SSD: Single Shot Multibox Detector," Proceeding of European Conference on Computer Vision, pp. 21-37, 2016.
8 S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks," Proceeding of Conference on Neural Information Processing Systems, pp. 91-99, 2015.
9 K. He, G. Gkioxari, and P. Dollar, "Mask R-CNN," Proceeding of IEEE International Conference on Computer Vision, pp. 2961-2969, 2017.
10 J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You Only Look Once: Unified, Realtime Object Detection," Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, pp. 779-788, 2016.
11 T.Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, "Focal Loss for Dense Object Detection," Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, pp. 2980-2988, 2017.
12 T.Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, and J. Hays, et al., "Microsoft COCO: Common Objects in Context," Proceeding of the European Conference on Computer Vision, pp. 740-750, 2014.
13 K. Soomro, A.R. Zamir, and M. Shah, "UCF101: A Dataset of 101 Human Actions Classes From Videos in the Wild," arXiv Preprint arXiv:1212.0402, 2012.
14 J. Redmon and A. Farhadi, "YOLO9000: Better, Faster, Stronger," Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263-7271, 2017.
15 J. Redmon and A. Farhadi, "YOLOv3: An Incremental Improvement," arXiv Preprint arXiv:1804.02767, 2018.
16 A. Bochkovskiy, C.Y. Wang, and H.Y.M. Liao, "YOLOv4: Optical Speed and Accuracy of Object Detection," arXiv Preprint arXiv:2004.10934, 2020.
17 Z. Shou, D. Wang, and S.F. Chang, "Temporal Action Localization in Untrimmed Videos via Multistage CNNs," Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1049-1058, 2016.
18 Z. Shou, J. Chan, and S.F. Chang, "CDC: Convolutional De-convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos," arXiv Preprint arXiv:1703.01515, 2017.
19 S.F. Chang, T. Sikora, and A. Puri, “Overview of the MPEG-7 Standard,” Journal of IEEE Transactions on Circuits and Systems for Video Technology, Vol. 11, No. 6, pp. 688-695, 2001.   DOI
20 Dublin Core(1995), https://www.dublincore.org/ (accessed July 1, 2020).
21 B. Zhou, A. Lapedriza A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 Million Image Database for Scene Recognition,” Journal of IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40, No. 6, pp. 1452-1464, 2017.
22 V. Lombardo, R. Damiano, and A. Pizzo, "Drammar: A Comprehensive Ontological Resource on Drama," Proceeding of International Semantic Web Conference, pp. 103-118, 2018.
23 OntoMedia(2002), http://www.ontomedia.de/ (accessed July 1, 2020).
24 S. Ji, W. Xu, M. Yang, and K. Yu, “3D Convolutional Neural Networks for Human Action Recognition,” Journal of IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 35, No. 1, pp. 221-231, 2013.   DOI
25 J.Y.H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, "Beyond Short Snippets: Deep Networks for Video Classification," Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694-4702, 2015.
26 K. Wang, X. Long, R. Li, and L.J. Zhao, “A Discriminative Algorithm for Indoor Place Recognition Based on Clustering of Features and Images,” Journal of International Journal of Automation and Computing, Vol. 14, No. 4, pp. 407-419, 2017.   DOI
27 A. Hanni, S. Chickerur, and I. Bidari, "Deep Learning Framework for Scene Based Indoor Location Recognition," Proceeding of IEEE International Conference on Technological Advancements in Power and Energy, pp. 1-8, 2017.
28 J. Deng, W. Dong, R. Socher, L.J. Li, K. Li and L.F. Fei, "ImageNet: A Large-scale Hierarchical Image Database," Proceeding of Conference on Neural Information Processing
29 S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, "Sequence to Sequence: Video to Text," Proceeding of IEEE International Conference on Computer Vision, pp. 4534-4542, 2015.
30 Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, "Jointly Modeling Embedding and Translation to Bridge Video and Language," Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, 2016.
31 DarkLabel(2017), https://darkpgmr.tistory.com/16 (accessed July 1, 2020).
32 Advene(2002), http://folk.ntnu.no/heggland/ontolog-crawler/login.php (accessed July 1, 2020).
33 ELAN(2016), https://archive.mpi.nl/tla/elan (accessed July 1, 2020).
34 W.Y. Wong and P. Reimann, "Web Based Educational Video Teaching and Learning Platform with Collaborative Annotation," Proceeding of IEEE International Conference on Advanced Learning Technologies, pp. 696-700, 2009.
35 VoTT(2019), https://github.com/Microsoft/VoTT (accessed July 1, 2020).
36 VATIC(2012), https://github.com/cvondrick/vatic (accessed July 1, 2020).
37 Y. Wang, X. Ji, Z. Zhou, H. Wang, and Z. Li, "Detecting Faces Using Region-based Fully Convolutional Networks," arXiv Preprint arXiv: 1709.05256, 2017.
38 S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S.Z. Li, "S3FD: Single Shot Scale-invariant Face Detector," Proceeding of IEEE International Conference on Computer Vision, pp. 4203-4212, 2017.