[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3837/tiis.2019.03.033

CNN-based Visual/Auditory Feature Fusion Method with Frame Selection for Classifying Video Events

Choe, Giseok (Department of Computer Science and Engineering, Sogang University)
Lee, Seungbin (Department of Computer Science and Engineering, Sogang University)
Nang, Jongho (Department of Computer Science and Engineering, Sogang University)

Publication Information

KSII Transactions on Internet and Information Systems (TIIS) / v.13, no.3, 2019 , pp. 1689-1701 More about this Journal

Abstract

In recent years, personal videos have been shared online due to the popular uses of portable devices, such as smartphones and action cameras. A recent report predicted that 80% of the Internet traffic will be video content by the year 2021. Several studies have been conducted on the detection of main video events to manage a large scale of videos. These studies show fairly good performance in certain genres. However, the methods used in previous studies have difficulty in detecting events of personal video. This is because the characteristics and genres of personal videos vary widely. In a research, we found that adding a dataset with the right perspective in the study improved performance. It has also been shown that performance improves depending on how you extract keyframes from the video. we selected frame segments that can represent video considering the characteristics of this personal video. In each frame segment, object, location, food and audio features were extracted, and representative vectors were generated through a CNN-based recurrent model and a fusion module. The proposed method showed mAP 78.4% performance through experiments using LSVC data.

Keywords

Multimedia; Computer Vision Systems; Aritifical Intelligence; Video Classification;

Citations & Related Records

Reference

1	W. Zhu, C. Toklu, and S.-P. Liou, "Automatic News Video Segmentation and Categorization Based on Closed-captioned Text," Urbana, vol. 51, pp. 61801, 2001.
2	Z. Liu, Y. Wang, and T. Chen, "Audio Feature Extraction and Analysis for Scene Segmentation and Classification," Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology, vol. 20, no. 1-2, pp. 61-79, 1998. DOI
3	B. T. Truong, and C. Dorai, "Automatic Genre Identification for Content-based Video Categorization," in Proceedings of International Conference on Pattern Recognition, pp. 230-233, 2000.
4	H. Wang, and C. Schmid, "Action Recognition with Improved Trajectories," in Proc. of IEEE International Conference on Computer Vision, pp. 3551-3558, 2013.
5	A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, "Large-scale Video Classification with Convolutional Neural Networks," in Proc. of the IEEE conference on Computer Vision and Pattern Recognition, 2014, pp. 1725-1732, 2014.
6	J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, "Long-term Recurrent Convolutional Networks for Visual Recognition and Description," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625-2634, 2015.
7	L. Wang, Y. Qiao, and X. Tang, "Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4305-4314, 2015.
8	D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning Spatiotemporal Features with 3d Convolutional Networks," in Proc. of the IEEE International Conference on Computer Vision, pp. 4489-4497, 2015.
9	K. Soomro, A. R. Zamir, and M. Shah, "UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild," arXiv preprint arXiv:1212.0402, 2012.
10	B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, "Places: A 10 Million Image Database for Scene Recognition," IEEE transactions on pattern analysis and machine intelligence, 2017.
11	L. Bossard, M. Guillaumin, and L. Van Gool, "Food-101-mining Discriminative Components with Random Forests," in European Conference on Computer Vision, pp. 446-461, 2014.
12	C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going Deeper with Convolutions," in Proc. of International Conference on Computer Vision and Pattern Recognition, 2015.
13	S. Ioffe, and C. Szegedy, "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift," arXiv preprint arXiv:1502.03167, 2015.
14	K. Simonyan, and A. Zisserman, "Very Deep Convolutional Networks for Large-scale Image Recognition," arXiv preprint arXiv:1409.1556, 2014.
15	R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, "NetVLAD: CNN Architecture for Weakly Supervised Place Recognition," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5297-5307, 2016.
16	CISCO, "Cisco Visual Networking Index: Forecast and Methodology," Feb 15, 2018; https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/complete-white-paper-c11-481360.html.
17	Z. Wu, Y. G. Jiang, L. S. Davis, and S.-F. Chang, "LSVC2017: Large-Scale Video Classification Challenge," 2017.
18	J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "ImageNet: A Large-scale Hierarchical Image Database," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248-255.
19	K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778, 2016.
20	S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, and B. Seybold, "CNN Architectures for Large-scale Audio Classification," in Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 131-135, 2017.