Spatio-temporal Semantic Features for Human Action Recognition

  • Liu, Jia (Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University) ;
  • Wang, Xiaonian (Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University) ;
  • Li, Tianyu (Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University) ;
  • Yang, Jie (Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University)
  • Received : 2012.07.04
  • Accepted : 2012.09.19
  • Published : 2012.10.31

Abstract

Most approaches to human action recognition is limited due to the use of simple action datasets under controlled environments or focus on excessively localized features without sufficiently exploring the spatio-temporal information. This paper proposed a framework for recognizing realistic human actions. Specifically, a new action representation is proposed based on computing a rich set of descriptors from keypoint trajectories. To obtain efficient and compact representations for actions, we develop a feature fusion method to combine spatial-temporal local motion descriptors by the movement of the camera which is detected by the distribution of spatio-temporal interest points in the clips. A new topic model called Markov Semantic Model is proposed for semantic feature selection which relies on the different kinds of dependencies between words produced by "syntactic " and "semantic" constraints. The informative features are selected collaboratively based on the different types of dependencies between words produced by short range and long range constraints. Building on the nonlinear SVMs, we validate this proposed hierarchical framework on several realistic action datasets.

Keywords

References

  1. M. Bregonzio, S.Gong and T. Xiang, "Recognising action as clouds of space-time interest points". In IEEE Conf. Computer Vision and Patt. Recog, pp.1948-1955, Aug 2009.
  2. Fathi and G. Mori, "Action recognition by learning midlevel motion features," In IEEE Conf. Computer Vision and Patt. Recog, pp.1-8 , Aug 2008.
  3. Z. Zhang, Y. Hu, S. Chan, and L.-T. Chia, "Motion context: A new representation for human action recognition," In Proc. European Conf. Computer Vision, vol.4, pp.817-829, Oct 2008.
  4. M. Rodriguez, J.Ahmed, and M.Shah, "Action MACH: A Spatio-temporal Maximum Average Correlation Height Filter for Action Recognition," In IEEE Conf. Computer Vision and Patt. Recog, pp. 1-8, Aug 2008.
  5. J. Liu and J. Luo snd M. Shah, "Recognizing realistic actions from videos "in the wild"". In IEEE Conf. Computer Vision and Patt. Recog, pp. 1996-2003, Aug 2009.
  6. M. Marszalek, I. Laptev, C. Schmid, "Actions in context," IEEE Conf. Computer Vision and Patt. Recog, pp. 2929-2936 , Aug 2009.
  7. J. Sun, X. Wu, S. Yan et.al., "Hierarchical spatio-temporal context modeling for action recognition," In IEEE Conf. Computer Vision and Patt. Recog, pp. 2004-2011, Aug 2009.
  8. R. Messing, C. Pal, and H. Kautz, "Activity recognition using the velocity histories of tracked keypoints," In IEEE Intl. Conf. Computer Vision, pp. 104-111, Aug 2009.
  9. M. Bregonzio, J. Li, S. Gong and T. Xiang, "Discriminative Topics Modelling for Action Feature Selection and Recognition," In Proc. Conf. British Machine Vision, pp. 1-11, Aug 2010.
  10. C. Schuldt, I. Laptev, and B. Caputo, "Recognizing Human Actions: A Local SVM Approach". In Intl. Conf. Pattern Recognition, pp. 32-36, Aug2004.
  11. Laptev and T. Lindeberg, "Space-Time Interest Points," In IEEE Intl. Conf. Computer Vision, pp. 432-439, Jun 2003.
  12. P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. "Behavior recognition via sparse spatio-temporal features". In IEEE Intl. Workshop Visual Surveillance and Performance Evaluation of Tracking and Surveillance,pp. 65-72, Oct 2005.
  13. S-F Wong, R. Cipolla, "Extracting spatiotemporal interest points using global information," In IEEE Intl. Conf. Computer Vision, pp.1-8, Oct 2007.
  14. G. Willems, T. Tuytelaars, L. J. Van Gool, "An efficient dense and scale-invariant spatiotemporal interest point detector," In Proc. European Conf. Computer Vision, pp. 650-663, Oct 2008.
  15. J. Liu, M. Shah, "Learning human actions via information maximization," In IEEE Conf. Computer Vision and Patt. Recog, pp1-8, Aug 2008.
  16. J.C. Niebles, F.F. Li, "A hierarchical model of shape and appearance for human action classification," In IEEE Conf. Computer Vision and Patt. Recog, pp1-8, Jun 2007.
  17. K.G. Derpanis, M. Sizintsev, K.J. Cannons, and R.P. Wildes, "Efficient action spotting based on a spacetime oriented structure representation", In IEEE Conf. Computer Vision and Patt. Recog, pp. 1990-1997, Jun 2010.
  18. H. Jiang and D. R.Martin, "Finding actions using shape flows," In Proc. European Conf. Computer Vision, pp. 278-292, Oct 2008.
  19. P. Matikainen, M. Hebert, and R. Sukthankar, "Trajectons: Action recognition through the motion analysis of tracked features," In IEEE Conf. Computer Vision and Patt. Recog, pp.514-521, Jun 2009.
  20. J.M. Morel and G.Yu, "ASIFT: A New Framework for Fully Affine Invariant Image Comparison", SIAM Journal on Imaging Sciences, vol.2, no.2, pp. 438-469 , Apr 2009. https://doi.org/10.1137/080732730
  21. M. Steyvers, T. Griffiths, "Probabilistic Topic Models",Handbook of Latent Semantic Analysis, Psychology Press, vol. 427, no.7, pp. 424-440 , Feb 2007.
  22. J.C. Niebles, H-C. Wang, F.F. Li, "Unsupervised learning of human action categories using spatial-temporal words," International Journal of Computer Vision, vol.79 no.3, pp. 299-318, Sep 2008. https://doi.org/10.1007/s11263-007-0122-4
  23. S-F. Wong, T-K. Kim and R. Cipolla, "Learning motion categories using both semantic and structural information," In IEEE Conf. Computer Vision and Patt. Recog, pp.18-23, Jun 2007.
  24. J.G. Zhang, S.H. Gong, "Action categorization by structural probabilistic latent semantic analysis," Computer Vision and Image Understanding, vol. 114, no.8, pp. 857-864 , May 2010. https://doi.org/10.1016/j.cviu.2010.04.006
  25. Y. Wang, G. Mori, "Human Action Recognition by Semi-latent Topic," Models. IEEE Trans. Pattern Anal. Mach. Intel, vol. 31 no.10, pp.1762-1774 , Oct 2009. https://doi.org/10.1109/TPAMI.2009.43
  26. T. Hospedales, S. Gong and T. Xiang, "A Markov Clustering Topic Model for Mining Behaviour in Video," In IEEE Intl. Conf. Computer Vision, pp.1165-1172 , Oct 2009.
  27. D. M. Blei, M. I. Jordan and A. Y. Ng, and Jafferty, "Latent Dirichlet allocation." Journal of Machine Learning Research, vol.3, no.4, pp. 993-1022, Jan 2003.
  28. P. Scovanner, S. Ali, M. Shah, "A 3-dimensional SIFT descriptor and its application to action recognition", In Intl Conf. Multimedia, pp.357-360, Sep 2007.
  29. T. Griffiths, M. Steyvers, D. M. Blei, J. B. Tenenbaum. "Integrating topics and syntax". Advances in Neural Information Processing Systems vol.17 no.17, pp. 537-544 , Dec 2005.
  30. T. Griffiths, M. Steyvers, "Finding scientific topics". Proceedings of the National Academy of Sciences, vol. 101, no.1, pp.5228-5235 , Apr 2004. https://doi.org/10.1073/pnas.0307752101
  31. T. Minka, J. Lafferty, "Expectation-propagation for the generative aspect model," In Proc. Conf. Uncertainty in Artificial Intelligence, pp.352-359 , Aug 2002.
  32. S. Maji, A.C. Berg, J. Malik, "Classification using intersection kernel support vector machines is efficient," In IEEE Conf. Computer Vision and Patt. Recog, pp.1-8, Jun 2008.
  33. H. Wang, M. Muneeb Ullah, A. Klaser, I. Laptev, and C. Schmid, "Evaluation of local spatio-temporal features for action recognition," In Proc. Conf. British Machine Vision, pp.1-10 , Sep 2009.
  34. Yao, J. Gall, and L. V. Gool, "A hough transform-based voting framework for action recognition". In IEEE Conf. Computer Vision and Patt. Recog, pp. 2061 - 2068, Jun 2010.
  35. Y. M. Lui and R. Beveridge, "Tangent bundle for human action recognition," In Proc. Conf. Automatic Face and Gesture Recognition, pp.97-102 , Mar 2011.
  36. Q.V. Le, W.Y. Zou, S.Y. Yeung, A.Y. Ng. "Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis," In IEEE Conf. Computer Vision and Patt. Recog, pp.3361 - 3368, Jun 2011.
  37. J. Liu, Y. Yang and M. Shah, "Learning semantic visual vocabularies using diffusion distance," IEEE Conf. Computer Vision and Patt. Recog, pp.461 - 468, Jun 2009.
  38. D. Zhang and G. Lu, "Study and evaluation of different fourier methods for Image Retrieval", Image and Vision Computing, vol.23, no.1 pp. 33-49, Ja 2005. https://doi.org/10.1016/j.imavis.2004.09.001
  39. H. Wang and Alexander et al. Action Recognition by Dense Trajectories,In IEEE Conf. Computer Vision and Patt. Recog, pp. 3169-3176, Aug 2011.
  40. J. Liu, J. Yang. "Action recognition using spatiotemporal features and hybrid generative/discriminative models". Journal of Electronic Imaging. vol.21 no.2, pp.1-11 Apr 2012.
  41. M. Belkin and P. Niyogi, "Laplacian Eigenmaps for dimensionality reduction and data representation". Neural Computation, vol.15, no. 6, pp.1373-1396, Jun 2003. https://doi.org/10.1162/089976603321780317