Browse > Article
http://dx.doi.org/10.7465/jkdi.2013.24.5.959

Big data and statistics  

Kim, Yongdai (Department of Statistics, Seoul National University)
Cho, Kwang Hyun (National Institute of Animal Science)
Publication Information
Journal of the Korean Data and Information Science Society / v.24, no.5, 2013 , pp. 959-974 More about this Journal
Abstract
We investigate the roles of statistics and statisticians in the big data era. Definition and application areas of big data are reviewed and statistical characteristics of big data and their meanings are discussed. Various statistical methodologies applicable to big data analysis are illustrated, and two real big data projects are explained.
Keywords
Big data; big data analysis; big data case investigation;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Liu, H., Lafferty, J. and Wasserman, L. (2009). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. The Journal of Machine Learning Research, 10, 2295-2328.
2 Makoto. S. (2013). Impact of big data, Hanbit Inc., Seoul.
3 Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C. and Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity, McKinsey Global Institute, New York.
4 National Information Society Agency (2012). Top-10 globally advanced case study of : Big data lead the world, National Information Society Agency, Seoul.
5 Nishimoto, S., Yu, A. T., Naselaris, T., Benjamini, Y., Yu, B. and Gallant, J. L. (2011). Reconstructing visual experiences from brain activity evoked by natural movies. Current Biology, 21, 1641-1646.   DOI   ScienceOn
6 Park. C, Kim, Y., Kim, J., Song J. and Choi, H.. (2013). Datamining using R, 2nd edition, Kyohak Publishing Co., Seoul.
7 Park, M. Y. and Hastie, T. (2007). L1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society B, 69, 659-677.   DOI   ScienceOn
8 Seeger, M. (2009). Bayesian modelling in machine learning: A tutorial review, Probabilistic Machine Learning and Medical Image Processing, Saarland University, Saarland.
9 Suchard, M. A., Wang, Q., Chan, C., Frelinger, J., Cron, A. and West, M. (2010). Understanding GPU programming for statistical computation: Studies in massively parallel massive mixtures. Journal of Computational and Graphical Statistics, 19, 419-348.   DOI   ScienceOn
10 Teh, Y. W., Jordan, M. I., Beal, M. J. and Blei, D. M. (2006). Hierarchical dirichlet processes. Journal of the American Statistical Association, 101, 1566-1581.   DOI
11 Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B, 58, 267-288.
12 Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society B, 67, 91-108.   DOI   ScienceOn
13 Tibshirani, R. J. and Taylor, J. (2011). The solution path of the generalized lasso. The Annals of Statistics, 39, 1335-1371.   DOI
14 Van der Laan, M. (2011). Targeted learning: Causal inference for observational and experimental data, Springer, New York.
15 Werbos, P. J. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences, Havard University, Cambridge.
16 Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38, 894-942.   DOI   ScienceOn
17 Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society B, 67, 301-320.   DOI   ScienceOn
18 Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine learning, 20, 273-297.
19 Dempster, A. P (1972). Covariance selection. Biometrics, 28, 157-175.   DOI
20 Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. The Annals of statistics, 32, 407-409.   DOI   ScienceOn
21 Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348-1360.   DOI   ScienceOn
22 Friedman, J. H., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9, 432-441.   DOI   ScienceOn
23 Fraiman, R., Justel, A. and Svarc, M. (2008). Selection of variables for cluster analysis and classification rules. Journal of the American Statistical Association, 103, 1294-1303.   DOI   ScienceOn
24 Freund, Y. and Schapire, R. E (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55, 119-139.   DOI   ScienceOn
25 Friedman, J., Hastie, T., Hofling, H. and Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1, 302-332.   DOI
26 Gill, P. E, Murray, W. and Saunders, M. A (1997), User's guide for SNOPT 5.3: A Fortran package for large-scale nonlinear programming, Technical Report NA 97-4. University of California, San Diego.
27 Gorban, A. N, Kegl, B., Wunsch, D. C and Zinovyev, A. (2007). Principal manifolds for data visualization and dimension reduction, Springer, New York.
28 Grant, M., Boyd, S., and Ye, Y. (2008), CVX: Matlab software for disciplined convex programming. Web page and software, http://stanford.edu/-boyd/cvx.
29 Hastie, T., Rosset, S., Tibshirani, R. and Zhu, J. (2004), The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5, 1391-1415.
30 Hastie, T., Tibshirani, R. and Friedman, J. (2009). Elements of statistical learning, 2nd Edition, Springer, New York.
31 Hoefling, H. (2010). A path algorithm for the fused lasso signal approximator. Journal of Computational and Graphical Statistics, 19, 984-1006.   DOI   ScienceOn
32 Hoffman, M., Blei, D. M. and Bach, F. (2010). Online learning for latent dirichlet allocation. Advances in Neural Information Processing Systems, 23, 856-864.
33 Kolaczyk, E. D. (2009). Statistical analysis of network data, Springer, New York.
34 IBM (2012). http://www-01.ibm.com/software/data/bigdata.
35 International Data Corporation (2012). Worldwide big data technology and services 2012-2015 forecast, International Data Corporation, Florida.
36 Jung J. (2011). Engine of creating value. New chances in the big data era and strategies, National Information Society Agency, Seoul.
37 Lam, C. (2012). Hadoop in action, Manning Publications Co., Stamford.
38 Bellman. R. (1961). Adaptive control processes: A guided tour, Princeton University Press, Princeton.
39 Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B, 289-300.
40 Bickel, P. J. and Levina, E. (2004). Some theory for Fisher's linear discriminant function, naive Bayes', and some alternatives when there are many more variables than observations. Bernoulli , 10, 989-1010.   DOI   ScienceOn
41 Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.