Browse > Article
http://dx.doi.org/10.5391/JKIIS.2011.21.2.171

A New Statistical Sampling Method for Reducing Computing time of Machine Learning Algorithms  

Jun, Sung-Hae (청주대학교 바이오정보통계학과)
Publication Information
Journal of the Korean Institute of Intelligent Systems / v.21, no.2, 2011 , pp. 171-177 More about this Journal
Abstract
Accuracy and computing time are considerable issues in machine learning. In general, the computing time for data analysis is increased in proportion to the size of given data. So, we need a sampling approach to reduce the size of training data. But, the accuracy of constructed model is decreased by going down the data size simultaneously. To solve this problem, we propose a new statistical sampling method having similar performance to the total data. We suggest a rule to select optimal sampling techniques according to given data structure. This paper shows a sampling method for reducing computing time with keeping the most of accuracy using cluster sampling, stratified sampling, and systematic sampling. We verify improved performance of proposed method by accuracy and computing time between sample data and total data using objective machine learning data sets.
Keywords
기계학습 알고리즘;통계적 샘플링;컴퓨팅 시간;군집 샘플링;층화 샘플링;계통 샘플링;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 The UC Irvine Machine Learning Repository, http://archive.ics.uci.edu/ml/
2 S. K. Thompson, Sampling, 2nd ed., John Wiley & Sons, 2002.
3 S. Jun, “Support Vector Machine based on Stratified Sampling,” International Journal of Fuzzy Logic and Intelligent System, vol. 9, no. 2, pp. 141-146, 2009.   과학기술학회마을   DOI
4 S. Jun, “Improvement of SOM using Stratifiation,” International Journal of Fuzzy Logic and Intelligent Systems, vol. 9, no. 1, pp. 36-41, 2009.   과학기술학회마을   DOI
5 S. Jun, “Web Usage Mining Using Evolutionary Support Vector Machine," Lecture Note in Artificial Intelligence, vol. 3809, pp. 1015-1020, Springer-Verlag, 2005.
6 J. Wang, X. Wu, C. Zhang, “Support vector machines based on K-means clustering for real-time business intelligent systems,” International Journal Business Intelligence and Data Mining, vol. 1, no. 1, pp. 54-64, 2005.   DOI
7 김영원, 류제복, 박진우, 홍기학 역, 표본조사의 이해와 활용, 교우사, 2006.
8 R. L. Scheaffer, W. Mendenhall III, R. L. Ott, Elementary Survey Sampling 6th edition, Duxbury, 2006.
9 손건태, 전산통계개론 - 통계적 모의실험과 추정 알고리즘 제4판, 자유아카데미, 2005.
10 R Development Core Team, R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, http://www.R-project.org, 2010.
11 Y. Tille, A. Matei, Survey Sampling-Package 'sampling', R-Project CRAN, 2009.
12 B. Repley, Feed-forward Neural Networks and Multinomial Log-Linear Models-Package 'nnet', R-Project CRAN, 2009.
13 Z.-J. Chen, B. Liu, X.-P. He, “A SVC Iterative Learning Algorithm Based on Sample Selection for Large Samples," Proceedings of International Conference on Machine Learning and Cybernetics, vol. 6, pp. 3308-3313, 2007.
14 M.-H. Ha, L.-F. Zheng, J.-Q. Chen, “The Key Theorem of Learning Theory Based on Random Sets Samples," Proceedings of International Conference on Machine Learning and Cybernetics, vol. 5, pp. 2826-2831, 2007.
15 Y. S. Jia, C. Y. Jia, H. W. Qi, “A New Nu-Support Vector Machine for Training Sets with Duplicate Samples,” Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, pp. 4370-4373, 2005.
16 W. Ng, M. Dash, “An Evaluation of Progressive Sampling for Imbalanced Data Sets," Proceedings of Sixth IEEE International Conference on Data Mining, pp. 657-661, 2006.
17 K.-H. Yang, G.-L. Shan L.-L. Zhao, “Correlation Coefficient Method for Support Vector Machine Input Samples," Proceedings of International Conference on Machine Learning and Cybernetics, pp. 2856-2861, 2006.
18 P. A. D. I. Santos, Jr., R. J. Burke, J. M. Tien, “Prograssive Random Sampling With Stratification,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Applications and Reviews, vol. 37, no. 6, pp. 1223-1230, 2007.
19 C. S. Ding, Q. Wu, C. T. Hsieh, M. Pedram, “Stratified Random Sampling for Power Estimation,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 17, no. 6, pp. 465-471, 1998.   DOI
20 M. Keramat, R. Kielbasa, “A study of stratified sampling in variance reduction techniques for parametric yield estimation,” Proceedings of IEEE International Symposium on Circuits and Systems, vol. 3, pp. 1652-1655, 1997.
21 M. Xing, M. Jaeger, H. Baogang, “An Effective Stratified Sampling Scheme for Environment Maps with Median Cut Method,” Proceedings of International Conference on Computer Graphics, Imaging and Visualisation, pp. 384-389, 2006.
22 S. R. Gunn, “Support Vector Machines for Classification and Regression,” Technical Report, University of Southampton, 1998.
23 T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Springer, 2001.
24 T. M. Mitchell, Machine Learning, McGraw-Hill, 1997.
25 A. Ben-Hur, D. Horn, H. T. Siegelmann, V. N. Vapnik, “Support Vector Clustering,” Journal of Machine Learning Research, vol. 2, pp. 125-137, 2001.
26 V. N. Vapnik, Statistical Learning Theory, John Wiley & Sons, 1998.
27 V. N. Vapnik, “An Overview of Statistical Learning Theory,” IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 988-999, 1999.   DOI