DOI QR코드

DOI QR Code

A Comparative Study on Discretization Algorithms for Data Mining

데이터 마이닝을 위한 이산화 알고리즘에 대한 비교 연구

  • Choi, Byong-Su (Department of Multimedia Engineering, Hansung University) ;
  • Kim, Hyun-Ji (Department of Statistics, Sungkyunkwan University) ;
  • Cha, Woon-Ock (Department of Multimedia Engineering, Hansung University)
  • 최병수 (한성대학교 멀티미디어공학과) ;
  • 김현지 (성균관대학교 통계학과) ;
  • 차운옥 (한성대학교 멀티미디어공학과)
  • Received : 20101000
  • Accepted : 20101200
  • Published : 2011.01.30

Abstract

The discretization process that converts continuous attributes into discrete ones is a preprocessing step in data mining such as classification. Some classification algorithms can handle only discrete attributes. The purpose of discretization is to obtain discretized data without losing the information for the original data and to obtain a high predictive accuracy when discretized data are used in classification. Many discretization algorithms have been developed. This paper presents the results of our comparative study on recently proposed representative discretization algorithms from the view point of splitting versus merging and supervised versus unsupervised. We implemented R codes for discretization algorithms and made them available for public users.

이산화는 데이터 마이닝을 위한 전처리 과정으로서 연속형 변수를 이산형 변수로 바꾸는 과정이고, 이산화 시킨 데이터가 원래 가지고 있던 정보손실을 최소로 하면서 높은 분류정확도를 가지는 것을 목적으로 한다. 지금까지 많은 이산화 알고리즘이 제안되었는데, 본 논문에서는 분할 이산화와 병합 이산화의 관점에서 최근까지 제안된 대표적인 이산화 알고리즘들을 비교하고, 이산화 알고리즘이 가지고 있는 특성을 연구하였다. 또한 비교 연구한 이산화 알고리즘을 R코드로 작성하여 다른 연구에 사용할 수 있도록 하였다.

Keywords

References

  1. Acuna, E. (2005). Dprep: Data preprocessing and visualization functions for classification, R package version 1.0. http://paginas.fe.up.pt/˜ec/files 0506/R/dprep.pdf.
  2. Chmielewski, M. R. and Grzymala-Busse, J. W. (1996). Global discretization of continuous attributes as preprocessing for machine learning, International Journal of Approximate Reasoning, 15, 319-331. https://doi.org/10.1016/S0888-613X(96)00074-6
  3. Dougherty, J., Kohavi, R. and Sahami, M. (1995). Supervised and unsupervised discretization of continuous features, Machine learning, 194-202.
  4. Fayyad, U. M. and Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning, Artificial Intelligence, 13, 1022-1027.
  5. Gonzalez-Abril, L., Cuberos, F. J., Velasco, F. and Ortega, J. A. (2009). Ameva: An autonomous discretization algorithm, Expert Systems with Applications, 36, 5327–5332.
  6. Jin, H. and Charles, L. (2005). Using AUC and accuracy in evaluating learning algorithms, IEEE Transactions on Knowledge and Data Engineering, 17, 299-310. https://doi.org/10.1109/TKDE.2005.50
  7. Kerber, R. (1992). ChiMerge: Discretization of numeric attributes, In Proceedings of the Tenth National Conference on Artificial Intelligence, 123-128.
  8. Kim, H. J. (2010). Discretization: Data preprocessing, discretization for classification. R package version 1.0. http://lib.stat.cmu.edu/R/CRAN/web/packages/discretization/index.html.
  9. Kurgan, L. A. and Cios, K. J. (2004). CAIM discretization algorithm, IEEE Transactions on Knowledge and Data Engineering, 16, 145-153.
  10. Ling, C. X., Huang, J. and Zhang, H. (2003). AUC : A better measure than accuracy in comparing learning algorithm, Advances in Artificial Intelligence, 2671, 991.
  11. Liu, H. and Setiono, R. (1995). Chi2: Feature selection and discretization of numeric attributes, Tools with Artificial Intelligence, 388–391.
  12. Liu, H. and Setiono, R. (1997). Feature selection and discretization, IEEE Transactions on Knowledge and Data Engineering, 9, 642-645. https://doi.org/10.1109/69.617056
  13. Liu, H., Hussain, H. F., Tan, C. L. and Dash, M. (2002). Discretization : An enabling technique, Data Mining and Knowledge Discovery, 6, 393-423. https://doi.org/10.1023/A:1016304305535
  14. Merz, C. J. and Murphy, P. M. (1998). UCI repository of machine learning database, department of information and computer science, University of California, Irvine, California, Available from: http://www.ics.uci.edu/ mlearn/MLRepository.html
  15. Pawlak, Z. (1982). Rough sets, International Journal of Computer and Information Sciences, 11, 341-356. https://doi.org/10.1007/BF01001956
  16. Quinlan, R. (1993). C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco.
  17. R Development Core Team (2005). R: A language and environment for statistical computing, R Foundation for statistical computing, Vienna, Austria, ISBN 3-900051-07-0, URL http://www.R-project.org.
  18. Sotiris, K. and Dimitris, K. (2006). Discretization techiniques: A recent survey, GESTES International Transactions on Computer Science and Engineering, 32, 47-58.
  19. Su, C. T. and Hsu, J. H. (2005). An extended Chi2 algorithm for discretization of real value attributes, IEEE Transactions on Knowledge and Data Engineering, 17, 437–441. https://doi.org/10.1109/TKDE.2005.39
  20. Tay, F. E. H. and Shen, L. (2002). Modified Chi2 algorithm for discretization, IEEE Transactions on Knowledge and Data Engineering, 14, 666-670. https://doi.org/10.1109/TKDE.2002.1000349
  21. Tsai, C. J., Lee, C. I. and Yang, W. P. (2008). A discretization algorithm based on class-attribute contingency coefficient, Information Sciences, 178, 714-731. https://doi.org/10.1016/j.ins.2007.09.004
  22. Witten, I. H. and Frank, E. (2000). Data Mining Practical Machine learning Tools and Techniques, Morgan kaufmann. Available from: http://www.cs.waikato.ac.nz/ml/weka/
  23. Zhaoa, Y. H. and Zhang, Y. (2008). Comparison of decision tree methods for finding active objects, Advances in Space Research, 41, 1955-1959. https://doi.org/10.1016/j.asr.2007.07.020
  24. Ziarko, W. (1993). Variable precision rough set model, Journal of Computer and System Sciences, 46, 39-59. https://doi.org/10.1016/0022-0000(93)90048-2