Comparison of Binary Discretization Algorithms for Data Mining

  • Na, Jong-Hwa (Dept. of Information and Statistics & Institute for Basic Science Research, Chungbuk National University) ;
  • Kim, Jeong-Mi (Dept. of Information and Statistics, Chungbuk National University) ;
  • Cho, Wan-Sup (Dept. of MIS, Chungbuk National University)
  • Published : 2005.11.30

Abstract

Recently, the discretization algorithms for continuous data have been actively studied. But there are few articles to compare the efficiency of these algorithms. In this paper we introduce the principles of some binary discretization algorithms including C4.5, CART and QUEST and investigate the efficiency of these algorithms through numerical study. For various underlying distribution, we compare these algorithms in view of misclassification rate and MSE. Real data examples are also included.

Keywords

References

  1. Knowledge EXplorer : A tool for automated knowledge acquisition from data, Technical Report TR-93-03 Berka, P.
  2. Discretization of numerical attributes for Knowledge EXplorer, Technical Report LISP-93-03 Berka, P.
  3. Discretization and grouping: preprocessing steps for data mining, Principles of Data Mining and Knowledge Discovery Berka, P.;Bruha, I.
  4. Empirical comparisons of various discretization procedures, Technical Report LISP-95-04 Berka, P.;Bruha, I.
  5. Classification and regression trees Breiman, L.;Freidman, J.;Olshen, R.;Stone, C.
  6. Proceedings of the Twelfth International Conference Supervised and unsupervised discretization of continuous features Dougherty, J.;Kohavi, R.;Sahami, M.
  7. ID3: History, implementation and applications, Manuscript Gestwicki, P.
  8. Machine Learning v.11 Very simple classification rules perform well on most commonly used datatsets Holte, R.C.
  9. Comparison of multiway discretization algorithms for data mining Kim, J.S.;Kim, J.M.;Na, J.H.
  10. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining Error-based and entropy-based discretization of continuous features Kohavi, R.;Sahami, M.
  11. Approved in International Journal on Artificial Intelligence Tools v.6 Data mining using MLC++: A machine learning library in C++ Kohavi, R.;Sommerfield, D.;Dougherty, J.
  12. Discretizing numerical attributes in a genetic attribute-based learning algorithm, Manuscript Kralik, P.;Bruha, I.
  13. Split selection methods for classification trees v.7 Loh, W.Y.;Shih, Y.S.
  14. C4.5: Programs for machine learning Quinlan, J.R.
  15. Journal of Artificial Intelligence Research v.4 Improved use of continuous attributes in C4.5 Quinlan, J.R.
  16. Minimum splits based discretization for continous features, Manuscript Wang, K.;Goh, H.C.