Browse > Article

An Incremental Method Using Sample Split Points for Global Discretization  

한경식 ((주)인우기술)
이수원 (숭실대학교 컴퓨터학부)
Abstract
Most of supervised teaming algorithms could be applied after that continuous variables are transformed to categorical ones at the preprocessing stage in order to avoid the difficulty of processing continuous variables. This preprocessing stage is called global discretization, uses the class distribution list called bins. But, when data are large and the range of the variable to be discretized is very large, many sorting and merging should be performed to produce a single bin because most of global discretization methods need a single bin. Also, if new data are added, they have to perform discretization from scratch to construct categories influenced by the data because the existing methods perform discretization in batch mode. This paper proposes a method that extracts sample points and performs discretization from these sample points in order to solve these problems. Because the approach in this paper does not require merging for producing a single bin, it is efficient when large data are needed to be discretized. In this study, an experiment using real and synthetic datasets was made to compare the proposed method with an existing one.
Keywords
Global Discretization; Machine Learning; Incremental Learning; Large Dataset; Data Mining;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 C. J. Merz, P. M. Murphy. UCI repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science, 1996
2 J. Gehrke, V. Ganti, R. Ramakrishnan, and W. Loh. Boat-- optimistic decision tree construction. Proceedings of the ACM SIGMOD Conference on Management of Data, 1999   DOI
3 N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian Network Classifiers. Machine Learning, 29, pp. 131-161, 1997   DOI
4 J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised discretization of continuous Features. Proceedings of Twelfth International Conference on Machine Learning, pp. 194-202, 1995
5 U. M. Fayyad, K. B. Irani. On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8, 87-102, 1992   DOI
6 T. Elomaa and J. Rousu. Generalizing boundary points. In Proceedings of the Seventeenth National Conference on Artificial Intelligence, Menlo Park, CA, 2000. AAAl Press. In press
7 R. Agrawal, T. Imielinski, and A. Swami. Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Engineering, 5(6):914-952, 1993   DOI   ScienceOn
8 K. Alsabti, S. Ranka, and V. Singh. CLOUDS: A Decision Tree Classifier for Large Datasets. In Proc. KDD-98, New York City, New York, 1998
9 L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, Belmont, 1984
10 J. R. Quinlan. Induction of decision trees. Machine Learning, 1:8 1-106, 1986
11 T. Elomaa and J. Rousu. General and efficient multisplitting of numerical attributes. Machine Learning, 36:200-244, 1999   DOI
12 U. M. Fayyad, K. B. Irani. Multi-interval discretization of continuous-valued attributes for classification learning, Proceedings of the 13th International Joint Conference on Artificial Intelligence, Morgan Kaufmann, pp. 1022-1027