전역적 범주화를 위한 샘플 분할 포인트를 이용한 점진적 기법

An Incremental Method Using Sample Split Points for Global Discretization

  • 발행 : 2004.07.01

초록

대부분의 교사학습 알고리즘은 수치형 변수 처리의 어려움을 해결하기 위해 전처리 단계에서 연속형 변수를 범주형으로 변환시킨 후 적용된다. 이러한 전처리 단계를 전역적 범주화라 하며 빈즈(Bins)라는 클래스 분포 리스트를 이용한다. 그러나 대부분의 전역적 범주화 기법은 단일 빈즈를 필요로 하기 때문에 데이타가 대용량이고 범주화를 수행할 변수의 범위가 매우 클 경우, 단일 빈즈를 생성하기 위해 많은 정렬 및 병합을 수행해야한다. 또한, 기존의 방법은 일괄처리 방식으로 범주화를 수행하기 때문에 새로운 데이타가 추가되면 이 데이타가 반영된 범주를 생성하기 위해 처음부터 범주화를 다시 수행해야한다. 본 논문은 이러한 문제점을 해결하기 위해 샘플 분할 포인트를 추출하고 이로부터 범주화를 수행하는 기법을 제안한다. 본 논문의 접근 방법은 단일 빈즈를 생성하기 위한 병합이 필요 없기 때문에 대용량 데이타에 대한 범주화를 수행할 때 효율적이다. 본 연구에서는 실제 데이타와 가상의 데이타를 이용하여 기존의 방법과 비교 실험하였다.

Most of supervised teaming algorithms could be applied after that continuous variables are transformed to categorical ones at the preprocessing stage in order to avoid the difficulty of processing continuous variables. This preprocessing stage is called global discretization, uses the class distribution list called bins. But, when data are large and the range of the variable to be discretized is very large, many sorting and merging should be performed to produce a single bin because most of global discretization methods need a single bin. Also, if new data are added, they have to perform discretization from scratch to construct categories influenced by the data because the existing methods perform discretization in batch mode. This paper proposes a method that extracts sample points and performs discretization from these sample points in order to solve these problems. Because the approach in this paper does not require merging for producing a single bin, it is efficient when large data are needed to be discretized. In this study, an experiment using real and synthetic datasets was made to compare the proposed method with an existing one.

키워드

참고문헌

  1. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, Belmont, 1984
  2. J. R. Quinlan. Induction of decision trees. Machine Learning, 1:8 1-106, 1986
  3. N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian Network Classifiers. Machine Learning, 29, pp. 131-161, 1997 https://doi.org/10.1023/A:1007465528199
  4. J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised discretization of continuous Features. Proceedings of Twelfth International Conference on Machine Learning, pp. 194-202, 1995
  5. U. M. Fayyad, K. B. Irani. On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8, 87-102, 1992 https://doi.org/10.1023/A:1022638503176
  6. U. M. Fayyad, K. B. Irani. Multi-interval discretization of continuous-valued attributes for classification learning, Proceedings of the 13th International Joint Conference on Artificial Intelligence, Morgan Kaufmann, pp. 1022-1027
  7. T. Elomaa and J. Rousu. General and efficient multisplitting of numerical attributes. Machine Learning, 36:200-244, 1999 https://doi.org/10.1023/A:1007674919412
  8. T. Elomaa and J. Rousu. Generalizing boundary points. In Proceedings of the Seventeenth National Conference on Artificial Intelligence, Menlo Park, CA, 2000. AAAl Press. In press
  9. K. Alsabti, S. Ranka, and V. Singh. CLOUDS: A Decision Tree Classifier for Large Datasets. In Proc. KDD-98, New York City, New York, 1998
  10. J. Gehrke, V. Ganti, R. Ramakrishnan, and W. Loh. Boat-- optimistic decision tree construction. Proceedings of the ACM SIGMOD Conference on Management of Data, 1999 https://doi.org/10.1145/304182.304197
  11. C. J. Merz, P. M. Murphy. UCI repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science, 1996
  12. R. Agrawal, T. Imielinski, and A. Swami. Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Engineering, 5(6):914-952, 1993 https://doi.org/10.1109/69.250074