Classification and Analysis of Data Mining Algorithms

데이터마이닝 알고리즘의 분류 및 분석

  • 이정원 (이화여자대학교 컴퓨터학과) ;
  • 김호숙 (이화여자대학교 컴퓨터학과) ;
  • 최지영 (이화여자대학교 컴퓨터학과) ;
  • 김현희 (이화여자대학교 컴퓨터학과) ;
  • 용환승 (이화여자대학교 컴퓨터학과) ;
  • 이상호 (이화여자대학교 컴퓨터학과) ;
  • 박승수 (이화여자대학교 컴퓨터학과)
  • Published : 2001.09.01

Abstract

Data mining plays an important role in knowledge discovery process and usually various existing algorithms are selected for the specific purpose of the mining. Currently, data mining techniques are actively to the statistics, business, electronic commerce, biology, and medical area and currently numerous algorithms are being researched and developed for these applications. However, in a long run, only a few algorithms, which are well-suited to specific applications with excellent performance in large database, will survive. So it is reasonable to focus our effort on those selected algorithms in the future. This paper classifies about 30 existing algorithms into 7 categories - association rule, clustering, neural network, decision tree, genetic algorithm, memory-based reasoning, and bayesian network. First of all, this work analyzes systematic hierarchy and characteristics of algorithms and we present 14 criteria for classifying the algorithms and the results based on this criteria. Finally, we propose the best algorithms among some comparable algorithms with different features and performances. The result of this paper can be used as a guideline for data mining researches as well as field applications of data mining.

지식탐사 프로세스의 핵심적인 역할을 담당하는 데이터마이닝 단계에서는 여러 가지 목적에 따라 알고리즘을 선택하여 사용한다. 최근 통계, 비즈니스, 전자 상거래, 의학, 생물학 등의 분야에서 데이터마이닝 기술아 적극적으로 활용되고 있으며, 이를 위해 다양한 알고리즘들이 계속해서 연구.개발되고 있다. 그러나 시간이 지나면 이들 중 각 분야 별로 우수한 응용성을 보이는 알고리즘이나 방대한 양의 데이터를 다루는데 있어 좋은 성능을 보이는 몇몇 알고리즘만이 남게 될 것이며 또한 앞으로는 이러한 알고리즘들만을 선별하여 집중 연구할 필요가 있다. 따라서 본 논문에서는 데이터마이닝에 널리 사용되고 활발한 연구가 진행중인 알고리즘들 중에서 연관규칙(association rule), 클러스터링(clustering), 신경망(neural network), 결정트리(decision tree), 유전자 알고리즘(genetic algorithm), 베이지안 네트워크(bayesian network), 메모리 기반 추론(memory-based reasoning)등 7가지 카테고리에 속하는 알고리즘들을 선정하여 분류.분석하였다. 우선 각 알고리즘의 계통과 특성들을 분석하였고 이를 토대로 비교.분석을 위한 14가지의 분류 기준을 제시하였다. 이러한 분류 기준에 근거하여 세부 알고리즘들을 분석해 보고 비교 가능한 일부 알고리즘은 여러 특징과 성능을 중심으로 각각 최상의 알고리즘을 도출해 보았다. 본 연구 결과는 데이터마이닝 분야의 흔재된 알고리즘들을 분류.분석함으로써 마이닝 기술 적용시 사용자에게 알고리즘 선택의 지표를 제시할 수 있을 것이다.

Keywords

References

  1. Michael J. A Berry, and Gorden Linoff, Data Mining Techniques : For Marketing, Sales, and Customer Support, John Wiley & Sons, Inc., 1997
  2. R.Agrawal, T. Imielinski, and A. Swami. 'Mining association rules between sets of items in large databases,' In Proc. of the ACM SIGMOD Conference on Management of Data, pp. 207-216, Washington, D.C., May 1993 https://doi.org/10.1145/170036.170072
  3. R. Agrawal and R. Srikant, 'Fast algorithms for mining association rules,' In Proc.of the 20th International Conference on Very Large Data Bases (VLDB94), pp. 487-499, Santiage, Chile, September 1994
  4. Jong Soo Park, Ming Syan Chen and Philip S.Yu, 'Efficient parallel mining for association rules,' In the 4th International Conference on Information and Knowledge Management, pp. 31-36, Baltimore, MD, November 1995 https://doi.org/10.1145/221270.221320
  5. Rakesh Agrawal and John C. Shafer, 'Parallel Mining of Association Rules,' IEEE Transations on Knowledge and Data Engineering, Vol. 8, No. 6, pp. 962-969, December 1996 https://doi.org/10.1109/69.553164
  6. D. W. Cheung, J. Han, V. Ng, A. W. Fu and Y.Fu, 'A fast distribution algorithm for mining association rules,' International Conference on Parallel and Distributed Information Systems, Miami Beach, Florida, December 1996
  7. Jung Soo Park, Ming-Syan Chen, and Philip S. Yu., 'An effective hash-based algorithm for mining association rules,' In Proc. of ACM SIGMOD Conference on Management of Data(SIGMOD'95), pp. 175-186, San Jose, California, May 1995 https://doi.org/10.1145/568271.223813
  8. Ashok Savasere, Edward Omiecinski, and Shamkant Navathe, 'An effective algorithm for mining association rules in large databases,' In Proc. of the 21st International Conference on Very Large Data Bases (VLDB'95), pp. 432-444, Zurich, Swizerland, 1995
  9. Hannu Toivonen, 'Sampling Large Database for Association rules,' In Proc. of the 22nd International Conference on Very Large Data Bases (VLDB'96), Mumbai(Bombay), India, 1996
  10. D. W. Cheung, J. Han, V. Ng and C. Y. Wong, 'Maintenance of discovered association rules in large database : An incremental updating technique,' International Conference on Data Engineering, New Orleans, Louisiana, February 1996
  11. Sergey Brin, Rajeev Motwani, Jeffrey D. Ulman, and Shalom Tsur., 'Dynamic Itemset Counting and Implication Rules for Market Basket Data,' In Proc. of ACM SIGMOD Conference on Management of Data (SIGMOD'97), pp. 255-264, 1997 https://doi.org/10.1145/253262.253325
  12. Alexander Hinneburg, Daniel A. Keim, 'Clustering Techniques for Large Data Sets-From the Past to the Future,' In Proc. of ACM SGMOD International Conference on KDD, San Diego, CA, USA, August 1999 https://doi.org/10.1145/312179.312189
  13. Anders L. Madsen, and Finn V. Jensen, Parallelization of Inference in Bayesian Networks, 1999
  14. Raymond T. Ng, Jiawei Han, 'Efficient and Effective Clustering Method for Spatial Data Mining,' In Proc. of the VLDB Conference, Santiago, Chile, 20th Int, pp. 144-155, September 1994
  15. Tian Zhang, Raghu Ramakrishnan, and Miron Livny, 'BIRCH : An Efficient Data Clustering Method for Very Large Databases,' In Proc. of the ACM SIGMOD Conference on Management of Data, Montreal, Canada, pp. 103-114, June 1996 https://doi.org/10.1145/235968.233324
  16. Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu, 'A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,' In Proc. of ACM SIGMOD 3rd International Conference on Knowledge Discovery and Data Mining, pp. 226-231, AAAI Press, 1996
  17. Xiaowei Xu, Martin Ester, Hans-Peter Kriegel, and Jorg Sander, 'A Distribution- Based Clustering Algorithm for Mining in Large Spatial Databases,' In proc. of 14th International Conference on Data Engineering(ICDE), Orlando, Florida, USA, pp. 324-331, february 1998 https://doi.org/10.1109/ICDE.1998.655795
  18. Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim, 'CURE : An Efficient Clustering Algorithm for Large Databases,' In Proc. ot the ACM SIGMOD Conference on Management of Data, Seattle, Washinton, USA, pp. 73-84, May 1998 https://doi.org/10.1145/276304.276312
  19. Alexander Hinneburg, and Daniel A.Keim, 'An Efficient Approach to Chustering in Large Multimedia Databases with Noise,' In proc. of 4th International Conference of Knowledge Discovery and Data Mining, New York, pp. 58-65, 1998
  20. Mihael Ankerst, Markus M. Breuning, Hans-Peter Kriegel, and Jorg Sander, 'OPTICS: Ordering Points To Identify the Clustering Structure,' In proc. of ACM SIGMOD International Conference on Management of Data, Philadephia, Pennsylvania, USA, pp. 49-60, June 1999 https://doi.org/10.1145/304182.304187
  21. Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim, 'ROCK : A Robust Clustering Algorithm for Categorical Attributes,' In proc. of the 15th International Conference on Data Engineering (ICDE), Sydney, Austrialia, March 1999 https://doi.org/10.1109/ICDE.1999.754967
  22. Minsky, M. and S. Pappert, Perceptrons, Cambridge : MIT Press, 1969
  23. Specht, D. F., Probobilistic neural networks, Neural Networks, 1990
  24. Mark J. L. Orr, Introduction to Radial Basis Function Networks, Edinburgh University, 1996
  25. Kohenen, T, Learning Vector Quantization, Neural Networks, 1988
  26. Specht, D. F., 'A Generalized Regression Neural Network,' IEEE Transactions on Neural Networks, 1991
  27. J. P. Bigus, Data Mining with Neural Networks, McGraw-Hill, 1996
  28. Kohonen, T., Self-Organizing Maps, 2nd Ed., Berlin: Springer-Verlag., 1997
  29. http://ftp.sas.com/pub/neural/FAQ.html
  30. J. Shafer, R. Agrawal, and M.Mehta, 'SPRINT: A scalable parallel classifier for data mining,' In proc. of the VLDB Conference, 1996
  31. J.Gehtke, R. Ramakrishman, and V. Ganti, 'Rainforest - A framework for fast decision tree construction of large datasets,' In proc. of the VLDB Conference, 1996
  32. R. Rastogi and K. Shim. 'PUBLIC: A decision tree classifier that integrates building and pruning,' In proc. of the VLDB Conference, 1998
  33. Jhannnes Gehrke, Venkatesh Ganti, and Raghu Ramakrishnan. 'BOAT: Optimistic Decision Tree Construction,' In proc. of the ACM SIGMOD Conference on Management of Data, Philadelphia, 1999 https://doi.org/10.1145/304182.304197
  34. David Heckerman, A Tutorial on Learning With Bayesian Networks, 1995
  35. David Heckerman, 'Bayesian Networks for Knowledge Discovery,' in Advances in knowledge discovery and data mining, pp. 273-305, 1996
  36. David Heckerman, and Michael P. Wellman, 'Bayesian Networks,' CACM Vol. 38, No. 3, 1995
  37. John H.Holland, Adaptation in natural and artificial systems, Ann Arbor:the University of Michigan Press,1975
  38. David Beasley,David R.Bull,and Ralph R.Martin 'An Overview of Genetic Algorithms:Part1, Fundamentals,' University Computing,15(2) pp.58-69, Inter-University Committee on Computing, 1993
  39. David Beasley,David R.Bull and Ralph R.Martin 'An Overviw of Genetic Algorithms:Part2, Research Topics,' University Computing, 15(4) page170-181,1993
  40. Koza John R, Genetic Programming : On the Programming of computers by means of Natural Selection, Cambridge,MA,MIT Press,1992. http://ailife.santafe.edu/~joke/encore/ www
  41. Goldberg David.E, korb Bradley,and Deb K.'Messy Genetic Algorithms:Motivation, Analysis and Results,' TCGA Report 90005, May 1995. http://cs.felk.cvut.cz/~xobitko/ga
  42. Pooja P.Mutalik,Leslie R.Knight,Joe L.Blanton, and Roger L.Wainwright 'Solving Combinational Optimization problems using parallel simulated annealing and parallel genetic algorithms,' ACM 0-89791-502-x/92/00002/ 1031,1992
  43. H.Muchlenbein,' Parallel Genetic Algorithms, Population Genetics and combinatorial Optimization,' In Proc. of third International Conference on Genetic Algorithms, Morgan Kaufmann publisher,1989
  44. Pretty,Chrisila B,Michael R Leuze, and john J.Grefenstette,'A Parallel genetic algorithm,' In Proc. of the 2nd International conference on Genetic Algorithms, pp. 155-161,1987
  45. Kenneth De Jong,and Wiliam Spears, 'Learning Concept Classification Rules Using Genetic Algorithms,' In Proc. of the 12th International Joint Conference on Artificial Intelligence, pp.651-656, Morgan Kaufmann Publisher,1991
  46. J.Bala, J.Huang, H.Vafaie, K.DeJong and H.Wechsler,' Hybrid Learning Using Genetic Algorithms and Decision Tree for Pattern Classification,' In Proc. ot the Fourteenth International Joint Conference on Artificial Intelligence (IJCAI95), Volume I pp.719-724, August 1995
  47. James D Kelly,and Lawrence Davis,' Hybridizing the Genetic Algorithms and the K Nearest Neighbors Classification Algorithms,' In Proc.of the 4th International Conference on Genetic Algorithms and their Applications, Morgan Kaufmann Publishers,1991
  48. S.S.Anand,D.Patterson,J.G.hughes. and D.A.Bell, Discovering Case Knowledge using Data Mining, Northern Ireland knowledge engineering Laboratory, School of Information and software Engineering, University of Ulster.1998
  49. Eliseo Reategui, John A. Campell, and Shirley Borghetti, 'Using a Neural Network to Learn General Knowledge in a Case-Based System,' Case-Based Reasoning Research and Development, 1995
  50. John W. Sheppard and Steven L. Salzberg, 'Genetic Algorithms: Bootstrapping Memory-Based Learning with Genetic,' 12th National Conference on Artificial Intelligence, AAAI, Seattle, August 1994
  51. Simoudis and James S. Miller, 'The Application of CBR to Help Desk Applications,' In Proc. of the DARPA Case-Based Reasoning Workshop, 1991
  52. Kihong Park and Bob carter, 'On the Effectiveness of Genetic Search in Combinatorial Optimization ' ACM , 1995
  53. W. D. Penny, and S. J. Roberts, 'Bayesian neural networks for classification: how useful is the evidence framework?,' Neural Networks 12, pp. 877-892, 1999 https://doi.org/10.1016/S0893-6080(99)00040-4
  54. Peter Cheeseman, John Stutz, Bayesian Classification (AutoClass): Theory and Results, Advances in knowledge discovery and data mining, pp. 153-180, 1996
  55. Wray Buntine, Graphical Models for Discovering Knowledge, Advances in knowledge discovery and data mining, pp. 59-82, 1996
  56. Graphical Models for Discovering Knowledge,Advances in Knowledge discovery and data mining Wray Buntine