데이타마이닝에서 고차원 대용량 데이타를 위한 셀-기반 클러스터 링 방법

A Cell-based Clustering Method for Large High-dimensional Data in Data Mining

  • Jin, Du-Seok (Dept.of Computer Engineering, Chonbuk National University) ;
  • Chang, Jae-Woo (Dept.of Computer Engineering, Chonbuk National University)
  • 발행 : 2001.12.01

초록

최근 데이타마이닝 응용분야에서는 고차원 대용량 데이타가 요구되고 있다. 그러나 기존의 대부분의 데이타마이닝을 위한 알고리즘들은 소위 차원의 저주(dimensionality curse)[1] 문제점과 이용 가 능한 메모리의 한계 때문에 고차원 대용량 데이타에는 비효율적이다. 따라서, 본 논문에서는 이러한 문제 점을 해결하기 위해서 셀-기반 클러스터링 방법을 제안한다. 제안하는 진-기반 클러스터링 방법은 고차원 대용량 데이타를 효율적으로 처리하기 위한 셀 구성 알고리즘과 필터링에 기반한 저장인덱스 구조를 제공 한다. 본 논문에서 제안한 셀-기반 클러스터링 방법을 (CLQUE 방법과 클러스터링 시간, 정확율, 검색시 간 관점에서 성능을 비교한다. 마지막으로, 실험결과 제안하는 셀-기반 클러스터링 방법이 CLIQUE 방법 에 비해 성능이 우수함을 보인다

Recently, data mining applications require a large amount of high-dimensional data Most algorithms for data mining applications however, do not work efficiently of high-dimensional large data because of the so-called curse of dimensionality[1] and the limitation of available memory. To overcome these problems, this paper proposes a new cell-based clustering which is more efficient than the existing algorithms for high-dimensional large data, Our clustering method provides a cell construction algorithm for dealing with high-dimensional large data and a index structure based of filtering .We do performance comparison of our cell-based clustering method with the CLIQUE method in terms of clustering time, precision, and retrieval time. Finally, the results from our experiment show that our cell-based clustering method outperform the CLIQUE method.

키워드

참고문헌

  1. Berchtold S., Bohm C., Keim D. and Kriegel H.-P., 'A Cost Model for Nearest Neighbor Serarch in High-Dimensional Data Space,' ACM PODS Symposium on Principles of Databases Systems, Tucson, Arizona, 1997, pp.78-86 https://doi.org/10.1145/263661.263671
  2. Han J. and Kamber M., 'Data Mining : Concepts and Techniques.' Morgan Kaufmann, 2000
  3. Ng R.T. and Han J., 'Efficient and Effective Clustering Methods for Spatial Data Mining,' Proc. 20th Int. Conf. on Very Large Data Bases, 1994, pp.144-155
  4. Kaufman L.. and Rousseeuw P.J.. 'Finding Groups in Data : An Introduction to Cluster Analysis.' John Wiley & Sons, 1990
  5. Zhang T., Rarnakrishnan Rand Linvy M., 'BIRCH : An Efficient Data Clustering Method for Very Large Databases.' Proc. ACM SIGMOD Int. Conf. on Management of Data, 1996, pp. 103-114 https://doi.org/10.1145/233269.233324
  6. Ester M., Kriegel H.-P., Sander J. and Xu X., 'A Density Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,' Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, 1996, pp.226-231
  7. Ester M., Kriegel H.-P., Sander J. and Xu X., 'Density-Connected Set and Their Application for Trend Detection in Spatial Databases,' Proc. 3rd Int. Conf. on Knowledge Discovery and Data Mining, 1997, pp.10-15
  8. Wang W., Yang J. and Muntz R., 'STING: A Statistical Information Grid Approach to Spatial Data Mining,' Proc, 23rd Int. Conf. on Very Large Data Bases, 1997, pp.186- 195
  9. Agrawal R, Gehrke J., Gunopulos D. and Raghavan P., 'Automatic Subspace Clustering of High Dimensional Data Mining Applications,' Proc. ACM SIGMOD Int. Conf. on Management of Data, 1998, pp.94-105 https://doi.org/10.1145/276304.276314
  10. Breiman L., Friedman J. H., Olshen R. A. and Stone C. J., 'Classification and Regression Trees,' Wadsworth, Belmont, 1984
  11. http://WWW.almaden.ibm.com/cs/quest