[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3745/KIPSTD.2007.14-D.7.733

An Adaptive Grid-based Clustering Algorithm over Multi-dimensional Data Streams

Park, Nam-Hun (연세대학교 대학원 컴퓨터과학과)
Lee, Won-Suk (연세대학교 컴퓨터과학과)

Publication Information

The KIPS Transactions:PartD / v.14D, no.7, 2007 , pp. 733-742 More about this Journal

Abstract

A data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate. Due to this reason, memory usage for data stream analysis should be confined finitely although new data elements are continuously generated in a data stream. To satisfy this requirement, data stream processing sacrifices the correctness of its analysis result by allowing some errors. The old distribution statistics are diminished by a predefined decay rate as time goes by, so that the effect of the obsolete information on the current result of clustering can be eliminated without maintaining any data element physically. This paper proposes a grid based clustering algorithm for a data stream. Given a set of initial grid cells, the dense range of a grid cell is recursively partitioned into a smaller cell based on the distribution statistics of data elements by a top down manner until the smallest cell, called a unit cell, is identified. Since only the distribution statistics of data elements are maintained by dynamically partitioned grid cells, the clusters of a data stream can be effectively found without maintaining the data elements physically. Furthermore, the memory usage of the proposed algorithm is adjusted adaptively to the size of confined memory space by flexibly resizing the size of a unit cell. As a result, the confined memory space can be fully utilized to generate the result of clustering as accurately as possible. The proposed algorithm is analyzed by a series of experiments to identify its various characteristics

Keywords

Data Stream; Data Mining; Clustering;

Citations & Related Records

Reference

1	R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, 1972
2	W. Wang, J. Yang, and R. Muntz. Sting: A statistical information grid approach to spatial data mining, 1997
3	Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu. A Framework for Clustering Evolving Data Streams. In Proc. VLDB 29th, Berlin, 2003
4	Cheng, C., Fu, A., and Zhang, Y. Entropy based subspace clustering for mining numerical data. KDD-99, 84-93, San Diego, August 1999 DOI
5	C.-H. Lee, C.R. Lin, and M.-S. Chen, Sliding-window filtering: An efficient algorithm for incremental mining, Proceedings of the 10th International Conference on Information and Knowledge Management, Atlanta, GE, November 2001, pp.263-270 DOI
6	A. Hinneburg and D. A. Keim, 'Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High- Dimensional Clustering', In Proc. Int' Conf. on Very Large Data Bases(VLDB), Edinburgh, Scotland, pp.506-517, Sept. 1999
7	M. Ester, H. Kriegel, J. Sander, M. Wimmer, and X. Xu. Incremental clustering for mining in a data warehousing environment, In Proc. VLDB 24th, New York, 1998
8	G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proc. Of the 28th Int'l Conference on Very Large Databases, Hong Kong, China, Aug. 2002
9	Nam Hun Park and Won Suk Lee. A statistical $\mu$ -partitioning method for clustering data streams. In Proc. of Eighteenth International Symposium on Computer and Information Sciences, November 2003
10	Nam Hun Park and Won Suk Lee. Statistical $\sigma$ -partition Clustering over Data Streams. In Proc. of 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, September 2003
11	T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering method for very large databases. In Proc. SIGMOD, pages 103-114, 1996
12	Liadan O'Callaghan, Nina Mishra, Adam Meyerson, Sudipto Guha, and Rajeev Motwani. STREAM-data algorithms for high-quality clustering. In Proc. of IEEE International Conference on Data Engineering, March 2002
13	J. H. Chang & W. S. Lee. Finding Recently Frequent Itemsets Adaptively over Online Transactional Data Streams. Information Systems, 31(8), December 2006 DOI ScienceOn
14	Hua-Fu Li, Suh-Yin Lee, Man-Kwan Shan: Online Mining Changes of Items over Continuous Append-only and Dynamic Data Streams. J. UCS 11(8), page 1411-1425, 2005
15	Mohamed Medhat Gaber, Arkady B. Zaslavsky, Shonali Krishnaswamy: Mining data streams: a review. SIGMOD Record 34(2), page 18-26, 2005 DOI ScienceOn
16	L. Kaufman and P.J. Rousseeuw. Finding Groups in Data. An Introduction to Cluster Analysis. Wiley, New York, 1990
17	S. Guha, R.Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In Proc. SIGMOD, pages 73-84, 1998 DOI
18	M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases, 1996
19	J. H. Chang & W. S. Lee. Finding Frequent Itemsets over Online Data Streams. Information and Software Technology, 48(7), July 2006 DOI ScienceOn
20	M. Garofalakis, J. Gehrke and R. Rastogi. Querying and mining data streams: you only get one look. In the tutorial notes of the 28th Int'l Conference on Very Large Databases, Hong Kong, China, Aug. 2002

KSCI

An Adaptive Grid-based Clustering Algorithm over Multi-dimensional Data Streams 적응적 격자기반 다차원 데이터 스트림 클러스터링 방법

An Adaptive Grid-based Clustering Algorithm over Multi-dimensional Data Streams