Browse > Article
http://dx.doi.org/10.3745/KIPSTD.2007.14-D.2.169

An Unsupervised Clustering Technique of XML Documents based on Function Transform and FFT  

Lee, Ho-Suk (호서대학교 공과대학 뉴미디어학과)
Abstract
This paper discusses a new unsupervised XML document clustering technique based on the function transform and FFT(Fast Fourier Transform). An XML document is transformed into a discrete function based on the hierarchical nesting structure of the elements. The discrete function is, then, transformed into vectors using FFT. The vectors of two documents are compared using a weighted Euclidean distance metric. If the comparison is lower than the pre specified threshold, the two documents are considered similar in the structure and are grouped into the same cluster. XML clustering can be useful for the storage and searching of XML documents. The experiments were conducted with 800 synthetic documents and also with 520 real documents. The experiments showed that the function transform and FFT are effective for the incremental and unsupervised clustering of XML documents similar in structure.
Keywords
Unsupervised Clustering; Structure of Elements; Function Transform; FFT; Weighted Euclidean Distance;
Citations & Related Records
연도 인용수 순위
  • Reference
1 PRWeb Press Release Service, http://www.prweb.com
2 Denilson Barbosa, 'ToXgene Template Specification Language,' Dept. of Computer Science, University of Toronto, version 2.1, March 2003
3 Alan V. Oppenheim, Ronald W. Schafer, John R. Buck, Discrete Time Signal Processing (2nd ed.), Prentice Hall. 1999
4 Matthaios Theodorakis, Andreas Vlachos, Theodore Z. Kalamboukis, 'Using Hierarchical Clustering to Enhance Classification Accuracy,' Proc. of the 3rd Hellenic Conf. in Artificial Intelligence, Samos, May 2004
5 Qiong Liu, Stephcn Levinson, Ying Wu, Thomas Huang, 'Interactive and Incremental Learning via a 'Mixture of Supervised and Unsupervised Learning Strategies,' Proc. of the 5th Joint Conf. on Information Science, Vol,1, pp.555-558, Atlantic City, USA 2002
6 Antoine Doucet, Helena Ahonen Myka, 'Naive clustering of a large XML document collection,' Proc. of the 1st Annuad Workshop of the Initiative for the Evaluation of XML Retrieval(IXEX'02), pp.81-88, Germany, December 2002
7 James W. Cooper, Anni R Coden, Eric W. Brown, 'A Novel Method for Detecting Similar Documents,' Proc. of the 35th Annual Hawaii Int'l Conference on System Sciences, 2002
8 Dwi H. Widyantoro. Thomas R. loerger, John Yen, 'An Incremental Approach to Building a Cluster Hierarchy, Proc. of the 2002 IEEE Int'l Conf. on Data ,Mining, pp.705-708, 2002   DOI
9 Pyo Jae Kim, Jin Young Choi, 'Incremental Conceptual Clustering Using a Modified Category Utility' Int'l Technical Conference on Circuits/Systems, Computers and Communications, Vol.1, No.1, pp.23-24, July 2005
10 Yuan Wang, David J. DeWitt, Jin Yi Cai, 'X Diff: An Effective Change Detection Algorithm for XML Documents,' Proc. of the 19th Int'l Conf. on Data Engineering, pp.519-530, Bangalore India, March 2003
11 Pavel Berkhin, 'Survey of Clustering Data Mining Techniques,' Technical report, Accrue Software, 2002
12 Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu, 'A FrLunework for On Demand Classification of Evolving Data Streams,' IEEE Trans. on Knowledge and Data Engineering, Vol.18, No.5, pp.577-589, May 2006   DOI   ScienceOn
13 David Gondek, Thomas Hofmann, 'Non Redundant Data Clustering,' Proc. of the 4th IEEE Int'l Conf. on Data Mining, 2004   DOI
14 C.C. Aggarwal, J. Ban, J. Wang, Philip Yu, 'CluStream: A Framework for Clustering Evolving Data Streams,' Proc. of Int'l Conf. on Very Large DataBases, pp.81-92, September 2003
15 M. L. Zaki, C. Aggarwal. 'Xrules: An Effective Structural Classifier for XML Data,' Machine Learning Journal. Vol 62, No.1-2, pp.137-170, February 2006   DOI
16 Jong Soo Kim, Myoung Ho Kim, 'On Effective Data Clustering in Bitemporal Databases,' Proc. of the 4th Int'l Workshop on Temporal Representation and Reasoning, pp.54-61, Florida, USA, May 1997   DOI
17 Sudipto Guha, Hajeev Rastogi, Kyuscok Shim, 'ROCK: A Hobust Clustering Algorithm for Categorical Attributes,' Proc. of 15th Int'I Conf. on Data Engineering,' pp.512-521, 1999   DOI
18 Andrew Nierman, H. V. Jagadish, 'Evaluating Structural Similaritv in XML Documents,' Proc. of the 5th Int'l Workshop on Web and Databases. 2002
19 Dongkyu Kim, Sang goo Lee, Jonghoon Chun, Juhnyoung Lee, 'A Semantic Classification Model for e Catalog,' Proc. of the IEEE Int'l Conf. on E Commerce Technology, 2004   DOI
20 Mu Chun Su, Chien Hsing Chou, 'A Modified Version of the K Means Algorithm with a Distance based on Cluster Symmetry,' IEEE Trans. on PAMI, Vol.23, No.6, pp.674-680, June 2001   DOI   ScienceOn
21 Jianghui Liu, Jason T. L. Wang, Wynne Hsu, Katherine G.. Herbert, 'XML Clustering by Principal Component Analysis,' Proc. of the 16th IEEE Int'l Conf. on Tools with Artificial Intelligence(ICTAI 2004), 2004   DOI
22 Wang Lian, David Wai lok Cheung, Nikos Mamoulis, Siu Ming Yiu, 'An Efficient and Scalable Algorithm for Clustering XML Documents by Structure,' IEEE Trans. on Knowledge and Data Engineering, Vol.19, No.1, pp.82-96, January 2004   DOI   ScienceOn
23 A.K. Jain, M.N. Murty, P.M. Flynn, 'Data Clustering: A Review,' ACM Computing Surveys, Vol.31, No.3, pp.264-323, September 1999   DOI   ScienceOn
24 Kyong Ho Lee, Yoon Chul Choy, Sung Bae Cho, 'An Efficient Algorithm to Compute Differences between Structured Documents,' IEEE Trans. on Knowledge and Data Engineering, Vol.16, No.8, pp.965-979, August 2004   DOI   ScienceOn
25 Pang Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining, Addison Wesley, 2006
26 Sergio Flesca, Giuseppe Manco, Elio Mascimi, Luigi Pontieri, Andrea Pugliese, 'Fast Detection of XML Structural Similarity,' IEEE Trans. on Knowledge and Data Engineering, Vol.17, No.2, pp.160-175, February 2005   DOI   ScienceOn
27 David Hand, Heikki Mannila, Padhraic Smyth, Principles of Data Mining, The MIT Press, 2001
28 Mehmed Kantardzic, Data Mining Concepts, Models, Methods, and Algorithms, IEEE Press, 2003