[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3745/KIPSTD.2007.14-D.2.169

An Unsupervised Clustering Technique of XML Documents based on Function Transform and FFT

Lee, Ho-Suk (호서대학교 공과대학 뉴미디어학과)

Publication Information

The KIPS Transactions:PartD / v.14D, no.2, 2007 , pp. 169-180 More about this Journal

Abstract

This paper discusses a new unsupervised XML document clustering technique based on the function transform and FFT(Fast Fourier Transform). An XML document is transformed into a discrete function based on the hierarchical nesting structure of the elements. The discrete function is, then, transformed into vectors using FFT. The vectors of two documents are compared using a weighted Euclidean distance metric. If the comparison is lower than the pre specified threshold, the two documents are considered similar in the structure and are grouped into the same cluster. XML clustering can be useful for the storage and searching of XML documents. The experiments were conducted with 800 synthetic documents and also with 520 real documents. The experiments showed that the function transform and FFT are effective for the incremental and unsupervised clustering of XML documents similar in structure.

Keywords

Unsupervised Clustering; Structure of Elements; Function Transform; FFT; Weighted Euclidean Distance;

Citations & Related Records

Reference

1	PRWeb Press Release Service, http://www.prweb.com
2	Denilson Barbosa, 'ToXgene Template Specification Language,' Dept. of Computer Science, University of Toronto, version 2.1, March 2003
3	Alan V. Oppenheim, Ronald W. Schafer, John R. Buck, Discrete Time Signal Processing (2nd ed.), Prentice Hall. 1999
4	Matthaios Theodorakis, Andreas Vlachos, Theodore Z. Kalamboukis, 'Using Hierarchical Clustering to Enhance Classification Accuracy,' Proc. of the 3rd Hellenic Conf. in Artificial Intelligence, Samos, May 2004
5	Qiong Liu, Stephcn Levinson, Ying Wu, Thomas Huang, 'Interactive and Incremental Learning via a 'Mixture of Supervised and Unsupervised Learning Strategies,' Proc. of the 5th Joint Conf. on Information Science, Vol,1, pp.555-558, Atlantic City, USA 2002
6	Antoine Doucet, Helena Ahonen Myka, 'Naive clustering of a large XML document collection,' Proc. of the 1st Annuad Workshop of the Initiative for the Evaluation of XML Retrieval(IXEX'02), pp.81-88, Germany, December 2002
7	James W. Cooper, Anni R Coden, Eric W. Brown, 'A Novel Method for Detecting Similar Documents,' Proc. of the 35th Annual Hawaii Int'l Conference on System Sciences, 2002
8	Dwi H. Widyantoro. Thomas R. loerger, John Yen, 'An Incremental Approach to Building a Cluster Hierarchy, Proc. of the 2002 IEEE Int'l Conf. on Data ,Mining, pp.705-708, 2002 DOI
9	Pyo Jae Kim, Jin Young Choi, 'Incremental Conceptual Clustering Using a Modified Category Utility' Int'l Technical Conference on Circuits/Systems, Computers and Communications, Vol.1, No.1, pp.23-24, July 2005
10	Yuan Wang, David J. DeWitt, Jin Yi Cai, 'X Diff: An Effective Change Detection Algorithm for XML Documents,' Proc. of the 19th Int'l Conf. on Data Engineering, pp.519-530, Bangalore India, March 2003
11	Pavel Berkhin, 'Survey of Clustering Data Mining Techniques,' Technical report, Accrue Software, 2002
12	Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu, 'A FrLunework for On Demand Classification of Evolving Data Streams,' IEEE Trans. on Knowledge and Data Engineering, Vol.18, No.5, pp.577-589, May 2006 DOI ScienceOn
13	David Gondek, Thomas Hofmann, 'Non Redundant Data Clustering,' Proc. of the 4th IEEE Int'l Conf. on Data Mining, 2004 DOI
14	C.C. Aggarwal, J. Ban, J. Wang, Philip Yu, 'CluStream: A Framework for Clustering Evolving Data Streams,' Proc. of Int'l Conf. on Very Large DataBases, pp.81-92, September 2003
15	M. L. Zaki, C. Aggarwal. 'Xrules: An Effective Structural Classifier for XML Data,' Machine Learning Journal. Vol 62, No.1-2, pp.137-170, February 2006 DOI
16	Jong Soo Kim, Myoung Ho Kim, 'On Effective Data Clustering in Bitemporal Databases,' Proc. of the 4th Int'l Workshop on Temporal Representation and Reasoning, pp.54-61, Florida, USA, May 1997 DOI
17	Sudipto Guha, Hajeev Rastogi, Kyuscok Shim, 'ROCK: A Hobust Clustering Algorithm for Categorical Attributes,' Proc. of 15th Int'I Conf. on Data Engineering,' pp.512-521, 1999 DOI
18	Andrew Nierman, H. V. Jagadish, 'Evaluating Structural Similaritv in XML Documents,' Proc. of the 5th Int'l Workshop on Web and Databases. 2002
19	Dongkyu Kim, Sang goo Lee, Jonghoon Chun, Juhnyoung Lee, 'A Semantic Classification Model for e Catalog,' Proc. of the IEEE Int'l Conf. on E Commerce Technology, 2004 DOI
20	Mu Chun Su, Chien Hsing Chou, 'A Modified Version of the K Means Algorithm with a Distance based on Cluster Symmetry,' IEEE Trans. on PAMI, Vol.23, No.6, pp.674-680, June 2001 DOI ScienceOn
21	Jianghui Liu, Jason T. L. Wang, Wynne Hsu, Katherine G.. Herbert, 'XML Clustering by Principal Component Analysis,' Proc. of the 16th IEEE Int'l Conf. on Tools with Artificial Intelligence(ICTAI 2004), 2004 DOI
22	Wang Lian, David Wai lok Cheung, Nikos Mamoulis, Siu Ming Yiu, 'An Efficient and Scalable Algorithm for Clustering XML Documents by Structure,' IEEE Trans. on Knowledge and Data Engineering, Vol.19, No.1, pp.82-96, January 2004 DOI ScienceOn
23	A.K. Jain, M.N. Murty, P.M. Flynn, 'Data Clustering: A Review,' ACM Computing Surveys, Vol.31, No.3, pp.264-323, September 1999 DOI ScienceOn
24	Kyong Ho Lee, Yoon Chul Choy, Sung Bae Cho, 'An Efficient Algorithm to Compute Differences between Structured Documents,' IEEE Trans. on Knowledge and Data Engineering, Vol.16, No.8, pp.965-979, August 2004 DOI ScienceOn
25	Pang Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining, Addison Wesley, 2006
26	Sergio Flesca, Giuseppe Manco, Elio Mascimi, Luigi Pontieri, Andrea Pugliese, 'Fast Detection of XML Structural Similarity,' IEEE Trans. on Knowledge and Data Engineering, Vol.17, No.2, pp.160-175, February 2005 DOI ScienceOn
27	David Hand, Heikki Mannila, Padhraic Smyth, Principles of Data Mining, The MIT Press, 2001
28	Mehmed Kantardzic, Data Mining Concepts, Models, Methods, and Algorithms, IEEE Press, 2003

1	XML Documents Clustering Technique Based on Bit Vector / [Kim, Woo-Saeng;] / Journal of the Institute of Electronics Engineers of Korea CI
2	XML Document Clustering Technique by K-means algorithm through PCA / [Kim, Woo-Saeng;] / The KIPS Transactions:PartD
3	Clustering Technique Using a Node and Level of XML tree / [Kim, Woosaeng;] / Journal of the Korea Institute of Information and Communication Engineering

KSCI

An Unsupervised Clustering Technique of XML Documents based on Function Transform and FFT 함수 변환과 FFT에 기반한 조정자가 없는 XML 문서 클러스터링 기법

An Unsupervised Clustering Technique of XML Documents based on Function Transform and FFT