Browse > Article

A Clustering Technique using Common Structures of XML Documents  

Hwang, Jeong-Hee (충북대학교 컴퓨터과학과)
Ryu, Keun-Ho (충북대학교 전기전자컴퓨터공학부)
Abstract
As the Internet is growing, the use of XML which is a standard of semi-structured document is increasing. Therefore, there are on going works about integration and retrieval of XML documents. However, the basis of efficient integration and retrieval of documents is to cluster XML documents with similar structure. The conventional XML clustering approaches use the hierarchical clustering algorithm that produces the demanded number of clusters through repeated merge, but it have some problems that it is difficult to compute the similarity between XML documents and it costs much time to compare similarity repeatedly. In order to address this problem, we use clustering algorithm for transactional data that is scale for large size of data. In this paper we use common structures from XML documents that don't have DTD or schema. In order to use common structures of XML document, we extract representative structures by decomposing the structure from a tree model expressing the XML document, and we perform clustering with the extracted structure. Besides, we show efficiency of proposed method by comparing and analyzing with the previous method.
Keywords
Document Clustering; XML Clustering; XML Document; Structural Similarity;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 J. Pei, J. Han, B. M. Asi, H. Pinto, 'PrefixSpan: Mining Sequential Pattern Efficiently by Prefix-Projected Pattern Growth,' Proceedings of International Conference on Data Engineering (ICDE) , 2001
2 NIAGARA query engine. http://www.cs.wisc.edu/niagara/data.html
3 http://sourceforge.net/projects/javawn/
4 K. Wang, C. Xu, 'Clustering Transactions Using Large Items,' Proceedings of ACM CIKM-99, 1999   DOI
5 S.Abiteboul, P. Buneman, D. Suciu, 'Data On The Web: From Relational to Semistructured Data and XML,' Morgan Kaufmann Publishers, San Francisco, California, 2000
6 http://www.cogsci.princeton.edu/~/
7 J. H. Hwang, K. H. Ryu, 'Structure-based Clustering for XML Document Retrieval,' to be published in the Journal of KIPS   과학기술학회마을
8 Y. Yang, X. Guan, J You, 'CLOPE : A fast and effective clustering algorithm for transaction data,' Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002.
9 J. Widom, 'Data Management for XML: Research Directions,' IEEE Computer Society Technical Committee on Data Engineering, 1999
10 A. G. Buchner, M. Baumgarten, M. D. Mulvenna, R. Bohm, S. S. Anand, 'Data Mining and XML: Current and Future Issues,' Proceedings of WISE, 2000   DOI
11 Z. Zhang, R. Li, S. Cao, Y. Zhu, 'Similarity Metric for XML Documents,' Workshop on Knowledge and Experience Management(FGWM), 2003
12 F. D. Francesca, G. Gordano, G. Manco, R. Ortale, A. Tagarelli, 'A General Framework for XML Document Clustering,' Technical report, n(8), ICAR-CNR, 2003
13 M. Zaki, 'Efficiently Mining Frequent Tree in a Forest,' Proceedings of the ACM SIGKDD International Conference, 2002
14 J. W. Lee, K. Lee, W. Kim, 'Preparation for Semantics-Based XML Mining,' In Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 345-352, 2001   DOI
15 A. Termier, M. C. Houster, M. Sebag, 'Tree-Finder: A First Step towards XML Data Mining,' In Proceedings of IEEE International Conference on Data Mining (ICDM), pp.450-457, 2002   DOI
16 T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, 'Efficient Substructure Discovery from Large Semi-structured Data,' Proceedings of the SIAM International Conference on Data Mining, 2002
17 E. Kotasakis, 'Structural Information Retrieval in XML Documents,' ACM Symposium on Applied Computing(SAC), 2002   DOI
18 A. Doucet, H. A. Myka, 'Naive Clustering of a Large XML Document Collection,' In Proceedings of INEX Workshop, 2002
19 Y. Shen, B. Wang, 'Clustering Schemaless XML Document,' Proceedings of the 11th International Conference on Cooperative Inormation System, 2003
20 D.Hill, 'A Vector Clustering Technique,' Merchandise Information Storage, Retrieval and Dissemination, North-Holland, Amsterdam, 1968
21 A. K. Jain, M. N. Murty and P. J. Flynn, 'Data clustering: a review', ACM computing Surveys. vol. 31, no. 3, September 1999   DOI   ScienceOn
22 J. Yoon, V. Raghavan, V. Chakilam, 'BitCube:Clustering and Statistical Analysis for XML Documents,' Proceedings of the International Conference on Scientific and Statistical Database Management, 2001
23 R. Nayak, R. Witt, A. Tonev, 'Data Mining and XML Documents,' International Conference on Internet Computing, 2002
24 M. L. Lee, L. H. Yang, W. Hsu, X. Yang, 'XClust: Clustering XML Schemas for Effective Integration,' Proceedings of the ACM International Conference on Information and Knowledge Management, 2002   DOI
25 J. T. Wang, D. Shasha, G. J. S. Chang, 'Structural Matching and Discovery in Document Databases,' Proceedings of the ACM SIGMOD on Management of Data, 1997   DOI
26 W3C, Extensible Markup Language(XML) 1.1. http://www.w3.org/TR/xml11, W3C Working Draft. April 2002