An Incremental Clustering Technique of XML Documents using Cluster Histograms

클러스터의 히스토그램을 이용한 XML 문서의 점진적 클러스터링 기법

  • 황정희 (남서울대학교 컴퓨터학과)
  • Published : 2007.06.15

Abstract

As a basic research to integrate and to retrieve XML documents efficiently, this paper proposes a clustering method by structures of XML documents. We apply an algorithm processing the many transaction data to the clustering of XML documents, which is a quite different method from the previous algorithms measuring structure similarity. Our method performs the clustering of XML documents not only using the cluster histograms that represent the distribution of items in clusters but also considering the global cluster cohesion. We compare the proposed method with the existing techniques by performing experiments. Experiments show that our method not only creates good quality clusters but also improves the processing time.

이 논문에서는 XML 문서에 대한 효율적인 검색과 통합을 위한 기초연구로써 XML 문서들에 대한 구조 중심의 클러스터링 기법을 제안한다. 기존 연구에서 문서간의 구조적 유사도를 기반으로 클러스터를 형성해 가는 것과는 다르게 많은 데이타를 빠르게 처리할 수 있는 트랜잭션 데이타를 취급하는 알고리즘을 변형하여 적용한다. 각 클러스터에 포함되어 있는 항목들에 대한 누적 분포를 나타내는 히스토그램을 이용하여 전체적인 클러스터링의 응집도를 고려하는 클러스터링을 수행한다. 기존 연구와의 실험을 통해 클러스터링 처리 시간의 향상과 양질의 클러스터를 생성하는 것을 알 수 있었다.

Keywords

References

  1. D. Braga, A. Campi, S. Ceri, M. Klemettinen, and P. Lanzi, 'A Tool for Extracting XML Association Rules from XML Documents,' Proceedings of IEEE-ICTAI 2002, USA, November 2002
  2. M. L. Lee, L. H. Yang, W. Hsu, X. Yang, 'XClust: Clustering XML Schemas for Effective Integration,' Proceedings of the ACM International Conference on Information and Knowledge Management, 2002 https://doi.org/10.1145/584792.584841
  3. A. Doucet, H. A. Myka, 'Naive Clustering of a Large XML Document Collection,' Proceedings of INEX Workshop, 2002
  4. J. Yoon, V. Raghavan, V. Chakilam, 'BitCube: Clustering and Statistical Analysis for XML Documents,' Proceedings of the International Conference on Scientific and Statistical Database Management, 2001
  5. R. Nayak, R. Witt, A. Tonev, 'Data Mining and XML Documents,' International Conference on Internet Computing, 2002
  6. K. Wang and H. Liu, 'Discovery Typical Structures of Documents: A Road Map Approach,' ACM SIGIR Conference on Information Retrieval, 1998
  7. M. Zaki, 'Efficiently Mining Frequent Tree in a Forest,' Proceedings of the ACM SIGKDD International Conference, 2002 https://doi.org/10.1109/TKDE.2005.125
  8. A. Termier, M. C. Rouster, M. Sebag, 'Tree-Finder: A First Step towards XML Data Mining,' Proceedings of IEEE International Conference on Data Mining (ICDM), 2002
  9. Y. Shen and B. Wang, 'Clustering Schemaless XML Documents,' International Conference on Ontologies, Databases and Applications of SEmantics(ODBASE), 2003
  10. J. W. Lee, K. Lee, W. Kim, 'Preparation for Semantics-Based XML Mining,' Proceedings of IEEE International Conference on Data Mining(ICDM), 2001 https://doi.org/10.1109/ICDM.2001.989538
  11. T. Dalamagas, T. Cheng, K. J. Winkel, and T. Sellis, 'Clustering XML Document by Structure,' The 3rd Helenic Conference on AL. SETN, 2004
  12. J. H. Hwang, K. H. Ryu, 'A Clustering Technique using Common Structures of XML Documents,' KISS, Vol.32, No.6, 2005
  13. http://www.cogsci.princeton.edu/~wn/wn2.0
  14. J. Pei, J. Han, B. M. Asi, H. Pinto, 'PrefixSpan: Mining Sequential Pattern Efficiently by Prefix-Projected Pattern Growth,' Proceedings of International Conference on Data Engineering(ICDE), 2001
  15. NIAGARA query engine. http://www.cs.wisc.edu/niagara/data.html
  16. http://www.acm.org/sigmod/record/xml, 2001