A Hierarchical Clustering Technique of XML Documents based on Representative Path

대표 경로에 기반한 XML 문서의 계층 군집화 기법

  • Published : 2009.06.30

Abstract

XML is increasingly important in data exchange and information management. A large amount of efforts have been spent in developing efficient techniques for accessing, querying, and storing XML documents. In this paper, we propose a new method to cluster XML documents efficiently. A new prepresentative path called a virtul path which can represent both the structure and the contents of a XML document is proposed for the feature of a XML document. A method to apply the well known hierarchical clustering techniques to the representative paths to cluster XML documents is also proposed. The experiment shows that the true clusters are formed in a compact shape when a virtual path is used for the feature of a XML document.

XML은 데이터 교환과 정보 관리에 점차 중요해지고 있다. 근래에 XML 문서들에 대한 접근, 질의, 저장을 위한 효율적인 기법들을 개발하기 위해 많은 노력들이 이루어지고 있다. 이 논문에서 우리는 XML 문서들을 효율적으로 군집화하는 새로운 방법을 제안한다. XML 문서의 특징을 위해 XML 문서의 구조와 내용을 대표할 수 있는 새로운 대표 경로, 즉 가상 경로가 제안된다. XML 문서들을 군집화하기 위해 잘 알려진 계층 군집화 기법들을 대표 경로들에 적용하기 위한 방법도 제안된다. 실험을 통해 XML 문서의 특징으로 가상 경로를 사용했을 때 실제적인 군집들이 촘촘한 형상으로 잘 형성됨을 알 수 있다.

Keywords

References

  1. R. Behrens, "A Grammar Based Model for XML Schema Integration," Proc. of the 17th British National Conf. on Databases, pp.172-190, 2000
  2. A. Doucet and H. Ahonen-Myka, "Navie Clustering of a Large XML Document Collection," Proc. 1st Annual Workshop of the Initiative for the Evaluation of XML retrival(INEX), Germany, pp.81-88, Dec. 2002.
  3. J. Yoon, V. Raghavan, and V. Chakilam, "BitCube: Clustering and Statistical Analysis for XML Documents," Proc. of the 13th Int. Conf. on Scientific and Statistical Database Management, Fairfax, Virginia, July 2001.
  4. J. Yoon, V. Raghavan, V. Chakilam, and L. Kerschberg, "BitCube: A 3-D Bitmap Indexing for XML Documents," Journal of Intelligent Information Systems, Vol. 17, pp.241-254, November 2001. https://doi.org/10.1023/A:1012861931139
  5. A. Tagarelli, and S. Greco. "Toward Semantic XML Clustering," 6th SIAM International Conference on Data Mining (SDM '06), pp. 188-199. Bethesda, Maryland, USA, April 2006.
  6. H. Lee, "An Unsupervised Clustering Technique of XML Documents based on Function Transform and FFT," Journal of Korea Information Processing Society, 2007
  7. J. Liu, Jason T., L. Wang, W. Hsu, and K. G. Herbert, "XML Clustering by Principal Component Analysis," Proc. of the 26th IEEE International Conference on Tools with Artificial Intelligence(ICTAI), 2004.
  8. J. Hwang, and K. Ryu, "XML Document Clustering Based on Sequential Pattern," Journal of Korea Information Processing Society, Dec. 2003.
  9. K. Wang, C. Xu, and B. Liu, "Clustering Transactions Using Large Items," Proc. of ACM CIKM-99, 1999
  10. Jian Pei, and etc., "PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth," Proc. 17th International Conference on Data Engineering, pp.215–224, April 2001.
  11. http://wordnet.princeton.edu/
  12. U. Park, and Y. Seo, "An Implementation of XML Document Searching System based on Structure and Semantics Similarity," Journal of Korean Society for Internet Information, Vol.6, No.2, April 2005.
  13. Niagara Query Engine, http://www.cs.wisc.edu/niagara/data.html
  14. Boberg J., and Salakoski T. "General formulation and evaluation of agglomerative clustering methods with metric and non-metric distances," Pattern Recognition, Vol.26(9), 1993.