순차패턴에 기반한 XML 문서 클러스터링

XML Document Clustering Based on Sequential Pattern

  • 황정희 (충북대학교 대학원 전자계산학과) ;
  • 류근호 (충북대학교 전기전자컴퓨터공학부)
  • 발행 : 2003.12.01


인터넷의 사용 증가로 정보의 양은 기하급수적으로 증가하고 있으며 웹 데이터의 표준인 XML의 데이터 표현의 유연성으로 인해 EDMS(Electronic Document Management System), ebXML(e-business extensible Markup Language) 등 웹 기반의 전자문서론 이용하는 시스템들은 XML를 문서 교환 방식 및 표준 문서 형식으로 도입하고 있는 실정이다. 그러므로 점차 확산되어 가고 있는 XML 문서에 대한 효율적인 문서의 관리와 검색을 위한 연구가 필요하다. 이 논문에서는 다중 문서간의 구조적 유사성을 분류하기 위하여 엘리먼트의 순서적 의미를 갖는 XML 문서를 대상으로 순차패턴을 이용하여 문서의 특성을 반영하는 대표구조를 추출하고 추출된 구조를 기반으로 유사 구조 문서를 클러스터링하는 방법을 제시한다. 이 논문의 제안 알고리즘은 클러스터의 응집도와 클러스터간의 유사도를 함께 고려하는 비용계산 방식을 이용하므로써 클러스터링의 정확도를 높일 수 있는 효과를 얻을 수 있다.

As the use of internet is growing, the amount of information is increasing rapidly and XML that is a standard of the web data has the property of flexibility of data representation. Therefore electronic document systems based on web, such as EDMS (Electronic Document Management System), ebXML (e-business extensible Markup Language), have been adopting XML as the method for exchange and standard of documents. So research on the method which can manage and search structural XML documents in an effective wav is required. In this paper we propose the clustering method based on structural similarity among the many XML documents, using typical structures extracted from each document by sequential pattern mining in pre-clustering process. The proposed algorithm improves the accuracy of clustering by computing cost considering cluster cohesion and inter-cluster similarity.



  1. W3C, Extensible Makup Language(XML) 1.1.,, W3C Working Draft. April, 2002
  2. Natanya Pitts, editor, 'XML Black Book 2nd Edition,' Young-Jin, 2001
  3. P. Kotasek, J. Zendulka, 'An XML Framework Proposal for Knowledge Discovery in Database,' The Fourth European Conference on Principles and Practice Knowledge Discovery in Database, 2000
  4. K. Wang, H. Liu, 'Discovery Typical Structures of Documents : A Road Map Approach,' In ACM SIGIR Conference on Information Retrieval, 1998
  5. A. P. Asirvatham, K. K. Ravi, 'Web Page Classification based on Document Structure,' In IEEE Conference, 2001
  6. J. T. Wang, D. Shasha, G. J. S. Chang, 'Structural Matching and Discovery in Document Database,' In ACM SIGMOD Conference, 1997
  7. J. Wison, 'Data Management for XML : Research Directions,' IEEE Computer Society Technical Commitee on Data Engineering, 1999
  8. R. Nayak, R. Witt, A. Tonev, 'Data Mining and XML Documents,' International Conference on Internet Computing, 2002
  9. T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, 'Efficient Substructure Discovery from Large Semi-structured Data,' SIAM on Data Mining, 2002
  10. J. W. Lee, K. Lee, W. Kim, 'Preparation for Semantice-Based XML Mining,' IEEE International Conference on Data Mining(ICDM), 2001
  11. J. Pei, J. Han, B. M. Asi, H. Pinto, 'PrefixSpan : Mining Sequenctial Pattern Efficiently by Prefix-Projected Pattern Growth,' Int, Conf. Data Engineering(ICDE), 2001
  12. J. Pei, J. Han, B. M. Asi, H. Pinto, 'Clustering Transactions Using Large Items,' In Proc. of ACM CIKM-99, 1999
  13. S. Nestorov, S. Abiteboul, R. Motwani, 'Extracting Schema from Semistructured Data,' In Proc. of SIGMOD conference, 1998
  14. C. H. Moh, E. P. Lim, W. K. Ng, 'DTD-Miner : ATool for Mining DTD from XML Document,' Int, Workshop on Advance Issues of E-Commerce and Web-Based Information Systme(WECWIS), 2000
  15. S. J. DeRose, 'XQUERY : a unified syntax for linking and querying general XML Document,' In Proceeding. Query Languages workshop (QL'98), Boston, Mass, December, 1998
  16. J. W .Lee, K. H. Lee, 'Methodology for Identifying XML-based Target Document for EDMS,' Korean Database Conference (KDBC), 2002
  17. R. Srikant, R. Agrawal, 'Mining Sequential Patterns : Generalizations and Performance Improvements,' The 5th International Conference on Extending Database Technology (EDBT), Avognon, France, March, 1996
  18. A. G. Buchner, M. Baumgarten, M. D. Mulvenna, R. Bohm, S. S. Anand, 'Data Mining and XML : Current and Future Issues,' WISE, 2000
  19. A. Deutsch, M. F. Fernandez, D. Suciu, 'Storing Semi-structured Data With STORED,' In proceedings of ACM SIGMOD International Conference on Management of Data, Philadelphia, USA, pp.431-442, 1999
  20. A. Doucet, H. A. Myka, 'Naive Clustering of a Large XML Document Cooletion,' The Proceeding of the 1st INEX, Germany, 2002
  21. M. Steinbach, G. Karpis, V. Kumar, 'A Comparison of Document Clustering Techniques,' Technical Report of Department of Computer Science and Engineering, University of Minnesota, 2000