Browse > Article

Extracting Maximal Similar Paths between Two XML Documents using Sequential Pattern Mining  

이정원 (이화여자대학교 컴퓨터학과)
박승수 (이화여자대학교 컴퓨터학과)
Abstract
Some of the current main research areas involving techniques related to XML consist of storing XML documents, optimizing the query, and indexing. As such we may focus on the set of documents that are composed of various structures, but that are not shared with common structure such as the same DTD or XML Schema. In the case, it is essential to analyze structural similarities and differences among many documents. For example, when the documents from the Web or EDMS (Electronic Document Management System) are required to be merged or classified, it is very important to find the common structure for the process of handling documents. In this paper, we transformed sequential pattern mining algorithms(1) to extract maximal similar paths between two XML documents. Experiments with XML documents show that our transformed sequential pattern mining algorithms can exactly find common structures and maximal similar paths between them. For analyzing experimental results, similarity metrics based on maximal similar paths can exactly classify the types of XML documents.
Keywords
XML; mining; structure discovery; similarity; sequential patterns;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Jtidy, http://jtidy.sourceforge.net
2 I. D. Baxter, A. Yahin, L. Moura, M. Sant' Anna, and L. Bier, 'Clone Detection using Abstract Syntax Tree,' In Proc. of the ICSM' 98, Nov. 1998   DOI
3 장성순, 서선애, 이광근, '프로그램 유사성 검사기' 제28회 한국정보과학회 추계학술대회 논문집, pages 334-336, 2001   과학기술학회마을
4 R. Agrawal, R. Srikant, 'Fast Algorithms for Mining Association Rules,' In Proc. of the 20th Int'l Conference on Very Large Databases, 1994
5 A. V. Aho, R. Sethi, and J. D. Ullman, Compilers : Principles, Techniques, and Tools, Addison-Wesley, 1986
6 J. W. Lee, K. Lee, and W. Kim, 'Preparations for Semantics-based XML Mining,' In Proc. of IEEE International Conference on Data Mining (ICDM '01), pages 345-352, Nov./Dec. 2001
7 C.Fellbaum, WordNet : An Electronic Lexical Database, Cambridge: MIT Press. 1998
8 Y. Papakonsstantinou, XML and the Automation of Web Information Processing, Tutorial given at the International Conference on Data Engineering, 1999
9 C. M. Hoffmann and M. J. O'Donnell, 'Pattern Matching in Trees,' Journal of ACM 29(1), pages 68-95, Jan. 1982   DOI   ScienceOn
10 P. Kilpelainen and H. Mannila, 'The Tree Inclusion Problem,' In Proc. the International Joint Conference on the Theory and Practice of Software Development (TAPSOFT' 91), Vol. 1: Colloqium on Trees in Algebra and Programming (CAAP , 91), pages 202-214, 1991
11 K. Wang and H. Liu, 'Discovering Typical Structures of Documents: a Road Map Approach,' In Proc. of SIGIR, pages 146-154, 1998   DOI
12 R. Srikant and R. Agrawal, 'Mining Sequential Patterns: Generalizations and Performance Improvements,' In Proc. of the Fifth Int'l Conf. on Extending Database Technology (EDBT), March 1996
13 J. Shamungasunadarm, Bridging relational Technology and XML, Dissertation of University of Wisconsin-Medison, 2001