XML Document Analysis based on Similarity

Lee, Jung-Won;Lee, Ki-Ho;

Journal of KIISE:Software and Applications (한국정보과학회논문지:소프트웨어및응용)

Volume 29 Issue 6
/
Pages.367-376
/
2002
/
1229-6848(pISSN)

Korean Institute of Information Scientists and Engineers (한국정보과학회)

XML Document Analysis based on Similarity

유사성 기반 XML 문서 분석 기법

Lee, Jung-Won (Dept.of Computer, Ewah Wonan's University) ;
Lee, Ki-Ho (Dept.of Computer, Ewah Wonan's University)

이정원 (이화여자대학교 컴퓨터학과) ;
이기호 (이화여자대학교 컴퓨터학과)

Published : 2002.06.01

PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

XML allows users to define elements using arbitrary words and organize them in a nested structure. These features of XML offer both challenges and opportunities in information retrieval and document management. In this paper, we propose a new methodology for computing similarity considering XML semantics - meanings of the elements and nested structures of XML documents. We generate extended-element vectors, using thesaurus, to normalize synonyms, compound words, and abbreviations and build similarity matrix using them. And then we compute similarity between XML elements. We also discover and minimize XML structure using automata(NFA(Nondeterministic Finite Automata) and DFA(Deterministic Finite automata). We compute similarity between XML structures using similarity matrix between elements and minimized XML structures. Our methodology considering XML semantics shows 100% accuracy in identifying the category of real documents from on-line bookstore.

XML 문서가 가지고 있는 태그의 자유로운 정의와 내포된 구조 정보는 정보 검색 및 문서 관리 분야에 많은 이점을 제공할 수 있다. 본 논문은 XML 요소(element)의 의미와 구조 정보를 반영한 문서간의 유사성을 검사할 수 있는 XML 문서 분석 기법을 제시하고자 한다. 도출된 문서간 유사성은 많은 정보 검색 및 마이닝 등의 기초 자료로 사용될 수 있다. 먼저 XML 요소를 시소러스를 이용하여 유사어와 합성어로 구성된 확장-요소 벡터로 확장하고 유사 행렬을 구축하여 요소간 유사성을 판별한다. 또한 오토마타(NFA(Nondeterministic Finite Automata)와 DFA(Deterministic Finite Automata)(를 이용하여 XML 문서의 내포된 구조를 발견하고 최소화 한다. 요소간의 유사 행렬과 최소화된 XML 구조를 이용하여 구조간의 유사성을 판별한다. 본 논문의 XML의 의미를 반영한 유사성 분석 기법은 온라인 서점의 실제 문서의 카테고리를 인식하는 데 있어 100% 정확도를 보였다.

Keywords

References

Deutsch, Fernandez and Suciu, 'Storing Semistructured Data with STORED,' In Proc. of SIGMOD, pages 431-442, 1999 https://doi.org/10.1145/304182.304220
Nestorov, Abiteboul, Motwani. 'Extracting Schema from Semistructured Data,' In Proc. of SIGMOD, pages 295-306, 1998 https://doi.org/10.1145/276304.276331
Ke Wang and Huiqing Liu, 'Discovering Typical Structures of Documents: a Road Map Approach,' In the Proc. of SIGIR, pages 146-154, 1998 https://doi.org/10.1145/290941.290982
Brad Adelberg, 'NoDoSE - A Tool for Semi-Automatically Extracting Structured and Semi-structured Data from Text Documents,' In Proc. of SIGMOD, pages 283-294, 1998 https://doi.org/10.1145/276304.276330
Christoph M. Hoffmann and Michael J. O'Donnell, 'Pattern Matching in Trees,' Journal of ACM 29(1), pages 68-95, Jan. 1982 https://doi.org/10.1145/322290.322295
Ira D.Baxter, Andrew Yahin, Leonardo Moura, Marcelo Sant'Anna, and Lorraine Bier, 'Clone Detection using Abstract Syntax Tree,' In Proc. of the ICSM'98, Nov. 1998 https://doi.org/10.1109/ICSM.1998.738528
장성순, 서선애, 이광근, '프로그램 유사성 검사기', 제 28회 한국정보과학회 추계학술대회 논문집, pages 334-336, 2001
S. Ducasse, M.Reiger, S.Demeyer, 'A Language Independed Approach for Detecting Duplicated Code,' In Proc. of the ICSM'99, pages 109-118, Sep. 1999 https://doi.org/10.1109/ICSM.1999.792593
R. Srikant and R. Agrawal, 'Mining Sequential Patterns:Generalizations and Performance Improvements,' In Proc. of the Fifth Int'l Conf. on Extending Database Technology(EDBT), Avignon, France, March 1996 https://doi.org/10.1007/BFb0014140
C.Fellbaum. WordNet : An Eletronic Lexical Database, Cambridge:MIT Press. 1998
http://www.w3.org/DOM/
Abiteboul, Buneman, and Suciu. Data on the web : from relations to semistructured data and XML, Morgan-Kaufmann, 2000
A.V.Aho, R. Sethi and J.D.Ulman. Compiler: Principles, Techniques, and Tools, Addision Wesley, 1986
R. Agrawal, R. Srikant, 'Fast Algorithms for Mining Association Rules,' In Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, Sept. 1994

Journal of KIISE:Software and Applications (한국정보과학회논문지:소프트웨어및응용)

XML Document Analysis based on Similarity

유사성 기반 XML 문서 분석 기법

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)