Similarity checking between XML tags through expanding synonym vector

Lee, Jung-Won;Lee, Hye-Soo;Lee, Ki-Ho;

Journal of KIISE:Software and Applications (한국정보과학회논문지:소프트웨어및응용)

Volume 29 Issue 9
/
Pages.676-683
/
2002
/
1229-6848(pISSN)

Korean Institute of Information Scientists and Engineers (한국정보과학회)

Similarity checking between XML tags through expanding synonym vector

유사어 벡터 확장을 통한 XML태그의 유사성 검사

Lee, Jung-Won (Dept.of Computer, Ewah Wonan's University) ;
Lee, Hye-Soo (Dept.of Computer, Ewah Wonan's University) ;
Lee, Ki-Ho (Dept.of Computer, Ewah Wonan's University)

이정원 (이화여자대학교 컴퓨터학과) ;
이혜수 (이화여자대학교 컴퓨터학과) ;
이기호 (이화여자대학교 컴퓨터학과)

Published : 2002.10.01

PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

The success of XML(eXtensible Markup Language) is primarily based on its flexibility : everybody can define the structure of XML documents that represent information in the form he or she desires. XML is so flexible that XML documents cannot be automatically provided with an underlying semantics. Different tag sets, different names for elements or attributes, or different document structures in general mislead the task of classifying and clustering XML documents precisely. In this paper, we design and implement a system that allows checking the semantic-based similarity between XML tags. First, this system extracts the underlying semantics of tags and then expands the synonym set of tags using an WordNet thesaurus and user-defined word library which supports the abbreviation forms and compound words for XML tags. Seconds, considering the relative importance of XML tags in the XML documents, we extend a conventional vector space model which is the most generally used for document model in Information Retrieval field. Using this method, we have been able to check the similarity between XML tags which are represented different tags.

XML(extensible Markup Language)문서가 웹 문서의 표준으로 자리 매김 할 수 있는 가장 큰 성공요인은 사용자가 문서 타입을 기술할 수 있는 유연성(flexibility)이다. 그러나 XML의 유연성으로 야기되는 문제점은 동일한 의미를 표현하기 위해 XML문서 작성자마다 서로 다른 태그명과 구조를 사용한다는 점이다. 즉 서로 다른 태그 집합, 요소(element), 속성(attribute)에 대한 서로 다른 이름 또는 다른 문서 구조로 인해 다른 태그로 표현된 문서는 서로 다른 부류의 문서로 간주되기 쉽다. 따라서 본 논문은 XML태그에 내재된 의미 정보(semantic information)와 구조 정보(structured information)를 추출하여 의미적으로 최대한 유사한 동의어로 확장하고, XML문서의 확장된 태그간의 의미적 유사도를 비교 분석할 수 있는 개념 기반의 태그 패턴 매처(Tag Pattern Matcher)를 설계 구현하였다. 두 XML문서의 태그간의 의미적 유사도에 가중치를 부여하여 기존의 비구조적인(semi-structured) 문서를 위한 벡터 스페이스 모델(vector space model)을 확장함으로써 두 XML문서가 유사한지를 파악할 수 있다.

Keywords

References

Minnos N. Garofalakis, Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim, 'Of Crawlers, Portals, Mice, and Men : Is there more to Mining the Web?,' In Proc. of the ACM SIGMOD Int. Conf. Management of Data, pages 504, Philadelphia, PA, USA, 1999 https://doi.org/10.1145/304182.304227
William J. Pardi, XML in Action, Microsoft Press, 1999
T. Bray,J. Paoli, and C. M. Sperberg-McQueen. Extensible Markup Language (XML) 1.0, W3C Recommendation, World Wide Web Consortium, Feb. 1998 http://www.w3.org/TR/1998/REC-xml-19880210
William B. Frakes and Ricardo Baeza-Yates, Information Retrieval: Data Structures & Algorithms, London: Prentice Hall, 1995
Gerard Salton and Michael J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill Book Company, New York, 1983
황도삼, 최기선, 김태석 공역, 자연언어 처리, 홍릉과학출판사, 1998
Fellbaum, C. 1998. Wordnet: An Electronic Lexical Database, Cambridge:MIT Press
Miller G.A., Beckwith R., Fellbaum C., Gross D. and Miller K., 'Introduction to WordNet : An On-Line Lexical Database.' in Five Papers on WordNet, CSL report, Cognitive Science Laboratory, Princeton University, 1993
R.Richardson, A.F.Smeaton, and J.Murphy, 'Using WordNet as a Knowledge Base for Measuring Semantic similarity between Words,' Working Paper:CA-1294
David Megginson, Structuring XML Documents, Prentice Hall PTR, 1998
Norman Walsh and Leonard Muellner, DocBook : The Definitive Guide, O'REILLY, 1999
M.Porter. An Algorithm for suffix stripping. Program, 14(3), pages 130-137, 1980 https://doi.org/10.1108/eb046814

Journal of KIISE:Software and Applications (한국정보과학회논문지:소프트웨어및응용)

Similarity checking between XML tags through expanding synonym vector

유사어 벡터 확장을 통한 XML태그의 유사성 검사

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)