정보 검색 기술을 이용한 대규모 이질적인 XML 문서에 대한 효율적인 선형 경로 질의 처리

Efficient Linear Path Query Processing using Information Retrieval Techniques for Large-Scale Heterogeneous XML Documents

  • 박영호 (한국과학기술원 전산학과) ;
  • 한욱신 (경북대학교 컴퓨터공학과) ;
  • 황규영 (한국과학기술원 전산학과)
  • 발행 : 2004.10.01

초록

본 논문에서는 대규모 이질 XML 문서들에 대한 부분 매치 질의를 효과적으로 처리하는 새로운 방법 XIR-Linear를 제안한다. XPath 질의는 XML 문서를 표현하는 트리 구조에 대한 경로 표현식 (path expression)으로 쓰여진다. 주요한 형태의 XPath 질의는 부분 매치 질의(partial match query)이다. XIR-Linear의 목적은 이질적인 스키마들을 가진 대규모 문서들에 대한 부분 매치 질의를 효과적으로 지원하는 것이다. XIR-Linear는 관계형 테이블을 이용한 스키마-레벨 방법에 기반을 두고, 역 인덱스 (inverted index) 기술을 사용하여 XPath 질의 처리의 효율성을 획기적으로 향상시킨다. 본 방법은 레이블 경로(label path)를 덱스트로 간주하고 레이블 경로 내의 레이블(label)들을 텍스트 내에 있는 키워드(keyword)로 간주한 후, 레이블들을 정보 검색 기술을 이용하여 인덱스 함으로써 전통적인 방법들에서 사용된 스트링 매치(string match) 보다 효율적인 방법으로 질의와 매치되는 레이블 경로들을 찾을 수 있도록 하였다. 성능 평가에서는 인터넷에서 수집한 XML 문서들을 사용하여 기존의 관계형 테이블을 이용하는 XRel, XParent와 비교 실험함으로써, 제안한 방법의 효율성을 입증한다. 실험을 통해 XIR-Linear가 실험 범위 내에서 XRel 이나 XParent에 비해 수십 배 이상 좋은 성능을 보이며, XML 문서 수의 증가함에 따라 더욱 우수하다는 것을 보인다.

We propose XIR-Linear, a novel method for processing partial match queries on large-scale heterogeneous XML documents using information retrieval (IR) techniques. XPath queries are written in path expressions on a tree structure representing an XML document. An XPath query in its major form is a partial match query. The objective of XIR-Linear is to efficiently support this type of queries for large-scale documents of heterogeneous schemas. XIR-Linear has its basis on the schema-level methods using relational tables and drastically improves their efficiency and scalability using an inverted index technique. The method indexes the labels in label paths as key words in texts, and allows for finding the label paths that match the queries far more efficiently than string match used in conventional methods. We demonstrate the efficiency and scalability of XIR-Linear by comparing it with XRel and XParent using XML documents crawled from the Internet. The results show that XIR-Linear is more efficient than both XRel and XParent by several orders of magnitude for linear path expressions as the number of XML documents increases.

키워드

참고문헌

  1. A. Aboulnaga, A. R. Alameldeen, and J. Naughton, 'Estimating the Selectivity of XML Path Expressions for Internet Scale Applications,' In Proc. the 27th Int'l Conf. on Very Large Data Bases (VLDE), pp. 591-600, Rome, Italy, Sept. 11-14, 2001
  2. Al-Khalifa, S., Jagadish, H. V., Koudas, N., Patel, J. M., Srivastava, D., and Wu, Y., 'Structural Joins: A Primitive for Efficient XML Query Pattern Matching,' In Proc. 18th Int'l Conf. on Data Engineering, San Jose, California, pp. 141-152, Feb. 2002 https://doi.org/10.1109/ICDE.2002.994704
  3. Jan-Marco Bremer and Michael Gertz, 'XQuery/ IR: Integrating XML Document and Data Retrieval,' In Proc. the Fifth Int'l Workshop on the Web and Databases (WebDE 2002), pp. 1-6, Madison, Wisconsin, 2002
  4. N. Bruno, N. Koudas, and D. Srivastava, 'Holistic Twig Joins: Optimal XML Pattern Matching.' In Proc. 2002 ACM SIGMOD Int'l Conf. on Management of Data, pp. 310-321, Madison, Wisconsin, June 3-6, 2002 https://doi.org/10.1145/564691.564727
  5. J. Clark and S. DeRose, XML Path Language (XPath), W3C Recommendation, http://www.w3.org/TR/xpath, Nov. 1999
  6. C. Chung, J. Min, and K. Shim, 'APEX: An Adaptive Path Index for XML Data,' In Proc: 2002 ACM SIGMOD Int'l Conf. on Management of Data, pp. 121-132, Madison, Wisconsin, June 3-6, 2002 https://doi.org/10.1145/564691.564706
  7. B. F. Cooper, N. Sample, M. J. Franklin, G. R. Hjaltason, and M. Shadmon, 'A Fast Index for Semistructured Data,' In Proc the 27th Int'l Conf. on Very Lorge Data Eases (VLDE), PD. 341-350, Rome, Italy, Sept. 11-14, 2001
  8. Daniela Florescu, Donald Kossmann, and Ioana Manolescu, 'Integrating Keyword Search into XML Query Processing,' In Proc. the 9th WWW Conference/Computer Networks, pp. 119-135, Amsterdam, NL, May 2000 https://doi.org/10.1016/S1389-1286(00)00069-4
  9. M. F. Fernandez and D. Suciu, 'Optimizing Regular Path Expressions using Graph Schemas,' In Proc. the 14th Int'l Conf. on Data Engineering (ICDE), pp. 14-23, Orlando, Florida, USA, Feb. 23-27, 1998
  10. Guo, L., Shao, F., Botev, C., and Shanmugasundaram, J., 'XRANK: Ranked Keyword Search over XML Documents,' In Proc. Int'l Conf. on Management of Data, ACM SIGMOD, pp. 16-27, 2003 https://doi.org/10.1145/872757.872762
  11. R. Goldman and J. Widom, 'DataGuides: Enabling Query Formulation and Optimization in Semi-structured Databases,' In Proc. the 23th Int'l Conf. on Very Large Data Bases (VLDB), pp. 436-445, Athens, Greece, Aug. 26-29, 1997
  12. H. Jiang, H. Lu, W. Wang, and J. Xu Yu, 'Path Materialization Revisited: An Efficient Storage Model for XML Data,' In Proc. the 13th Austra-lasian Database Conference (ADC), pp, 85-94, Melbourne, Australia, Jan. 28 - Feb. 1, 2002
  13. H. Jiang, H. Lu, W. Wang and J. Yu, 'XParent: An Efficient RDBMS-Based XML Database System,' ICDE 2002 https://doi.org/10.1109/ICDE.2002.994745
  14. H. Jiang, H, Lu, W, Wang, and B. C. Ooi, 'XR-Tree: Indexing XML Data for Efficient Structural Joins,' In Proc. the 19th Int'l Conf. on Data Engineering (ICDE), pp. 253-264, Bangalore, India, Mar. 5-8, 2003
  15. H. Jiang, W. Wang, H. Lu, and J. X. Yu, 'Holistic Twig Joins on Indexed XML Documents,' In Proc. the 29th Int'l Conf. on Very Large Data Bases (VLDB), pp. 273-284, Berlin, Germany, Sept. 9-12, 2003
  16. R. Kaushik, P. Bohannon, J. F. Naughton, and H. F. Korth, 'Covering Indexes for Branching Path Queries,' In Proc. 2002 ACM SIGMOD Int'l Conf. on Management of Data, pp. 133-144, Madison, Wisconsin, June 3-6, 2002
  17. Q. Li and B. Moon, 'Indexing and Querying XML Data for Regular Path Expressions,' In Proc. the 27th Int'l Conf. on Very Large Data Bases (VLDB), pp. 361-370, Rome, Italy, Sept. 11-14, 2001
  18. F. Mandreoli, R. Martoglia, P. Tiberio, 'Searching Similar (Sub)Sentences for Example-Based Machine Translation,' In Proc. SEBD'02, Isola d'Elba, Italy, June 2002
  19. J. Naughton et al., 'The Niagara Internet Query System,' IEEE Data Engineering Bulletin, Vol. 24, No.2, pp. 27-33, June, 2001
  20. C. Petrou, S. Hadjiefthymiades, and D. Martakos, 'An XML -based, 3-tier Scheme for Integrating Heterogeneous Information Sources to the WWW,' In Proc. the 10th Int'l Workshop on Database and Expert Systems Applications, pp. 706-710, Florence, Italy, Sept.1-3, 1999 https://doi.org/10.1109/DEXA.1999.795270
  21. N. Polyzotis and M. Garofalakis, 'Statistical Synopses for Graph-structured XML Databases,' In Proc. 2002 ACM SIGMOD Int'l Conf. on Management of Data, pp. 358-369, Madison, Wisconsin, June 3-6, 2002 https://doi.org/10.1145/564691.564733
  22. G. Salton and M. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, New York 1983
  23. M.Yoshikawa, T.Amagasa, T.Shimura, & S.Uemura: 'XRel: a path-based approach to storage and retrieval of XML documents using relational databases,' Proc. ACM Transactions on Internet Technology, Vol. 5, Augus, 2001 https://doi.org/10.1145/383034.383038
  24. Chun Zhang, J. Naughton, D. DeWitt, Q. Luo, G. Lohman, On Supporting Containment Queries in Relational database Management Systems, SIGMOD, 2001 https://doi.org/10.1145/376284.375722
  25. Odysseus Object-Relational Database Management System, http://odysseus.kaist.ac.kr/
  26. ReGet Deluxe 3.3 Beta (build 173), http://deluxe.reget.com/en/
  27. Teleport Pro Version 1.29, http://www.tenmax.com/teleport/pro/home.htm
  28. XMark-An XML Benchmark Project, http://monetdb.cwi.nl/xml
  29. Xyleme, http://www.xyleme.com
  30. eXtensible Markup Language(XML), http://www.w3.org/XML/