프리픽스 매취 조인을 이용한 XML 문서에 대한 분기 경로 질의 처리

Branching Path Query Processing for XML Documents using the Prefix Match Join

  • 박영호 (한국과학기술원 전산학과) ;
  • 한욱신 (경북대학교 컴퓨터공학과) ;
  • 황규영 (한국과학기술원 전산학과)
  • 발행 : 2005.08.01

초록

본 논문에서는 정보 검색(Information Retrieval, IR) 기술과 새로운 인스턴스 조인 기술을 이용하여 방대하고도 이질적인 XML 문서들에 대한 부분 매취 질의(Partial Match Query)를 처리하는 새로운 방법으로, XIR-Branching을 제안한다. 부분 매취 질의는 경로 표현식에 조상-후손 관계성(descendent-or-self axis) "//"를 가지는 질의로 정의되며, 선형 경로 표현식(Linear Path Expression, LPE)과 분기 경로 표현식(Branching Path Expression, BPE)으로 구분된다 일반적 형식의 부분 매취 질의는 분기하는 경로들을 만드는 분기 조건들을 가진다. XIR-Branching의 목적은 이질적인 스키마들을 가진 방대한 문서들에 주어지는 부분 매취 질의를 효과적으로 지원하는 것이다. XIR-Branching은 관계형 테이블을 사용하는 전통적인 스키마-레벨 방법들(XRel, XParent, XIR-Linear[21])에 그 기초를 두고, 역 인덱스(inverted index) 기술과 새롭게 소개하는 인스턴스-레벨 조인 기술인 프리픽스 매취 조인(Prefix Match Join)을 사용하여 질의 처리 효율성과 확장성을 향상시킨다. 전자는 LPE를 처리하기 위한 기술로 XIR-Linear[21]에서 사용한 방법이다. 후자는 BPE를 처리하기 위한 기술로 본 논문에서 새롭게 제안하는 기술이며, 전통적인 방법에서 사용하는 포함 관계 조인(containment join) 보다 효과적인 방법으로 결과 노드(result node)를 찾는다. 기존 연구인 XR-Linear는 역 인덱스를 사용하여 LPE 처리에 우수한 성능을 보이고 있지만, BPE 처리 방법을 다루지 않았다. 그러나. 더욱 구체적이고 일반적인 질의를 위해서는 BPE도 처리할 수 있어야 한다. 본 논문에서는 BPE까지 다룰 수 있는 새로운 방법으로 기존의 XIR-Linear를 확장한 XIR-Branching을 제안한다. 제안하는 방법은 스키마-레벨 방법으로 질의 대상 후보 집합을 크게 줄인 후, 인스턴스-레벨 조인 방법인 프리픽스 매취 조인으로 최종 결과 집합을 효과적으로 구하는 방법이다. XIR-Branching의 우수성을 보이기 위해 기존 BP포 처리 방법인 XRei, XParent와 비교 분석을 수행한다. 마지막으로, 성능 평가를 통하여 XIR-Branching이 기존 방법들에 비해 수십에서 수백배 효과적이고 확장성 또한 뛰어남을 보인다.

We propose XIR-Branching, a novel method for processing partial match queries on heterogeneous XML documents using information retrieval(IR) techniques and novel instance join techniques. A partial match query is defined as the one having the descendent-or-self axis '//' in its path expression. In its general form, a partial match query has branch predicates forming branching paths. The objective of XIR-Branching is to efficiently support this type of queries for large-scale documents of heterogeneous schemas. XIR-Branching has its basis on the conventional schema-level methods using relational tables(e.g., XRel, XParent, XIR-Linear[21]) and significantly improves their efficiency and scalability using two techniques: an inverted index technique and a novel prefix match join. The former supports linear path expressions as the method used in XIR-Linear[21]. The latter supports branching path expressions, and allows for finding the result nodes more efficiently than containment joins used in the conventional methods. XIR-Linear shows the efficiency for linear path expressions, but does not handle branching path expressions. However, we have to handle branching path expressions for querying more in detail and general. The paper presents a novel method for handling branching path expressions. XIR-Branching reduces a candidate set for a query as a schema-level method and then, efficiently finds a final result set by using a novel prefix match join as an instance-level method. We compare the efficiency and scalability of XIR-Branching with those of XRel and XParent using XML documents crawled from the Internet. The results show that XIR-Branching is more efficient than both XRel and XParent by several orders of magnitude for linear path expressions, and by several factors for branching path expressions.

키워드

참고문헌

  1. eXtensible Markup Language(XML), http://www.w3.org/XML/
  2. J. Naughton et al., 'The Niagara Internet Query System,' IEEE Data Engineering Bulletin, Vol. 24, No. 2, pp. 27-33, June, 2001
  3. Xyleme, http://www.xyleme.com
  4. J. Clark and S. DeRose, XML Path Language (XPath), W3C Recommendation, http://www.w3.org/TR/xpath, Nov. 1999
  5. F. Mandreoli, R. Martoglia, P. Tiberio, 'Searching Similar(Sub)Sentences for Example-Based Machine Translation,' In Proc. SEBD'02, Isola d'Elba, Italy, June 2002
  6. M.Yoshikawa, T.Amagasa, T.Shimura, & S.Uemura: 'XRel: a path-based approach to storage and retrieval of XML documents using relational databases,' Proc. ACM Transactions on Internet Technology, Vol. 5, Augus, 2001 https://doi.org/10.1145/383034.383038
  7. H. Jiang, H. Lu, W. Wang, and J. Xu Yu, 'Path Materialization Revisited: An Efficient Storage Model for XML Data,' In Proc. the 13th Australasian Database Conference(ADC), pp. 85-94, Melbourne, Australia, Jan. 28 - Feb. 1, 2002 https://doi.org/10.1145/563932.563916
  8. H. Jiang, H. Lu, W. Wang and J. Yu, 'XParent: An Efficient RDBMS-Based XML Database System,' ICDE 2002 https://doi.org/10.1109/ICDE.2002.994745
  9. Q. Li and B. Moon, 'Indexing and Querying XML Data for Regular Path Expressions,' In Proc. the 27th Int'l Conf. on Very Large Data Bases(VLDB), pp. 361-370, Rome, Italy, Sept. 11-14, 2001
  10. C. Zhang, J. F. Naughton, D. J. DeWitt, Q. Luo, and G. M. Lohmann, 'On Supporting Containment Queries in Relational Database Management Systems,' In Proc. 2001 ACM SIGMOD Int'l Conf. on Management of Data, pp. 425-436, Santa Barbara, California, May 21-24, 2001 https://doi.org/10.1145/375663.375722
  11. Al-Khalifa, S., Jagadish, H. V., Koudas, N., Patel, J. M., Srivastava, D., and Wu, Y., 'Structural Joins: A Primitive for Efficient XML Query Pattern Matching,' In Proc. 18th Int'l Conf. on Data Engineering, San Jose, California, pp. 141-152, Feb. 2002 https://doi.org/10.1109/ICDE.2002.994704
  12. N. Bruno, N. Koudas, and D. Srivastava, 'Holistic Twig Joins: Optimal XML Pattern Matching,' In Proc. 2002 ACM SIGMOD Int'l Conf. on Management of Data, pp. 310-321, Madison, Wisconsin, June 3-6, 2002
  13. Jiang,H.,Lu, H.,Wang, W., and Ooi, B.C., XR-Tree : Indexing XML Data for Efficient Structural Joins, In IEEE International Conference on Data Engineering, 2003 https://doi.org/10.1109/ICDE.2003.1260797
  14. H. Jiang, W. Wang, H. Lu, and J. X. Yu, 'Holistic Twig Joins on Indexed XML Documents,' In Proc. the 29th Int'l Conf. on Very Large Data Bases(VLDB), pp. 273-284, Berlin, Germany, Sept. 9-12, 2003
  15. Jan-Marco Bremer and Michael Gertz, 'XQuery/IR: Integrating XML Document and Data Retrieval,' In Proc. the Fifth Int'l Workshop on the Web and Databases(WebDB 2002), pp. 1-6, Madison, Wisconsin, 2002
  16. Daniela Florescu, Donald Kossmann, and Ioana Manolescu, 'Integrating Keyword Search into XML Query Processing,' In Proc. the 9th WWW Conference/Computer Networks, pp. 119-135, Amsterdam, NL, May 2000 https://doi.org/10.1016/S1389-1286(00)00069-4
  17. Lin Guo, Feng Shao, Chavdar Botev, and Jayavel Shanmugasundaram, 'XRANK: Ranked Keyword Search over XML Documents,' In Proc. 2003 ACM SIGMOD Int'l Conf. on Management of Data, pp. 16-27, San Diego, California, June 9-12, 2003
  18. A. Halverson, J. Burger, L. Galanis, A. Kini, R. Krishnamurthy, A. N. Rao, F. Tian, S. Viglas, Y. Wang, J. F. Naughton, and D. J. DeWitt, 'Mixed Mode XML Query Processing,' In Proc. the 29th Int'l Conf. on Very Large Data Bases(VLDB), pp. 225-236, Berlin, Germany, Sept. 9-12, 2003
  19. B. F. Cooper, N. Sample, M. J. Franklin, G. R. Hjaltason, and M. Shadmon, 'A Fast Index for Semistructured Data,' In Proc. the 27th Int'l Conf. on Very Large Data Bases(VLDB), pp. 341-350, Rome, Italy, Sept. 11-14, 2001
  20. 박영호, 한욱신, 황규영, '정보 검색 기술을 이용한 대규모 이질적인 XML 문서에 대한 효율적인 선형 경로 질의 처리,' 정보과학회논문지:데이타베이스, 제31권, 제5호, 2004년 10월
  21. G. Salton and M. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, New York 1983
  22. Al-Khalifa, S., Jagadish, H. V., Koudas, N., Patel, J. M., Srivastava, D., and Wu, Y., 'Structural Joins: A Primitive for Efficient XML Query Pattern Matching,' In Proc. 18th Int'l Conf. on Data Engineering, San Jose, California, pp. 141-152, Feb. 2002 https://doi.org/10.1109/ICDE.2002.994704
  23. A. Aboulnaga, A. R. Alameldeen, and J. Naughton, 'Estimating the Selectivity of XML Path Expressions for Internet Scale Applications,' In Proc. the 27th Int'l Conf. on Very Large Data Bases (VLDB), pp. 591-600, Rome, Italy, Sept. 11-14, 2001
  24. N. Polyzotis and M. Garofalakis, 'Statistical Synopses for Graph-structured XML Databases,' In Proc. 2002 ACM SIGMOD Int'l Conf. on Management of Data, pp. 358-369, Madison, Wisconsin, June 3-6, 2002 https://doi.org/10.1145/564691.564733
  25. R. Kaushik, P. Bohannon, J. F Naughton, H. F Korth, 'Covering Indexes for Branching Path Queries,' Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 133-144, 2002 https://doi.org/10.1145/564691.564707
  26. M. Altinel, M. J. Franklin, 'Efficient Filtering of XML Documents for Selective Dissemination of Information,' In Proc. the 26th Int'l Conf. on Very Large Data Bases(VLDB), pp. 53-64, Cairo, Egypt, Sept. 10-14, 2000
  27. Z. Ives, A. Levy, and D. Weld, Efficient Evaluation of Regular Path Expressions on Streaming XML Data, Technical Report UW-CSE-2000-05-02, University of Washington, 2000
  28. J. McHugh, J. Widom, 'Query Optimization for XML,' In Proc. the 25th Int'l Conf. on Very Large Data Bases(VLDB), pp. 315-326, Edinburgh, Scotland, UK, Sept. 7-10, 1999
  29. Igor Tatarinov, et. al, 'Storing and querying ordered XML using a relational database system', Proc. of ACM SIGMOD, pp. 204-215, 2002 https://doi.org/10.1145/564691.564715
  30. R. Goldman and J. Widom, 'DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases,' In Proc. the 23th Int'l Conf. on Very Large Data Bases(VLDB), pp. 436-445, Athens, Greece, Aug. 26-29, 1997
  31. Kyu-Young Whang, Min-Jae Lee, Jae-Gil Lee, Min-Soo Kim, and Wook-Shin Han, 'Odysseus: a High-Performance ORDBMS Tightly-Coupled with IR Features,' Technical Report CS-TR-2004-204, Department of Computer Science, http://cs.kaist.ac.kr/research/technical/Archive/CS-TR-2004-204.pdf, KAIST, Dec., 2004
  32. Teleport Pro Version 1.29, http://www.tenmax.com/teleport/pro/home.htm
  33. ReGet Deluxe 3.3 Beta(build 173), http://deluxe.reget.com/en/
  34. Kyu-Young Whang, Min-Jae Lee, Jae-Gil Lee, Min-Soo Kim, and Wook-Shin Han, 'Odysseus: a High-Performance ORDBMS Tightly-Coupled with IR Features,' In Proc. the 21st Int'l Conf. on Data Engineering,(ICDE), National Center of Sciences, Tokyo, Japen, April 5-8, 2005 https://doi.org/10.1109/ICDE.2005.95