Browse > Article

Efficient Linear Path Query Processing using Information Retrieval Techniques for Large-Scale Heterogeneous XML Documents  

박영호 (한국과학기술원 전산학과)
한욱신 (경북대학교 컴퓨터공학과)
황규영 (한국과학기술원 전산학과)
Abstract
We propose XIR-Linear, a novel method for processing partial match queries on large-scale heterogeneous XML documents using information retrieval (IR) techniques. XPath queries are written in path expressions on a tree structure representing an XML document. An XPath query in its major form is a partial match query. The objective of XIR-Linear is to efficiently support this type of queries for large-scale documents of heterogeneous schemas. XIR-Linear has its basis on the schema-level methods using relational tables and drastically improves their efficiency and scalability using an inverted index technique. The method indexes the labels in label paths as key words in texts, and allows for finding the label paths that match the queries far more efficiently than string match used in conventional methods. We demonstrate the efficiency and scalability of XIR-Linear by comparing it with XRel and XParent using XML documents crawled from the Internet. The results show that XIR-Linear is more efficient than both XRel and XParent by several orders of magnitude for linear path expressions as the number of XML documents increases.
Keywords
XML; partial match queries; inverted indexes; Information Retrieval; XIR-Linear; IR;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 H. Jiang, H. Lu, W. Wang, and J. Xu Yu, 'Path Materialization Revisited: An Efficient Storage Model for XML Data,' In Proc. the 13th Austra-lasian Database Conference (ADC), pp, 85-94, Melbourne, Australia, Jan. 28 - Feb. 1, 2002
2 H. Jiang, H. Lu, W. Wang and J. Yu, 'XParent: An Efficient RDBMS-Based XML Database System,' ICDE 2002   DOI
3 H. Jiang, H, Lu, W, Wang, and B. C. Ooi, 'XR-Tree: Indexing XML Data for Efficient Structural Joins,' In Proc. the 19th Int'l Conf. on Data Engineering (ICDE), pp. 253-264, Bangalore, India, Mar. 5-8, 2003
4 R. Goldman and J. Widom, 'DataGuides: Enabling Query Formulation and Optimization in Semi-structured Databases,' In Proc. the 23th Int'l Conf. on Very Large Data Bases (VLDB), pp. 436-445, Athens, Greece, Aug. 26-29, 1997
5 B. F. Cooper, N. Sample, M. J. Franklin, G. R. Hjaltason, and M. Shadmon, 'A Fast Index for Semistructured Data,' In Proc the 27th Int'l Conf. on Very Lorge Data Eases (VLDE), PD. 341-350, Rome, Italy, Sept. 11-14, 2001
6 Daniela Florescu, Donald Kossmann, and Ioana Manolescu, 'Integrating Keyword Search into XML Query Processing,' In Proc. the 9th WWW Conference/Computer Networks, pp. 119-135, Amsterdam, NL, May 2000   DOI   ScienceOn
7 M. F. Fernandez and D. Suciu, 'Optimizing Regular Path Expressions using Graph Schemas,' In Proc. the 14th Int'l Conf. on Data Engineering (ICDE), pp. 14-23, Orlando, Florida, USA, Feb. 23-27, 1998
8 Guo, L., Shao, F., Botev, C., and Shanmugasundaram, J., 'XRANK: Ranked Keyword Search over XML Documents,' In Proc. Int'l Conf. on Management of Data, ACM SIGMOD, pp. 16-27, 2003   DOI
9 N. Bruno, N. Koudas, and D. Srivastava, 'Holistic Twig Joins: Optimal XML Pattern Matching.' In Proc. 2002 ACM SIGMOD Int'l Conf. on Management of Data, pp. 310-321, Madison, Wisconsin, June 3-6, 2002   DOI
10 J. Clark and S. DeRose, XML Path Language (XPath), W3C Recommendation, http://www.w3.org/TR/xpath, Nov. 1999
11 C. Chung, J. Min, and K. Shim, 'APEX: An Adaptive Path Index for XML Data,' In Proc: 2002 ACM SIGMOD Int'l Conf. on Management of Data, pp. 121-132, Madison, Wisconsin, June 3-6, 2002   DOI
12 Jan-Marco Bremer and Michael Gertz, 'XQuery/ IR: Integrating XML Document and Data Retrieval,' In Proc. the Fifth Int'l Workshop on the Web and Databases (WebDE 2002), pp. 1-6, Madison, Wisconsin, 2002
13 A. Aboulnaga, A. R. Alameldeen, and J. Naughton, 'Estimating the Selectivity of XML Path Expressions for Internet Scale Applications,' In Proc. the 27th Int'l Conf. on Very Large Data Bases (VLDE), pp. 591-600, Rome, Italy, Sept. 11-14, 2001
14 Al-Khalifa, S., Jagadish, H. V., Koudas, N., Patel, J. M., Srivastava, D., and Wu, Y., 'Structural Joins: A Primitive for Efficient XML Query Pattern Matching,' In Proc. 18th Int'l Conf. on Data Engineering, San Jose, California, pp. 141-152, Feb. 2002   DOI
15 eXtensible Markup Language(XML), http://www.w3.org/XML/
16 Xyleme, http://www.xyleme.com
17 G. Salton and M. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, New York 1983
18 M.Yoshikawa, T.Amagasa, T.Shimura, & S.Uemura: 'XRel: a path-based approach to storage and retrieval of XML documents using relational databases,' Proc. ACM Transactions on Internet Technology, Vol. 5, Augus, 2001   DOI
19 Chun Zhang, J. Naughton, D. DeWitt, Q. Luo, G. Lohman, On Supporting Containment Queries in Relational database Management Systems, SIGMOD, 2001   DOI   ScienceOn
20 Odysseus Object-Relational Database Management System, http://odysseus.kaist.ac.kr/
21 ReGet Deluxe 3.3 Beta (build 173), http://deluxe.reget.com/en/
22 Teleport Pro Version 1.29, http://www.tenmax.com/teleport/pro/home.htm
23 XMark-An XML Benchmark Project, http://monetdb.cwi.nl/xml
24 F. Mandreoli, R. Martoglia, P. Tiberio, 'Searching Similar (Sub)Sentences for Example-Based Machine Translation,' In Proc. SEBD'02, Isola d'Elba, Italy, June 2002
25 J. Naughton et al., 'The Niagara Internet Query System,' IEEE Data Engineering Bulletin, Vol. 24, No.2, pp. 27-33, June, 2001
26 C. Petrou, S. Hadjiefthymiades, and D. Martakos, 'An XML -based, 3-tier Scheme for Integrating Heterogeneous Information Sources to the WWW,' In Proc. the 10th Int'l Workshop on Database and Expert Systems Applications, pp. 706-710, Florence, Italy, Sept.1-3, 1999   DOI
27 N. Polyzotis and M. Garofalakis, 'Statistical Synopses for Graph-structured XML Databases,' In Proc. 2002 ACM SIGMOD Int'l Conf. on Management of Data, pp. 358-369, Madison, Wisconsin, June 3-6, 2002   DOI
28 H. Jiang, W. Wang, H. Lu, and J. X. Yu, 'Holistic Twig Joins on Indexed XML Documents,' In Proc. the 29th Int'l Conf. on Very Large Data Bases (VLDB), pp. 273-284, Berlin, Germany, Sept. 9-12, 2003
29 R. Kaushik, P. Bohannon, J. F. Naughton, and H. F. Korth, 'Covering Indexes for Branching Path Queries,' In Proc. 2002 ACM SIGMOD Int'l Conf. on Management of Data, pp. 133-144, Madison, Wisconsin, June 3-6, 2002
30 Q. Li and B. Moon, 'Indexing and Querying XML Data for Regular Path Expressions,' In Proc. the 27th Int'l Conf. on Very Large Data Bases (VLDB), pp. 361-370, Rome, Italy, Sept. 11-14, 2001