A Path Partitioning Technique for Indexing XML Data

XML 데이타 색인을 위한 경로 분할 기법

  • 김종익 (서울대학교 컴퓨터공학부) ;
  • 김형주 (서울대학교 컴퓨터공학부)
  • Published : 2004.06.01

Abstract

Query languages for XML use paths in a data graph to represent queries. Actually, paths in a data graph are used as a basic constructor of an XML query. User can write more expressive Queries by using Patterns (e.g. regular expressions) for paths. There are many identical paths in a data graph because of the feature of semi-structured data. Current researches for indexing XML utilize identical paths in a data graph, but such an index can grow larger than source data graph and cannot guarantee efficient access path. In this paper we propose a partitioning technique that can partition all the paths in a data graph. We develop an index graph that can find appropriate partitions for a path query efficiently. The size of our index graph can be adjusted regardless of the source data. So, we can significantly improve the cost for index graph traversals. In the performance study, we show our index much faster than other graph based indexes.

XML에 대한 질의 언어는 데이타 그래프 내의 경로를 이용하여 질의를 표현한다. 특히, 경로에 패턴 (예를 들어, 정규식)을 사용함으로써, 데이타의 구조를 정확히 알지 못하더라도 질의가 가능하도록 한다. 이때, 패턴을 이용하는 질의는 데이타 그래프의 탐색범위를 크게 넓히게 된다. 기존의 XML색인 기법은 질의의 탐색범위를 줄이기 위해 데이타 그래프 내의 서로 동일한 경로들을 하나로 묶어 작은 크기의 색인 그래프를 생성하는 방법을 이용한다. 하지만 이러한 색인들은 많은 경우 색인의 크기가 데이터 그래프의 크기만큼 증가하게 되어 질의의 탐색범위를 줄이지 못하고, 따라서 효율적인 질의 처리를 보장하지 못한다. 본 논문에서는 데이타 내에 존재하는 모든 경로를 분할(partitioning)하고 질의 처리 시 질의에 맞는 분할 영역을 빠르게 찾아낼 수 있는 색인 그래프를 제안한다. 본 논문에서 제안하는 색인 그래프는 데이터 그래프의 크기와 상관없이 색인 그래프의 크기를 조절할 수 있다. 따라서 색인 그래프의 크기를 작게 구성함으로써 색인 그래프 탐색 비용을 크게 줄일 수 있다. 본 논문에서는, 실험을 통해 기존의 그래프 기반색인 기법들보다 본 논문의 색인 기법이 보다 효율적임을 보이고 색인의 크기 변화에 따른 성능 변화에 대해 알아본다.

Keywords

References

  1. Tim Bray, Jean Paoli, and C. M. Sperberg McQueen. Extensible markup language (XML) 1.0. W3C Recommendation, 1998
  2. Arnaud Le Hors, Philippe Le Hegaret, Lauren Wood, Gavin Nicol, Jonathan Robie, Mike Champion, and Steve Byrne. Document Object Model Level2 Core. W3C Recommendation, 2000
  3. Yannis Papakonstantinou, Serge Abiteboul, and Hector Carcia-Molina. Object exchange across heterogeneous information source. In IEEE International Conference on Data Engineering, 1995: 251-260 https://doi.org/10.1109/ICDE.1995.380386
  4. Alin Deutsch, Mary F. Fernandez, Daniela Florescu, Alon Y. Levy, and Dan Suciu. A Query language for XML. Computer Networks 31(11-16): 1155-1169, 1999 https://doi.org/10.1016/S1389-1286(99)00020-1
  5. Peter Buneman, Mary F. Fernandez, and Dan Suciu, 'UnQL: a query language and algebra for semistructured data based on structural recursion. VLDB Journal: Very Large Data Bases, 9(1):76-110, May 2000 https://doi.org/10.1007/s007780050084
  6. Serge Abiteboul, Dallan Quass, Jason McHugh, Jennifer Widom, and Janet Wiener. The lorel query language for semistructured data. International Journal on Digital Libraries 1(1):68-88, 1997
  7. James Clark and Steve DeRose. XML Path Language (XPath) 1.0. W3C Recommendation, 1999
  8. Don Chamberlin, Daniela Florescu, Jonathan Robie, Jerome Simeon, and Mugur Stefanescu. XQuery: A Query Language for XML. W3C Working Draft, February 2001
  9. Roy Goldman and Jennifer Widom. DataGuides: enabling query formulation and optimization in semistructured databases. In Proceedings of the Conference on Very Large Data Bases, 1997: 436-445
  10. Peter Buneman, Susan Davidson, Gerd Hillebrand, and Dan Suciu. A query language and optimization techniques for unstructured data. In Proceedings of the ACM SIGMOD International Conference on the Management of Data, 1996: 505-516 https://doi.org/10.1145/233269.233368
  11. Svetlozar Nestorov, Jeffery Ullman, Janet Wiener, and Sudarshan Chawathe. Representative objects: concise representations of semistructured, hierarchical data. In IEEE International Conference on Data Engineering, 1997:79-90 https://doi.org/10.1109/ICDE.1997.581741
  12. Tova Milo and Dan Suciu. Index structures for path expressions. In Proceedings of the International Conference on Database Theory, 1999: 277-295
  13. Dan Suciu. Semistructured data and XML. In Proceedings of International Conference on Foundations of Data Organization 51(12):1050-1052, 1998
  14. Mary F. Fernandez and Dan Suciu. Optimizing regular path expressions using graph schemas. In IEEE International Conference on Data Engineering, 1998:14-23 https://doi.org/10.1109/ICDE.1998.655753
  15. B. Cooper, N. Sample, M. J. Franlin, G. R. Hjaltason, and M. Shadmon. A fast index for semistructured data. In Proceedings of the Conference on Very Large Data Bases, 2001:341-350
  16. Chun Zhang, Jeffrey Naughton, David DeWitt, Qiong Luo, and Guy Lohman. On Supporting Containment Queries in Relational Database Management Systems. In Proceedings of the ACM SIGMOD International Conference on the Management of Data, 2001:425-436 https://doi.org/10.1145/375663.375722
  17. Quanzhong Li and Bongki Moon. Indexing and Querying XML Data for Regular Path Expressions. In Proceedings of the Conference on Very Large Data Bases, 2001:361-370
  18. Jongik Kim and Hyoung-Joo Kim. Efficient Processing of Regular Path Joins using PID. Information and Software Technology, 45 (5):241-251, April 2003 https://doi.org/10.1016/S0950-5849(02)00208-2
  19. Chin-Wan Chung, Jun-Ki Min, and Kyuseok Shim. APEX: An Adaptive Path Index for XML Data. In Proceedings of the ACM SIGMOD International Conference on the Management of Data, 2002:121-132 https://doi.org/10.1145/564691.564706
  20. Raghav Kaushik, Pradeeep Shenoy, Philip Bohannon, and Ehud Gudes. Exploiting Local Similarity for Indexing Paths in Graph-Structured Data. In IEEE International Conference on Data Engineering, 2002:129-140 https://doi.org/10.1109/ICDE.2002.994703
  21. Sangwon Park and Hyoung-Joo Kim. SigDAQ: an enhanced XML query optimization technique. Journal of System Software 61(2):91-103, 2002 https://doi.org/10.1016/S0164-1212(01)00105-4
  22. The Internet Movie Database Ltd. Internet movie database, http://www.imdb.com
  23. Xmark: The xml benchmark project. http://monetdb.cwi.nl/xml/index.html
  24. Xmark: The xml benchmark project