고차원 데이타 패킹을 위한 주기적 편중 분할 방법

A Cyclic Sliced Partitioning Method for Packing High-dimensional Data

  • 김태완 (부산대학교 컴퓨터및정보통신연구소) ;
  • 이기준 (부산대학교 전자계산학과)
  • 발행 : 2004.04.01

초록

이전의 연구들에서 제안된 많은 색인 방법들은 저차원과 동적인 환경을 가정하고 제안되었다. 그러나 최근의 많은 데이타베이스 응용분야들은 대용량, 고차원 그리고 정적인 환경에 대한 처리를 요구하고 있다. 따라서 기존의 저차원이고 동적인 환경에서 제안되었던 색인 구축 전략들은 특히 데이타 및 공간 분할에 있어서 새로운 환경에 잘 적응하지 못한다. 본 연구에서 우리는 이러한 사실들을 지적하였고, 새로운 환경에 적응하는 색인 구축 시 적용되는 새로운 분할 전략을 성능 모델에 근거하여 제안하였다. 우리의 접근 방법은 기본적으로 정적인 환경에서 색인 구축에 사용되는 패킹이라는 기법을 적용하였다. 그리고 고차원 환경에서 질의 성능의 기대 값을 제시하는 민코프스키-합 비용모델에 대한 관찰 결과를 이용하였다. 이러한 것들에 바탕을 두어 우리는 데이타 및 공간을 균등하게 분할하는 것보다 불균등하게 분할하는 것이 좋을 것이라는 예측을 비용 모델에 대한 관찰 결과로써 도출하였다. 그리고 이러한 결과를 이용한 불균등 분할 방법과 성능 모델들을 제시하였다. 이 연구의 결론으로서 균등 분할 방법보다 불균등 분할 방법이 고차원 환경에서 더 효율적인 방법임을 성능 모델 및 실험을 통하여 보여주었다. 그리고, 어떻게 불균등하게 분할하는 것이 좋은지에 대한 명확한 계량적 기준들을 제시하였다.

Traditional works on indexing have been suggested for low dimensional data under dynamic environments. But recent database applications require efficient processing of huge sire of high dimensional data under static environments. Thus many indexing strategies suggested especially in partitioning ones do not adapt to these new environments. In our study, we point out these facts and propose a new partitioning strategy, which complies with new applications' requirements and is derived from analysis. As a preliminary step to propose our method, we apply a packing technique on the one hand and exploit observations on the Minkowski-sum cost model on the other, under uniform data distribution. Observations predict that unbalanced partitioning strategy may be more query-efficient than balanced partitioning strategy for high dimensional data. Thus we propose our method, called CSP (Cyclic Spliced Partitioning method). Analysis on this method explicitly suggests metrics on how to partition high dimensional data. By the cost model, simulations, and experiments, we show excellent performance of our method over balanced strategy. By experimental studies on other indices and packing methods, we also show the superiority of our method.

키워드

참고문헌

  1. D. Barbara, et al., 'The New Jersey Data Reduction Report,' IEEE Bulletin of the Technical Committee on Data Engineering, 20(4), page 3-45, 1997
  2. S. Chaudhuri and U. Dayal, 'An Overview of Data Warehousing and OLAP Technology,' SIGMOD Record, 1997 https://doi.org/10.1145/248603.248616
  3. C. Bohm, S. Berchtold and D. Keim, 'Searching in High-Dimensional Spaces-Index Structures for Improving the Performance of Multimedia Databases,' ACM Computing Surveys, 33(3), page 322-373, 2001 https://doi.org/10.1145/502807.502809
  4. V. Gaede and O. Gunther, 'Multidimensional Access Methods,' ACM Computing Surveys, 30(2), page 170-231, 1998 https://doi.org/10.1145/280277.280279
  5. N. Beckmann, H.-P. Kriegel, R. Schneider and B. Seeger, 'The R*-tree: An Efficient and Robust Access Method for Points and Rectangles,' Proc. ACM SIGMOD Int. Conf. on Management of Data, page 322-331, 1990
  6. I. Kamel and C. Faloutsos, 'On Packing R-trees,' Proc. Int. Conf. on Information and Knowledge Management (CIKM), page 490-499, 1993 https://doi.org/10.1145/170088.170403
  7. B. -U. Pagel, H. -W. Six, H. Toben and P. W. Widmayer, 'Towards an Analysis of Range Query Performance in Spatial Data Structures,' ACM PODS, 1993 https://doi.org/10.1145/153850.153878
  8. L. Arge, 'Efficient External-Memory Data Structures and Applications,' Ph.D. Thesis, BRICS Dissertation Series, DS-96-03, University of Aarhus, 1996
  9. L. Arge, K. Hindrichs, J. Vahrenhold, and J. S. Vitter, 'Efficient Bulk Operations on Dynamic R-trees,' ALENEX, page 328-348, 1999
  10. J. van den Bercken, B. Seeger and P. W. Widmayer, 'A Generic Approach to Bulk Loading Multidimensional Index Structures,' Proc. 23rd Int'l Conf. on Very Large Data Bases (VLDB) page 406-415, 1997
  11. A. Guttman, 'R-trees: A Dynamic Index Structure for Spatial Searching,' Proc. ACM SIGMOD Int. Conf. on Management of Data, page 47-57, 1984 https://doi.org/10.1145/602259.602266
  12. S. Berchtold, C. Bohm and H.-P. Kriegel, 'Improving the Query Performance of High-Dimensional Index Structures by Bulk Load Operations,' Proc. EDBT, 1998
  13. Y.J. Garcia, M.L. Lopez and S.T. Leutenegger, 'A Greedy Algorithm for Bulk Loading R-trees,' Technical Report 97-2, 1997
  14. D.M. Gavrila, 'R-tree Index Optimazation,' Technical Report CS-TR-3292, 1996
  15. S. T. Leutenegger, M. A. Lopez and J. Edington, 'STR: A Simple and Efficient Algorithm for R-Tree Packing,' Proc. 13th Int. Conf. on Data Engineering (ICDE), 1997 https://doi.org/10.1109/ICDE.1997.582015
  16. J.T. Roussopoulos and L. Leifker, 'Direct spatial search on pictorial databases using r-trees,' Proc. ACM SIGMOD Int. Conf. on Management of Data, 1985 https://doi.org/10.1145/971699.318900
  17. D. White and R. Jain, 'Similarity Indexing: Algorithms and Performance,' Int. Symp. on Optical Science and Technology (SPIE), page 62-73, 1996 https://doi.org/10.1117/12.234810
  18. S. Berchtold, D. A. Keim and H.-P. Kreigel, 'The X-tree: An Index Structure for High-Dimensional Data,' Proc. 22rd Int'l Conf. on Very Large Data Bases (VLDB) page 28-39, 1996
  19. A. K. Jain, M. N. Murty and P. J. Flynn, 'Data Clustering; A Review,' ACM Computing Surveys, 31(3), page 264-323, 1999 https://doi.org/10.1145/331499.331504
  20. R. Eenk, et al., 'Bulk loading a Data Warehouse built upon a UB-Tree,' Proc. IEEE IDEAS, page 179-187, 2000 https://doi.org/10.1109/IDEAS.2000.880576
  21. C. Aggarwal, J. Wolf, P. Yu and M. Epelman, 'The S-Tree: An Efficient Index for Multi-dimensional Objects,' Int. Symp. SSD'97, page 350-373, 1997 https://doi.org/10.1007/3-540-63238-7_39
  22. R. Wober, H.-J. Schek and S. Blott, 'A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces,' Proc. 24th Int'l Conf. on Very Large Data Bases(VLDB), page 194-205, 1998
  23. G.R. Hjaltason, H. Samet and Y. Sussmann, 'Speeding up bulk-loading of quadtrees,' ACMGIS, page 50-53, 1997 https://doi.org/10.1145/267825.267839
  24. S.T. Leutenegger, and D.M. Nicol, 'Efficient Bulk-Loading of Gridfiles,' ICASE Report 94-74, 1994
  25. B. -U. Pagel, H. -W. Six and M. Winter, 'Window Query-Optimal Clustering of Spatial Objects,' ACM PODS, page 86-94, 1995 https://doi.org/10.1145/212433.212458
  26. R. E. Bellman, 'Adaptive Control Process,' Princeton University Press, 1961
  27. S.T. Leutenegger and M.A. Lopez, 'The Effect of Buffering on the Performance of R-Trees,' Proc. 14th Int. Conf. on Data Engineering (ICDE), page 164-171, 1998 https://doi.org/10.1109/ICDE.1998.655772
  28. T.W. Kim and K.-J. Li, 'A Distance-Based Packing Method for High Dimensional Data,' ADC'03, page 135-144, 2003