Browse > Article

An Improved Algorithm for Building Multi-dimensional Histograms with Overlapped Buckets  

문진영 (한국과학기술원 전산학과)
심규석 (서울대학교 전기컴퓨터공학부)
Abstract
Histograms have been getting a lot of attention recently. Histograms are commonly utilized in commercial database systems to capture attribute value distributions for query optimization Recently, in the advent of researches on approximate query answering and stream data, the interests in histograms are widely being spread. The simplest approach assumes that the attributes in relational tables are independent by AVI(Attribute Value Independence) assumption. However, this assumption is not generally valid for real-life datasets. To alleviate the problem of approximation on multi-dimensional data with multiple one-dimensional histograms, several techniques such as wavelet, random sampling and multi-dimensional histograms are proposed. Among them, GENHIST is a multi-dimensional histogram that is designed to approximate the data distribution with real attributes. It uses overlapping buckets that allow more efficient approximation on the data distribution. In this paper, we propose a scheme, OPT that can determine the optimal frequencies of overlapped buckets that minimize the SSE(Sum Squared Error). A histogram with overlapping buckets is first generated by GENHIST and OPT can improve the histogram by calculating the optimal frequency for each bucket. Our experimental result confirms that our technique can improve the accuracy of histograms generated by GENHIST significantly.
Keywords
Histograms;
Citations & Related Records
연도 인용수 순위
  • Reference
1 V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. J. Shekita, 'Improved histograms for selectivity estimation of range predicates', In Proc. Int'l Conf. on Management of Data, ACM SIGMOD, Montreal, Canada, pp. 294-305, June 1996   DOI
2 M. Muralikrishan and D. J. DeWitt, 'Equi-depth histograms for estimatng selectivity factors for multidimensional queries,' In Proc. Int'l Con[. on Management of Data, ACM SIGMOD, Chicago, Illinois, pp. 28-36, June 1988   DOI
3 V. Poosala and Y. E. Ioannidis, 'Selectivity estimation without the attribute value independence assumption', In Proc. the 23rd Int'l Conf. on Very Large Data Bases, Athens, Greece, pp. 486-495, August 1997
4 Y. Matias, J. S. Vitter, and M. Wang, 'Waveletbased histograms for selectivity estimation', In Proc. Int'l Conf. on Management of Data, ACM SIGMOD, Seattle, Washington, pp, 448-459, June 1998   DOI   ScienceOn
5 J. Lee, D. Kim, and C. Chung, 'Multi-dimensional Selectivity Estimation Using Compressed Histogram Information', In Proc. Ini'l Conf. on Management of Data, ACM SIGMOD, Philadelphia, Pennsylvania, pp. 205-214, June 1999   DOI
6 D. Gunopulos, G. Kollios, V. J. Tsotras, and C. Dorneniconi, 'Approximating multi-dimensional aggregate range queries over real attributes', In Proc. Ini'l Conf. on Management of Data, ACM SIGMOD, Dallas, Texas, pp. 463-474, June 2000   DOI
7 H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K C. Sevcik, and T. Suel, 'Optimal histograms with quality guarantees', In Proc. the 24th Int'l Conf. on Very Large Data Bases, New York, NY, pp. 275-286, August 1998
8 S. Muthukrishnan, V. Poosala, and T. Suel, 'On rectangular partitionings in two dimensions: Algorithms, complexity, and applications', In Proc. Ini'l Conf. on Database Theory, Jerusalem, Israel, pp. 236-256, January 1999
9 S. A. William, H. Press, B. P. Flannery, and W. T. Vettrling, Numerical recipes in C The art of scientific computing, Cambridge University Press, 1993
10 E. Ioannidis and V. Poosala, 'Balancing histogram optimality and practicality for query result size estimation', In Proc. Ini'l Conf. on Management of Data, ACM SIGMOD, San Jose, California, pp, 233-244, May 1995   DOI
11 V. Poosala, Histogram-Based Estimation Techniques in Database Systems, Ph. D. dissertation, University of Wisconsine-Madison, 1997