[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3745/KTCCS.2015.4.1.15

Effect of Sampling for Multi-set Cardinality Estimation

Dao, DinhNguyen (인하대학교 컴퓨터정보공학과)
Nyang, DaeHun (인하대학교 컴퓨터정보공학과)
Lee, KyungHee (수원대학교 전기공학과)

Publication Information

KIPS Transactions on Computer and Communication Systems / v.4, no.1, 2015 , pp. 15-22 More about this Journal

Abstract

Estimating the number of distinct values is really well-known problems in network data measurement and many effective algorithms are suggested. Recent works have built upon technique called Linear Counting to solve the estimation problem for massive sets or spreaders in small memory. Sampling is used to reduce the measurement data, and it is assumed that sampling gives bad effect on the accuracy. In this paper, however, we show that the sampling on multi-set estimation sometimes gives better results for CSE with sampling than for MCSE that examines all the packets without sampling in terms of accuracy and estimation range. To prove this, we presented mathematical analysis, conducted experiment with real data, and compared the results of CSE, MCSE, and CSES.

Keywords

Traffic Measurement; Spreader; Estimation; Sampling;

Citations & Related Records

Reference

1	Q. Zhao, J. Xu, and A. Kumar. "Detection of Super Sources and Destinations in High-Speed Networks: Algorithms, Analysis and Evaluation", IEEE Journal on Selected Areas in Communications, pp.1840-1852, Oct., 2006.
2	M. Yoon, T. Li, S. Chen, and J. Peir, "Fit a spread estimator in small memory", INFOCOM 2009, IEEE, 2009.
3	T. Li, S. Chen, and W. Luo, "Spreader classification based on optimal dynamic bit sharing", Networking, IEEE/ACM Transactions on, pp.817-830, 2013.
4	K.-Y. Whang, B. T. Vander-Zanden, and H. M. Taylor, "A linear-time probabilistic counting algorithm for database applications", ACM Transactions on Database Systems, pp. 208-229, June, 1990.
5	B. Choi and S. Bhattacharyya, "Observations on cisco sampled net flow", ACM SIGMETRICS Performance Evaluation pp.18-23, 2005.
6	Z. Bar-Yossef and T. Jayram. "Counting distinct elements in a data stream", Randomization and Approximation Techniques in Computer Science, pp.1-10, 2002.
7	A. Chen and J. Cao. "Distinct counting with a self-learning bitmap," Journal of the American Statistical Association, pp. 1171-1174, Mar., 2011.
8	P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier. "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm", DMTCS Proceedings, 2008.
9	P. Flajolet and G. Nigel Martin. "Probabilistic counting algorithms for data base applications," Journal of Computer and System Sciences, pp.182-209, Oct., 1985.
10	D. Kane, J. Nelson, and D. Woodruff. "An optimal algorithm for the distinct elements problem," Proceedings of the twenty-ninth ACM, pp.41-52, 2010.
11	J. Cao, Y. Jin, A. Chen, T. Bu, and Z.-L. Zhang. "Identifying high cardinality internet hosts," INFOCOM 2009, IEEE, pp. 810-818, 2009.
12	C. Estan, G. Varghese, and M. Fisk. "Bitmap algorithms for counting active flows on high speed links", Proceedings of the 3rd ACM SIGCOMM, pp.925-937, Oct., 2003.
13	X. Shi, D. Chiu, and J. Lui. "An online framework for catching top spreaders and scanners", Computer Networks, pp. 1375-1388, June, 2010.

KSCI

Effect of Sampling for Multi-set Cardinality Estimation 멀티셋의 크기 추정 기법에서 샘플링의 효과

Effect of Sampling for Multi-set Cardinality Estimation