[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.5392/JKCA.2016.16.07.667

Matrix-based Filtering and Load-balancing Algorithm for Efficient Similarity Join Query Processing in Distributed Computing Environment

Yang, Hyeon-Sik (전북대학교 IT정보공학과)
Jang, Miyoung (전북대학교 컴퓨터공학과)
Chang, Jae-Woo (전북대학교 IT정보공학과)

Publication Information

The Journal of the Korea Contents Association / v.16, no.7, 2016 , pp. 667-680 More about this Journal

Abstract

As distributed computing platforms like Hadoop MapReduce have been developed, it is necessary to perform the conventional query processing techniques, which have been executed in a single computing machine, in distributed computing environments efficiently. Especially, studies on similarity join query processing in distributed computing environments have been done where similarity join means retrieving all data pairs with high similarity between given two data sets. But the existing similarity join query processing schemes for distributed computing environments have a problem of skewed computing load balance between clusters because they consider only the data transmission cost. In this paper, we propose Matrix-based Load-balancing Algorithm for efficient similarity join query processing in distributed computing environment. In order to uniform load balancing of clusters, the proposed algorithm estimates expected computing cost by using matrix and generates partitions based on the estimated cost. In addition, it can reduce computing loads by filtering out data which are not used in query processing in clusters. Finally, it is shown from our performance evaluation that the proposed algorithm is better on query processing performance than the existing one.

Keywords

Distributed Computing Environment; Big Data; Similarity Join Query Processing; Filtering; Load Balancing;

Citations & Related Records

Reference

1	Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler "The hadoop distributed file system," Mass Storage Systems and Technologies (MSST), pp.1-10, 2010.
2	Jeffrey Dean and Sanjay Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, Vol.51, Issue.1, pp.107-113, 2010. DOI
3	Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik, "A primitive operator for similarity joins in data cleaning," Data Engineering, p.5, 2006.
4	A. Metwally, D. Agrawal, and A. El Abbadi, "DETECTIVES: DETEcting Coalition hiT Inflation attacks in adVertising nEtworks Streams," Proceedings of the 16th WWW International Conference on World Wide Web, pp.241-250, 2007.
5	A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig, "Syntactic clustering of the web," Computer Networks, pp.1157-1166, 1997.
6	T. C. Hoad and J. Zobel, "Methods for identifying versioned and plagiarized documents," JASIST, Vol.54, Issue.3, pp.203-215, 2003. DOI
7	Yasin N. Silva and Jason M. Reed, "Exploiting MapReduce-based similarity joins," Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp.693-696, 2012.
8	Ahmed Metwally and Christos Faloutsos, "V-smart-join: A scalable mapreduce framework for all-pair similarity joins of multisets and vectors," Proceedings of the VLDB Endowment, Vol.5, No.8, pp.704-715, 2012. DOI
9	Alper Okcan and Mirek Riedewald, "Processing theta-joins using MapReduce," Proceedings of the 2011 ACM SIGMOD International Conference on Management of data ACM, pp.949-960, 2011.
10	http://chorochronos.datastories.org/

KSCI

Matrix-based Filtering and Load-balancing Algorithm for Efficient Similarity Join Query Processing in Distributed Computing Environment 분산 컴퓨팅 환경에서 효율적인 유사 조인 질의 처리를 위한 행렬 기반 필터링 및 부하 분산 알고리즘

Matrix-based Filtering and Load-balancing Algorithm for Efficient Similarity Join Query Processing in Distributed Computing Environment