[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.9717/kmms.2013.16.5.657

Data Partitioning on MapReduce by Leveraging Data Utility

Kim, Jong Wook (Teradata)

Publication Information

Journal of Korea Multimedia Society / v.16, no.5, 2013 , pp. 657-666 More about this Journal

Abstract

Today, many aspects of our lives are characterized by the rapid influx of large amounts of data from various application domains. The applications that produce this massive of data span a large spectrum, from social media to business intelligence or biology. This massive influx of data necessitates large scale parallelism for efficiently supporting a large class of analysis tasks. Recently, there have been extensive studies in using MapReduce framework to support large parallelism. While this technique has produced impressive results in diverse applications, the same can not be said for multimedia applications where most of users are interested in a small number of results having high or low score. Thus, in this paper, we develop the data partitioning algorithm which is able to efficiently process large data set having different data utility. The experiment results show that the proposed technique provides significant execution time gains over the existing solution.

Keywords

MapRedue; Load Balance; Data Utility; Wasted Work;

Citations & Related Records

Times Cited By KSCI : 1 (Citation Analysis)

Reference
Cited By KSCI

1	Hive. http://wiki.apache.org/hadoop/Hive/ 2013.
2	G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapat, A. Lakshman, A. Pilchin, S. Sivasubramanian, P, Vosshall, and W. Vogels, "Dynamo: Amazon's Highly Available Key-Value Store," Proc. the 21st ACM SIGOPS Symposium on Operating Systems Principles, pp. 205-220, 2007.
3	R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou, "SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets," Proc. the International Conference on Very Large Data Bases, pp. 1265-1276, 2008.
4	B. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H. Jacobsen, N. Puz, D. Weaver, and R. Yerneni, "PNUTS: Yahoo!'s Hosted Data Serving Platform," Proc. the International Conference on Very Large Data Bases, pp. 1277-1288, 2008.
5	B. Panda, J.S. Herbach, S. Basu, and R.J. Bayardo, "PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce," Proc. the International Conference on Very Large Data Bases, pp. 1426-1437, 2009.
6	J. Lin. "Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce," Proc. the international ACM SIGIR conference, pp. 155-162, 2009.
7	K.S. Candan, J.W. Kim, P. Nagarkar, M. Nagendra, and R. Yu. "RanKloud: Scalable Multimedia Data Processing in Server Clusters," IEEE MultiMedia, Vol. 18, Issue 1, pp. 64-77, 2011. DOI ScienceOn
8	R. Yu, M. Nagendra, P. Nagarkar, K.S. Candan, and J.W. Kim. "Data-Utility Sensitive Query Processing on Server Clusters to Support Scalable Data Analysis Services," Lecture Notes in Business Information Processing Vol. 74, pp 155-184, 2011. DOI
9	R. Raghu and G. Johannes, "Database Management Systems," McGraw-Hill Higher Education, 2nd edition, Boston, MA, 2000.
10	Internet Movie Database, http://www.imdb. com/interfaces, 1990.
11	M. Zaharia, D. Borthakur, J.S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, Job Scheduling for Multi-User MapReduce Clusters, Technical Report No. UCB/EECS-2009-55, 2009.
12	S. Ibrahim, H. Jin, L. Lu, S. Wu, B. He, and L. Qi. "LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud." IEEE Second International Conference on Cloud Computing Technology and Science, pp. 17-24, 2010.
13	Y.C. Kwon, M. Balazinska, B. Howe, and J. Rolia, "Skew-resistant Parallel Processing of Feature-extracting Scientific User-defined Functions," Proc. ACM Sympo. Cloud Computing, pp. 75-86, 2010.
14	B. Gurfler, N. Augsten, A. Reiser, and A. Kemper. "Handling Data Skew in MapReduce," Proc. Cloud Computing and Services Science, pp. 574-583, 2011.
15	J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, Symposium on Opearting Systems Design and Implementation, pp. 137-150, 2004.)
16	Yahoo, Hadoop, http://hadoop.apache.org, 2013
17	J.H. Kim and M. Kim, "A Filter Lining Scheme for Efficient Skyline Computation," Journal of Korea Multimedia Society, Vol. 14, n. 12, pp 1591-1600, 2011 과학기술학회마을 DOI ScienceOn

2	Jong Wook Kim. (2014) Journal of Korea Multimedia Society Effective Indexing for Evolving Data Collection by Using Ontology / 17 (2) , 240
8	Jong Wook Kim. (2015) Journal of Korea Multimedia Society Efficient Top-K Queries Computation for Encrypted Data in the Cloud / 18 (8) , 915
11	(2013) 멀티미디어학회논문지 하둡 분산 환경 기반 프라이버시 보호 빅 데이터 배포 시스템 개발 / 20 (11) , 1785

KSCI

Data Partitioning on MapReduce by Leveraging Data Utility 맵리듀스에서 데이터의 유용성을 이용한 데이터 분할 기법

Data Partitioning on MapReduce by Leveraging Data Utility