Browse > Article
http://dx.doi.org/10.9717/kmms.2013.16.5.657

Data Partitioning on MapReduce by Leveraging Data Utility  

Kim, Jong Wook (Teradata)
Publication Information
Abstract
Today, many aspects of our lives are characterized by the rapid influx of large amounts of data from various application domains. The applications that produce this massive of data span a large spectrum, from social media to business intelligence or biology. This massive influx of data necessitates large scale parallelism for efficiently supporting a large class of analysis tasks. Recently, there have been extensive studies in using MapReduce framework to support large parallelism. While this technique has produced impressive results in diverse applications, the same can not be said for multimedia applications where most of users are interested in a small number of results having high or low score. Thus, in this paper, we develop the data partitioning algorithm which is able to efficiently process large data set having different data utility. The experiment results show that the proposed technique provides significant execution time gains over the existing solution.
Keywords
MapRedue; Load Balance; Data Utility; Wasted Work;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Hive. http://wiki.apache.org/hadoop/Hive/ 2013.
2 G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapat, A. Lakshman, A. Pilchin, S. Sivasubramanian, P, Vosshall, and W. Vogels, "Dynamo: Amazon's Highly Available Key-Value Store," Proc. the 21st ACM SIGOPS Symposium on Operating Systems Principles, pp. 205-220, 2007.
3 R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou, "SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets," Proc. the International Conference on Very Large Data Bases, pp. 1265-1276, 2008.
4 B. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H. Jacobsen, N. Puz, D. Weaver, and R. Yerneni, "PNUTS: Yahoo!'s Hosted Data Serving Platform," Proc. the International Conference on Very Large Data Bases, pp. 1277-1288, 2008.
5 B. Panda, J.S. Herbach, S. Basu, and R.J. Bayardo, "PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce," Proc. the International Conference on Very Large Data Bases, pp. 1426-1437, 2009.
6 J. Lin. "Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce," Proc. the international ACM SIGIR conference, pp. 155-162, 2009.
7 K.S. Candan, J.W. Kim, P. Nagarkar, M. Nagendra, and R. Yu. "RanKloud: Scalable Multimedia Data Processing in Server Clusters," IEEE MultiMedia, Vol. 18, Issue 1, pp. 64-77, 2011.   DOI   ScienceOn
8 R. Yu, M. Nagendra, P. Nagarkar, K.S. Candan, and J.W. Kim. "Data-Utility Sensitive Query Processing on Server Clusters to Support Scalable Data Analysis Services," Lecture Notes in Business Information Processing Vol. 74, pp 155-184, 2011.   DOI
9 R. Raghu and G. Johannes, "Database Management Systems," McGraw-Hill Higher Education, 2nd edition, Boston, MA, 2000.
10 Internet Movie Database, http://www.imdb. com/interfaces, 1990.
11 M. Zaharia, D. Borthakur, J.S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, Job Scheduling for Multi-User MapReduce Clusters, Technical Report No. UCB/EECS-2009-55, 2009.
12 S. Ibrahim, H. Jin, L. Lu, S. Wu, B. He, and L. Qi. "LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud." IEEE Second International Conference on Cloud Computing Technology and Science, pp. 17-24, 2010.
13 Y.C. Kwon, M. Balazinska, B. Howe, and J. Rolia, "Skew-resistant Parallel Processing of Feature-extracting Scientific User-defined Functions," Proc. ACM Sympo. Cloud Computing, pp. 75-86, 2010.
14 B. Gurfler, N. Augsten, A. Reiser, and A. Kemper. "Handling Data Skew in MapReduce," Proc. Cloud Computing and Services Science, pp. 574-583, 2011.
15 J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, Symposium on Opearting Systems Design and Implementation, pp. 137-150, 2004.)
16 Yahoo, Hadoop, http://hadoop.apache.org, 2013
17 J.H. Kim and M. Kim, "A Filter Lining Scheme for Efficient Skyline Computation," Journal of Korea Multimedia Society, Vol. 14, n. 12, pp 1591-1600, 2011   과학기술학회마을   DOI   ScienceOn