DOI QR코드

DOI QR Code

Efficient Processing of Multiple Group-by Queries in MapReduce for Big Data Analysis

맵리듀스에서 빅데이터 분석을 위한 다중 Group-by 질의의 효율적인 처리 기법

  • 박은주 (숙명여자대학교 컴퓨터과학부) ;
  • 박소정 (숙명여자대학교 컴퓨터과학부) ;
  • 오소현 (숙명여자대학교 컴퓨터과학부) ;
  • 최혜진 (숙명여자대학교 컴퓨터과학부) ;
  • 이기용 (숙명여자대학교 컴퓨터과학부) ;
  • 심준호 (숙명여자대학교 컴퓨터과학부)
  • Received : 2015.03.06
  • Accepted : 2015.04.01
  • Published : 2015.05.15

Abstract

MapReduce is a framework used to process large data sets in parallel on a large cluster. A group-by query is a query that partitions the input data into groups based on the values of the specified attributes, and then evaluates the value of the specified aggregate function for each group. In this paper, we propose an efficient method for processing multiple group-by queries using MapReduce. Instead of computing each group-by query independently, the proposed method computes multiple group-by queries in stages with one or more MapReduce jobs in order to reduce the total execution cost. We compared the performance of this method with the performance of a less sophisticated method that computes each group-by query independently. This comparison showed that the proposed method offers better performance in terms of execution time.

맵리듀스(MapReduce)는 대용량의 데이터를 다수의 컴퓨터로 병렬 처리하기 위해 사용되는 프레임워크이다. Group-by 질의는 데이터를 지정된 애트리뷰트들의 값에 따라 그룹화하고, 각 그룹에 대해 지정된 집계 함수 값을 구하는 질의이다. 본 논문에서는 둘 이상의 group-by 질의가 동시에 요청되었을 때, 이들을 맵리듀스를 사용하여 효율적으로 처리하는 기법을 제안한다. 제안 기법은 각 group-by 질의를 독립적으로 계산하는 대신, 총 수행비용을 줄이기 위해 하나 이상의 맵리듀스 잡을 통해 단계적으로 계산한다. 성능 평가 실험을 통해, 제안 기법이 각 group-by 질의를 독립적으로 계산하는 단순 방법에 비해 좋은 성능을 가짐을 보인다.

Keywords

Acknowledgement

Supported by : 정보통신산업진흥원

References

  1. M. Beyer, "Gartner Says Solving 'Big Data' Challenge Involves More Than Just Managing Volumes of Data," Gartner, Jun. 27, 2011.
  2. H. Choe, E. Park, and J. Shim, "Development Requirements and User Interfaces for POCT Mobile Application," Proc. of the 41th KIISE Winter Conference, 2014. (in Korean)
  3. J. Dean and S. Chemawat, "MapReduce: simplified data processing on large clusters," Proc. of OSDI 2004, pp. 137-150, 2004.
  4. T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas, "MRShare: sharing across multiple quries in MapReduce," Proc. of the VLDB Endowment, Vol. 3, No. 1-2, pp. 494-505, 2010.
  5. G. Wang and C. Chan, "Multi-Query Optimization in MapReduce Framework," Proc. of the VLDB Endowment, Vol. 7, No. 3, pp. 145-156, 2013.
  6. J. Pan, F. Magoules, Y. L. Biannic, and C. Favart, "Parallelizing Multiple Group-by queries using Map-Reduce: optimization and cost estimation," Telecommunication Systems, Vol. 52, No. 2, pp. 635-645, 2013. https://doi.org/10.1007/s11235-011-9508-2
  7. K. Y. Lee, S. Park, E. Park, J. Park, and Y. Choi, "Efficient Computation of Data Cubes using Map-Reduce," KIPS Transactions on Software and Data Engineering, Vol. 3, No. 11, pp. 479-486, November, 2014. (in Korean) https://doi.org/10.3745/KTSDE.2014.3.11.479
  8. Z. Chen and V. Narasayya, "Efficient Computation of Multiple Group By Queries," Proc. of ACM SIGMOD, pp. 263-273, 2005.
  9. A. Segev, "The node-weighted Steiner tree problem," Networks, Vol. 17, No. 1, pp. 1-17, 1987. https://doi.org/10.1002/net.3230170102
  10. P. N. Klein and R. Ravi, "A nearly best-possible approximation algorithm for node-weighted Steiner trees," Journal of Algorithms, Vol. 19, No. 1, pp. 104-115, Jul. 1995. https://doi.org/10.1006/jagm.1995.1029
  11. Amazon EC2 [Online]. Available: http://aws.amazon.com/ec2/ (downloaded 2015, Mar. 25)