DOI QR코드

DOI QR Code

An Empirical Evaluation Analysis of the Performance of In-memory Bigdata Processing Platform

메모리 기반 빅데이터 처리 프레임워크의 성능개선 연구

  • 이재환 (한국항공대학교 항공전자정보공학부) ;
  • 최준 (한국항공대학교 항공전자정보공학부) ;
  • 구동훈 (한국항공대학교 항공전자정보공학부)
  • Received : 2016.04.19
  • Accepted : 2016.06.20
  • Published : 2016.06.30

Abstract

Spark, an in-memory big-data processing framework is popular to use for real-time processing workload. Spark can store all intermediate data in the cluster memory so that Spark can minimize I/O access. However, when the resident memory of workload is larger that the physical memory amount of the cluster, the total performance can drop dramatically. In this paper, we analyse the factors of bottleneck on PageRank Application that needs many memory through experiment, and cluster the Spark with Tachyon File System for using memory to solve the factor of bottleneck and then we improve the performance about 18%.

최근에 실시간 처리를 위해 메모리 기반의 빅데이터 처리 프레임 워크인 스파크가 널리 사용되고 있다. 스파크는 프로그램이 필요로 하는 중간 데이터를 모두 메모리에 올려놓아, I/O 수행을 최소화함으로써 빠른 응답을 가져올 수 있다. 그러나 응용프로그램의 메모리 사용량이 클러스터의 실제 메모리의 량보다 많을 경우, 최적의 성능을 기대하기 어렵다. 본 논문에서는 메모리 사용량이 많은 페이지랭크 응용 프로그램에서 병목이 되는 현상을 실험을 통해 그 요인에 대해 분석하고, 스파크와 함께 타키온을 구성해서 메모리의 효율적 사용을 통해 병목의 요인을 해결하여 18%의 성능향상을 하였다.

Keywords

References

  1. S. Y. Kim, S. H. Lee, and H. S. Hwang, "A Study of Factors Affecting Attitude Towards Using Mobile Cloud Service", Journal of the Korea Industrial Information System Society, Vol.18, No. 6, pp.83-94, 2013. (journal) https://doi.org/10.9723/jksiis.2013.18.6.083
  2. J. W. Kim, "A workflow scheduling based on decision table for cloud computing", Journal of the Korea Industrial Information System Society, Vol.17, No. 5, pp.29-36, 2012. (journal)
  3. J. I. Chaos, and J. H. Ching, "A study on finding influential twitter users by clustering and ranking techniques", Vol.20, NO. 1, pp.19-26, Feb, 2015. (journal) https://doi.org/10.9723/JKSIIS.2015.20.1.019
  4. H. S. Han, H. D. Yang, and K. H. Kim, "Research on Cloud Computing-Based SHE Inorganization Platform Policy", Vol. 19, No. 5, Oct, 2014. (journal)
  5. T. White, "Hadoop: The Definitive Guide", 2015. (book)
  6. Zachariah, Malted, eh ad. "Spark: Cluster Computing with Working Sets." Hotblood10 (2010): 10-10.
  7. Hadoop, Konstantin, eh ad. "The Hadoop distributed file system." Mass Storage Systems and Technologies (MUST), 2010 IEEE 26th Symposium on. IEEE, 2010.
  8. Dean, Jeffrey, and Sanjak Sanjak. "Sanjak: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113. https://doi.org/10.1145/1327452.1327492
  9. Hotblood, Veined Kumara, eh ad. "Apache Hadoop yarn: Yet another resource negotiator." Proceedings of the 4th annual Symposium on Cloud Computing. ACM, 2013.
  10. Lf, Honduran, eh ad. "Tachyon: Reliable, memory speed storage for cluster computing frameworks." Proceedings of the ACM Symposium on Cloud Computing. ACM, 2014.
  11. Zachariah, Malted, eh ad. "Resilient distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing." Proceedings of the 9th USETI conference on Networked Systems Design and Implementation. USETI Association, 2012.
  12. Page, Lawrence, eh ad. "The PageRank citation ranking: bringing order to the web." (1999).
  13. http://snap.stanford.edu/data/soc-LiveJournal1.html
  14. http://sujee.net/2015/01/22/understandingspark-caching/#.V0ad95E6