DOI QR코드

DOI QR Code

Anomaly Detection of Hadoop Log Data Using Moving Average and 3-Sigma

이동 평균과 3-시그마를 이용한 하둡 로그 데이터의 이상 탐지

  • 손시운 (강원대학교 컴퓨터과학과) ;
  • 길명선 (강원대학교 컴퓨터과학과) ;
  • 문양세 (강원대학교 컴퓨터과학과) ;
  • 원희선 (한국전자통신연구원)
  • Received : 2016.05.17
  • Accepted : 2016.05.30
  • Published : 2016.06.30

Abstract

In recent years, there have been many research efforts on Big Data, and many companies developed a variety of relevant products. Accordingly, we are able to store and analyze a large volume of log data, which have been difficult to be handled in the traditional computing environment. To handle a large volume of log data, which rapidly occur in multiple servers, in this paper we design a new data storage architecture to efficiently analyze those big log data through Apache Hive. We then design and implement anomaly detection methods, which identify abnormal status of servers from log data, based on moving average and 3-sigma techniques. We also show effectiveness of the proposed detection methods by demonstrating that our methods identifies anomalies correctly. These results show that our anomaly detection is an excellent approach for properly detecting anomalies from Hadoop log data.

최근 빅데이터 처리를 위한 연구들이 활발히 진행 중이며, 관련된 다양한 제품들이 개발되고 있다. 이에 따라, 기존 환경에서는 처리가 어려웠던 대용량 로그 데이터의 저장 및 분석이 가능해졌다. 본 논문은 다수의 서버에서 빠르게 생성되는 대량의 로그 데이터를 Apache Hive에서 분석할 수 있는 데이터 저장 구조를 제안한다. 그리고 저장된 로그 데이터로부터 특정 서버의 이상 유무를 판단하기 위해, 이동 평균 및 3-시그마 기반의 이상 탐지 기술을 설계 및 구현한다. 또한, 실험을 통해 로그 데이터의 급격한 증가폭을 나타내는 구간을 이상으로 판단하여, 제안한 이상 탐지 기술의 유효성을 보인다. 이 같은 결과를 볼 때, 본 연구는 하둡 기반으로 로그 데이터를 분석하여 이상치를 바르게 탐지할 수 있는 우수한 결과라 사료된다.

Keywords

References

  1. J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. Byers, "Big Data: The Next Frontier for Innovation, Competition, and Productivity," Technical Report, McKinsey Global Institute, 2011.
  2. T. Rabl, M. Sadoghi, H.-A. Jacobsen, S. Gomez-Villamor, V. Muntes-Mulero, and S. Mankowskii, "Solving Big Data Challenges for Enterprise Application Performance Management," in Proc. of the VLDB Endowment, Vol.5, No. 12, pp.1724-1735, Aug., 2012. https://doi.org/10.14778/2367502.2367512
  3. M. Saecker and V. Markl, "Big Data Analytics on Modern Hardware Architectures: A Technology Survey," Springer Lecture Notes in Business Information Processing, Vol.138, pp.125-149, 2013. https://doi.org/10.1007/978-3-642-36318-4_6
  4. Hadoop [Internet], http://hadoop.apache.org/.
  5. C. Lam and J. warren, "Hadoop in Action," Manning Publications, 2010.
  6. T. White, "Hadoop: The Definitive Guide," O'Reilly Media, Yahoo! Press, June, 2009.
  7. HDFS [Internet], http://hadoop.apache.org/hdfs/.
  8. K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The Hadoop Distributed File System," in Proc. of the 26th IEEE Symp. on Mass Storage Systems and Technologies(MSST), Lake Tahoe, Nevada, pp.1-10, May, 2010.
  9. Dhruba Borthakur, "The Hadoop Distributed File System: Architecture and Design," Technical Report, pp.1-14, 2007, http://hadoop.apache.org/core.
  10. J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Communications of the ACM, Vol.51, No.1, pp.107-113, Jan., 2008. https://doi.org/10.1145/1327452.1327492
  11. J. Dean and S. Ghemawat, "MapReduce: a Flexible Data Processing Tool," Communications of the ACM, Vol.54, No.1, pp.72-77, Jan., 2010.
  12. S. Lee, J. Kim, Y.-S. Moon, and W.-K. Loh, "Iceberg Cube Parallel Computation using MapReduce," Korea Computer Congress, Vol.37, No.1(A), pp.25-26, June, 2010.
  13. H. Lee, M. Kim, H. Lee, and H. Yoon, "Design and Implementation of an Analysis module based on MapReduce for Large-scalable Social Data," Korea Computer Congress, Vol.38, No.1(B), pp.357-360, June, 2011.
  14. G. Kim, G. Nam, and U. Kim, "Analysis and Statistics of Domestic Dam Based on MapReduce," Korean Society for Internet Information, pp.131-132, Nov., 2013.
  15. D.-S. Choi, G.-J. Mun, Y.-M. Kim, and B.-N. Noh, "An Analysis of Large-Scale Security Log using MapReduce," Korean Institute of Information Technology, Vol.9, No.8, pp. 125-132, Aug., 2011.
  16. Hive [Internet], https://hive.apache.org/.
  17. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Authony, H. Liu, P. Wyckoff, and R. Murthy, "Hive: a Warehousing Solution over a Map-Reduce Framework," in Proc. of the VLDB Endowment, Vol.2, Issue 2, Aug., 2009.
  18. J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy, "Hive - a petabyte scale data warehouse using Hadoop," in Proc. of the 26th IEEE International Conference on Data Engineering, pp.996-1005, Mar., 2010.
  19. Y.-S. Moon and J. Kim, "Efficient Moving Average Transform-Based Subsequence Matching Algorithms in Time-Series Databases," Information Sciences, Vol.177, No. 23, pp.5415-5431, Dec., 2007. https://doi.org/10.1016/j.ins.2007.05.038
  20. J. M. Lucas and M. S. Saccucci, "Exponentially Weighted Moving Average Control Schemes: Properties and Enhancements," Technometircs, Vol.32, Issue 1, 1990.
  21. J. S. Hunter, "The exponentially Weighted Moving Average," Journal of Quality Technology, Vol.18, No.4, Oct., 1986.
  22. William W. S. Wei, "Time Series Analysis Univariate And Multivariate Methods," Addison-Wesley, 2005.
  23. F. Pukelsheim, "The three sigma rule," The American Statistician, Vol.48, Issue 2, pp.88-91, 1994.
  24. H.-P. Kriegel, P. Kroger, E. Schubert, A. Zimek, "LoOP: local outlier probabilities," in Proc. of the 18th ACM Conference on Information and Knowledge Management, pp.1649-1652, Nov., 2009.
  25. Ganglia Monitoring System [Internet], http://ganglia.info/.