Browse > Article
http://dx.doi.org/10.3745/KTSDE.2016.5.6.283

Anomaly Detection of Hadoop Log Data Using Moving Average and 3-Sigma  

Son, Siwoon (강원대학교 컴퓨터과학과)
Gil, Myeong-Seon (강원대학교 컴퓨터과학과)
Moon, Yang-Sae (강원대학교 컴퓨터과학과)
Won, Hee-Sun (한국전자통신연구원)
Publication Information
KIPS Transactions on Software and Data Engineering / v.5, no.6, 2016 , pp. 283-288 More about this Journal
Abstract
In recent years, there have been many research efforts on Big Data, and many companies developed a variety of relevant products. Accordingly, we are able to store and analyze a large volume of log data, which have been difficult to be handled in the traditional computing environment. To handle a large volume of log data, which rapidly occur in multiple servers, in this paper we design a new data storage architecture to efficiently analyze those big log data through Apache Hive. We then design and implement anomaly detection methods, which identify abnormal status of servers from log data, based on moving average and 3-sigma techniques. We also show effectiveness of the proposed detection methods by demonstrating that our methods identifies anomalies correctly. These results show that our anomaly detection is an excellent approach for properly detecting anomalies from Hadoop log data.
Keywords
Big Data; Apache Hadoop; Apache Hive; Log Data; Anomaly Detection;
Citations & Related Records
Times Cited By KSCI : 3  (Citation Analysis)
연도 인용수 순위
1 J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. Byers, "Big Data: The Next Frontier for Innovation, Competition, and Productivity," Technical Report, McKinsey Global Institute, 2011.
2 T. Rabl, M. Sadoghi, H.-A. Jacobsen, S. Gomez-Villamor, V. Muntes-Mulero, and S. Mankowskii, "Solving Big Data Challenges for Enterprise Application Performance Management," in Proc. of the VLDB Endowment, Vol.5, No. 12, pp.1724-1735, Aug., 2012.   DOI
3 M. Saecker and V. Markl, "Big Data Analytics on Modern Hardware Architectures: A Technology Survey," Springer Lecture Notes in Business Information Processing, Vol.138, pp.125-149, 2013.   DOI
4 Hadoop [Internet], http://hadoop.apache.org/.
5 C. Lam and J. warren, "Hadoop in Action," Manning Publications, 2010.
6 T. White, "Hadoop: The Definitive Guide," O'Reilly Media, Yahoo! Press, June, 2009.
7 HDFS [Internet], http://hadoop.apache.org/hdfs/.
8 K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The Hadoop Distributed File System," in Proc. of the 26th IEEE Symp. on Mass Storage Systems and Technologies(MSST), Lake Tahoe, Nevada, pp.1-10, May, 2010.
9 Dhruba Borthakur, "The Hadoop Distributed File System: Architecture and Design," Technical Report, pp.1-14, 2007, http://hadoop.apache.org/core.
10 J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Communications of the ACM, Vol.51, No.1, pp.107-113, Jan., 2008.   DOI
11 J. Dean and S. Ghemawat, "MapReduce: a Flexible Data Processing Tool," Communications of the ACM, Vol.54, No.1, pp.72-77, Jan., 2010.
12 S. Lee, J. Kim, Y.-S. Moon, and W.-K. Loh, "Iceberg Cube Parallel Computation using MapReduce," Korea Computer Congress, Vol.37, No.1(A), pp.25-26, June, 2010.
13 H. Lee, M. Kim, H. Lee, and H. Yoon, "Design and Implementation of an Analysis module based on MapReduce for Large-scalable Social Data," Korea Computer Congress, Vol.38, No.1(B), pp.357-360, June, 2011.
14 G. Kim, G. Nam, and U. Kim, "Analysis and Statistics of Domestic Dam Based on MapReduce," Korean Society for Internet Information, pp.131-132, Nov., 2013.
15 D.-S. Choi, G.-J. Mun, Y.-M. Kim, and B.-N. Noh, "An Analysis of Large-Scale Security Log using MapReduce," Korean Institute of Information Technology, Vol.9, No.8, pp. 125-132, Aug., 2011.
16 Hive [Internet], https://hive.apache.org/.
17 A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Authony, H. Liu, P. Wyckoff, and R. Murthy, "Hive: a Warehousing Solution over a Map-Reduce Framework," in Proc. of the VLDB Endowment, Vol.2, Issue 2, Aug., 2009.
18 J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy, "Hive - a petabyte scale data warehouse using Hadoop," in Proc. of the 26th IEEE International Conference on Data Engineering, pp.996-1005, Mar., 2010.
19 Y.-S. Moon and J. Kim, "Efficient Moving Average Transform-Based Subsequence Matching Algorithms in Time-Series Databases," Information Sciences, Vol.177, No. 23, pp.5415-5431, Dec., 2007.   DOI
20 J. M. Lucas and M. S. Saccucci, "Exponentially Weighted Moving Average Control Schemes: Properties and Enhancements," Technometircs, Vol.32, Issue 1, 1990.
21 J. S. Hunter, "The exponentially Weighted Moving Average," Journal of Quality Technology, Vol.18, No.4, Oct., 1986.
22 William W. S. Wei, "Time Series Analysis Univariate And Multivariate Methods," Addison-Wesley, 2005.
23 Ganglia Monitoring System [Internet], http://ganglia.info/.
24 F. Pukelsheim, "The three sigma rule," The American Statistician, Vol.48, Issue 2, pp.88-91, 1994.
25 H.-P. Kriegel, P. Kroger, E. Schubert, A. Zimek, "LoOP: local outlier probabilities," in Proc. of the 18th ACM Conference on Information and Knowledge Management, pp.1649-1652, Nov., 2009.