Browse > Article
http://dx.doi.org/10.7472/jksii.2017.18.1.27

Outlier Detection Based on MapReduce for Analyzing Big Data  

Hong, Yejin (Department of Information and Communication Engineering, Dongguk University)
Na, Eunhee (Department of Information and Communication Engineering, Dongguk University)
Jung, Yonghwan (Korea Institute of Science and Technology Information, Korea Advanced Institute of Science)
Kim, Yangwoo (Department of Information and Communication Engineering, Dongguk University)
Publication Information
Journal of Internet Computing and Services / v.18, no.1, 2017 , pp. 27-35 More about this Journal
Abstract
In near future, IoT data is expected to be a major portion of Big Data. Moreover, sensor data is expected to be major portion of IoT data, and its' research is actively carried out currently. However, processed results may not be trusted and used if outlier data is included in the processing of sensor data. Therefore, method for detection and deletion of those outlier data before processing is studied in this paper. Moreover, we used Spark which is memory based distributed processing environment for fast processing of big sensor data. The detection and deletion of outlier data consist of four stages, and each stage is implemented with Mapper and Reducer operation. The proposed method is compared in three different processing environments, and it is expected that the outlier detection and deletion performance is best in the distributed Spark environment as data volume is increasing.
Keywords
Big Data; Outlier; MapReduce; Distributed Processing; Spark;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Hewlett Packard Enterprise, "Internet of things research study", report, pp. 1-3, 2015.
2 Zhang, Yang, Nirvana Meratnia, and Paul Havinga. "Outlier detection techniques for wireless sensor networks: A survey." Communications Surveys & Tutorials, IEEE 12.2 pp.159-170, 2010. http://ieeexplore.ieee.org/document/5451757/   DOI
3 Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1, pp.107-113, 2008. http://dl.acm.org/citation.cfm?id=J79   DOI
4 Shvachko, Konstantin, et al. "The Hadoop distributed file system." Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on. IEEE, pp.1-10, 2010. http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=5488875
5 Zaharia, Matei, et al. "Spark cluster computing with working sets", Hot Cloud 10, pp.10-10, 2010.
6 David Culler, Michael Franklin (Director), Amplab UC BERKELEY, BDAS, the Berkeley Data Analytics Stack, 2014 https://amplab.cs.berkeley.edu/software/
7 Murphy, Kevin P, "Machine learning: a probabilistic perspective" MIT press, 2012.
8 Davies, Paul L. "Statistical evaluation of interlaboratory tests." Fresenius' Zeitschrift fur analytische Chemie 331.5 , pp.513-519, 1988.   DOI
9 Knorr, Edwin M., and Raymond T. Ng. "Finding intensional knowledge of distance-based outliers." VLDB, Vol.99, pp.211-222, 1999.
10 Zhuang, Yongzhen, et al. "A weighted moving average-based approach for cleaning sensor data." Distributed Computing Systems, ICDCS'07. 27th International Conference on. IEEE, pp.38-38,2007. http://ieeexplore.ieee.org/document/4268192/
11 Apache Mesos, The Apache software foundation, http://mesos.apache.org, 2012-2015.
12 Zaharia, Matei, et al. "Fast and interactive analytics over Hadoop data with Spark." USENIX; login 37.4, pp.45-51, 2012.
13 Zaharia, Matei, et al. "Resilient distributed data sets: A fault-tolerant abstraction for in-memory cluster computing." Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, pp.2.2 2012.
14 Changyong Park and Youngeun Choi. "Validation of quality control algorithms for temperature data of the republic of korea." Vol.22 No.3, pp.299-307, 2012. https://www.researchgate.net/publication/264105200_ Validation_of_Quality_Control_Algorithms_for_Temp erature_Data_of_the_Republic_of_Korea   DOI
15 Aitchison, John, and Ian Robert Dunsmore, "Statistical prediction analysis", CUP Archive, 1980.
16 Preparata, Franco P., and Michael Shamos, "Computational geometry: an introduction", Springer Science & Business Media, 2012.
17 Open Data Portal, "Environment& Weather", https://www.data.go.kr
18 Powers, David Martin. "Evaluation: from precision, recall and F-measure to ROC, informed, markedness and correlation." Journal of Machine Learning Technologie, 2011
19 Brown, Angus M. "A step-by-step guide to nonlinear regression analysis of experimental data using a Microsoft Excel spreadsheet." Computer methods and programs in biomedicine 65.3, pp.191-200,2001. http://www.sciencedirect.com/science/article/pii/S0169 260700001243   DOI