DOI QR코드

DOI QR Code

Outlier Detection Based on MapReduce for Analyzing Big Data

대용량 데이터 분석을 위한 맵리듀스 기반의 이상치 탐지

  • Hong, Yejin (Department of Information and Communication Engineering, Dongguk University) ;
  • Na, Eunhee (Department of Information and Communication Engineering, Dongguk University) ;
  • Jung, Yonghwan (Korea Institute of Science and Technology Information, Korea Advanced Institute of Science) ;
  • Kim, Yangwoo (Department of Information and Communication Engineering, Dongguk University)
  • Received : 2016.10.15
  • Accepted : 2016.11.30
  • Published : 2017.02.28

Abstract

In near future, IoT data is expected to be a major portion of Big Data. Moreover, sensor data is expected to be major portion of IoT data, and its' research is actively carried out currently. However, processed results may not be trusted and used if outlier data is included in the processing of sensor data. Therefore, method for detection and deletion of those outlier data before processing is studied in this paper. Moreover, we used Spark which is memory based distributed processing environment for fast processing of big sensor data. The detection and deletion of outlier data consist of four stages, and each stage is implemented with Mapper and Reducer operation. The proposed method is compared in three different processing environments, and it is expected that the outlier detection and deletion performance is best in the distributed Spark environment as data volume is increasing.

가까운 미래에는 빅데이터의 많은 부분을 IoT 데이터가 차지할 것이라는 전망이 나오고 있다. 그에 따라, IoT 데이터의 많은 부분을 차치하는 센서 데이터에 관한 관심과 연구 또한 활발하게 진행되고 있다. 여러 분야에서 활용되고 있는 센서 데이터는 분석할 때 실제와는 다른 값인 이상치를 포함하게 되면 정확한 분석이 어려우며, 왜곡된 결과가 도출되어 활용할 수 없는 경우가 생긴다. 따라서 본 논문에서는 정확한 결과를 도출하기 위해 수집된 원자료를 분석하기 전에 이상치 탐지 및 제거를 하였다. 또한, 점점 늘어나고 있는 대용량의 데이터를 빠르게 처리하기 위해 메모리 접근방식인 스파크를 사용한 분산처리환경에서 처리하였다. 맵리듀스 기반의 이상치 탐지 및 제거는 총 4단계로 나누어 구현하였으며, 각 단계를 매퍼와 리듀스로 구현하였다. 제안한 기법의 평가를 위해서 3가지 환경에서 비교하였으며, 그 결과 이상치 탐지 및 제거를 하고자 하는 데이터의 용량이 커질수록 스파크를 이용한 분산처리환경에서의 처리가 가장 빠르다는 결과를 얻었다.

Keywords

References

  1. Hewlett Packard Enterprise, "Internet of things research study", report, pp. 1-3, 2015.
  2. Zhang, Yang, Nirvana Meratnia, and Paul Havinga. "Outlier detection techniques for wireless sensor networks: A survey." Communications Surveys & Tutorials, IEEE 12.2 pp.159-170, 2010. http://ieeexplore.ieee.org/document/5451757/ https://doi.org/10.1109/SURV.2010.021510.00088
  3. Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1, pp.107-113, 2008. http://dl.acm.org/citation.cfm?id=J79 https://doi.org/10.1145/1327452.1327492
  4. Shvachko, Konstantin, et al. "The Hadoop distributed file system." Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on. IEEE, pp.1-10, 2010. http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=5488875
  5. Zaharia, Matei, et al. "Spark cluster computing with working sets", Hot Cloud 10, pp.10-10, 2010.
  6. David Culler, Michael Franklin (Director), Amplab UC BERKELEY, BDAS, the Berkeley Data Analytics Stack, 2014 https://amplab.cs.berkeley.edu/software/
  7. Murphy, Kevin P, "Machine learning: a probabilistic perspective" MIT press, 2012.
  8. Preparata, Franco P., and Michael Shamos, "Computational geometry: an introduction", Springer Science & Business Media, 2012.
  9. Knorr, Edwin M., and Raymond T. Ng. "Finding intensional knowledge of distance-based outliers." VLDB, Vol.99, pp.211-222, 1999.
  10. Zhuang, Yongzhen, et al. "A weighted moving average-based approach for cleaning sensor data." Distributed Computing Systems, ICDCS'07. 27th International Conference on. IEEE, pp.38-38,2007. http://ieeexplore.ieee.org/document/4268192/
  11. Davies, Paul L. "Statistical evaluation of interlaboratory tests." Fresenius' Zeitschrift fur analytische Chemie 331.5 , pp.513-519, 1988. https://doi.org/10.1007/BF00467041
  12. Apache Mesos, The Apache software foundation, http://mesos.apache.org, 2012-2015.
  13. Zaharia, Matei, et al. "Fast and interactive analytics over Hadoop data with Spark." USENIX; login 37.4, pp.45-51, 2012.
  14. Zaharia, Matei, et al. "Resilient distributed data sets: A fault-tolerant abstraction for in-memory cluster computing." Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, pp.2.2 2012.
  15. Changyong Park and Youngeun Choi. "Validation of quality control algorithms for temperature data of the republic of korea." Vol.22 No.3, pp.299-307, 2012. https://www.researchgate.net/publication/264105200_ Validation_of_Quality_Control_Algorithms_for_Temp erature_Data_of_the_Republic_of_Korea https://doi.org/10.14191/Atmos.2012.22.3.299
  16. Open Data Portal, "Environment& Weather", https://www.data.go.kr
  17. Powers, David Martin. "Evaluation: from precision, recall and F-measure to ROC, informed, markedness and correlation." Journal of Machine Learning Technologie, 2011
  18. Brown, Angus M. "A step-by-step guide to nonlinear regression analysis of experimental data using a Microsoft Excel spreadsheet." Computer methods and programs in biomedicine 65.3, pp.191-200,2001. http://www.sciencedirect.com/science/article/pii/S0169 260700001243 https://doi.org/10.1016/S0169-2607(00)00124-3
  19. Aitchison, John, and Ian Robert Dunsmore, "Statistical prediction analysis", CUP Archive, 1980.

Cited by

  1. 전문 설비의 이상신호 처리를 위한 저비용 관제 시스템 구축 vol.7, pp.3, 2017, https://doi.org/10.3745/ktsde.2018.7.3.113
  2. 빅데이터를 활용한 드론의 이상 예측시스템 연구 vol.21, pp.2, 2017, https://doi.org/10.7472/jksii.2020.21.2.27