DOI QR코드

DOI QR Code

Standardizing Unstructured Big Data and Visual Interpretation using MapReduce and Correspondence Analysis

맵리듀스와 대응분석을 활용한 비정형 빅 데이터의 정형화와 시각적 해석

  • Choi, Joseph (Department of Statistics, Pusan National University) ;
  • Choi, Yong-Seok (Department of Statistics, Pusan National University)
  • Received : 2013.10.16
  • Accepted : 2014.02.17
  • Published : 2014.04.30

Abstract

Massive and various types of data recorded everywhere are called big data. Therefore, it is important to analyze big data and to nd valuable information. Besides, to standardize unstructured big data is important for the application of statistical methods. In this paper, we will show how to standardize unstructured big data using MapReduce which is a distribution processing system. We also apply simple correspondence analysis and multiple correspondence analysis to nd the relationship and characteristic of direct relationship words for Samsung Electronics and The Korea Economic Daily newspaper as well as Apple Inc.

오늘날, 다양한 분야에서 다양한 형태의 빅 데이터들이 축적되고 있다. 이에, 빅 데이터를 분석하고 그 속에서 가치 있는 정보를 찾아내는 것은 매우 중요해지고 있다. 또한, 비정형 빅 데이터를 정형화하여 통계적 기법을 적용할 수 있게 하는 것은 매우 중요해지고 있다. 본 연구에서는 분산처리 시스템인 맵리듀스를 활용하여 비정형 빅 데이터를 정형화하고, 통계적 분석 기법인 단순 대응분석과 다중 대응분석을 적용하여, 한국 경제 신문의 지면에 실린 기사를 이용해 삼성전자와 애플을 언급하고 있는 단어들의 관계와 특성을 각각 파악하였다.

Keywords

References

  1. Adrian, M. (2011). It's going mainstream, and it's your next opportunity, Teradata Magazine, AR-6309.
  2. Choi, Y. S. (2001). Understanding and Application of Correspondence Analysis using SAS, Freedom Academy, Seoul.
  3. Chiang, O. (2011). Twitter Hits Nearly 200M Accounts, 110M Tweets Per Day, Focuses On Global Ex- pansion, Forbes, Available from: http://www.forbes.com/sites/oliverchiang/2011/01/19/twitter-hits- nearly-200m-users-110m-tweets-per-day-focuses-on-global-expansion/
  4. Dean, J. and Ghemawat, S. (2004). MapReduce: Simpli ed Data Processing on Large Clusters, OSDI, 1.
  5. Gantz, J. and Reinsel, D. (2010). The digital universe decade-are you ready, White Paper, IDC.
  6. Gantz, J. and Reinsel, D. (2011). Extracting value from chaos, IDC iView, 1-12.
  7. Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S. and Brilliant, L. (2008). Detecting in uenza epidemics using search engine query data, Nature, 457(7232), 1012-1014.
  8. Greenacre, M. J. (1984). The and Applications of Correspondence Analysis, Academic Press, New York.
  9. Gruman, G. (2010). Tapping into the power of big data, Technology Forecast, 2010(3), 4-13.
  10. Jeong, J. S. (2011). New value creation engine, new possibilities of big data and the corresponding strategy, IT & Future Strategy, 18, National Information Society Agency.
  11. Kim, Y. and Cho, K. H. (2011). Big data and statistics, Journal of the Korean Data & Information Sciences Society, 24(5), 959-974. https://doi.org/10.7465/jkdi.2013.24.5.959
  12. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C. and Byers, A. H. (2011). big data: The next frontier for innovation, competition, and productivity, McKinsey Global Institute, 1-137.
  13. Special Report (2010.02.25). Data, data everywhere, The Economist, Available from: http://www.eco- nomist.com/node/15557443