DOI QR코드

DOI QR Code

An Algorithms for Tournament-based Big Data Analysis

토너먼트 기반의 빅데이터 분석 알고리즘

  • Lee, Hyunjin (Dept. of Computer Science & Software, Korea Soongsil Cyber University)
  • Received : 2015.07.21
  • Accepted : 2015.08.28
  • Published : 2015.09.30

Abstract

While all of the data has a value in itself, most of the data that is collected in the real world is a random and unstructured. In order to extract useful information from the data, it is need to use the data transform and analysis algorithms. Data mining is used for this purpose. Today, there is not only need for a variety of data mining techniques to analyze the data but also need for a computational requirements and rapid analysis time for huge volume of data. The method commonly used to store huge volume of data is to use the hadoop. A method for analyzing data in hadoop is to use the MapReduce framework. In this paper, we developed a tournament-based MapReduce method for high efficiency in developing an algorithm on a single machine to the MapReduce framework. This proposed method can apply many analysis algorithms and we showed the usefulness of proposed tournament based method to apply frequently used data mining algorithms k-means and k-nearest neighbor classification.

모든 데이터는 그 자체로 가치를 가지고 있지만, 실세계에서 수집되는 데이터들은 무작위적이며 비구조화되어 있다. 따라서 이러한 데이터를 효율적으로 활용하기 위해서 데이터에서 유용한 정보를 추출하기 위한 데이터 변환과 분석 알고리즘들을 사용하게 된다. 이러한 목적으로 사용되는 것이 데이터 마이닝이다. 오늘날에는 데이터를 분석하기 위한 다양한 데이터 마이닝 기법뿐만 아니라, 대용량 데이터를 효율적으로 처리하기 위한 연산 요건과 빠른 분석 시간을 필요로 하고 있다. 대용량 데이터를 저장하기 위하여 하둡이 많이 사용되며, 이 하둡의 데이터를 분석하기 위하여 맵리듀스 프레임워크를 사용한다. 본 논문에서는 단일 머신에서 동작하는 알고리즘을 맵리듀스 프레임워크로 개발할 때 적용의 효율성을 높이기 위한 토너먼트 기반 적용 방안을 제안하였다. 본 방법은 다양한 알고리즘에 적용할 수 있으며, 널리 사용되는 데이터 마이닝 알고리즘인 k-means, k-근접 이웃 분류에 적용하여 그 유용성을 보였다.

Keywords

References

  1. Sungmin Kang, Seokjoo Lee, Jun-ki Min, "An Efficie nt Clustering Method based on Multi Centroid Set using MapReduce," KIISE Transactions on Computing Practices, Vol.21, No.7, pp.494-499, 2015. https://doi.org/10.5626/KTCP.2015.21.7.494
  2. Hadoop, "http://hadoop.apache.org/"
  3. J. Dean and S. Ghemawat, "MapReduce: Simplified data processing on large clusters," Communications of the ACM, Vol. 51, No. 1, pp. 107-113, 2008. https://doi.org/10.1145/1327452.1327492
  4. Seung-jun Choi, Jea-Won Park, Jong-Bae Kim, Jae- Hyun Choi, "A Quality Evaluation Model for Distributed Processing Systems of Big Data," Journal of Digital Contents Society, Vol. 15, No. 4, pp. 533-545, 2014. https://doi.org/10.9728/dcs.2014.15.4.533
  5. S. Ghemowat, H. Gobioff, and S. T. Leung, "The Goo gle file system," 19th Symposium on Operating Systems Principles, pp. 29-43, 2003.
  6. P. Zhou, J. Lei, and W. Ye, "Large-Scale Data Sets Clustering Based on MapReduce and Hadoop," Journal of Computational Information systems, vol. 7, No. 16, pp. 5956-5963, 2011.
  7. Lin G., Zhonghua S., Zhiqiang M., Xiang G., Charles Z., and Yoohui J., "K-Means of Cloud Computing: MapReduce, DVM, and Windows Azure," in CLOUD COMPUTING 2013, pp. 13-18, 2013.
  8. Hyunjin Lee, "Decombined Distributed Parallel VQ Codebook Generation Based on MapReduce," Journal of Digital Contents Society, Vol. 15, No. 3, pp. 365- 371, 2014. https://doi.org/10.9728/dcs.2014.15.3.365
  9. Prajesh P. Anchalia, and Kaushik Roy, "The k-Nearest Neighbor Algorithm Using MapReduce Paradigm," 2014 Fifth International Conference on Intelligent System, Modeling and Simulation, pp. 512-518. 2014.
  10. H. Maulik, and S. Bandyopadhyay. "Genetic Algorithm-Based Clustering Technique," Pattern Recognition, Vol.33, pp. 1455-1465, 2000. https://doi.org/10.1016/S0031-3203(99)00137-5
  11. D. Arthur and S. Vassilvitskii. "K-Means++: The Advantage of Careful Seeding," Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2007.
  12. Young Joon Kim, Keon Myung Lee, "Big Numeric Data Classification Using Grid-based Bayesian Inference in the MapReduce Framework," International Journal of Fuzzy Logic and Intelligent Systems, Vol. 14, No.4, 2014.
  13. Chi Zhang, Feifei Li, and Jeffrey Jestes, "Efficient parallel kNN joins for large data in MapReduce," Proceedings of the 15th International Conference on Extending Database Technology, pp. 38-49, 2012.